Database Normalization in SQL with Examples

  • Comments posted to this topic are about the item Database Normalization in SQL with Examples

  • Nice article!

    My only quibble is your last example, adding a zip code table. The problem is that zip code and city do not have  a 1:1 correspondence. It is entirely possible, especially in the case of small towns that are close together, for a zip code to contain multiple cities. Therefore you need the zip code table to have one of the following:

    1. A primary key independent of zip code. In addition, you should probably add an alternate key for the combination of zip code and city. The foreign key would then be the primary key (probably an identity)
    2. A primary key consisting of both zip code and city. Personally I don't favor multi-part keys for space reasons but YMMV.
    3. Like 1 but use no foreign key just putting the zip code and city in the customer table and have the front end look up the city when the zip code is entered, with a dialog to let the user choose the city if there are multiple for the zip code. This was the choice I went with, it works quite well, being flexible enough to handle the weirdness of the real world.

    Other than that, an excellent summation of normalization. I just wish every developer could read and understand this article. 3NF is such a central and simple concept that leverages the database and its data so well it should absolutely be taught first in college.

  • The debate about normalizing addresses continues! (see https://stackoverflow.com/questions/5530681/normalize-an-address).

    Here in The Netherlands, zip code and house number makes an address unique. With these 2 fields you can determine street, city and province also. So in my case the customer table should only contain a number and a zip code.

    I agree with Roger, every developer should have a knowledge about normalization, this prevents several issue in the databases. Had one recently where we had to cancel an database archiving project due to a non-normalized database.

    Wilfred
    The best things in life are the simple things

  • Assuming there is only one Customer per project, why not delete the 'Feedback' table and move the column feedback to the projects table?  You can then add a foreign key to the projects table that points to  the customerID.

  • I also believe it's not wise to break out the postal code (not every country calls it a zip code) from the rest of the address.  City and postal code have a many to many relationship, and you would be creating too complex a design for almost no benefit, since the Postal table could potentially have no other attributes than the postal code itself.

    It also feels like the relationship between Customers and ContactPersons is backwards.  As defined here, an individual person could be a contact for multiple customers, but it would seem more likely that a customer could have multiple contacts.

  • roger.plowman wrote:

    Nice article!

    My only quibble is your last example, adding a zip code table. The problem is that zip code and city do not have  a 1:1 correspondence. It is entirely possible, especially in the case of small towns that are close together, for a zip code to contain multiple cities. Therefore you need the zip code table to have one of the following:

     

      <li style="list-style-type: none;">

    1. A primary key independent of zip code. In addition, you should probably add an alternate key for the combination of zip code and city. The foreign key would then be the primary key (probably an identity)

     

      <li style="list-style-type: none;">

    1. A primary key consisting of both zip code and city. Personally I don't favor multi-part keys for space reasons but YMMV.

     

      <li style="list-style-type: none;">

    1. Like 1 but use no foreign key just putting the zip code and city in the customer table and have the front end look up the city when the zip code is entered, with a dialog to let the user choose the city if there are multiple for the zip code. This was the choice I went with, it works quite well, being flexible enough to handle the weirdness of the real world.

     

    Other than that, an excellent summation of normalization. I just wish every developer could read and understand this article. 3NF is such a central and simple concept that leverages the database and its data so well it should absolutely be taught first in college.

    Expand this a bit more, a single city could also have multiple zip codes for different areas of the city.

     

  • Lynn Pettis wrote:

    roger.plowman wrote:

    Nice article!

    My only quibble is your last example, adding a zip code table. The problem is that zip code and city do not have  a 1:1 correspondence. It is entirely possible, especially in the case of small towns that are close together, for a zip code to contain multiple cities. Therefore you need the zip code table to have one of the following:

     

      <li style="list-style-type: none;">
      <li style="list-style-type: none;">

     

      <li style="list-style-type: none;">
      <li style="list-style-type: none;">

    1. A primary key independent of zip code. In addition, you should probably add an alternate key for the combination of zip code and city. The foreign key would then be the primary key (probably an identity)

     

     

     

      <li style="list-style-type: none;">
      <li style="list-style-type: none;">

     

      <li style="list-style-type: none;">
      <li style="list-style-type: none;">

    1. A primary key consisting of both zip code and city. Personally I don't favor multi-part keys for space reasons but YMMV.

     

     

     

      <li style="list-style-type: none;">
      <li style="list-style-type: none;">

     

      <li style="list-style-type: none;">
      <li style="list-style-type: none;">

    1. Like 1 but use no foreign key just putting the zip code and city in the customer table and have the front end look up the city when the zip code is entered, with a dialog to let the user choose the city if there are multiple for the zip code. This was the choice I went with, it works quite well, being flexible enough to handle the weirdness of the real world.

     

    Other than that, an excellent summation of normalization. I just wish every developer could read and understand this article. 3NF is such a central and simple concept that leverages the database and its data so well it should absolutely be taught first in college.

    Expand this a bit more, a single city could also have multiple zip codes for different areas of the city.

    True, but the key is zip code. Entering a single zip code will also fill in the city automatically (and state). When you have multiple zip codes per city it makes no difference, since zip code is entered first.

    This does require a slight reordering of data entry fields on the form, but our users have been doing it that way for literally 20 years so they're used to it. 🙂

     

  • Hi,

    I noticed the author used an ID column of IDENTITY as the PK in every table. This appears to be a best practice based on what i've  read online. Can anyone point out when one would not use this a PK in their table? In other words, is there an exception to this rule?

    Thanks for posting the article; I found it informative.

  • pozzolan wrote:

    Hi,

    I noticed the author used an ID column of IDENTITY as the PK in every table. This appears to be a best practice based on what i've  read online. Can anyone point out when one would not use this a PK in their table? In other words, is there an exception to this rule?

    Thanks for posting the article; I found it informative.

    I've always found identities useful as PKs, but there are cases where you wouldn't want to do that, although it's rare. One example might be if you need large detail record counts for a master record, you might want to group the details by the master record's PK, that way you could take advantage of having the details physically contiguous for disk performance reasons.

    In my opinion, cases where you wouldn't want to use an identity for PKs are pretty rare, though.

    This is one of those controversial subjects that can spark long threads, btw. 🙂 I'm sure somebody would just love to point out all the edge cases where a non-identity is a better PK.

  • Here is the thing, I would not separate out the ZIP CODE (postal code) or the City from the address.  Many businesses make use of third parties to validate customer address and these companies use postal information, if not the postal database itself, to complete this validation.  This is also how many companies end up with your zip+4 address even though you may have only provided your zip code initially.  Sometimes what may make sense when it comes to normalization doesn't make sense in the database itself.

     

  • roger.plowman wrote:

    pozzolan wrote:

    Hi,

    I noticed the author used an ID column of IDENTITY as the PK in every table. This appears to be a best practice based on what i've  read online. Can anyone point out when one would not use this a PK in their table? In other words, is there an exception to this rule?

    Thanks for posting the article; I found it informative.

    I've always found identities useful as PKs, but there are cases where you wouldn't want to do that, although it's rare. One example might be if you need large detail record counts for a master record, you might want to group the details by the master record's PK, that way you could take advantage of having the details physically contiguous for disk performance reasons.

    In my opinion, cases where you wouldn't want to use an identity for PKs are pretty rare, though.

    This is one of those controversial subjects that can spark long threads, btw. 🙂 I'm sure somebody would just love to point out all the edge cases where a non-identity is a better PK.

    You can use it as a primary key or possibly as an alternate key.  Although many people may use an identity value for the primary key, there may also be times where you may want to use a natural key as the primary key.  Yes, this topic can bring out a "holy war" between using a synthetic key or a natural key.  It really comes down to a single answer: It depends.

     

  • Lynn Pettis wrote:

    ...Sometimes what may make sense when it comes to normalization doesn't make sense in the database itself. 

    I can't agree more.  While it is possible to have an additional lookup table for determining the possible list of cities for a given postal code such as Roger's example, The U.S. post office would prefer the zip+4 format in addresses so the final cleansed value may be different than the initial 5 digit value the data entry person keys in.

  • Thank you, everyone, for the discussion. I have always involved in multiple debates and discussions with my clients regarding how to store addresses and normalize the tables.

    The answer to this is - "There is no hard and fast rule, it actually depends on how your application architecture is set up and how does it communicate with other 3rd party apps." Based on this, you need to decide what option will be best for your design.

  • pozzolan wrote:

    Hi,

    I noticed the author used an ID column of IDENTITY as the PK in every table. This appears to be a best practice based on what i've  read online. Can anyone point out when one would not use this a PK in their table? In other words, is there an exception to this rule?

    Thanks for posting the article; I found it informative.

    For an example where I personally would not use an identity primary key, imagine building a basic sports database.

    First we'll create a Player table.  An identity primary key is perfect here, since it will likely participate in many foreign key relationships as well and there's no obvious natural key.  Similarly, you may have a Schedule table, where each game is listed, also with an identity GameId primary key that's frequently used.  The Schedule might have a natural composite key you could use, such as GameDateTime + HomeTeam, but since GameDateTime can change, this isn't a great PK choice.  (Also, if you go deep enough into the weeds, it is possible for the same team to play two games at the same time in split-squad exhibition matches.  There's always another edge case in sports!)

    But now, we want to track performance of the players.  We could create a Game_Stat table which includes the statistics for each player in each game.  A natural key here would be a composite of PlayerId + GameId, and I would choose this for my primary key.  You could create a GameStatId identity column for your primary key (and I'm sure many people would), but the reason I wouldn't is because I would never use it.  When querying statistics from that table, I would typically expect to be querying for specific PlayerId's or GameId's, and would be sure to index both of those columns.  And one of them would likely make a good clustered index as well so that the data is organized in an order useful for at least some of my queries.  I would never refer to the data by the GameStatId, so why add an extra 4 bytes to every row if it is never used?  It won't be used in queries or foreign keys, it doesn't make a good clustered index for my use cases, and rather than preventing duplicate data it might increase the chances of duplicate data if you forget to make a unique index/constraint on PlayerId+GameId.

    All that being said, now imagine someone says "Hey, we want the ability to note any records or significant milestones reached with a player's performance in a specific game."  There are lots of ways this might be accomplished, but one possibility would be to create a new table with a foreign key relationship to the Game_Stat table.  All of a sudden, it might make sense again to have an identity key in that Game_Stat table to make the foreign key relationship a bit simpler.  So in conclusion, as with most things in SQL, it depends and YMMV.

  • Even though you could just hold the Zip code (and house number) in ther UK on the customer record that then means that every time you would need to use the address you would have to look it up.

    This would be labourious and prone to issues, in the UK certainly new postcodes are constantly being created and existing properties can have new postcodes assigned to then.

    At the time Circa 2000 we used to new a new address file update 4 times a year!

    Besides, not all houses have a number, many properties have names only and please do not even look at non residential proerties.

    I worked on a project many years age trying to match electricity addresses to Gas addresses. Residential was relatively easy but commercial was an absolute pain. We tried using some third party address software. You could pass in an address in loose format and it would try and resolve it so an address in it's database and return a FULL and property set of address fields.

    Try getting it to search for "bradford metropolitan district council, building 21", just think how many different wasy to spell, misspell and abbreviate that!

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply