Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 

A Normalization Primer

By Brian Kelley,

A Normalization Primer

For most DBAs, normalization is an understood concept, a bread and butter bit of knowledge.  However, it is not at all unusual to review a database design by a development group for an OLTP (OnLine Transaction Processing) environment and find that the schema chosen is anything but properly normalized.  In some cases we'll see denormalized structures that will indeed make sense from a performance standpoint, however, these are few and far between.  Usually when we see a denormalized structure, it wasn't done on purpose.

Then it's time to go back to the development group and start talking with some of the personnel about normalizing the schema for integrity and performance reasons.  In some cases we'll get developers that know exactly what we're talking about and after a short meeting, everyone will be on the same page.  But then there are those times when as DBAs we'll say "Let's talk about normalization," and we find that while we're dealing with decent developers, they've not learned this foundation for good database design.  A lot of the time is spent working through normalization basics in order to get this foundation down.  Without the foundation, any improvements we suggest, even if met with acceptance, won't be met with an understanding of why.  As a result, the next time around the developers will commit the same mistakes, meaning we'll go through the process all over again.  And so we come to the purpose of this article: a short primer on normalization concepts to ensure that necessary foundation for good database design. 

What is Normalization and Why Do We Need It?

Normalization solves several issues for us.  First and foremost, when we normalize a database, we attempt to do the following:

  • Eliminate Unnecessary Redundancy of Data
  • Preserve Data Integrity
  • Protect from Data Loss

Let's look at a fictional case study which can demonstrate all of these things.  I and a couple of colleagues have formed the DBA Sports Talent Agency, a newly christened company representing pro athletes.  We've managed to accumulate a small stable of top talent.  Chad, one of the agents, has put together an initial database table (primary keys are underlined throughout the article) with all of our talent and some basic information about each one: 

Table 1 (ClientList)

Client Sport Team Salary Agent
Smith, John Football Tennessee Titans 2500000 Kelley, Brian
Brown, Reggie Baseball Kansas City Royals 750000 Silva, Chad
Johnson, Kyle Football Arizona Cardinals 500000 Kelley, Brian
Scott, Richard Hockey Dallas Stars 350000 Ruelas, Anthony
Ferguson, Martin Baseball New York Mets 125000 Silva, Chad
Elsworth, Robert Football Tennessee Titans 450000 Kelley, Brian

I need to make a clarifying point on redundancy and I'll do it with this example:  If we take a quick look at teams, we see that the Tennessee Titans are listed twice.  This is a text field, and it could have easily have gotten mis-keyed.  I know quite a few people who were voting for the Tennessee Tuxedos (after a cartoon penguin) instead of the Titans and a moment's flashback could have caused a serious issue.  After all, if we were to look at all actual names of teams out there, the Tennessee Tuxedos is not a valid name. As a result, a typo could cause a domain integrity issue.  Large repeated text fields can be a big headache (although a single occurrence could result in a domain violation).  But while this repetition can represent a problem, this isn't the redundancy we're talking about.

Let's say that our client, Mr. Reggie Brown, was a top notch running back coming out of college.  Spurning the NFL, he signs a baseball contract.  A little into his first year, he decides he wants to play both sports, believing he can excel at both.  Our sports agency sees the money and we quickly get him signed with the Oakland Raiders, and he'll pick up a million and a quarter for half a season.  But based on Chad's original design, we have an issue.  Let's assume that the Client is the primary key for the table.  We can't insert another Client entry with Brown, Reggie in it.  We've got one of two options if we stick with this single table design:

  1. Double up within the fields (Table 2)
  2. Create additional fields (Table 3)

Here's an example of the first case.  Notice that we've separated the information in the fields using a semi-colon.  Ok, this doesn't seem so bad, right?  As we start looking to manipulate the data we find out differently.  First and foremost, querying for all we have signed who play baseball isn't straightforward.  I can't simply issue a SELECT Client FROM ClientList WHERE Sport = 'Baseball' any longer. 

Table 2 (ClientList version 2)

Client Sport Team Salary Agent
Smith, John Football Tennessee Titans 2500000 Kelley, Brian
Brown, Reggie Baseball; Football Kansas City Royals; Oakland Raiders 750000; 1250000 Silva, Chad; Kelley, Brian
Johnson, Kyle Football Arizona Cardinals 500000 Kelley, Brian
Scott, Richard Hockey Dallas Stars 350000 Ruelas, Anthony
Ferguson, Martin Baseball New York Mets 125000 Silva, Chad
Elsworth, Robert Football Tennessee Titans 450000 Kelley, Brian

A second issue is I can no longer calculate total salary numbers easily either.  If I want to calculate Reggie's, I have to parse the field to get the two salary amounts.  Chad realizes these two problems and redesigns the table again.  Chad now tries the second option, which ends up looking like this:

Table 3 (ClientList version 3)

Client Sport Team1 Salary1 Agent1 Sport2 Team2 Salary2 Agent2
Smith, John Football Tennessee Titans 2500000 Kelley, Brian        
Brown, Reggie Baseball Kansas City Royals 750000 Silva, Chad Football Oakland Raiders 12500000 Kelley, Brian
Johnson, Kyle Football Arizona Cardinals 500000 Kelley, Brian        
Scott, Richard Hockey Dallas Stars 350000 Ruelas, Anthony        
Ferguson, Martin Baseball New York Mets 125000 Silva, Chad        
Elsworth, Robert Football Tennessee Titans 450000 Kelley, Brian        

This works in that we're not having to parse fields, but now we're looking at a second set of columns just for one athlete.  There's a lot of wasted space.  We can now find out who plays baseball, but we have to do an extra comparison (we have to check Sport2 as well).  This is the redundancy we're worried about.

Another point I made is that good normalization prevents data loss.  Let's say Richard Scott gets a career-ending injury.  He and the Stars (with our help) agree to a lump-sum payment but that takes him off our books.  Notice that Richard is the only athlete whom Anthony manages.  If we drop Richard off the list, we drop any reference to Anthony as well.  This raises a crucial point: without a client, an agent isn't listed.  As a result, if we go and hire Anthony's brother, Henry, to manage our basketball stable, he doesn't show up on the table until he signs the first client.  With this current design, we can run into the case where we're not able to represent everyone on our staff.  Our denormalized database schema, though it is only one table, is causing us some serious issues.  Now let's look at the first three normal forms and what that does for us.

First Normal Form (1NF)

To bring a database schema into first normal form, we ensure that there are no arrays, no redundant columns, no repeated structures representing the exact same data.  The first two attempts Chad made at updating our ClientList table violates first normal form.  ClientList version 2 has a repeating group in several of the fields for Mr. Brown. We've talked about how difficult it would be to query our baseball athletes in this table.  Consider also the difficulty we face preserving integrity.  We can't do simple comparisons against a field, but are faced with parsing the fields to verify integrity.  This was the other point of where normalization can help us.  By ensuring fields are atomic, we simplify the procedures for integrity checks within our database.

ClientList version 3 has a duplicate set of columns which represent the same type of information as the first set. To be first normal form, we need to eliminate this redundancy.  One way to do this is to separate the client specific information from the contract information.  Chad decides to form two tables.  The first is Client, and he adds a column named ClientID.  It's a simple integer assigned for each client.  This'll be a surrogate key instead of making the text field Client (which he renames ClientName) the primary key.  It's easier for us to use, and there's always the possibility of getting two John Smiths, meaning ClientName in its current form can't be guaranteed to be unique.

Client Table

ClientID ClientName
1 Smith, John
2 Brown, Reggie
3 Johnson, Kyle
4 Scott, Richard
5 Ferguson, Martin
6 Elsworth, Robert

Contract Table

ClientID Team Sport Salary Agent
1 Tennessee Titans Football 2500000 Kelley, Brian
2 Kansas City Royals Baseball 750000 Silva, Chad
2 Oakland Raiders Football 1250000 Kelley, Brian
3 Arizona Cardinals Football 500000 Kelley, Brian
4 Dallas Stars Hockey 350000 Ruelas, Anthony
5 New York Mets Baseball 125000 Silva, Chad
6 Tennessee Titans Football 450000 Kelley, Brian

Chad has handled the repetition of field information by simply creating a row for Reggie's second sport in the Contract table.  The primary key is ClientID and Team (because of the complex way salaries work, it's entirely possible for a player to be playing for one team, yet still be owed salary by another team, so we need to key on team and not sport), and we see that this does make all the rows unique. Our database schema is now in first normal form, but we still have some issues.

Second Normal Form (2NF)

Our database design is in second normal form if every attribute describes the entire primary key.  Now, a given attribute could describe a combination of the primary key plus other fields, but the whole primary key is the key (pun intended) to second normal form.  It's obvious that Sport is not dependent on ClientID.  It describes Team.  If we look, we see that each Team in our list has only one agent.  Each client does not.  According to our business rules, agents describe teams (actually sports, but we'll get into that in 3rd normal form), not clients.  Along those lines, Reggie has two agents.  We can see that each team has only one agent (as demonstrated by the Titans), but clients can have more.  Chad takes this into consideration and separates Contracts further: 

Contract Table

ClientID Team Salary
1 Tennessee Titans 2500000
2 Kansas City Royals 750000
2 Oakland Raiders 1250000
3 Arizona Cardinals 500000
4 Dallas Stars 350000
5 New York Mets 125000
6 Tennessee Titans 450000

Team Table

Team Sport Agent
Tennessee Titans Football Kelley, Brian
Kansas City Royals Baseball Silva, Chad
Oakland Raiders Football Kelley, Brian
Arizona Cardinals Football Kelley, Brian
Dallas Stars Hockey Ruelas, Anthony
New York Mets Baseball Silva, Chad

Third Normal Form (3NF):

Third normal form states that all attributes describe only the primary key (no non-key fields). We just discussed above that while Agent describes Team, it also describes Sport.  Each agent concentrates on one sport.  So we should split the Team table even further, in order to bring it to third normal form.  By the way, this is the perfect opportunity to add Henry to represent our basketball interests.

Team Table

Team Sport
Tennessee Titans Football
Kansas City Royals Baseball
Oakland Raiders Football
Arizona Cardinals Football
Dallas Stars Hockey
New York Mets Baseball

Sport Table

Sport Agent
Football Kelley, Brian
Baseball Silva, Chad
Hockey Ruelas, Anthony
Basketball Ruelas, Henry

Finally our database schema is in third normal form.  First we've eliminated the repeated fields and the repeated values in the fields.  That brought us to first normal form.  Then we pulled out any fields that didn't describe the entire primary key (attributes that might have described a combination of the primary key and additional fields are valid because it's the entire primary key that is the requirement).  That got us to second normal form.  Finally, we pulled out anything that depended on the primary key + something else.  That got us to third normal form.  If we want to see the fruits of our labor, here's the final  normalized design:

Client Table

ClientID ClientName
1 Smith, John
2 Brown, Reggie
3 Johnson, Kyle
4 Scott, Richard
5 Ferguson, Martin
6 Elsworth, Robert

Contract Table

ClientID Team Salary
1 Tennessee Titans 2500000
2 Kansas City Royals 750000
2 Oakland Raiders 1250000
3 Arizona Cardinals 500000
4 Dallas Stars 350000
5 New York Mets 125000
6 Tennessee Titans 450000

Team Table

Team Sport
Tennessee Titans Football
Kansas City Royals Baseball
Oakland Raiders Football
Arizona Cardinals Football
Dallas Stars Hockey
New York Mets Baseball

Sport Table

Sport Agent
Football Kelley, Brian
Baseball Silva, Chad
Hockey Ruelas, Anthony
Basketball Ruelas, Henry

Final Thoughts

Normalizing databases is not very difficult, especially when compared with potential issues in the future.  It assists us with data integrity, the elimination of redundancy, and the prevention of data loss.  We've covered the first three normal forms here, because usually when a database is described as normalized it is at least 3NF.  Bringing a database to third normal form may take a little work, but it'll save us some headaches in the end. It is important that any database developer or any developer writing code against a database understand the first three normal forms.  

For performance reasons (too many joins, for instance) we may end up denormalizing data.  However, that is a decision that has to be weighed carefully.  For design, it's best to bring a database to at least 3NF and then denormalize from there (with great caution).

Also, one thing I should point out with respect to teams is that I'm making the assumption that location + name makes the team unique, and I've left them as one field.  A better design would have been to split location and team name.  After all, the Los Angeles Lakers were once the Minneapolis Lakers.  Along those lines, it is entirely possible for two teams with the exact same location and name to exist for two different sports.  For instance: St. Louis Cardinals and New York Giants.  At this point in time, I can't think of any cases where that is true.  So to keep things simple, I left Team as a combination of Location and Name, and left it as a satisfactory primary key.  If I had been really starting a sports agency, I would make split up Team into location and name and I was also add the sport to ensure uniqueness.  But for demonstration purposes, I chose to keep it simple.

Finally, as I stated in the introduction, normalization is important in OLTP environments.  However, when dealing with OLAP (OnLine Analytical Processing) and warehousing solutions, there are other structures that may be more efficient for the tasks that will be performed (star schema, for instance).  Normalization is important in an OLTP environment where changes are occurring on a frequent basis.  A warehousing solution usually involves static data that is denormalized for speed in reporting. It's important to remember what environment we're speaking of with respect to applying normalization.  A simple case of the right tool for the right job.  

Total article views: 17135 | Views in the last 30 days: 8
 
Related Articles
ARTICLE

First Normal Form

Learn the basics of first normal form and what that means to a database designer from Tom Thomson.

FORUM

Normalization

Need Some example to do normalized database

BLOG

First Normal form (INF)

First Normal Form (INF):- A table is said to be in a First Normal Form (1NF)if it satisfy the follow...

BLOG

Normalizing Your Database

If you’ve been working with databases for any length of time, you have heard the term normalization....

FORUM

Unuseable Field Name in Database

Database contains a field with a space eg. First Name

Tags
basics    
database design    
t-sql    
 
Contribute

Join the most active online SQL Server Community

SQL knowledge, delivered daily, free:

Email address:  

You make SSC a better place

As a member of SQLServerCentral, you get free access to loads of fresh content: thousands of articles and SQL scripts, a library of free eBooks, a weekly database news roundup, a great Q & A platform… And it’s our huge, buzzing community of SQL Server Professionals that makes it such a success.

Join us!

Steve Jones
Editor, SQLServerCentral.com

Already a member? Jump in:

Email address:   Password:   Remember me: Forgotten your password?
Steve Jones