A Normalization Primer

, 2003-01-13

A Normalization Primer

For most DBAs, normalization is an understood concept, a bread and butter bit

of knowledge.  However, it is not at all unusual to review a database

design by a development group for an OLTP (OnLine Transaction Processing)

environment and find that the schema chosen is anything but

properly normalized.  In some cases we'll see denormalized structures that

will indeed make sense from a performance standpoint, however, these are few and

far between.  Usually when we see a denormalized structure, it wasn't done

on purpose.

Then it's time to go back to the development group and start talking with

some of the personnel about normalizing the schema for integrity and performance

reasons.  In some cases we'll get developers that know exactly what we're

talking about and after a short meeting, everyone will be on the same page. 

But then there are those times when as DBAs we'll say "Let's talk about

normalization," and we find that while we're dealing with decent

developers, they've not learned this foundation for good database design.  A lot of the time

is spent working through normalization basics in order to get this foundation down.  Without

the foundation, any improvements we suggest,

even if met with acceptance, won't be met with an understanding of why.  As

a result, the next time around the developers will commit the same mistakes,

meaning we'll go through the process all over again.  And so we come to the

purpose of this article: a short primer on normalization concepts to ensure that

necessary foundation for good database design. 

What is Normalization and Why Do We Need It?

Normalization solves several issues for us.  First and foremost, when we

normalize a database, we attempt to do the following:

  • Eliminate Unnecessary Redundancy of Data
  • Preserve Data Integrity
  • Protect from Data Loss

Let's look at a fictional case study which can demonstrate all of these

things.  I and a couple of colleagues have formed the DBA Sports Talent

Agency, a newly christened company representing pro athletes.  We've managed

to accumulate a small stable of top talent.  Chad, one of the agents, has

put together an initial database table (primary keys are underlined throughout

the article) with all of our talent and some basic

information about each one: 

Table 1 (ClientList)

Client Sport Team Salary Agent
Smith, John Football Tennessee Titans 2500000 Kelley, Brian
Brown, Reggie Baseball Kansas City Royals 750000 Silva, Chad
Johnson, Kyle Football Arizona Cardinals 500000 Kelley, Brian
Scott, Richard Hockey Dallas Stars 350000 Ruelas, Anthony
Ferguson, Martin Baseball New York Mets 125000 Silva, Chad
Elsworth, Robert Football Tennessee Titans 450000 Kelley, Brian

I need to make a clarifying point on redundancy and I'll do it with this

example:  If we take a quick look at teams, we see that the Tennessee Titans are listed

twice.  This is a text field, and it could have easily have gotten mis-keyed. 

I know quite a few people who were voting for the Tennessee Tuxedos (after a

cartoon penguin) instead of the Titans and a moment's flashback could have

caused a serious issue.  After all, if we were to look at all actual names

of teams out there, the Tennessee Tuxedos is not a valid name. As a result, a

typo could cause a domain integrity issue.  Large repeated text fields can

be a big headache (although a single occurrence could result in a domain

violation).  But while this repetition can represent a problem, this isn't the redundancy we're talking about.

Let's say that our client, Mr. Reggie Brown, was a top notch running back

coming out of college.  Spurning the NFL, he signs a baseball

contract.  A little into his first year, he decides he wants to play both

sports, believing he can excel at both.  Our sports agency sees the money

and we quickly get him signed with the Oakland Raiders, and he'll pick up a

million and a quarter for half a season.  But based on

Chad's original design, we have an issue.  Let's assume that the Client is

the primary key for the table.  We can't insert another Client entry

with Brown, Reggie in it.  We've got one of two options if we stick with

this single table design:

  1. Double up within the fields (Table 2)
  2. Create additional fields (Table 3)

Here's an example of the first case.  Notice that we've separated the

information in the fields using a semi-colon.  Ok, this doesn't seem so

bad, right?  As we start looking to manipulate the data we find out

differently.  First and foremost, querying for all we

have signed who play baseball isn't straightforward.  I can't simply issue

a SELECT Client FROM ClientList WHERE Sport = 'Baseball' any

longer. 

Table 2 (ClientList version 2)

Client Sport Team Salary Agent
Smith, John Football Tennessee Titans 2500000 Kelley, Brian
Brown, Reggie Baseball; Football Kansas City Royals; Oakland Raiders 750000; 1250000 Silva, Chad; Kelley, Brian
Johnson, Kyle Football Arizona Cardinals 500000 Kelley, Brian
Scott, Richard Hockey Dallas Stars 350000 Ruelas, Anthony
Ferguson, Martin Baseball New York Mets 125000 Silva, Chad
Elsworth, Robert Football Tennessee Titans 450000 Kelley, Brian

A second issue is I can no longer calculate total salary numbers

easily either.  If I want to calculate Reggie's, I have to parse the field

to get the two salary amounts.  Chad realizes these two problems and redesigns

the table again.  Chad now tries the second option,

which ends up looking like this:

Table 3 (ClientList version 3)

Client Sport Team1 Salary1 Agent1 Sport2 Team2 Salary2 Agent2
Smith, John Football Tennessee Titans 2500000 Kelley, Brian        
Brown, Reggie Baseball Kansas City Royals 750000 Silva, Chad Football Oakland Raiders 12500000 Kelley, Brian
Johnson, Kyle Football Arizona Cardinals 500000 Kelley, Brian        
Scott, Richard Hockey Dallas Stars 350000 Ruelas, Anthony        
Ferguson, Martin Baseball New York Mets 125000 Silva, Chad        
Elsworth, Robert Football Tennessee Titans 450000 Kelley, Brian        

This works in that we're not having to parse fields, but now

we're looking at a second set of columns just for one athlete. 

There's a lot of wasted space.  We can

now find out who plays baseball, but we have to do an extra comparison (we have to

check Sport2 as well).  This is the redundancy we're worried about.

Another point I made is that good normalization prevents data

loss.  Let's say Richard Scott gets a career-ending injury.  He and

the Stars (with our help) agree to a lump-sum payment but that takes him off our books.  Notice that Richard is the only athlete

whom Anthony manages.  If we drop Richard off the list, we drop any

reference to Anthony as well.  This raises a crucial point: without a

client, an agent isn't listed.  As a result, if we go and hire Anthony's

brother, Henry, to manage our basketball stable, he doesn't show

up on the table until he signs the first client.  With this current design,

we can run into the case where we're not able to represent everyone on our

staff.  Our denormalized database schema, though it is only one table,

is causing us some serious issues.  Now let's look at the first three

normal forms and what that does for us.

First Normal Form (1NF)

To bring a database schema into first normal form, we ensure

that there are no arrays, no redundant columns, no repeated structures

representing the exact same data.  The first two attempts Chad made at

updating our ClientList table violates first normal form.  ClientList

version 2 has a repeating group in several of the fields for Mr. Brown. We've

talked about how difficult it would be to query our baseball athletes in this

table.  Consider also the difficulty we face preserving integrity.  We

can't do simple comparisons against a field, but are faced with parsing the

fields to verify integrity.  This was the other point of where

normalization can help us.  By ensuring fields are atomic, we simplify the

procedures for integrity checks within our database.

ClientList version 3 has a duplicate set of columns which represent the same

type of information as the first set. To be first normal form, we need to

eliminate this redundancy.  One way to do this is to separate the client

specific information from the contract information.  Chad decides to form two

tables.  The first is Client, and he adds a column named ClientID.  It's a simple integer assigned for each client. 

This'll be a surrogate key instead of making the text field Client (which he renames ClientName) the primary key.  It's easier for us to use, and

there's always the possibility of getting two John Smiths, meaning ClientName in

its current form can't be guaranteed to be unique.

Client Table

ClientID ClientName
1 Smith, John
2 Brown, Reggie
3 Johnson, Kyle
4 Scott, Richard
5 Ferguson, Martin
6 Elsworth, Robert

Contract Table

ClientID Team Sport Salary Agent
1 Tennessee Titans Football 2500000 Kelley, Brian
2 Kansas City Royals Baseball 750000 Silva, Chad
2 Oakland Raiders Football 1250000 Kelley, Brian
3 Arizona Cardinals Football 500000 Kelley, Brian
4 Dallas Stars Hockey 350000 Ruelas, Anthony
5 New York Mets Baseball 125000 Silva, Chad
6 Tennessee Titans Football 450000 Kelley, Brian

Chad has handled the repetition of field information by simply

creating a row for Reggie's second sport in the Contract table.  The

primary key is ClientID and Team (because of the complex way salaries work, it's

entirely possible for a player to be playing for one team, yet still be owed

salary by another team, so we need to key on team and not sport), and we see that this does make all the rows

unique. Our database schema is now in first normal form, but we still have some

issues.

Second Normal Form (2NF)

Our database design is in second normal form if every attribute describes the entire primary key.  Now,

a given attribute could describe a combination of the primary

key plus other fields, but the whole primary key is the key (pun intended) to

second normal form.  It's

obvious that Sport is not dependent on ClientID.  It describes Team. 

If we look, we see that each Team in our list has only one agent.  Each

client does not.  According to our business rules, agents describe teams

(actually sports, but we'll get into that in 3rd normal form), not

clients.  Along those lines, Reggie has two agents.  We can see that

each team has only one agent (as demonstrated by the Titans), but clients can

have more.  Chad takes this into consideration and separates Contracts further: 

Contract Table

ClientID Team Salary
1 Tennessee Titans 2500000
2 Kansas City Royals 750000
2 Oakland Raiders 1250000
3 Arizona Cardinals 500000
4 Dallas Stars 350000
5 New York Mets 125000
6 Tennessee Titans 450000

Team Table

Team Sport Agent
Tennessee Titans Football Kelley, Brian
Kansas City Royals Baseball Silva, Chad
Oakland Raiders Football Kelley, Brian
Arizona Cardinals Football Kelley, Brian
Dallas Stars Hockey Ruelas, Anthony
New York Mets Baseball Silva, Chad

Third Normal Form (3NF):

Third normal form states that all attributes describe only the

primary key (no non-key fields). We just discussed above that while Agent

describes Team, it also describes Sport.  Each agent concentrates on one

sport.  So we should split the Team table even further, in order to bring

it to third normal form.  By the way, this is the perfect opportunity to

add Henry to represent our basketball interests.

Team Table

Team Sport
Tennessee Titans Football
Kansas City Royals Baseball
Oakland Raiders Football
Arizona Cardinals Football
Dallas Stars Hockey
New York Mets Baseball

Sport Table

Sport Agent
Football Kelley, Brian
Baseball Silva, Chad
Hockey Ruelas, Anthony
Basketball Ruelas, Henry

Finally our database schema is in third normal form.  First we've eliminated

the repeated fields and the repeated values in the fields.  That brought us

to first normal form.  Then we pulled out any fields that didn't describe

the entire primary key (attributes that might have described a combination of

the primary key and additional fields are valid because it's the entire primary

key that is the requirement).  That got us to second normal form.  Finally,

we pulled out anything that depended on the primary key + something else. 

That got us to third normal form.  If we want to see the fruits of our

labor, here's the final  normalized design:

Client Table

ClientID ClientName
1 Smith, John
2 Brown, Reggie
3 Johnson, Kyle
4 Scott, Richard
5 Ferguson, Martin
6 Elsworth, Robert

Contract Table

ClientID Team Salary
1 Tennessee Titans 2500000
2 Kansas City Royals 750000
2 Oakland Raiders 1250000
3 Arizona Cardinals 500000
4 Dallas Stars 350000
5 New York Mets 125000
6 Tennessee Titans 450000

Team Table

Team Sport
Tennessee Titans Football
Kansas City Royals Baseball
Oakland Raiders Football
Arizona Cardinals Football
Dallas Stars Hockey
New York Mets Baseball

Sport Table

Sport Agent
Football Kelley, Brian
Baseball Silva, Chad
Hockey Ruelas, Anthony
Basketball Ruelas, Henry

Final Thoughts

Normalizing databases is not very difficult, especially when compared with

potential issues in the future.  It assists us with data integrity, the elimination

of redundancy, and the prevention of data loss.  We've covered the first

three normal forms here, because usually when a database is described as

normalized it is at least 3NF.  Bringing a database to third normal form

may take a little work, but it'll save us some headaches in the end. It is

important that any database developer or any developer writing code against a

database understand the first three normal forms.  

For performance

reasons (too many joins, for instance) we may end up denormalizing data. 

However, that is a decision that has to be weighed carefully.  For design,

it's best to bring a database to at least 3NF and then denormalize from there (with

great caution).

Also, one thing I should point out with respect to teams is that I'm making

the assumption that location + name makes the team unique, and I've left them as

one field.  A better design would have been to split location and team name.  After all, the Los Angeles Lakers were once the Minneapolis

Lakers.  Along those lines, it is entirely possible for two teams with the

exact same location and name to exist for two different sports.  For

instance: St. Louis Cardinals and New York Giants.  At this point in time,

I can't think of any cases where that is true.  So to keep things

simple, I left Team as a combination of Location and Name, and left it as a

satisfactory primary key.  If I had been really starting a sports agency, I would make

split up Team into location and name and I was also add the sport to ensure

uniqueness. 

But for demonstration purposes, I chose to keep it simple.

Finally, as I stated in the introduction, normalization is important in OLTP environments.  However, when dealing with OLAP (OnLine

Analytical Processing) and warehousing solutions, there are other structures

that may be more efficient for the tasks that will be performed (star schema,

for instance).  Normalization is important in an OLTP environment where

changes are occurring on a frequent basis.  A warehousing solution usually

involves static data that is denormalized for speed in reporting. It's important

to remember what environment we're speaking of with respect to applying

normalization.  A simple case of the right tool for the right

job.  

Rate

5 (3)

Share

Share

Rate

5 (3)

Related content