Discussion on Normalization and Database Design Concepts

,

Introduction

Relational database design helps to create a set of relations and tables, which are connected with each other through some defined relationships. The main purpose of relational database design is to preserve the information and minimize data redundancy.

We need to follow some design guidelines for relational schemas to maintain the quality of the database. Also, a formal methodology is required for verifying and improving the quality of the relation schemas.

Let us first have a discussion on Data Redundancy and Update Anomalies.

Data Redundancy

When the same set of data is present multiple times in the database, then it is known as data redundancy. Redundant data causes wastage of storage and maintenance overhead. Redundant data has an impact on performance. There can be problem of update anomalies due to data redundancy.

Let us consider the table, named EmpDept. Here multiple records are present in the table for the same department, NS or FS. So, this is redundant data. If we needed to edit one of these values, we'd need to ensure it was changed in multiple rows.

IF OBJECT_ID('EmpDept','U') IS NOT NULL
DROP TABLE dbo.EmpDept

CREATE TABLE dbo.EmpDept
(
EmpId INT PRIMARY KEY,
EmpName VARCHAR(100) NOT NULL,
EmpAddr VARCHAR(100) NOT NULL,
DeptName VARCHAR(10) NOT NULL,
DeptDesc VARCHAR(100)
)
INSERT INTO dbo.EmpDept (EmpId,EmpName,EmpAddr,DeptName,DeptDesc)
VALUES 
(101,'e1','e1addr1','NS','Network Security'),
(102,'e2','e2addr2','NS','Network Security'),
(103,'e3','e3addr3','FS','Financial Services'),
(104,'e4','e4addr4','FS','Financial Services'),
(105,'e5','e5addr5','HR','Human Resources')
SELECT EmpId,EmpName,EmpAddr,DeptName,DeptDesc
FROM dbo.EmpDept
/*
EmpIdEmpNameEmpAddrDeptNameDeptDesc
101e1e1addr1NSNetwork Security
102e2e2addr2NSNetwork Security
103e3e3addr3FSFinancial Services
104e4e4addr4FSFinancial Services
105e5e5addr5HRHuman Resources
*/

Update Anomalies

When the incorrect database design and redundant data cause issues in the insertion, modification or deletion of records in the table, then these are known as update anomalies. These can be of three types. We will discuss them with the EmpDept table data.

Insertion Anomalies

If a new employee joins the organization and the employee is not assigned to any department yet, then the employee details cannot be inserted in the table when DeptId is a mandatory column in the table. Also, if there is a department without any employee, it is not possible to enter the details in the table, as EmpId cannot be NULL.

When both the employee and department details are available, a user must insert exactly the same values for DeptName and DeptDesc as already available in the table for these same departments to maintain data consistency. A data entry error here would cause an issue.

All the above mentioned issues are known as insertion anomalies.

Modification Anomalies

There are two records for the department named FS. If I need to change the description for this department, both of these records should be modified. Otherwise, data will be inconsistent in the table. This is a modification anomaly.

Deletion Anomalies

There is only one employee in the HR department. If I remove this employee from the table, the department details for HR department will also be deleted. This is deletion anomaly.

Better Database Design

I need to design the tables in such a way that these anomalies can be avoided. I will decompose the EmpDept table in two tables: Emp and Dept. I have the DeptId column in Dept table as primary key and foreign key in Emp table. In the current design, each department has only one occurance in the Dept table. So, data redundancy is reduced here.

Now, department information can be added in the database without employee details available. One department has only one entry in the Dept table. So, there is no need to modify multiple records for updating a particular department's information. Also, if the only employee of a department is deleted, still the department information is available in the database as employee and department details are now maintained in two different tables. So, update anomalies are also minimized with this modified design of the database tables.

Also, I can join the two tables and retrieve information from both Emp and Dept tables together.

IF OBJECT_ID('Dept','U') IS NOT NULL
DROP TABLE dbo.Dept

CREATE TABLE dbo.Dept
(
DeptId INT PRIMARY KEY,
DeptName VARCHAR(10) NOT NULL,
DeptDesc VARCHAR(100)
)
INSERT INTO dbo.Dept (DeptId,DeptName,DeptDesc)
VALUES 
(201,'NS','Network Security'),
(202,'FS','Financial Services'),
(203,'HR','Human Resources')
SELECT DeptId,DeptName,DeptDesc
FROM dbo.Dept
/*
DeptIdDeptNameDeptDesc
201NSNetwork Security
202FSFinancial Services
203HRHuman Resources
*/IF OBJECT_ID('Emp','U') IS NOT NULL
DROP TABLE dbo.Emp

CREATE TABLE dbo.Emp
(
EmpId INT PRIMARY KEY,
EmpName VARCHAR(100) NOT NULL,
EmpAddr VARCHAR(100) NOT NULL,
DeptId INT NOT NULL,
CONSTRAINT FK_Emp_DeptId FOREIGN KEY (DeptId) REFERENCES dbo.Dept(DeptId) 
)
INSERT INTO dbo.Emp (EmpId,EmpName,EmpAddr,DeptId)
VALUES 
(101,'e1','e1addr1',201),
(102,'e2','e2addr2',201),
(103,'e3','e3addr3',202),
(104,'e4','e4addr4',202),
(105,'e5','e5addr5',203)
SELECT EmpId,EmpName,EmpAddr,DeptId
FROM dbo.Emp
--Retrieve records by joining both the tables
SELECT E.EmpId,E.EmpName,E.EmpAddr,D.DeptName,D.DeptDesc
FROM dbo.Emp E
INNER JOIN dbo.Dept D ON E.DeptId = D.DeptId
/*
EmpIdEmpNameEmpAddrDeptNameDeptDesc
101e1e1addr1NSNetwork Security
102e2e2addr2NSNetwork Security
103e3e3addr3FSFinancial Services
104e4e4addr4FSFinancial Services
105e5e5addr5HRHuman Resources
*/

Next, we will discuss about some Database Design concepts like Functional Dependency and Keys.

Functional Dependency

Let us consider the table named Cuboid. It has the following attributes: length, breadth, height, area, volume.

IF OBJECT_ID('Cuboid','U') IS NOT NULL
DROP TABLE dbo.Cuboid

CREATE TABLE dbo.Cuboid
(
CuboidId INT PRIMARY KEY,
Clength INT NOT NULL,
Cbreadth INT NOT NULL,
Cheight INT NOT NULL,
Carea INT,
Cvolume INT
)

Now, let us consider the set of attributes (length, breadth, height) as X and (area) as Y. Area of any surface of the cuboid depends on any two of the attribute set (length, breadth, height). When the values of this attribute set is same for two records, the area for two records will also be the same. This relationship between the two sets of attributes is known as functional dependency, So, the attribute set (area) or (volume) is functionally dependent on (length, breadth, height).

It is written as X --> Y.

Full Functional Dependency

Let us assume that in the functional dependency X --> Y, X is (length, breadth, height) and Y is (volume).

Volume is dependent on length, breadth and height. Now, if any of the attribute is removed from X, Y (volume) cannot be calculated and the functional dependency will not hold anymore.

So, X ((length, breadth, height)  --> Y (volume) is a full functional dependency.

Partial Functional Dependency

Let us assume that in the functional dependency X --> Y, X is (length, breadth, height) and Y is (area).

Area is dependent on any two of length, breadth and height. Now, if any of the attribute is removed from X, Y (area) still can be calculated and the functional dependency still holds.

So, X ((length, breadth, height)  --> Y (area) is a partial functional dependency.

Super Key

It specifies a uniqueness constraint that no two records in a table can be identical. One or more columns in the table can be defined as super key if the columns are able to uniquely identify the records.

IF OBJECT_ID('Student','U') IS NOT NULL
DROP TABLE dbo.Student

CREATE TABLE dbo.Student
(
StudentId INT,
Sname VARCHAR(50),
Saddress VARCHAR(100),
Semail VARCHAR(100),
SadharId INT,
SpanId CHAR(10)
)
INSERT INTO dbo.Student (StudentId,Sname,Saddress,Semail)
VALUES 
(301,'s1','s1addr1','s1@samplemail.com'),
(302,'s2','s2addr1','s2@samplemail.com'),
(303,'s2','s3addr3','s3@samplemail.com')
SELECT StudentId,Sname,Saddress,Semail
FROM dbo.Student
/*
--StudentId values are unique
301
302
303
--(StudentId,Sname) combination values are unique
301s1
302s2
303s2
--(StudentId,Sname,Saddress) combination values are unique
301s1s1addr1
302s2s2addr1
303s2s3addr3
--(StudentId,Saddress) combination values are unique
301s1addr1
302s2addr1
303s3addr3
*/

In the Student table, StudentId can uniquely identify a record. So, StudentId is a superkey. The combination of any column with StudentId is also a superkey, as the combination can also uniquely identify a record.

A few possible super keys in the Student table are: (StudentId, Sname), (StudentId, Saddress), (StudentId, Sname, Saddress), (StudentId, Sname, Saddress, Semail), etc.

It is a superkey where after removing any column from the combination of columns, the remaining column set is not a superkey anymore. In the Student table, let us assume that (StudentId, Sname) combination can uniquely identify a record. Now, let us consider the possible super keys in the table: (StudentId, Sname), (StudentId, Sname, Saddress), (StudentId, Sname, Saddress, Semail) etc. If I remove address from the super key (StudentId, Sname, Saddress), it becomes (StudentId, Sname), which can uniquely identify the record. So, the super key (StudentId, Sname, Saddress) is not a key.

But, if I remove name from the superkey (StudentId, Sname), it becomes (StudentId), which can't uniquely identify a record in the table. So, the super key (StudentId, Sname) is a key.

Candidate Key

The Student table may have multiple keys. Each of the key is known as candidate key. Let us consider that any of the columns StudentId, SadharId, SpanId, Semail can uniquely identify a record in the Student table. So, all of them are candidate keys of the Student table.

Primary Key

We define one of the candidate keys with a single column, or minimum number of columns, as the primary key.

Alternate Key

Except for the primary key, the rest of the candidate keys are called alternate key. We may define them as unique keys in the table.

Prime and non-prime attributes

An attribute or column of a table is known as prime attribute if it is a part of any candidate key of the table. Otherwise, the attribute is known as a non-prime attribute.

Different Normal Forms and Normalization

There are some predefined normal forms. For each table, some conditions can be used on the table schema to verify whether the table satisfies a certain normal form. This process is known as Normalization.

Normalization proceeds in a top-down approach to evaluate each table against the criteria for the normal forms and the tables are decomposed as necessary to meet the requirements for the normal form. Table analysis is being done based on their functional dependencies and candidate keys. The main purpose of normalization is to reduce data redundancy and minimize the update anomalies.

The normalization process was first proposed by Codd. He started with three normal forms: first, second and third. Later, Boyce and Codd together proposed an enhanced version of the third normal form known as Boyce-Codd Normal Form (BCNF).

The normal form of a relation or table refers to the highest normal form condition that it fulfills. When, a table is decomposed to fulfill the criteria for a particular normal form, it needs to maintain two properties:

  • Nonadditive join property - Relations or tables created after decomposition should not generate wrong datasets when joined together.
  • Dependency preservation property - Each functional dependency is represented in one of the tables after decomposition.

First Normal Form (1NF)

The domain of an attribute must include only atomic (simple, indivisible) values and value of any attribute in a tuple must be a single value from the domain of that attribute.

Let us consider the Emp table column named EmpAddr. The EmpAddr column value should be a single value. It cannot contain multiple components like area, city, state. To maintain different components separately, three different columns should be created. Also, one employee may have multiple addresses. But, that is not possible to save in a single record. For multiple addresses of the employee, either multiple address columns should be created (EmpAddr1, EmpAddr2, and so on) or multiple records are to be added for a single employee.

So, both Emp and Dept tables are in First Normal Form (1NF).

Second Normal Form (2NF)

A relation schema or table is in second normal form (2NF) if every non-prime attribute is not partially dependent on any key of the table.

Now, let us consider the table named StudentProjects. We will check if the table is in Second Normal Form or not.

IF OBJECT_ID('StudentProjects','U') IS NOT NULL
DROP TABLE dbo.StudentProjects

CREATE TABLE dbo.StudentProjects
(
StudentId INT,
ProjectId INT,
HrsWorked INT,
Sname VARCHAR(50),
Pname VARCHAR(100),
DeptId INT, 
Dname VARCHAR(50),
)

(StudentId, ProjectId) is the candidate key of the table. Every column of this table is single-valued attribute and multiple values are not allowed in any of the columns. So, the table is in 1NF.

Now, we will check on the functional dependencies existing on the table.

  • FD1: StudentId, ProjectId --> HrsWorked
  • FD2: StudentId --> Sname
  • FD3: ProjectId --> Pname
  • FD4: StudentId, ProjectId --> DeptId
  • FD5: DeptId --> Dname

FD2 and FD3 violates 2NF condition. In FD2, Sname is dependent on StudentId i.e. Sname is partially dependent on the candidate key (StudentId, ProjectId) of the table. Also, in FD3, Pname is dependent on ProjectId i.e. Pname is partially dependent on the candidate key.

StudentProject table is now decomposed in three tables -- StudentProjectDetails, StudentDetails, ProjectDetails

--table 1
IF OBJECT_ID('StudentProjectDetails','U') IS NOT NULL
DROP TABLE dbo.StudentProjectDetails

CREATE TABLE dbo.StudentProjectDetails
(
StudentId INT,
ProjectId INT,
HrsWorked INT,
DeptId INT, 
Dname VARCHAR(50),
)
--table 2
IF OBJECT_ID('StudentDetails','U') IS NOT NULL
DROP TABLE dbo.StudentDetails

CREATE TABLE dbo.StudentDetails
(
StudentId INT,
Sname VARCHAR(50)
)
--table 3
IF OBJECT_ID('ProjectDetails','U') IS NOT NULL
DROP TABLE dbo.ProjectDetails
CREATE TABLE dbo.ProjectDetails
(
ProjectId INT,
Pname VARCHAR(100),
)

StudentProjectDetails 

(StudentId, ProjectId) is the candidate key of the table.

The functional dependencies are:

  • FD1: StudentId, ProjectId --> HrsWorked
  • FD2: StudentId, ProjectId --> DeptId
  • FD3: DeptId --> Dname

All these functional dependencies follow 2NF. So, StudentProjectDetails  table is in 2NF.

StudentDetails

StudentId is the candidate key of the table.

Functional dependency:

  • FD1: StudentId --> Sname
  • FD1 follows 2NF. So, the table StudentDetails is in 2NF.

ProjectDetails

ProjectId is the candidate key of the table.

Functional dependency:

  • FD1: ProjectId --> Pname

FD1 follows 2NF. So, the table ProjectDetails is in 2NF.

Third Normal Form (3NF)

A relation schema or table is in Third normal form (3NF), when a nontrivial functional dependency X --> A holds in the table, then either X is a super key of the table or A is a prime attribute of the table.

Let us consider the StudentProjectDetails table again. This table is already in 2NF and let us check if it is in 3NF or not.

(StudentId, ProjectId) is the candidate key.

The functional dependencies are:

  • FD1: StudentId, ProjectId --> HrsWorked
  • FD2: StudentId, ProjectId --> DeptId
  • FD3: DeptId --> DName

FD3 violates 3NF condition. Because, neither DeptId is the super key of the table nor DName is any prime attribute.

StudentProjectDetails table is decomposed in two tables -- StudentProjectDetailsData, DepartmentDetails

--table 4
IF OBJECT_ID('StudentProjectDetailsData','U') IS NOT NULL
DROP TABLE dbo.StudentProjectDetailsData

CREATE TABLE dbo.StudentProjectDetailsData
(
StudentId INT,
ProjectId INT,
HrsWorked INT,
DeptId INT
)
--table 5
IF OBJECT_ID('DepartmentDetails','U') IS NOT NULL
DROP TABLE dbo.DepartmentDetails

CREATE TABLE dbo.DepartmentDetails
(
DeptId INT, 
Dname VARCHAR(50),
)

StudentProjectDetailsData

(StudentId, ProjectId) is the candidate key.

The functional dependencies are:

  • FD1: StudentId, ProjectId --> HrsWorked
  • FD2: StudentId, ProjectId --> DeptId

Both these functional dependencies follow 3NF. So, StudentProjectDetailsData table is in 3NF.

DepartmentDetails

DeptId is the candidate key.

The functional dependency is:

  • FD1: DeptId --> DName

FD1 follows 3NF. So, DepartmentDetails table is in 3NF.

Boyce Codd Normal Form (BCNF)

A relation schema or table is in Boyce Codd normal form (BCNF), when a nontrivial functional dependency X --> A holds in the table, then X is a super key of the table.

Let us consider the table named Teaching.

IF OBJECT_ID('Teaching','U') IS NOT NULL
DROP TABLE dbo.Teaching

CREATE TABLE dbo.Teaching
(
StudentId INT,
CourseId INT,
InstructorId INT
)

Here, (StudentId,CourseId) combination is the candidate key.

The functional dependencies are:

  • FD1: (StudentId,CourseId) --> InstructorId
  • FD2: InstructorId --> CourseId

FD1 and FD2 both are in 3NF. But, FD2 violates BCNF. Here, InstructorId is not a super key. To maintain BCNF, the table Teaching needs to be decomposed. There are different ways to decompose the table and none of the decomposition preserves the functional dependencies. But, the following decomposition ensures that the nonadditive join property is maintained.

Teaching table is decomposed in two tables: Teaching1 and Teaching2.

--table 1
IF OBJECT_ID('Teaching1','U') IS NOT NULL
DROP TABLE dbo.Teaching1

CREATE TABLE dbo.Teaching1
(
CourseId INT,
InstructorId INT
)
--table 2
IF OBJECT_ID('Teaching2','U') IS NOT NULL
DROP TABLE dbo.Teaching2

CREATE TABLE dbo.Teaching2
(
StudentId INT,
InstructorId INT
)

Teaching1

InstructorId is the candidate key.

The functional dependency is:

  • FD1: InstructorId --> CourseId

FD1 follows BCNF as InstructorId is the super key So, Teaching1 table is in BCNF.

Teaching2

(StudentId,InstructorId) combination is the candidate key.

No non-key attribute is there in this table. So, this table is already in BCNF.

Conclusion

Normalization helps to maintain the quality of the database design. But, it is not always required to normalize the table to the highest possible normal form. As any table is moved to higher normal forms, there are more decompositions causing more number of tables. Too many numbers of tables has impact on the performance and maintainability of the tables. So, sometimes it is recommended to keep the tables in lower normal forms to achieve the required performance though some data redundancy may be introduced. Denormalization is the opposite of normalization where two or more tables in higher normal forms are combined together to form a single table in lower normal form.

Rate

5 (3)

Share

Share

Rate

5 (3)