How best to handle non-normalized data

Question

Post reply

How best to handle non-normalized data

Mike Scalise

SSCrazy Eights

Points: 8145
More actions
September 15, 2016 at 2:21 pm

#313339

Hi,
I am trying to write queries against a vendor's database whose tables are in a very peculiar format. They are neither fully normalized nor de-normalized. Here is an example of the structure:
DECLARE @Person TABLE (
ID int NOT NULL,
FirstName varchar(25) NULL,
LastName varchar(25) NULL,
DOB date NULL
)
INSERT @Person (ID, FirstName, LastName, DOB) VALUES (1, N'John', N'Doe', CAST(N'1980-12-31' AS Date))
INSERT @Person (ID, FirstName, LastName, DOB) VALUES (2, N'Sally', N'Smith', CAST(N'1981-09-15' AS Date))
DECLARE @Person_Favorites TABLE (
ID int NOT NULL,
POS int NOT NULL,
Hobby varchar(50) NULL,
Color varchar(50) NULL,
SQLConference varchar(50) NULL
)
INSERT @Person_Favorites (ID, POS, Hobby, Color, SQLConference) VALUES (1, 1, N'Basketball', N'Blue', N'PASS Summit')
INSERT @Person_Favorites (ID, POS, Hobby, Color, SQLConference) VALUES (1, 2, N'Baseball', N'Green', N'SQL Saturday')
INSERT @Person_Favorites (ID, POS, Hobby, Color, SQLConference) VALUES (1, 3, NULL, N'Yellow', N'SQL Cruise')
INSERT @Person_Favorites (ID, POS, Hobby, Color, SQLConference) VALUES (1, 4, NULL, N'Black', NULL)
INSERT @Person_Favorites (ID, POS, Hobby, Color, SQLConference) VALUES (1, 5, NULL, N'Brown', NULL)
INSERT @Person_Favorites (ID, POS, Hobby, Color, SQLConference) VALUES (2, 1, N'Tennis', N'Red', N'SQL Bits')
INSERT @Person_Favorites (ID, POS, Hobby, Color, SQLConference) VALUES (2, 2, N'Football', N'Green', N'SQLIntersection')
INSERT @Person_Favorites (ID, POS, Hobby, Color, SQLConference) VALUES (2, 3, N'Soccer', NULL, NULL)
INSERT @Person_Favorites (ID, POS, Hobby, Color, SQLConference) VALUES (2, 4, N'Track and Field', NULL, NULL)
SELECT *
FROM @Person
SELECT *
FROM @Person_Favorites
You can see that the @Person_Favorites table has a column for each type of favorite for the person, and the number of rows each person will have is equal to highest count of all of their favorite things.
So, if I wanted to report and filter on any of these, one of the things I thought to do is create views to normalize them. The queries would look something like:
SELECT ID,
POS,
Hobby
FROM @Person_Favorites
WHERE Hobby IS NOT NULL
SELECT ID,
POS,
Color
FROM @Person_Favorites
WHERE Color IS NOT NULL
SELECT ID,
POS,
SQLConference
FROM @Person_Favorites
WHERE SQLConference IS NOT NULL
However, that would mean I'd have to create a ton of views if the person has 100 different favorite categories. In addition, I believe the query optimizer wouldn't be able to best optimize the query if I start joining several views together.
I can't reasonably just join @Person to @Person_Favorites on ID because of how the data is stored in @Person_Favorites, right?
I'm curious to know what others think. Does anyone know of a way to address this from a reporting perspective? Obviously a data warehouse is in the future but there are still real-time reports that need to be run.
Thanks in advance,
Mike
Mike Scalise, PMP
https://www.michaelscalise.com

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply

The Dixie Flatline SSC Guru Points: 53354 More actions · Answer 1

What about those two tables is not normalized?

select p.id, p.FirstName, p.LastName, p.dob, pf.pos,pf.Hobby, pf.Color, pf.SQLConference

from @Person p

join @Person_Favorites pf on p.ID = pf.ID

--where pf.SQLConference like 'SQL%'

--where color = 'Green'

where lastname = 'Smith'

The SELECT/FROM/JOIN lines could be your view, if you feel you need one, but that's as simple as a JOIN gets and you can filter on any of the columns.

I really don't see the problem here. Could you elaborate a bit?

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

The Dixie Flatline SSC Guru Points: 53354 More actions · Answer 2

Are you trying to take normalization a step further and create separate tables for possible colors, hobbies, and SQLConferences?

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

Alan Burstein SSC Guru Points: 61136 More actions · Answer 3

Vendor Databases are generally not designed by database gurus which is a good thing, it helps keep us database people employed.

For a situation like this, clustered indexes, non-clustered indexes (filtered and unfiltered) and indexed views can be your friend. Below is your code/data stored in perm tables so I could index and create views... Play around with this code in SSMS with "show actual execution plan" turned on. Note my comments.

USE tempdb

GO

CREATE TABLE dbo.person

(

ID int NOT NULL,

FirstName varchar(25) NULL,

LastName varchar(25) NULL,

DOB date NULL

);

CREATE TABLE dbo.person_favorites

(

ID int NOT NULL,

POS int NOT NULL,

Hobby varchar(50) NULL,

Color varchar(50) NULL,

SQLConference varchar(50) NULL

);

-- clustered Index on #person

ALTER TABLE dbo.person ADD CONSTRAINT pk_person PRIMARY KEY CLUSTERED (ID);

ALTER TABLE dbo.person_favorites ADD CONSTRAINT pk_person_favs PRIMARY KEY CLUSTERED (ID, POS);

INSERT dbo.Person (ID, FirstName, LastName, DOB) VALUES

(1, N'John', N'Doe', '1980-12-31'), (2, N'Sally', N'Smith', '1981-09-15');

INSERT dbo.Person_Favorites (ID, POS, Hobby, Color, SQLConference) VALUES

(1, 1, N'Basketball', N'Blue', N'PASS Summit'),

(1, 2, N'Baseball', N'Green', N'SQL Saturday'),

(1, 3, NULL, N'Yellow', N'SQL Cruise'),

(1, 4, NULL, N'Black', NULL),

(1, 5, NULL, N'Brown', NULL),

(2, 1, N'Tennis', N'Red', N'SQL Bits'),

(2, 2, N'Football', N'Green', N'SQLIntersection'),

(2, 3, N'Soccer', NULL, NULL),

(2, 4, N'Track and Field', NULL, NULL);

GO

CREATE UNIQUE NONCLUSTERED INDEX pk_person_favorites__hobby

ON dbo.Person_Favorites(ID, POS, Hobby)

WHERE Hobby IS NOT NULL;

-- get's nonclustered Index Seek

SELECT ID,

POS,

Hobby

FROM dbo.Person_Favorites

WHERE Hobby IS NOT NULL AND ID = 1;

CREATE UNIQUE NONCLUSTERED INDEX pk_person_favorites__color

ON dbo.Person_Favorites(ID, POS, Color)

WHERE color IS NOT NULL;

-- get's nonclustered Index Seek

SELECT ID,

POS,

Color

FROM dbo.Person_Favorites

WHERE Color IS NOT NULL AND ID = 2;

GO

CREATE VIEW dbo.vw_people_favorites

WITH SCHEMABINDING AS

SELECT

pf.ID,

pf.POS,

p.FirstName,

p.LastName,

p.DOB,

pf.Hobby,

pf.SQLConference

FROM dbo.person p

JOIN dbo.person_favorites pf ON p.ID = pf.ID;

GO

-- Indexed view which can be used for more indexes

CREATE UNIQUE CLUSTERED INDEX vci_people_favorites ON dbo.vw_people_favorites(ID, POS);

-- gets info from clustered index Scan

SELECT

FirstName,

LastName,

DOB,

Hobby

FROM dbo.vw_people_favorites;

"I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

-- Itzik Ben-Gan 2001

drew.allen SSC Guru Points: 76948 More actions · Answer 4

The Dixie Flatline (9/15/2016)
What about those two tables is not normalized?
select p.id, p.FirstName, p.LastName, p.dob, pf.pos,pf.Hobby, pf.Color, pf.SQLConference
from @Person p
join @Person_Favorites pf on p.ID = pf.ID
--where pf.SQLConference like 'SQL%'
--where color = 'Green'
where lastname = 'Smith'
The SELECT/FROM/JOIN lines could be your view, if you feel you need one, but that's as simple as a JOIN gets and you can filter on any of the columns.
I really don't see the problem here. Could you elaborate a bit?

It violates Fourth Normal Form. It contains multiple independent multi-valued facts. The fact that John's favorite sport is basketball, is unrelated to the fact that his favorite colo(u)r is blue.

Drew

Edit: It quoted the wrong post.

J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA

Mike Scalise SSCrazy Eights Points: 8145 More actions · Answer 5

The Dixie Flatline (9/15/2016)
Are you trying to take normalization a step further and create separate tables for possible colors, hobbies, and SQLConferences?

Well, it's not that I want to do that, but as drew said, it violates fourth normal form. In the current state, I can neither simply join the two tables and get what I need nor are they completely normalized where I can report off of them that way. They're kind of in-between, which is why I need to do something about it.

Mike Scalise, PMP
https://www.michaelscalise.com

Mike Scalise SSCrazy Eights Points: 8145 More actions · Answer 6

Alan.B (9/15/2016)
Vendor Databases are generally not designed by database gurus which is a good thing, it helps keep us database people employed.
For a situation like this, clustered indexes, non-clustered indexes (filtered and unfiltered) and indexed views can be your friend. Below is your code/data stored in perm tables so I could index and create views... Play around with this code in SSMS with "show actual execution plan" turned on. Note my comments.
USE tempdb
GO
CREATE TABLE dbo.person
(
ID int NOT NULL,
FirstName varchar(25) NULL,
LastName varchar(25) NULL,
DOB date NULL
);
CREATE TABLE dbo.person_favorites
(
ID int NOT NULL,
POS int NOT NULL,
Hobby varchar(50) NULL,
Color varchar(50) NULL,
SQLConference varchar(50) NULL
);
-- clustered Index on #person
ALTER TABLE dbo.person ADD CONSTRAINT pk_person PRIMARY KEY CLUSTERED (ID);
ALTER TABLE dbo.person_favorites ADD CONSTRAINT pk_person_favs PRIMARY KEY CLUSTERED (ID, POS);
INSERT dbo.Person (ID, FirstName, LastName, DOB) VALUES
(1, N'John', N'Doe', '1980-12-31'), (2, N'Sally', N'Smith', '1981-09-15');
INSERT dbo.Person_Favorites (ID, POS, Hobby, Color, SQLConference) VALUES
(1, 1, N'Basketball', N'Blue', N'PASS Summit'),
(1, 2, N'Baseball', N'Green', N'SQL Saturday'),
(1, 3, NULL, N'Yellow', N'SQL Cruise'),
(1, 4, NULL, N'Black', NULL),
(1, 5, NULL, N'Brown', NULL),
(2, 1, N'Tennis', N'Red', N'SQL Bits'),
(2, 2, N'Football', N'Green', N'SQLIntersection'),
(2, 3, N'Soccer', NULL, NULL),
(2, 4, N'Track and Field', NULL, NULL);
GO
CREATE UNIQUE NONCLUSTERED INDEX pk_person_favorites__hobby
ON dbo.Person_Favorites(ID, POS, Hobby)
WHERE Hobby IS NOT NULL;
-- get's nonclustered Index Seek
SELECT ID,
POS,
Hobby
FROM dbo.Person_Favorites
WHERE Hobby IS NOT NULL AND ID = 1;
CREATE UNIQUE NONCLUSTERED INDEX pk_person_favorites__color
ON dbo.Person_Favorites(ID, POS, Color)
WHERE color IS NOT NULL;
-- get's nonclustered Index Seek
SELECT ID,
POS,
Color
FROM dbo.Person_Favorites
WHERE Color IS NOT NULL AND ID = 2;
GO
CREATE VIEW dbo.vw_people_favorites
WITH SCHEMABINDING AS
SELECT
pf.ID,
pf.POS,
p.FirstName,
p.LastName,
p.DOB,
pf.Hobby,
pf.SQLConference
FROM dbo.person p
JOIN dbo.person_favorites pf ON p.ID = pf.ID;
GO
-- Indexed view which can be used for more indexes
CREATE UNIQUE CLUSTERED INDEX vci_people_favorites ON dbo.vw_people_favorites(ID, POS);
-- gets info from clustered index Scan
SELECT
FirstName,
LastName,
DOB,
Hobby
FROM dbo.vw_people_favorites;

Alan, I'll definitely take a look at this in SSMS--thank you.

However, if I have 100 different favorite types stored in this way, that's going to be a very wide non-clustered index. I'm a little apprehensive about creating that on an OLTP system, but then again, I haven't really done that before. In the same vein, I have heard that indexed views can seriously affect performance of OLTP systems with a lot of activity, so I've been trying to avoid that too.

In addition, the tables in the database were created in such a way that they don't even meet the criteria to index the views that I have........

Mike

Mike Scalise, PMP
https://www.michaelscalise.com

The Dixie Flatline SSC Guru Points: 53354 More actions · Answer 7

drew.allen (9/15/2016)
The Dixie Flatline (9/15/2016)
What about those two tables is not normalized?
select p.id, p.FirstName, p.LastName, p.dob, pf.pos,pf.Hobby, pf.Color, pf.SQLConference
from @Person p
join @Person_Favorites pf on p.ID = pf.ID
--where pf.SQLConference like 'SQL%'
--where color = 'Green'
where lastname = 'Smith'
The SELECT/FROM/JOIN lines could be your view, if you feel you need one, but that's as simple as a JOIN gets and you can filter on any of the columns.
I really don't see the problem here. Could you elaborate a bit?
It violates Fourth Normal Form. It contains multiple independent multi-valued facts. The fact that John's favorite sport is basketball, is unrelated to the fact that his favorite colo(u)r is blue.
Drew
Edit: It quoted the wrong post.

DOH!! I never stopped to question whether or not those favorite columns were UNrelated. I am so ashamed :ermm:

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

Jason A. Long SSC-Insane Points: 23780 More actions · Answer 8

Jason A. Long

SSC-Insane

Points: 23780

September 16, 2016 at 1:10 pm

#1901459

-- removed --

Mike Scalise SSCrazy Eights Points: 8145 More actions · Answer 9

...yeah I couldn't think of any viable workarounds either.

Mike Scalise, PMP
https://www.michaelscalise.com

Alan Burstein SSC Guru Points: 61136 More actions · Answer 10

...if I have 100 different favorite types stored in this way, that's going to be a very wide non-clustered index.

That is a potential design problem - I would consider spreading that data across multiple tables but, if you can't, you can still index in a way that improves performance without slowing other things down too much. If you are stuck with that design I would consider making a copy of that table that is updated regularly using SQL Replication, an SSIS job or the SQL agent to keep the copy up-to-date. Then the original would remain unchanged and the copy could handle non-OLTP reporting types of requests.

It's common to setup Transactional Replication for this kind of thing where the original has minimal indexes but the copy is well indexed and is used for reporting.

In the same vein, I have heard that indexed views can seriously affect performance of OLTP systems with a lot of activity...

Just like tables, Indexes on views have the same pros and cons: they speed up data access queries but slow down data modification queries. To understand the benefits (or lack of) you need to do some testing. Lack of proper indexes can cause queries that would finish in seconds, finish in minutes - when that happens you have more locking/blocking/deadlocking. Correctly designed indexes also reduce I/O and memory pressure. Again - I would do some testing.

This is where data warehouses really help - separating OLTP operations from reporting.

"I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

-- Itzik Ben-Gan 2001