How best to handle non-normalized data

  • Hi,

    I am trying to write queries against a vendor's database whose tables are in a very peculiar format. They are neither fully normalized nor de-normalized. Here is an example of the structure:

    DECLARE @Person TABLE (

    ID int NOT NULL,

    FirstName varchar(25) NULL,

    LastName varchar(25) NULL,

    DOB date NULL

    )

    INSERT @Person (ID, FirstName, LastName, DOB) VALUES (1, N'John', N'Doe', CAST(N'1980-12-31' AS Date))

    INSERT @Person (ID, FirstName, LastName, DOB) VALUES (2, N'Sally', N'Smith', CAST(N'1981-09-15' AS Date))

    DECLARE @Person_Favorites TABLE (

    ID int NOT NULL,

    POS int NOT NULL,

    Hobby varchar(50) NULL,

    Color varchar(50) NULL,

    SQLConference varchar(50) NULL

    )

    INSERT @Person_Favorites (ID, POS, Hobby, Color, SQLConference) VALUES (1, 1, N'Basketball', N'Blue', N'PASS Summit')

    INSERT @Person_Favorites (ID, POS, Hobby, Color, SQLConference) VALUES (1, 2, N'Baseball', N'Green', N'SQL Saturday')

    INSERT @Person_Favorites (ID, POS, Hobby, Color, SQLConference) VALUES (1, 3, NULL, N'Yellow', N'SQL Cruise')

    INSERT @Person_Favorites (ID, POS, Hobby, Color, SQLConference) VALUES (1, 4, NULL, N'Black', NULL)

    INSERT @Person_Favorites (ID, POS, Hobby, Color, SQLConference) VALUES (1, 5, NULL, N'Brown', NULL)

    INSERT @Person_Favorites (ID, POS, Hobby, Color, SQLConference) VALUES (2, 1, N'Tennis', N'Red', N'SQL Bits')

    INSERT @Person_Favorites (ID, POS, Hobby, Color, SQLConference) VALUES (2, 2, N'Football', N'Green', N'SQLIntersection')

    INSERT @Person_Favorites (ID, POS, Hobby, Color, SQLConference) VALUES (2, 3, N'Soccer', NULL, NULL)

    INSERT @Person_Favorites (ID, POS, Hobby, Color, SQLConference) VALUES (2, 4, N'Track and Field', NULL, NULL)

    SELECT *

    FROM @Person

    SELECT *

    FROM @Person_Favorites

    You can see that the @Person_Favorites table has a column for each type of favorite for the person, and the number of rows each person will have is equal to highest count of all of their favorite things.

    So, if I wanted to report and filter on any of these, one of the things I thought to do is create views to normalize them. The queries would look something like:

    SELECT ID,

    POS,

    Hobby

    FROM @Person_Favorites

    WHERE Hobby IS NOT NULL

    SELECT ID,

    POS,

    Color

    FROM @Person_Favorites

    WHERE Color IS NOT NULL

    SELECT ID,

    POS,

    SQLConference

    FROM @Person_Favorites

    WHERE SQLConference IS NOT NULL

    However, that would mean I'd have to create a ton of views if the person has 100 different favorite categories. In addition, I believe the query optimizer wouldn't be able to best optimize the query if I start joining several views together.

    I can't reasonably just join @Person to @Person_Favorites on ID because of how the data is stored in @Person_Favorites, right?

    I'm curious to know what others think. Does anyone know of a way to address this from a reporting perspective? Obviously a data warehouse is in the future but there are still real-time reports that need to be run.

    Thanks in advance,

    Mike

    Mike Scalise, PMP
    https://www.michaelscalise.com

  • What about those two tables is not normalized?

    select p.id, p.FirstName, p.LastName, p.dob, pf.pos,pf.Hobby, pf.Color, pf.SQLConference

    from @Person p

    join @Person_Favorites pf on p.ID = pf.ID

    --where pf.SQLConference like 'SQL%'

    --where color = 'Green'

    where lastname = 'Smith'

    The SELECT/FROM/JOIN lines could be your view, if you feel you need one, but that's as simple as a JOIN gets and you can filter on any of the columns.

    I really don't see the problem here. Could you elaborate a bit?

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • Are you trying to take normalization a step further and create separate tables for possible colors, hobbies, and SQLConferences?

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • Vendor Databases are generally not designed by database gurus which is a good thing, it helps keep us database people employed.

    For a situation like this, clustered indexes, non-clustered indexes (filtered and unfiltered) and indexed views can be your friend. Below is your code/data stored in perm tables so I could index and create views... Play around with this code in SSMS with "show actual execution plan" turned on. Note my comments.

    USE tempdb

    GO

    CREATE TABLE dbo.person

    (

    ID int NOT NULL,

    FirstName varchar(25) NULL,

    LastName varchar(25) NULL,

    DOB date NULL

    );

    CREATE TABLE dbo.person_favorites

    (

    ID int NOT NULL,

    POS int NOT NULL,

    Hobby varchar(50) NULL,

    Color varchar(50) NULL,

    SQLConference varchar(50) NULL

    );

    -- clustered Index on #person

    ALTER TABLE dbo.person ADD CONSTRAINT pk_person PRIMARY KEY CLUSTERED (ID);

    ALTER TABLE dbo.person_favorites ADD CONSTRAINT pk_person_favs PRIMARY KEY CLUSTERED (ID, POS);

    INSERT dbo.Person (ID, FirstName, LastName, DOB) VALUES

    (1, N'John', N'Doe', '1980-12-31'), (2, N'Sally', N'Smith', '1981-09-15');

    INSERT dbo.Person_Favorites (ID, POS, Hobby, Color, SQLConference) VALUES

    (1, 1, N'Basketball', N'Blue', N'PASS Summit'),

    (1, 2, N'Baseball', N'Green', N'SQL Saturday'),

    (1, 3, NULL, N'Yellow', N'SQL Cruise'),

    (1, 4, NULL, N'Black', NULL),

    (1, 5, NULL, N'Brown', NULL),

    (2, 1, N'Tennis', N'Red', N'SQL Bits'),

    (2, 2, N'Football', N'Green', N'SQLIntersection'),

    (2, 3, N'Soccer', NULL, NULL),

    (2, 4, N'Track and Field', NULL, NULL);

    GO

    CREATE UNIQUE NONCLUSTERED INDEX pk_person_favorites__hobby

    ON dbo.Person_Favorites(ID, POS, Hobby)

    WHERE Hobby IS NOT NULL;

    -- get's nonclustered Index Seek

    SELECT ID,

    POS,

    Hobby

    FROM dbo.Person_Favorites

    WHERE Hobby IS NOT NULL AND ID = 1;

    CREATE UNIQUE NONCLUSTERED INDEX pk_person_favorites__color

    ON dbo.Person_Favorites(ID, POS, Color)

    WHERE color IS NOT NULL;

    -- get's nonclustered Index Seek

    SELECT ID,

    POS,

    Color

    FROM dbo.Person_Favorites

    WHERE Color IS NOT NULL AND ID = 2;

    GO

    CREATE VIEW dbo.vw_people_favorites

    WITH SCHEMABINDING AS

    SELECT

    pf.ID,

    pf.POS,

    p.FirstName,

    p.LastName,

    p.DOB,

    pf.Hobby,

    pf.SQLConference

    FROM dbo.person p

    JOIN dbo.person_favorites pf ON p.ID = pf.ID;

    GO

    -- Indexed view which can be used for more indexes

    CREATE UNIQUE CLUSTERED INDEX vci_people_favorites ON dbo.vw_people_favorites(ID, POS);

    -- gets info from clustered index Scan

    SELECT

    FirstName,

    LastName,

    DOB,

    Hobby

    FROM dbo.vw_people_favorites;

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • The Dixie Flatline (9/15/2016)


    What about those two tables is not normalized?

    select p.id, p.FirstName, p.LastName, p.dob, pf.pos,pf.Hobby, pf.Color, pf.SQLConference

    from @Person p

    join @Person_Favorites pf on p.ID = pf.ID

    --where pf.SQLConference like 'SQL%'

    --where color = 'Green'

    where lastname = 'Smith'

    The SELECT/FROM/JOIN lines could be your view, if you feel you need one, but that's as simple as a JOIN gets and you can filter on any of the columns.

    I really don't see the problem here. Could you elaborate a bit?

    It violates Fourth Normal Form. It contains multiple independent multi-valued facts. The fact that John's favorite sport is basketball, is unrelated to the fact that his favorite colo(u)r is blue.

    Drew

    Edit: It quoted the wrong post.

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • The Dixie Flatline (9/15/2016)


    Are you trying to take normalization a step further and create separate tables for possible colors, hobbies, and SQLConferences?

    Well, it's not that I want to do that, but as drew said, it violates fourth normal form. In the current state, I can neither simply join the two tables and get what I need nor are they completely normalized where I can report off of them that way. They're kind of in-between, which is why I need to do something about it.

    Mike Scalise, PMP
    https://www.michaelscalise.com

  • Alan.B (9/15/2016)


    Vendor Databases are generally not designed by database gurus which is a good thing, it helps keep us database people employed.

    For a situation like this, clustered indexes, non-clustered indexes (filtered and unfiltered) and indexed views can be your friend. Below is your code/data stored in perm tables so I could index and create views... Play around with this code in SSMS with "show actual execution plan" turned on. Note my comments.

    USE tempdb

    GO

    CREATE TABLE dbo.person

    (

    ID int NOT NULL,

    FirstName varchar(25) NULL,

    LastName varchar(25) NULL,

    DOB date NULL

    );

    CREATE TABLE dbo.person_favorites

    (

    ID int NOT NULL,

    POS int NOT NULL,

    Hobby varchar(50) NULL,

    Color varchar(50) NULL,

    SQLConference varchar(50) NULL

    );

    -- clustered Index on #person

    ALTER TABLE dbo.person ADD CONSTRAINT pk_person PRIMARY KEY CLUSTERED (ID);

    ALTER TABLE dbo.person_favorites ADD CONSTRAINT pk_person_favs PRIMARY KEY CLUSTERED (ID, POS);

    INSERT dbo.Person (ID, FirstName, LastName, DOB) VALUES

    (1, N'John', N'Doe', '1980-12-31'), (2, N'Sally', N'Smith', '1981-09-15');

    INSERT dbo.Person_Favorites (ID, POS, Hobby, Color, SQLConference) VALUES

    (1, 1, N'Basketball', N'Blue', N'PASS Summit'),

    (1, 2, N'Baseball', N'Green', N'SQL Saturday'),

    (1, 3, NULL, N'Yellow', N'SQL Cruise'),

    (1, 4, NULL, N'Black', NULL),

    (1, 5, NULL, N'Brown', NULL),

    (2, 1, N'Tennis', N'Red', N'SQL Bits'),

    (2, 2, N'Football', N'Green', N'SQLIntersection'),

    (2, 3, N'Soccer', NULL, NULL),

    (2, 4, N'Track and Field', NULL, NULL);

    GO

    CREATE UNIQUE NONCLUSTERED INDEX pk_person_favorites__hobby

    ON dbo.Person_Favorites(ID, POS, Hobby)

    WHERE Hobby IS NOT NULL;

    -- get's nonclustered Index Seek

    SELECT ID,

    POS,

    Hobby

    FROM dbo.Person_Favorites

    WHERE Hobby IS NOT NULL AND ID = 1;

    CREATE UNIQUE NONCLUSTERED INDEX pk_person_favorites__color

    ON dbo.Person_Favorites(ID, POS, Color)

    WHERE color IS NOT NULL;

    -- get's nonclustered Index Seek

    SELECT ID,

    POS,

    Color

    FROM dbo.Person_Favorites

    WHERE Color IS NOT NULL AND ID = 2;

    GO

    CREATE VIEW dbo.vw_people_favorites

    WITH SCHEMABINDING AS

    SELECT

    pf.ID,

    pf.POS,

    p.FirstName,

    p.LastName,

    p.DOB,

    pf.Hobby,

    pf.SQLConference

    FROM dbo.person p

    JOIN dbo.person_favorites pf ON p.ID = pf.ID;

    GO

    -- Indexed view which can be used for more indexes

    CREATE UNIQUE CLUSTERED INDEX vci_people_favorites ON dbo.vw_people_favorites(ID, POS);

    -- gets info from clustered index Scan

    SELECT

    FirstName,

    LastName,

    DOB,

    Hobby

    FROM dbo.vw_people_favorites;

    Alan, I'll definitely take a look at this in SSMS--thank you.

    However, if I have 100 different favorite types stored in this way, that's going to be a very wide non-clustered index. I'm a little apprehensive about creating that on an OLTP system, but then again, I haven't really done that before. In the same vein, I have heard that indexed views can seriously affect performance of OLTP systems with a lot of activity, so I've been trying to avoid that too.

    In addition, the tables in the database were created in such a way that they don't even meet the criteria to index the views that I have........

    Mike

    Mike Scalise, PMP
    https://www.michaelscalise.com

  • drew.allen (9/15/2016)


    The Dixie Flatline (9/15/2016)


    What about those two tables is not normalized?

    select p.id, p.FirstName, p.LastName, p.dob, pf.pos,pf.Hobby, pf.Color, pf.SQLConference

    from @Person p

    join @Person_Favorites pf on p.ID = pf.ID

    --where pf.SQLConference like 'SQL%'

    --where color = 'Green'

    where lastname = 'Smith'

    The SELECT/FROM/JOIN lines could be your view, if you feel you need one, but that's as simple as a JOIN gets and you can filter on any of the columns.

    I really don't see the problem here. Could you elaborate a bit?

    It violates Fourth Normal Form. It contains multiple independent multi-valued facts. The fact that John's favorite sport is basketball, is unrelated to the fact that his favorite colo(u)r is blue.

    Drew

    Edit: It quoted the wrong post.

    DOH!! I never stopped to question whether or not those favorite columns were UNrelated. I am so ashamed :ermm:

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • -- removed --

  • ...yeah I couldn't think of any viable workarounds either.

    Mike Scalise, PMP
    https://www.michaelscalise.com

  • ...if I have 100 different favorite types stored in this way, that's going to be a very wide non-clustered index.

    That is a potential design problem - I would consider spreading that data across multiple tables but, if you can't, you can still index in a way that improves performance without slowing other things down too much. If you are stuck with that design I would consider making a copy of that table that is updated regularly using SQL Replication, an SSIS job or the SQL agent to keep the copy up-to-date. Then the original would remain unchanged and the copy could handle non-OLTP reporting types of requests.

    It's common to setup Transactional Replication for this kind of thing where the original has minimal indexes but the copy is well indexed and is used for reporting.

    In the same vein, I have heard that indexed views can seriously affect performance of OLTP systems with a lot of activity...

    Just like tables, Indexes on views have the same pros and cons: they speed up data access queries but slow down data modification queries. To understand the benefits (or lack of) you need to do some testing. Lack of proper indexes can cause queries that would finish in seconds, finish in minutes - when that happens you have more locking/blocking/deadlocking. Correctly designed indexes also reduce I/O and memory pressure. Again - I would do some testing.

    This is where data warehouses really help - separating OLTP operations from reporting.

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply