Eliminating NULLs when selecting from potentially several tables

  • I have several tables in a database that are not even close to being normalized but that's the way an application we have stores the data. I'm trying to get it into a normalized form so that I can join it to other tables.

    Here's an example of its structure:

    PersonID, Degree, Ethnicity, Hobby

    100, BS, Caucasian, Basketball

    100, MS, Hispanic, Boxing

    100, PhD, NULL, Baseball

    100, NULL, NULL, Tennis

    100, NULL, NULL, Soccer

    On a record by record level, the information isn't related. It's like each PersonID + Column pair individually is related and should be its own table (where the column is not null). For example:

    SELECT PersonID, Degree

    FROM table

    WHERE Degree IS NOT NULL -- would return three records (as it should)

    SELECT PersonID, Ethnicity

    FROM table

    WHERE Ethnicity IS NOT NULL -- two records

    SELECT PersonID, Hobby

    FROM table

    WHERE Hobby IS NOT NULL -- five records

    I want to be able to write queries against any one of these derived tables (in a normalized form) and join them to, as I said before, other tables from different data sources. Since I can't change the actual structure of the table, you might think views are the way to go. I could create views, but if I have 10 fields in this table and 20 tables in this form, that's 10 fields x 20 tables = 200 views. And I can tell you I definitely have more than 20 tables like this.

    I thought a generic table-valued function might be the ticket here, but I haven't yet been able to get anything to work. Does anyone have any thoughts on how I might accomplish this?

    Thank you,

    Mike

    Here's T-SQL to create my current situation:

    CREATE TABLE [dbo].

    (

    [PersonID] [int] NOT NULL,

    [Degree] [varchar](50) NULL,

    [Ethnicity] [varchar](50) NULL,

    [Hobby] [varchar](50) NULL

    ) ON [PRIMARY]

    GO

    SET ANSI_PADDING OFF

    GO

    INSERT [dbo].

    ([PersonID], [Degree], [Ethnicity], [Hobby]) VALUES (100, N'BS', N'Caucasian', N'Basketball')

    GO

    INSERT [dbo].

    ([PersonID], [Degree], [Ethnicity], [Hobby]) VALUES (100, N'MS', N'Hispanic', N'Boxing')

    GO

    INSERT [dbo].

    ([PersonID], [Degree], [Ethnicity], [Hobby]) VALUES (100, N'PhD', NULL, N'Baseball')

    GO

    INSERT [dbo].

    ([PersonID], [Degree], [Ethnicity], [Hobby]) VALUES (100, NULL, NULL, N'Tennis')

    GO

    INSERT [dbo].

    ([PersonID], [Degree], [Ethnicity], [Hobby]) VALUES (100, NULL, NULL, N'Soccer')

    GO

    Mike Scalise, PMP
    https://www.michaelscalise.com

  • Thanks for posting the script.

    You may not need many views from the same table.

    Will a view like this work for you?

    CREATE VIEW dbo.PersonProfile

    AS

    SELECT t.PersonID, 'Degree', t.Degree

    FROM dbo.

    t

    WHERE t.Degree > ''

    UNION

    SELECT t.PersonID, 'Ethnicity', t.Ethnicity

    FROM dbo.

    t

    WHERE t.Ethnicity > ''

    UNION

    SELECT t.PersonID, 'Hobby', t.Hobby

    FROM dbo.

    t

    WHERE t.Hobby > ''

    _____________
    Code for TallyGenerator

  • Thanks, Sergiy.

    That would reduce the number of views I'd have to create, but I think it'd be a similar amount of typing in anything that joins to the view. For example, as it stands, I would probably query like this

    table1 = normalized table from another data source

    table2 = the table I'm trying to exclude nulls from

    view1 = the view with several tables UNIONed, NULLs excluded, and item names in their own column

    SELECT *

    FROM table1 a

    INNER JOIN (SELECT PersonID, Degree FROM table2 WHERE Degree > '') b

    ON a.PersonID = b.PersonID

    Even if I created the views in the way that you're suggesting, I'd end up with a query that looks like this:

    SELECT *

    FROM table1 a

    INNER JOIN (SELECT PersonID, Degree FROM view1 WHERE item = 'Degree') b

    ON a.PersonID = b.PersonID

    It's like I'm just substituting the removal of NULLs in the WHERE clause for the specifying the name of the "item" in the WHERE clause.

    I have tried something like this table-valued function, where I pass the table that's not in the right form, the name of the ID column I plan to join on, and the field whose non-NULL values I want to return:

    SELECT *

    FROM table1 a

    INNER JOIN tbl_val_fn('table2', 'PersonID', 'Degree') b

    ON a.PersonID = b.PersonID

    The function would look like:

    SELECT @IDField, @ItemField

    FROM @TableName

    WHERE @ItemField IS NOT NULL

    Any thoughts on this and if I'm even heading down a feasible path?

    Thanks,

    Mike

    Mike Scalise, PMP
    https://www.michaelscalise.com

  • will the following satisfy your need for simplicity?

    create view [table_normalized]

    as

    select t1.PersonID

    , t1.col_name

    , t1.col_value

    from ( select t1.PersonID

    , t.colnumber

    , case

    when t.colnumber = 1 then 'Degree'

    when t.colnumber = 2 then 'ethnicity'

    when t.colnumber = 3 then 'Hobby'

    end as col_name

    , case

    when t.colnumber = 1 then t1.Degree

    when t.colnumber = 2 then t1.Ethnicity

    when t.colnumber = 3 then t1.Hobby

    end as col_value

    from dbo.

    t1

    cross apply (select top 3 row_number() over (order by num) as colnumber

    from (values (1)) t (num)

    group by cube (num, num, num) -- expand and change the top 3 to allow for the number or columns you have

    -- from (values (1),(1),(1),(1),(1),(1)) t (num) -- use this instead of the from and group by if they fail on your version of Sql

    ) t

    ) t1

    where t1.col_value is not null

    and then use it as

    SELECT *

    FROM table1 a

    INNER JOIN table_normalized b

    ON b.col_name = 'Degree'

    and a.PersonID = b.PersonID

  • Frederico,

    That's an interesting approach. Am I correct in assuming that I'd need to create a case statement containing every possibility for field names across all of the tables I have that are in the funky format? There are probably hundreds.

    Mike

    Mike Scalise, PMP
    https://www.michaelscalise.com

  • yes that would be the case - but easy enough to build the case statements in Excel for example - this is something I had to do on the last projects I worked on the last 3 years.

    One view per table.

    Note that you may eventually consider "materializing" the views for performance reasons - but probably only feasible if they aren't updated frequently.

  • frederico_fonseca (4/22/2016)


    will the following satisfy your need for simplicity?

    create view [table_normalized]

    as

    select t1.PersonID

    , t1.col_name

    , t1.col_value

    from ( select t1.PersonID

    , t.colnumber

    , case

    when t.colnumber = 1 then 'Degree'

    when t.colnumber = 2 then 'ethnicity'

    when t.colnumber = 3 then 'Hobby'

    end as col_name

    , case

    when t.colnumber = 1 then t1.Degree

    when t.colnumber = 2 then t1.Ethnicity

    when t.colnumber = 3 then t1.Hobby

    end as col_value

    from dbo.

    t1

    cross apply (select top 3 row_number() over (order by num) as colnumber

    from (values (1)) t (num)

    group by cube (num, num, num) -- expand and change the top 3 to allow for the number or columns you have

    -- from (values (1),(1),(1),(1),(1),(1)) t (num) -- use this instead of the from and group by if they fail on your version of Sql

    ) t

    ) t1

    where t1.col_value is not null

    and then use it as

    SELECT *

    FROM table1 a

    INNER JOIN table_normalized b

    ON b.col_name = 'Degree'

    and a.PersonID = b.PersonID

    Hi Frederico,

    I'm just curious--why did you use GROUP BY cube instead of GROUP BY rollup. Also, is it necessary to use "num" three times in "group by cube (num, num, num)". Couldn't you just list it once to get the same results? "group by cube (num)"

    Thanks,

    Mike

    Mike Scalise, PMP
    https://www.michaelscalise.com

  • Hi Mike,

    group by cube and group by rollup do not give same results. so cube is the correct one for this purpose.

    as for why the 3 num and not just 1 - again this is how cube works.

    following code should show you what I'm trying to do here - and this is just one of the many ways of generating a sequence of numbers on the fly. not necessarily the fastest but for small numbers difference with other methods is probably negligible. I did try and find a post comparing several methods but could not find one that also uses this method - maybe someone here on the forums has a link to such a compare post.

    select '2 to the power of 1', count(*)

    from (

    select row_number() over (order by num) as colnumber

    from (values (1)) t (num)

    group by cube (num)

    ) t1

    union all

    select '2 to the power of 2', count(*)

    from (

    select row_number() over (order by num) as colnumber

    from (values (1)) t (num)

    group by cube (num, num)

    ) t2

    union all

    select '2 to the power of 3', count(*)

    from (

    select row_number() over (order by num) as colnumber

    from (values (1)) t (num)

    group by cube (num, num, num)

    ) t3

    union all

    select '2 to the power of 4', count(*)

    from (

    select row_number() over (order by num) as colnumber

    from (values (1)) t (num)

    group by cube (num, num, num, num)

    ) t4

    union all

    select '2 to the power of 5', count(*)

    from (

    select row_number() over (order by num) as colnumber

    from (values (1)) t (num)

    group by cube (num, num, num, num, num)

    ) t5

    union all

    select '2 to the power of 6', count(*)

    from (

    select row_number() over (order by num) as colnumber

    from (values (1)) t (num)

    group by cube (num, num, num, num, num, num)

    ) t5

    and you can also try the inner selects on their own to see the numbers being generated

  • Got it. Thanks so much!

    Mike Scalise, PMP
    https://www.michaelscalise.com

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply