Merging pseudo-duplicate records

  • Here's an interesting sql challenge that I haven't encountered before. We have a data warehouse staging database in which we capture change history for hundreds of tables from a source system. In the source system, records are updated in place, but in our data warehouse we capture these changes by "terminating" the existing record and adding a new record reflecting the changes. In the data warehouse we add two columns to every table -- effective_date and expiration_date -- which indicate the dates the record was in effect in the source system. By convention, an expiration_date of 6/6/2079 means the record is currently still active in the source system. Each day we simply compare yesterday's version of the record (in the data warehouse) against today's version (in the source system). If differences are found in any of the columns, we terminate the record and add a new one, setting those dates appropriately.

    Here's a highly simplified example. In this example, the employee_id column is the natural key in the source system. We add the effective_date and expiration_date in the data warehouse, so those three columns together make up the key in the data warehouse. The employee_name, employee_dept, and last_login_date columns all come from the source system as well.

    drop table mytbl

    create table mytbl (

    effective_date smalldatetime,

    expiration_date smalldatetime,

    employee_id int,

    employee_name varchar(30),

    employee_dept int,

    last_login_date smalldatetime

    )

    insert mytbl values ('1/1/2014', '1/15/2014', 1, 'bob', 7, '1/15/2014')

    insert mytbl values ('1/16/2014', '1/20/2014', 1, 'bob', 7, '1/20/2014')

    insert mytbl values ('1/21/2014', '1/31/2014', 1, 'bob', 7, '1/31/2014')

    insert mytbl values ('2/1/2014', '2/9/2014', 1, 'bob', 8, '2/9/2014')

    insert mytbl values ('2/10/2014', '6/6/2079', 1, 'bob', 8, '2/15/2014')

    insert mytbl values ('1/2/2014', '1/16/2014', 2, 'frank', 4, '1/16/2014')

    insert mytbl values ('1/17/2014', '1/21/2014', 2, 'frank', 4, '1/21/2014')

    insert mytbl values ('1/22/2014', '2/1/2014', 2, 'frank', 4, '1/31/2014')

    insert mytbl values ('2/2/2014', '2/10/2014', 2, 'frank', 4, '2/10/2014')

    insert mytbl values ('2/11/2014', '6/6/2079', 2, 'frank', 4, '2/15/2014')

    insert mytbl values ('1/3/2014', '1/17/2014', 3, 'cheryl', 6, '1/17/2014')

    insert mytbl values ('1/18/2014', '1/22/2014', 3, 'cheryl', 6, '1/22/2014')

    insert mytbl values ('1/23/2014', '2/2/2014', 3, 'cheryl', 9, '1/31/2014')

    insert mytbl values ('2/3/2014', '2/11/2014', 3, 'cheryl', 9, '2/11/2014')

    insert mytbl values ('2/12/2014', '6/6/2079', 3, 'cheryl', 6, '2/15/2014')

    select *

    from mytbl

    order by employee_id, effective_date

    In the select output, you can follow the trail of changes for each of these three employees. Bob moved from dept 7 to 8 at some point; Frank didn't change departments at all; Cheryl moved from dept 6 to 9 and later back to 6. However, the last_login_date was updated frequently for all these employees.

    We've tracked hundreds of tables this way for years, some with hundreds of columns. For optimization purposes, I'm now interested in trimming the fat a bit. That is, we track changes in many columns that we don't really need in our data warehouse. Some of these columns are rapidly-changing, causing all sorts of unnecessary terminate/inserts in the data warehouse. My goal is to remove these columns, reclaim the disk space and increase the ETL speed. So in this example, let's get rid of the last_login_date column.

    alter table mytbl

    drop column last_login_date

    select *

    from mytbl

    order by employee_id, effective_date

    Now in the select output, you can see we have many "effective duplicate" records. For example, nothing changed for Bob between 1/1/2014 and 1/31/2014 -- those really should be one record, not three. Here's the challenge: I'm looking for an efficient way to merge these "effective duplicates" together, through set-based sql updates/deletes/inserts (hoping to avoid any RBAR operations). Here's what the table ultimately should look like (cheating to get there):

    create table mytbl2 (

    effective_date smalldatetime,

    expiration_date smalldatetime,

    employee_id int,

    employee_name varchar(30),

    employee_dept int

    )

    insert mytbl2 values ('1/1/2014', '1/31/2014', 1, 'bob', 7)

    insert mytbl2 values ('2/1/2014', '6/6/2079', 1, 'bob', 8)

    insert mytbl2 values ('1/2/2014', '6/6/2079', 2, 'frank', 4)

    insert mytbl2 values ('1/3/2014', '1/22/2014', 3, 'cheryl', 6)

    insert mytbl2 values ('1/23/2014', '2/11/2014', 3, 'cheryl', 9)

    insert mytbl2 values ('2/12/2014', '6/6/2079', 3, 'cheryl', 6)

    select *

    from mytbl2

    order by employee_id, effective_date

    Note that Bob only has two records (he changed department), Frank only has one record (no changes), and Cheryl has three records (two department changes).

    My inclination would be to drop the unwanted columns, then GROUP BY all the remaining columns from the source system, and taking the MIN effective_date and MAX expiration_date. However, this doesn't work for cases like Cheryl's -- she moved to another department, then back again, so that change history needs to be retained.

    As I mentioned, we have hundreds of tables, and I'd like to strip out dozens (maybe hundreds) of unused columns, so ultimately there will be millions of these pseudo-duplicates that need to be merged together. These are huge tables, so I really need to find an efficient set-based approach to this. Can anyone offer up clever ideas? Thanks in advance!

  • Here's something to start with.

    It's based on the gaps'n'island concept.

    ;

    WITH cte AS

    (

    SELECT

    *,

    ROW_NUMBER() OVER(PARTITION BY employee_id ORDER BY effective_date) as pos1,

    ROW_NUMBER() OVER(PARTITION BY employee_id,employee_dept ORDER BY effective_date) as pos2

    FROM mytbl

    )

    SELECT min(effective_date) as effective_date, max(expiration_date) as expiration_date, employee_id,employee_name,employee_dept

    FROM cte

    GROUP BY employee_id,employee_name,employee_dept,pos1-pos2

    ORDER BY employee_id,employee_name,employee_dept



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • Thanks much for the push in the right direction! I'll have to read up on the gaps and islands stuff - that's perfect. I'll have to turn this into some ugly dynamic sql, but I think it will work for me. Cheers!

Viewing 3 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic. Login to reply