Combining / crunching data?

  • Ok this will take some explaining.

    If I have a table, columns MYKEY, DT, A, B, C, D, E with the following data:

    MYKEY DT A B C D E

    ------------------------------------------------

    1 1Jan 1 2 3 4 5

    1 2Jan 6 7 8 9 10

    1 3Jan null null 3 null 6

    1 4Jan 7 null null null 1

    2 5Jan 4 3 4 2 3

    2 6Jan null null 2 3 1

    What I want is to create a new table where MYKEY is unique and I take the most recent non-null value for each column. So in the above case I would end up with:

    1 4Jan 7 7 3 9 1

    2 6Jan 4 3 2 3 1

    I will be doing this for very large data sets so the performance of the query is important.

    What I have done to date is add a column called COPY, and use ROWNUMBER() partitioned by MYKEY to assign a COPY number to each row, and then in a loop, set each value to the one from the previous row where not null.

    But it's slow slow slow when you work with large data sets, no matter how I try to index it.

    Any ideas?

  • Brian McGee-355525 (12/4/2010)


    Ok this will take some explaining.

    If I have a table, columns MYKEY, DT, A, B, C, D, E with the following data:

    MYKEY DT A B C D E

    ------------------------------------------------

    1 1Jan 1 2 3 4 5

    1 2Jan 6 7 8 9 10

    1 3Jan null null 3 null 6

    1 4Jan 7 null null null 1

    2 5Jan 4 3 4 2 3

    2 6Jan null null 2 3 1

    What I want is to create a new table where MYKEY is unique and I take the most recent non-null value for each column. So in the above case I would end up with:

    1 4Jan 7 7 3 9 1

    2 6Jan 4 3 2 3 1

    I will be doing this for very large data sets so the performance of the query is important.

    What I have done to date is add a column called COPY, and use ROWNUMBER() partitioned by MYKEY to assign a COPY number to each row, and then in a loop, set each value to the one from the previous row where not null.

    But it's slow slow slow when you work with large data sets, no matter how I try to index it.

    Any ideas?

    Yes... I have some ideas... but I need to know, is your real data limited to just 5 columns?

    I also need to know what PK is on the table and which columns the clustered index is on (and, yes, they can be different).

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I have done similar through what (at first sight) looks horrendous code.

    Basically you can use an update statement that updates the required "master rows" and each column is derived from a subquery something like:

    Update x

    set

    col1=(select top 1 t2.col1 from table as t2 where t2.col1 is not null and t2.k = x.k order by t2.datecol desc),

    col2=(select top 1 t3.col2 from table as t3 where t3.col2 is not null and t3.k = x.k order by t3.datecol desc)

    where .... "whatever determins the masters"

    where:

    datecol is the column in the table that tells you which is the "most recent" ie something like date_updated

    and k is a column that is common across all the rows you want to merge.

    These can look very complicated but careful layout and ensuring consistent use of correleation names (aliases) can make it readable.

    If you get the indexes right on the underlying tables this can actually be stunningly fast, even though at first look it appears it will not be.

    If you post the exact table structures you will probably get nmore help.

    Something like this should be orders of magnitude faster than cursors etc...

    Mike John

  • Not sure if you've tried this already.

    DECLARE @MyTable TABLE(MYKEY INT,DT VARCHAR(10),A INT,B INT,C INT,D INT,E INT);

    INSERT INTO @MyTable(MYKEY,DT,A,B,C,D,E)

    SELECT 1 ,'1Jan', 1, 2, 3, 4, 5 UNION ALL

    SELECT 1 ,'2Jan', 6, 7, 8, 9, 10 UNION ALL

    SELECT 1 ,'3Jan', null, null, 3, null, 6 UNION ALL

    SELECT 1 ,'4Jan', 7, null, null, null, 1 UNION ALL

    SELECT 2 ,'5Jan', 4, 3, 4, 2, 3 UNION ALL

    SELECT 2 ,'6Jan', null, null, 2, 3, 1;

    WITH CTE AS (

    SELECT MYKEY,DT,

    A,

    ROW_NUMBER() OVER(PARTITION BY MYKEY ORDER BY CASE WHEN A IS NULL THEN 1 ELSE 0 END,DT DESC) AS A_rn,

    B,

    ROW_NUMBER() OVER(PARTITION BY MYKEY ORDER BY CASE WHEN B IS NULL THEN 1 ELSE 0 END,DT DESC) AS B_rn,

    C,

    ROW_NUMBER() OVER(PARTITION BY MYKEY ORDER BY CASE WHEN C IS NULL THEN 1 ELSE 0 END,DT DESC) AS C_rn,

    D,

    ROW_NUMBER() OVER(PARTITION BY MYKEY ORDER BY CASE WHEN D IS NULL THEN 1 ELSE 0 END,DT DESC) AS D_rn,

    E,

    ROW_NUMBER() OVER(PARTITION BY MYKEY ORDER BY CASE WHEN E IS NULL THEN 1 ELSE 0 END,DT DESC) AS E_rn

    FROM @MyTable)

    SELECT MYKEY,

    MAX(DT) AS DT,

    MAX(CASE WHEN A_rn=1 THEN A END) AS A,

    MAX(CASE WHEN B_rn=1 THEN B END) AS B,

    MAX(CASE WHEN C_rn=1 THEN C END) AS C,

    MAX(CASE WHEN D_rn=1 THEN D END) AS D,

    MAX(CASE WHEN E_rn=1 THEN E END) AS E

    FROM CTE

    GROUP BY MYKEY

    ORDER BY MYKEY;

    ____________________________________________________

    Deja View - The strange feeling that somewhere, sometime you've optimised this query before

    How to get the best help on a forum

    http://www.sqlservercentral.com/articles/Best+Practices/61537
  • Mike John (12/6/2010)


    where .... "whatever determins the masters"

    Can you quantify that in code using the example because that is the crux of this problem.

    Also... you're using an UPDATE which modifies the original data which may not be allowed.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thanks to everyone for the suggestions, I will certainly give them a go. As things go, I've been dragged off on to several other items since starting this one, but again huge thanks for the suggestions - they will all be tried 🙂

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply