Combining / crunching data?

Question

Combining / crunching data?

Atat Rowcount

Default port

Points: 1453
More actions
December 4, 2010 at 8:50 am

#227854

Ok this will take some explaining.
If I have a table, columns MYKEY, DT, A, B, C, D, E with the following data:
MYKEY DT A B C D E
------------------------------------------------
1 1Jan 1 2 3 4 5
1 2Jan 6 7 8 9 10
1 3Jan null null 3 null 6
1 4Jan 7 null null null 1
2 5Jan 4 3 4 2 3
2 6Jan null null 2 3 1
What I want is to create a new table where MYKEY is unique and I take the most recent non-null value for each column. So in the above case I would end up with:
1 4Jan 7 7 3 9 1
2 6Jan 4 3 2 3 1
I will be doing this for very large data sets so the performance of the query is important.
What I have done to date is add a column called COPY, and use ROWNUMBER() partitioned by MYKEY to assign a COPY number to each row, and then in a loop, set each value to the one from the previous row where not null.
But it's slow slow slow when you work with large data sets, no matter how I try to index it.
Any ideas?

Viewing 6 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply

Jeff Moden SSC Guru Points: 1004683 More actions · Answer 1

Brian McGee-355525 (12/4/2010)
Ok this will take some explaining.
If I have a table, columns MYKEY, DT, A, B, C, D, E with the following data:
MYKEY DT A B C D E
------------------------------------------------
1 1Jan 1 2 3 4 5
1 2Jan 6 7 8 9 10
1 3Jan null null 3 null 6
1 4Jan 7 null null null 1
2 5Jan 4 3 4 2 3
2 6Jan null null 2 3 1
What I want is to create a new table where MYKEY is unique and I take the most recent non-null value for each column. So in the above case I would end up with:
1 4Jan 7 7 3 9 1
2 6Jan 4 3 2 3 1
I will be doing this for very large data sets so the performance of the query is important.
What I have done to date is add a column called COPY, and use ROWNUMBER() partitioned by MYKEY to assign a COPY number to each row, and then in a loop, set each value to the one from the previous row where not null.
But it's slow slow slow when you work with large data sets, no matter how I try to index it.
Any ideas?

Yes... I have some ideas... but I need to know, is your real data limited to just 5 columns?

I also need to know what PK is on the table and which columns the clustered index is on (and, yes, they can be different).

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Mike John SSCertifiable Points: 7216 More actions · Answer 2

I have done similar through what (at first sight) looks horrendous code.

Basically you can use an update statement that updates the required "master rows" and each column is derived from a subquery something like:

Update x

set

col1=(select top 1 t2.col1 from table as t2 where t2.col1 is not null and t2.k = x.k order by t2.datecol desc),

col2=(select top 1 t3.col2 from table as t3 where t3.col2 is not null and t3.k = x.k order by t3.datecol desc)

where .... "whatever determins the masters"

where:

datecol is the column in the table that tells you which is the "most recent" ie something like date_updated

and k is a column that is common across all the rows you want to merge.

These can look very complicated but careful layout and ensuring consistent use of correleation names (aliases) can make it readable.

If you get the indexes right on the underlying tables this can actually be stunningly fast, even though at first look it appears it will not be.

If you post the exact table structures you will probably get nmore help.

Something like this should be orders of magnitude faster than cursors etc...

Mike John

Mark Cowne One Orange Chip Points: 26987 More actions · Answer 3

Not sure if you've tried this already.

DECLARE @MyTable TABLE(MYKEY INT,DT VARCHAR(10),A INT,B INT,C INT,D INT,E INT);

INSERT INTO @MyTable(MYKEY,DT,A,B,C,D,E)

SELECT 1 ,'1Jan', 1, 2, 3, 4, 5 UNION ALL

SELECT 1 ,'2Jan', 6, 7, 8, 9, 10 UNION ALL

SELECT 1 ,'3Jan', null, null, 3, null, 6 UNION ALL

SELECT 1 ,'4Jan', 7, null, null, null, 1 UNION ALL

SELECT 2 ,'5Jan', 4, 3, 4, 2, 3 UNION ALL

SELECT 2 ,'6Jan', null, null, 2, 3, 1;

WITH CTE AS (

SELECT MYKEY,DT,

A,

ROW_NUMBER() OVER(PARTITION BY MYKEY ORDER BY CASE WHEN A IS NULL THEN 1 ELSE 0 END,DT DESC) AS A_rn,

B,

ROW_NUMBER() OVER(PARTITION BY MYKEY ORDER BY CASE WHEN B IS NULL THEN 1 ELSE 0 END,DT DESC) AS B_rn,

C,

ROW_NUMBER() OVER(PARTITION BY MYKEY ORDER BY CASE WHEN C IS NULL THEN 1 ELSE 0 END,DT DESC) AS C_rn,

D,

ROW_NUMBER() OVER(PARTITION BY MYKEY ORDER BY CASE WHEN D IS NULL THEN 1 ELSE 0 END,DT DESC) AS D_rn,

E,

ROW_NUMBER() OVER(PARTITION BY MYKEY ORDER BY CASE WHEN E IS NULL THEN 1 ELSE 0 END,DT DESC) AS E_rn

FROM @MyTable)

SELECT MYKEY,

MAX(DT) AS DT,

MAX(CASE WHEN A_rn=1 THEN A END) AS A,

MAX(CASE WHEN B_rn=1 THEN B END) AS B,

MAX(CASE WHEN C_rn=1 THEN C END) AS C,

MAX(CASE WHEN D_rn=1 THEN D END) AS D,

MAX(CASE WHEN E_rn=1 THEN E END) AS E

FROM CTE

GROUP BY MYKEY

ORDER BY MYKEY;

____________________________________________________

Deja View - The strange feeling that somewhere, sometime you've optimised this query before

How to get the best help on a forum

http://www.sqlservercentral.com/articles/Best+Practices/61537

Jeff Moden SSC Guru Points: 1004683 More actions · Answer 4

Mike John (12/6/2010)
where .... "whatever determins the masters"

Can you quantify that in code using the example because that is the crux of this problem.

Also... you're using an UPDATE which modifies the original data which may not be allowed.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Atat Rowcount Default port Points: 1453 More actions · Answer 5

Thanks to everyone for the suggestions, I will certainly give them a go. As things go, I've been dragged off on to several other items since starting this one, but again huge thanks for the suggestions - they will all be tried 🙂