Getting last good data for each id from each column

  • I have a table keyed by an id and entry date that has a lot of data nulled out before the last row. I would like to get the last id/date with the last non-null value from each column. Here is an example:

    [font="Courier New"]Example Data

    id date col1 col2 col3 col4 col5 col6

    1 7/17/2016 aaaa NULL NULL 2222 1111 NULL

    1 7/18/2016 bbbb xxxx NULL NULL 2222 NULL

    1 7/19/2016 NULL NULL jjjj 1111 3333 NULL

    1 7/20/2016 NULL NULL NULL NULL 4444 8888

    1 7/21/2016 NULL NULL NULL NULL 5555 NULL

    2 7/17/2015 RRRR NULL NULL 9999 2222 NULL

    2 7/18/2015 LLLL qqqq NULL NULL 5555 NULL

    2 7/19/2015 NULL NULL NULL pppp 3737 NULL

    Desired Result

    id date col1 col2 col3 col4 col5 col6

    1 7/21/2016 bbbb xxxx jjjj 1111 5555 8888

    2 7/19/2015 LLLL qqqq pppp 3737 3333 NULL[/font]

    I can get the last non-null row for a column for each id with the following:

    select id, c1

    from (select id, max(date) dte from tbl a where col1 is not null group by id) a

    join tbl b on a.id = b.id and a.dte = b.date

    and I can repeat this in common table expressions and join them all together but am hitting performance and memory issues. I only need to do this once or twice. The real table has about 4 million rows and 30 columns of interest with 360000 ids. I can have hundreds of rows per ID. I've tried using LAG() but can only figure out how to go back a specific number of rows. Doing COALESCE() against a lot of lag values is not practical or probably even supported for several hundred LAG() values.

    I've also tried storing the results in temporary tables and then joining them together but still run in to space issues like the following error:

    Msg 1105, Level 17, State 2, Line 1

    Could not allocate space for object '<temporary system object: 422355172524032>' in database 'tempdb' because the 'PRIMARY' filegroup is full. Create disk space by deleting unneeded files, dropping objects in the filegroup, adding additional files to the filegroup, or setting autogrowth on for existing files in the filegroup.

    Next thing I'll try is creating a temporary table and then updating each column to it as it is calculated. Is there are better approach?

  • Here is a two-pass approach to get you the values you want. The first pass gets the maximum date within ID for which each column has a (non-null) value. The second pass then retrieves that value. I don't know what to tell you about your disk allocation, but if you are running into memory contention, I would suggest running this against ranges of IDs, rather than attempting the entire table at one time. For example, run IDs 1 - 99,999 followed by 100,000 to 199,999 etc.

    Hope this helps.

    declare @sampledata Table (ID int, [Date] Date , Column1 char(4), Column2 char(4), Column3 char(4), Column4 char(4), Column5 char(4), Column6 char(4), primary key (ID, [Date]))

    insert into @sampledata

    SELECT '1', '7/17/2016', 'aaaa', NULL, NULL, '2222 ', '1111 ', NULL union all

    SELECT '1', '7/18/2016', 'bbbb', 'xxxx ', NULL, NULL, '2222 ', NULL union all

    SELECT '1', '7/19/2016', NULL, NULL, 'jjjj ', '1111 ', '3333 ', NULL union all

    SELECT '1', '7/20/2016', NULL, NULL, NULL, NULL, '4444 ', '8888' union all

    SELECT '1', '7/21/2016', NULL, NULL, NULL, NULL, '5555 ', NULL union all

    SELECT '2', '7/17/2015', 'RRRR', NULL, NULL, '9999 ', '2222 ', NULL union all

    SELECT '2', '7/18/2015', 'LLLL', 'qqqq ', NULL, NULL, '5555 ', NULL union all

    SELECT '2', '7/19/2015', NULL, NULL, NULL, 'pppp ', '3737 ', NULL

    select * from @sampledata

    -- first pass, identify dates

    select id, max(case when column1 is not null then [Date] end) as Date1

    , max(case when Column2 is not null then [Date] end) as Date2

    , max(case when Column3 is not null then [Date] end) as Date3

    , max(case when Column4 is not null then [Date] end) as Date4

    , max(case when Column5 is not null then [Date] end) as Date5

    , max(case when Column6 is not null then [Date] end) as Date6

    , max([Date]) as MaxDate

    into #temp

    from @sampledata

    group by ID

    create unique clustered index PK_#temp on #temp(ID)

    -- second pass, retrieve values

    select s.ID, max(MaxDate) as [Date], max(case when [Date1] = [Date] then Column1 end) as Column1

    ,max(case when [Date2] = [Date] then Column2 end) as Column2

    ,max(case when [Date3] = [Date] then Column3 end) as Column3

    ,max(case when [Date4] = [Date] then Column4 end) as Column4

    ,max(case when [Date5] = [Date] then Column5 end) as Column5

    ,max(case when [Date6] = [Date] then Column6 end) as Column6

    from @sampleData s

    join #temp on #temp.ID = s.ID

    group by s.ID

    Edited to add: I think your sample data may be a little confused, you are showing 3737 for an expected result in column4, but that value only appears in column5.

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • That's because LAG is for a relative position. FIRST_VALUE and LAST_VALUE are for absolute positions. You do have to get a little tricky, because you need to ensure that all the NULL values group on one end of your ORDER BY clause. You should also be aware that FIRST_VALUE and LAST_VALUE require a frame. If you don't specify one, it will use the default RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW, which will not give you the desired results for LAST_VALUE.

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • Actually, I think that the Quirky Update[/url] method might be the fastest approach to doing this. You have to be very careful to follow ALL of the rules.

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • THANK YOU! That worked very well. Instead of using #temp I put it in a CTE and it ran in 11 seconds which is acceptable performance for my needs.

  • Try this. I get the results that you want, but you should try it in your system to check the performance.

    with ct1

    as

    (

    select t.ID,t.Date, case when t.Column1 is null then t.Column1 else cast(row_number()over(partition by t.id order by t.id) as varchar(20) )+ '-'+t.Column1 end as 'Column1',

    case when t.Column2 is null then t.Column2 else cast(row_number()over(partition by t.id order by t.id) as varchar(20) )+'-'+ t.Column2 end as 'Column2',

    case when t.Column3 is null then t.Column3 else cast(row_number()over(partition by t.id order by t.id) as varchar(20) )+ '-'+t.Column3 end as 'Column3',

    case when t.Column4 is null then t.Column4 else cast(row_number()over(partition by t.id order by t.id) as varchar(20) )+ '-'+t.Column4 end as 'Column4',

    case when t.Column5 is null then t.Column5 else cast(row_number()over(partition by t.id order by t.id) as varchar(20) )+ '-'+t.Column5 end as 'Column5',

    case when t.Column6 is null then t.Column6 else cast(row_number()over(partition by t.id order by t.id) as varchar(20) )+ '-'+t.Column6 end as 'Column6'

    from #t as t

    )

    ,

    cte2

    as

    (

    select c1.Id, max(c1.date) as Date, Max(column1) as column1, max(column2) as column2, max(column3) as column3,

    max(column4) as column4,max(column5) as column5, max(column6) as column6

    from ct1 as c1

    group by c1.id

    )

    select c2.Id,

    c2.Date,

    Right(c2.column1,len(c2.column1)-2) as column1,

    Right(c2.column2,len(c2.column2)-2) as column2,

    Right(c2.column3,len(c2.column3)-2) as column3,

    Right(c2.column4,len(c2.column4)-2) as column4,

    Right(c2.column5,len(c2.column5)-2) as column5,

    Right(c2.column6,len(c2.column6)-2) as column6

    from cte2 as c2

  • Quick suggestion, slightly different but easy to scale on large number of columns

    😎

    USE TEEST;

    GO

    SET NOCOUNT ON;

    --

    declare @sampledata Table

    (

    ID int

    , [Date] Date

    , Column1 char(4)

    , Column2 char(4)

    , Column3 char(4)

    , Column4 char(4)

    , Column5 char(4)

    , Column6 char(4)

    , primary key (ID, [Date])

    );

    insert into @sampledata

    SELECT '1', '7/17/2016', 'aaaa', NULL, NULL, '2222 ', '1111 ', NULL union all

    SELECT '1', '7/18/2016', 'bbbb', 'xxxx ', NULL, NULL, '2222 ', NULL union all

    SELECT '1', '7/19/2016', NULL, NULL, 'jjjj ', '1111 ', '3333 ', NULL union all

    SELECT '1', '7/20/2016', NULL, NULL, NULL, NULL, '4444 ', '8888' union all

    SELECT '1', '7/21/2016', NULL, NULL, NULL, NULL, '5555 ', NULL union all

    SELECT '1', '7/22/2016', NULL, NULL, NULL, NULL, NULL, NULL union all

    SELECT '2', '7/17/2015', 'RRRR', NULL, NULL, '9999 ', '2222 ', NULL union all

    SELECT '2', '7/18/2015', 'LLLL', 'qqqq ', NULL, NULL, '5555 ', NULL union all

    SELECT '2', '7/19/2015', NULL, NULL, 'abcd ', NULL, NULL, NULL union all

    SELECT '2', '7/20/2015', NULL, NULL, NULL, 'pppp ', '3737 ', NULL;

    ;WITH BASE_DATA AS

    (

    SELECT

    SD.ID

    ,SD.Date

    ,X.COLID

    ,X.COLVAL

    FROM @sampledata SD

    CROSS APPLY

    (

    SELECT 1, SD.Column1 UNION ALL

    SELECT 2, SD.Column2 UNION ALL

    SELECT 3, SD.Column3 UNION ALL

    SELECT 4, SD.Column4 UNION ALL

    SELECT 5, SD.Column5 UNION ALL

    SELECT 6, SD.Column6

    ) X(COLID,COLVAL)

    )

    ,LAST_NOT_NULL_ID_DATE AS

    (

    SELECT

    BD.ID

    ,BD.COLID

    ,MAX(BD.Date) AS LAST_DATE

    ,MAX(BD.COLVAL) AS LAST_VALUE

    FROM BASE_DATA BD

    WHERE BD.COLVAL IS NOT NULL

    GROUP BY BD.ID

    ,BD.COLID

    )

    SELECT

    LD.ID

    ,MAX(LD.LAST_DATE ) AS LAST_DATE

    ,MAX(CASE WHEN LD.COLID = 1 THEN LD.LAST_VALUE END) AS COL1

    ,MAX(CASE WHEN LD.COLID = 2 THEN LD.LAST_VALUE END) AS COL2

    ,MAX(CASE WHEN LD.COLID = 3 THEN LD.LAST_VALUE END) AS COL3

    ,MAX(CASE WHEN LD.COLID = 4 THEN LD.LAST_VALUE END) AS COL4

    ,MAX(CASE WHEN LD.COLID = 5 THEN LD.LAST_VALUE END) AS COL5

    ,MAX(CASE WHEN LD.COLID = 6 THEN LD.LAST_VALUE END) AS COL6

    FROM LAST_NOT_NULL_ID_DATE LD

    GROUP BY LD.ID

    ;

    Output

    ID LAST_DATE COL1 COL2 COL3 COL4 COL5 COL6

    ----------- ---------- ---- ---- ---- ---- ---- ----

    1 2016-07-21 bbbb xxxx jjjj 2222 5555 8888

    2 2015-07-20 RRRR qqqq abcd pppp 5555 NULL

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply