Tricky SQL Query with Values based on previous sequential values

  • I had an odd requirement recently which involved filling in gaps in sequences with values based on the last value before the gap and the first value after the gap. Basically, we needed to fill in missing yields for a date/service combination by assuming a straight line between the last known value and the next known value. Probably easier to explain with data:

    create table mytable (id int identity(1,1) PRIMARY KEY NONCLUSTERED, DATE datetime NOT NULL, service varchar(5) NOT NULL, yield decimal(10,5))

    GO

    CREATE CLUSTERED INDEX IX_myTable_service_date

    ON MyTable (service, date);

    GO

    Insert into mytable (Date, service,yield)

    Select '2009-09-17','A0001',.1

    UNION ALL

    Select '2009-09-18','A0001',NULL

    UNION ALL

    Select '2009-09-19','A0001',NULL

    UNION ALL

    Select '2009-09-20','A0001',.3

    UNION ALL

    Select '2009-09-21','A0001',.2

    UNION ALL

    Select '2009-09-22','A0001',.1

    UNION ALL

    Select '2009-09-23','A0001',NULL

    UNION ALL

    Select '2009-09-24','A0001',NULL

    UNION ALL

    Select '2009-09-25','A0001',NULL

    UNION ALL

    Select '2009-09-26','A0001',.7

    UNION ALL

    Select '2009-09-27','A0001',.6

    UNION ALL

    Select '2009-09-28','A0001',.6

    UNION ALL

    Select '2009-09-29','A0001',.6

    UNION ALL

    Select '2009-09-30','A0001',.4

    UNION ALL

    Select '2009-09-17','A0002',.1

    UNION ALL

    Select '2009-09-18','A0002',NULL

    UNION ALL

    Select '2009-09-19','A0002',NULL

    UNION ALL

    Select '2009-09-20','A0002',.3

    UNION ALL

    Select '2009-09-21','A0002',.2

    UNION ALL

    Select '2009-09-22','A0002',.1

    UNION ALL

    Select '2009-09-23','A0002',NULL

    UNION ALL

    Select '2009-09-24','A0002',NULL

    UNION ALL

    Select '2009-09-25','A0002',NULL

    UNION ALL

    Select '2009-09-26','A0002',.7

    UNION ALL

    Select '2009-09-27','A0002',.6

    UNION ALL

    Select '2009-09-28','A0002',.6

    UNION ALL

    Select '2009-09-29','A0002',.6

    UNION ALL

    Select '2009-09-30','A0002',.4

    The table should look as follows after the update(don't worry about rounding differences):

    12009-09-17 00:00:00.000A00010.10000

    22009-09-18 00:00:00.000A00010.16667

    32009-09-19 00:00:00.000A00010.23333

    42009-09-20 00:00:00.000A00010.30000

    52009-09-21 00:00:00.000A00010.20000

    62009-09-22 00:00:00.000A00010.10000

    72009-09-23 00:00:00.000A00010.25000

    82009-09-24 00:00:00.000A00010.40000

    92009-09-25 00:00:00.000A00010.55000

    102009-09-26 00:00:00.000A00010.70000

    112009-09-27 00:00:00.000A00010.60000

    122009-09-28 00:00:00.000A00010.60000

    132009-09-29 00:00:00.000A00010.60000

    142009-09-30 00:00:00.000A00010.40000

    152009-09-17 00:00:00.000A00020.10000

    162009-09-18 00:00:00.000A00020.16667

    172009-09-19 00:00:00.000A00020.23333

    182009-09-20 00:00:00.000A00020.30000

    192009-09-21 00:00:00.000A00020.20000

    202009-09-22 00:00:00.000A00020.10000

    212009-09-23 00:00:00.000A00020.25000

    222009-09-24 00:00:00.000A00020.40000

    232009-09-25 00:00:00.000A00020.55000

    242009-09-26 00:00:00.000A00020.70000

    252009-09-27 00:00:00.000A00020.60000

    262009-09-28 00:00:00.000A00020.60000

    272009-09-29 00:00:00.000A00020.60000

    282009-09-30 00:00:00.000A00020.40000

    Obviously there a several RBAR solutions that could accomplish this but in reality, this data is approx. 250,000+ rows as it includes all dates from today to today+365 and there are 700+ services.

    There will also be many more services in the future, so I wanted to future proof it as much as possible.

    After much banging my head against a wall, I did come up with a set-based solution as follows:

    update m

    set m.yield=LowerYield.Yield + (((UpperYield.Yield-LowerYield.Yield)/(Datediff(Day,LowerYield.Date1,UpperYield.Date1))) * Datediff(Day,LowerYield.Date1,m.date))

    from mytable m

    inner join (

    select m.date,m.service, m1.date as date1, m1.yield as yield, row_number() over (partition by m.service,m.date order by m1.date desc) as sequence

    from mytable m

    inner join mytable m1 on m1.service=m.service

    where m.yield is null and m1.yield is not null

    and m.date > m1.date

    ) LowerYield on LowerYield.date=m.date and Loweryield.service=m.service and LowerYield.Sequence=1

    inner join (

    select m.date,m.service, m1.date as date1, m1.yield as yield, row_number() over (partition by m.service,m.date order by m1.date) as sequence

    from mytable m

    inner join mytable m1 on m1.service=m.service

    where m.yield is null and m1.yield is not null

    and m.date < m1.date
    ) UpperYield on UpperYield.date=m.date and UpperYield.service=m.service and UpperYield.Sequence=1

    select * from mytable
    drop table mytable[/code]

    I'm happy with the performance from this (approx. 44 secs on the full data set on quad core 2.2Ghz, 16gb RAM, RAID 10 SCSI disks), but would like to see if anyone can come up with a more elegant solution!

    This is a real requirement (with table/column names changed and fake data). If anyone's interested, it's to be used to order a list of web results according to their profitability (and other factors). As the data's relatively static, it makes most sense to do this as a once a day batch job from the sql server rather than attempt this real-time where performance is the key driver.

  • Had I had the full data, maybe I cuold come up with something else, but all in all it looks OK..

    One thing though: Wouldn't it be better if you added an extra WHERE to the last row?

    WHERE m.yield IS NULL

    Because I guess you only want to update empty rows. It might make a big difference when it comes to large sets...

  • Well the Inner Joins restrict the data set to only NULL yields anyway, but I'll give it a spin and see if the execution plan/performance changes after adding that.

    I'll also see if I can provide some sql to give a representative number of rows/spread of data to the real thing.

    Nothing urgent, I'm happy with the solution, I've just spent quite a while looking at this and feel I must have missed a simpler solution.

    It's kind of similar to a running totals problem, but you also have to consider what's 'in front' of the row as well as 'behind' it.

  • UPDATE t SET yield = b.yield + DATEDIFF(d,b.date,t.date)*1. / DATEDIFF(d,b.date,a.date) * (a.yield - b.yield)

    FROM

    dbo.mytable t JOIN

    (SELECT id,

    (SELECT TOP 1 id

    FROM dbo.mytable

    WHERE service = m.service AND date m.date AND yield IS NOT NULL

    ORDER BY date) above

    FROM dbo.mytable m

    WHERE yield IS NULL)n ON t.id = n.id

    JOIN dbo.mytable a ON n.above = a.id

    JOIN dbo.mytable b ON n.below = b.id;

  • Nice one. Hadn't thought of doing that!

    Well, with my data, the original method and with Benyos's clause gave consistently the same time (give or take .5 secs), but the extra clause adds more 'logical reads'. Hans method was very slightly slower in all runs and was more expensive in CPU, but there's not much in it.

    Results below - I dropped and recreated the source table on each run and ran each one 3 times to make sure the results were consistent:

    --Hans

    Table 'mytable'. Scan count 561052, logical reads 2395243, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'Worktable'. Scan count 4, logical reads 140278, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 61969 ms, elapsed time = 16616 ms.

    --Benyos

    Table 'mytable'. Scan count 144951, logical reads 673414, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'Worktable'. Scan count 4, logical reads 140286, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 26499 ms, elapsed time = 13803 ms.

    --Mine

    Table 'mytable'. Scan count 82612, logical reads 474057, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'Worktable'. Scan count 4, logical reads 140286, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 26065 ms, elapsed time = 13586 ms.

  • I love a CTE ... 🙂

    with cteGaps(Date,Service,yield,Rown)

    as

    (

    Select Date,Service,yield,row_number() over (partition by service ,case when yield is null then 1 else 0 end order by date desc)

    from #mytable

    )

    ,

    cteRanges(Service,StartDate,EndDate)

    as

    (

    select service,min(date),max(date)

    from cteGaps where yield is null

    group by service,date+rown

    ),

    CteStepMaths(Service,StartDate,EndDate,StepYieldAmt,PrevYield)

    as

    (

    Select cteRanges.Service,

    Startdate,

    EndDate,

    StepYieldAmt = convert(decimal(10,5),convert(decimal(10,5),post.yield - Prev.Yield) / convert(decimal(10,5),(datediff(dd,StartDate,EndDate)+2.00000000))),

    Prev.Yield

    from cteRanges join #mytable Prev on Prev.service = cteRanges.Service

    and Prev.Date = cteRanges.StartDate - 1

    join #mytable Post on Post.service = cteRanges.Service

    and Post.Date = cteRanges.EndDate +1

    )

    ,

    cteStepsSoln(Service,Date,Yield)

    as

    (

    Select #MyTable.Service,

    #MyTable.Date,

    PrevYield+(StepYieldAmt * (DateDiff(dd,StartDate,#MyTable.date)+1))

    from CteStepMaths

    join #MyTable

    on #MyTable.Service = CteStepMaths.Service

    and #MyTable.Date between CteStepMaths.StartDate and CteStepMaths.EndDate

    )

    Select #myTable.id,#myTable.DATE ,#myTable.service ,coalesce(#myTable.Yield,cteStepsSoln.Yield)

    from #myTable

    left outer join cteStepsSoln

    on #myTable.Service = cteStepsSoln.Service

    and #myTable.Date = cteStepsSoln.Date



    Clear Sky SQL
    My Blog[/url]

  • This makes my head hurt, but trounces my solution 😀 (I've converted yours to update the source table as per the query instead of outputing to select). Results below:

    Table 'mytable'. Scan count 5590, logical reads 865294, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'Worktable'. Scan count 4, logical reads 574664, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 10059 ms, elapsed time = 5385 ms.

  • Well , the challenging part is sorting out the missing data ranges.

    Its an extension on this technique.

    http://sqlblogcasts.com/blogs/sqlandthelike/archive/2009/08/27/sql-and-contiguous-data-ranges.aspx



    Clear Sky SQL
    My Blog[/url]

  • Ah, yes, I understand now. It was mainly the +2.000000 in the calculation that was confusing me, but I get it now, it's just that you're picking the dates inside the gap rather than outside.

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply