Finding the most recent value for many record in a detail table (optimizing)

  • My system has a number of history tables that would give the date that a value has changed.

    I found a very close analogue to this in Adventure works looking up products and seeing what the most recent value was as of a certain date.

    I can think of two ways to do this:

    declare @ReportingDate DATE = '9/7/2004'

    -- Method 1

    SELECT

    Name

    ,ReportDate = @ReportingDate

    ,MostRecentOrderDetailId = (SELECT TOP 1 SalesOrderDetailId FROM sales.SalesOrderDetail SalesOrderDetail WHERE SalesOrderDetail.ProductID = Product.ProductID AND SalesOrderDetail.ModifiedDate <= @ReportingDate ORDER BY ModifiedDate DESC)

    FROM Production.Product Product

    -- Method 2

    SELECT

    Name

    ,ReportDate = @ReportingDate

    ,MostRecent.MostRecentOrderDetailId

    FROM Production.Product Product

    OUTER APPLY (SELECT TOP 1 MostRecentOrderDetailId = SalesOrderDetailId FROM sales.SalesOrderDetail SalesOrderDetail WHERE SalesOrderDetail.ProductID = Product.ProductID AND SalesOrderDetail.ModifiedDate <= @ReportingDate ORDER BY ModifiedDate DESC) MostRecent

    Both produce a near identical query plan. Method 2 would be a clear winner of the two if I wanted to retrieve more than a single value from the detail table (say price and quantity).

    Is there a better way to do this overall without refactoring how the data is stored?

    Best performance I can think of would be to have a daily reporting table that precalculates the requested values cross applied with a dates table to give the value for every date. Then it would be a simple join rather than a subquery/cross apply.

    Even using a reporting table (DailySalesOrderDetail), these queries need to be optimized because they'll be needed to recalculate the the reporting whenever the SalesOrderDetail table changes.

    IF OBJECT_ID('tempdb..#DailyOrderDetail') IS NOT NULL DROP TABLE #DailyOrderDetail

    SELECT DateFull, Name, MostRecentOrderDetailId

    INTO #DailyOrderDetail

    FROM edimain.dbo.DateLookup

    CROSS APPLY (SELECT

    Name

    ,MostRecentOrderDetailId = (SELECT TOP 1 SalesOrderDetailId FROM sales.SalesOrderDetail SalesOrderDetail WHERE SalesOrderDetail.ProductID = Product.ProductID AND SalesOrderDetail.ModifiedDate <= DateFull ORDER BY ModifiedDate DESC)

    FROM Production.Product Product) MostRecent

    WHERE DateFull BETWEEN '1/1/2002' and '1/1/2005'

    CREATE NONCLUSTERED INDEX IX_DailyOrderDetail_DateFull

    ON #DailyOrderDetail ([DateFull])

    INCLUDE ([Name],[MostRecentOrderDetailId])

    declare @ReportingDate DATE = '9/7/2004'

    -- Method 3

    SELECT

    d.Name

    ,ReportDate = @ReportingDate

    ,MostRecentOrderDetailId = d.MostRecentOrderDetailId

    FROM Production.Product Product

    INNER JOIN #DailyOrderDetail D on D.DateFull = @ReportingDate AND D.Name = Product.Name

    Given the reporting table already existing, Method 3 (just a join to a denormalized reporting table) doesn't even register as a single percentage point in comparison to Method 1 and Method 2.

    So if I could get Method 1 or Method 2 optimized further, that would be an interesting exercise. My application uses it quite a bit an it would be useful for optimizing a reporting table to use Method 3 as well.

    Is there a term for what sort of query this is?

  • tl;dr

    1) I think pre-aggregated data will always smoke real-time queries 🙂

    2) since you are on SQL 2012, have a look at LAST_VALUE. It might be a better option than your TOP 1 subquery. NOTE: the default window clause (RANGE BETEEN...) is HORRIBLY inefficient currently. Be sure to use ROWS BETWEEN...

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • The First_Value/Last_Value sounds really promising, but I'm not quite getting it yet.

    -- Method 1.5a

    SELECT DISTINCT

    Product.Name

    ,ReportDate = @ReportingDate

    ,MostRecentOrderDetailId = (SELECT LAST_VALUE(SalesOrderDetailId) OVER (PARTITION BY Product.ProductId order by SalesOrderDetail.ModifiedDate)

    FROM sales.SalesOrderDetail SalesOrderDetail WHERE SalesOrderDetail.ProductID = product.ProductID

    AND SalesOrderDetail.ModifiedDate <= @ReportingDate)

    FROM Production.Product Product

    -- Method 1.5b

    SELECT DISTINCT

    Product.Name

    ,ReportDate = @ReportingDate

    ,LAST_VALUE(SalesOrderDetailId) OVER (PARTITION BY Product.ProductId order by SalesOrderDetail.ModifiedDate)

    FROM Production.Product Product

    LEFT JOIN sales.SalesOrderDetail SalesOrderDetail ON SalesOrderDetail.ProductID = product.ProductID

    1.5a returns the fun error:

    Msg 512, Level 16, State 1, Line 11

    Subquery returned more than 1 value. This is not permitted when the subquery follows =, !=, <, <= , >, >= or when the subquery is used as an expression.

    When I try to add grouping to either, the window function I get this error:

    Column 'sales.SalesOrderDetail.ModifiedDate' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.

    For every product I'm trying to get a single value (the Last_Value before a date), but I seem to be doing rather poorly at using Last_Value and getting it to group or return just a single value.

    Any advice on what I'm overlooking here?

  • This is a variation of the 'Top N per Group' problem, and there are many ways to express it in T-SQL. Whichever you choose, be sure to make the ORDER BY clause deterministic and provide a useful index. A deterministic ORDER BY simply means there should be no ties. In the AdventureWorks example, TOP (1) ... ORDER BY ModifiedDate DESC is not deterministic per product because multiple order detail rows can have the same ModifiedDate. A useful index for that query is:

    CREATE INDEX nc1

    ON Sales.SalesOrderDetail

    (

    ProductID,

    ModifiedDate DESC,

    SalesOrderDetailID DESC

    );

    Writing the query to be deterministic:

    SELECT

    p.Name,

    ReportDate = @ReportingDate,

    MostRecent.SalesOrderDetailID

    FROM Production.Product AS p

    OUTER APPLY

    (

    SELECT TOP (1)

    sod.SalesOrderDetailID

    FROM Sales.SalesOrderDetail AS sod

    WHERE

    sod.ProductID = p.ProductID

    AND sod.ModifiedDate <= @ReportingDate

    ORDER BY

    sod.ModifiedDate DESC,

    sod.SalesOrderDetailID DESC

    ) AS MostRecent;

    The query plan no longer features an index spool or sort, seeking straight to the required row per product:

  • How about this solutions...

    --- Create a InLine Function with ReportingDate Parameter --

    CREATE FUNCTION fnMostRecentOrderDetail(@ReportingDate datetime) RETURNS TABLE

    AS

    RETURN

    (

    SELECT ProductId, SalesOrderDetailId, ModifiedDate,

    ROW_NUMBER() OVER (PARTITION BY ProductId ORDER BY ModifiedDate DESC) RowNo

    FROM sales.SalesOrderDetail

    WHERE ModifiedDate < @ReportingDate

    )

    GO

    and then...

    -- Join the function with ReportingDate Parameter --

    declare @ReportingDate DATE = '9/7/2007'

    SELECT

    Name

    ,ReportDate = @ReportingDate

    ,MostRecent.ModifiedDate

    FROM Production.Product Product

    LEFT OUTER JOIN fnMostRecentOrderDetail(@ReportingDate) MostRecent ON Product.ProductID = MostRecent.ProductId and RowNo = 1

  • Add index suggested in Paul's reply, to improve performance of your query.

  • Paul White (5/24/2013)


    This is a variation of the 'Top N per Group' problem, and there are many ways to express it in T-SQL. Whichever you choose, be sure to make the ORDER BY clause deterministic and provide a useful index...

    Thank you for the correction! yes, on the query in my system that looks into history is deterministic, my AdventureWorks example wasn't as well thought out as it could be 🙂

    More importantly giving a good example of what the right index should get you in a query plan

  • Bhaskar.Shetty (5/24/2013)


    How about this solutions...

    >>SQL Removed<<

    Hmm, thank you. I hadn't quite thought of that as a third option to the Sub query and cross apply. In fact if I'd made a ITF, I think my mind would have drifted towards a cross apply

    I like how it's clean and I'm interested to see how it the RowNumber compares to the Top 1 performance wise. I'll post a followup comparison 🙂

  • Bhaskar.Shetty (5/24/2013)


    How about this solutions...

    --- Create a InLine Function with ReportingDate Parameter --

    CREATE FUNCTION fnMostRecentOrderDetail(@ReportingDate datetime) RETURNS TABLE

    AS

    RETURN

    (

    SELECT ProductId, SalesOrderDetailId, ModifiedDate,

    ROW_NUMBER() OVER (PARTITION BY ProductId ORDER BY ModifiedDate DESC) RowNo

    FROM sales.SalesOrderDetail

    WHERE ModifiedDate < @ReportingDate

    )

    GO

    and then...

    -- Join the function with ReportingDate Parameter --

    declare @ReportingDate DATE = '9/7/2007'

    SELECT

    Name

    ,ReportDate = @ReportingDate

    ,MostRecent.ModifiedDate

    FROM Production.Product Product

    LEFT OUTER JOIN fnMostRecentOrderDetail(@ReportingDate) MostRecent ON Product.ProductID = MostRecent.ProductId and RowNo = 1

    I have a solution very similar now.

    Currently it's a inline table function that takes a product id and date, then returns the one reference back to the record.

    It uses the Row_Number() ORDER BY... where RowNumber = 1, but inside of the function, so it always just returns a single row.

    Currently it's not in the AdventureWorks db, it's in my actual db. But I'm interested in this sort of thing, so I'll probobly post up here the three different methods I'm seeing and the differences. AdventureWorks might be too small to really differentiate the performance, but we'll see.

    We have the Top 1 Desc (starting), Return Row_Number 1 for a very specific item (my new) and Return Row_Numbers and join on 1 to limit the results (actually returns multiple) that you gave me.

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply