APPLY versus ROW_NUMBER versus double JOIN to retrieve most recent record from a foreign key table

  • I've been mulling over this for a while, and there's something about using the APPLY function (OUTER APPLY in this case) that bugs me in that it seems to force the contents of the apply to be executed as many times as there are input rows. I'm sure I've heard people say this shouldn't be the case, so I've finally got round to getting all the sample code and such together to find out where I'm going wrong.

    Basically, I have two tables, ServicerHistory and ServicerHistoryCommissionRate. ServicerHistory contains details of an account and who is servicing it (I've only bothered including the ServicerHistoryID in the sample below as the rest isn't pertinent to this example), whilst ServicerHistoryCommissionRate contains the effective date of the commission rates (if there are any) applied to that account.

    DDL for the tables and indexes:-

    --create objects

    IF OBJECT_ID('dbo.ServicerHistoryCommissionRate') IS NOT NULL

    DROP TABLE dbo.ServicerHistoryCommissionRate

    IF OBJECT_ID('dbo.ServicerHistory') IS NOT NULL

    DROP TABLE dbo.ServicerHistory

    CREATE TABLE dbo.ServicerHistory

    (ServicerHistoryID int IDENTITY(1,1) PRIMARY KEY)

    GO

    CREATE TABLE dbo.ServicerHistoryCommissionRate

    (ServicerHistoryCommissionRateID int IDENTITY(1,1) PRIMARY KEY

    , ServicerHistoryID int NOT NULL FOREIGN KEY REFERENCES dbo.ServicerHistory(ServicerHistoryID)

    , CommissionRate decimal(5,2) NOT NULL

    , EffectiveDate date NOT NULL)

    GO

    CREATE INDEX idx_ServicerHistoryID_EffectiveDate

    ON dbo.ServicerHistoryCommissionrate (ServicerHistoryID, EffectiveDate DESC)

    INCLUDE (CommissionRate)

    GO

    And some data population (adjust the executions as you feel fit; :-

    --populate with some sample data

    INSERT INTO dbo.ServicerHistory

    DEFAULT VALUES

    GO 100000

    --set up some dates

    DECLARE @Dates TABLE

    (SomeDate date NOT NULL)

    INSERT INTO @Dates

    (SomeDate)

    SELECT '2011-07-12'

    UNION ALL

    SELECT '2013-03-02'

    UNION ALL

    SELECT '2010-08-13'

    UNION ALL

    SELECT '2011-01-02'

    UNION ALL

    SELECT '2013-05-03'

    UNION ALL

    SELECT '2009-12-18'

    --set up commission rates

    DECLARE @CommRates TABLE

    (Commissionrate decimal(5,2))

    INSERT INTO @CommRates

    (Commissionrate)

    SELECT 32.35

    UNION ALL

    SELECT 25

    UNION ALL

    SELECT 12.8

    UNION ALL

    SELECT 10

    UNION ALL

    SELECT 7.4

    --populate some servicer history ids with valid dates and commission rates

    INSERT INTO dbo.ServicerHistoryCommissionRate

    SELECT TOP 20000 sh.ServicerHistoryID

    , Commrate.CommissionRate

    , Dates.SomeDate

    FROM dbo.ServicerHistory sh

    CROSS APPLY (SELECT TOP 1 CommissionRate

    FROM @CommRates AS cr

    ORDER BY NEWID()) Commrate

    CROSS APPLY (SELECT TOP 1 SomeDate

    FROM @Dates AS d

    ORDER BY NEWID()) Dates

    LEFT OUTER JOIN dbo.ServicerHistoryCommissionRate shcr

    ON sh.ServicerHistoryID = shcr.ServicerHistoryID

    AND dates.SomeDate = shcr.EffectiveDate

    WHERE shcr.ServicerHistoryCommissionRateID IS NULL

    ORDER BY NEWID()

    GO 50

    The three methods of retrieving the data that I've come up with are below:-

    --retrieve the most recent commission rate for each row using outer apply (there may not be a rate)

    --uh oh, the subquery executed onece for each row in ServicerHistory table

    SELECT sh.ServicerHistoryId

    , shrc.CommissionRate

    FROM dbo.ServicerHistory AS sh

    OUTER APPLY (SELECT TOP 1

    SHRC.CommissionRate

    FROM dbo.ServicerHistoryCommissionRate AS SHRC

    WHERE SHRC.ServicerHistoryID = SH.ServicerHistoryId

    ORDER BY SHRC.EffectiveDate DESC

    ) SHRC

    --try using ROW_NUMBER, whoo each table only touched once

    SELECT sh.ServicerHistoryId

    , shrc.CommissionRate

    FROM dbo.ServicerHistory AS sh

    LEFT OUTER JOIN (SELECT SHRC.CommissionRate

    , ServicerHistoryId

    , ROW_NUMBER() OVER (PARTITION BY ServicerHistoryID ORDER BY EffectiveDate DESC) RowNum

    FROM dbo.ServicerHistoryCommissionRate AS SHRC

    ) SHRC

    ON sh.ServicerHistoryId = shrc.ServicerHistoryID

    AND SHRC.Rownum = 1

    --try old fashioned double join method, SErvicerHistoryCommissionRate table touched twice so more expensive

    SELECT sh.ServicerHistoryID

    , shrc.CommissionRate

    FROM dbo.ServicerHistory AS sh

    LEFT OUTER JOIN (SELECT commid.ServicerHistoryID

    , MAX(ServicerHistoryCommissionRateID) ServicerHistoryCommissionRateID

    FROM dbo.ServicerHistoryCommissionRate commid

    LEFT OUTER JOIN (SELECT ServicerHistoryID

    ,MAX(EffectiveDate) MaxDate

    FROM dbo.ServicerHistoryCommissionRate

    GROUP BY ServicerHistoryID) commdate

    ON commid.ServicerHistoryId = commdate.ServicerHistoryID

    GROUP BY commid.ServicerHistoryID)comm

    ON sh.ServicerHistoryId = comm.servicerHistoryID

    LEFT OUTER JOIN dbo.ServicerHistoryCommissionRate shrc

    ON comm.ServicerHistoryCommissionRateID = shrc.ServicerHistoryCommissionRateID

    Statistics time and io return the following:-

    (100000 row(s) affected)

    Table 'ServicerHistoryCommissionRate'. Scan count 100000, logical reads 319323, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'ServicerHistory'. Scan count 7, logical reads 484, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 639 ms, elapsed time = 845 ms.

    (100000 row(s) affected)

    Table 'ServicerHistoryCommissionRate'. Scan count 1, logical reads 1852, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'ServicerHistory'. Scan count 1, logical reads 163, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 327 ms, elapsed time = 1040 ms.

    (100000 row(s) affected)

    Table 'ServicerHistory'. Scan count 7, logical reads 484, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'ServicerHistoryCommissionRate'. Scan count 14, logical reads 4098, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 767 ms, elapsed time = 1847 ms.

    The execution plans for the above are attached; as you can see, the first query using the APPLY method executes the index seek on the ServicerHistoryCommissionRate 100k times (for the actual data this comes from there are several million rows in each table so it just gets worse!), making this seem to be by far the most inefficient method of the three when returning a set.

    Am I missing something with the way in which to implement using APPLY when dealing with sets, or is it just not good for that kind of query? I like APPLY from a "readable code" point of view, but for performance it just seems to be a bit horrible unless you're working with a very small set of rows.

    Any tips, pointers or observations are most welcome!

    Follow me on twitter @EvoDBACheck out my blog Natural Selection DBA[/url]

  • Matthew Darwin (6/6/2013)


    ...

    Any tips, pointers or observations are most welcome!

    Have you tried using the row_number method with the outer apply so you have an even playing field?

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • SELECT

    sh.ServicerHistoryId,

    shrc.CommissionRate

    FROM dbo.ServicerHistory sh

    OUTER APPLY (

    SELECT

    d.CommissionRate

    FROM (

    SELECT

    ishrc.CommissionRate,

    RowNum = ROW_NUMBER() OVER (PARTITION BY ServicerHistoryID ORDER BY EffectiveDate DESC)

    FROM dbo.ServicerHistoryCommissionRate ishrc

    WHERE ishrc = ServicerHistoryID = sh.ServicerHistoryID

    ) d

    WHERE d.Rownum = 1

    ) shrc

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • I've not; but I think that's a little bit of my point - it isn't an even playing field, and despite APPLY seeming to be designed for this kind of use, in reality it doesn't fare so well. I'm guessing that the use of APPLY will make it worse because instead of simply running 100k SELECT TOP 1 queries, you're going to run 100k queries generating row numbers for each subquery. Whereas using a join you only need to have a single pass of generating the row numbers, especially seeing as you can't filter on the output of the row number until after it's been produced.

    I'm assuming you mean something like this:-

    SELECT sh.ServicerHistoryId

    , shrc.CommissionRate

    FROM dbo.ServicerHistory AS sh

    OUTER APPLY (SELECT SHRC.CommissionRate

    , ServicerHistoryId

    , ROW_NUMBER() OVER (PARTITION BY ServicerHistoryID ORDER BY EffectiveDate DESC) RowNum

    FROM dbo.ServicerHistoryCommissionRate AS SHRC

    WHERE SHRC.ServicerHistoryID = SH.ServicerHistoryId

    ) SHRC

    WHERE SHRC.Rownum = 1

    Follow me on twitter @EvoDBACheck out my blog Natural Selection DBA[/url]

  • Surely this is because the Outer apply cointains a filter based on the outer table which will cause a rBAR operation as each row has to be evaluated.

    The Left outer is a set operation.

    _________________________________________________________________________
    SSC Guide to Posting and Best Practices

  • That's entirely my conclusion; however I've had people tell me that it shouldn't do that (unless they're using it in another way that means the filter isn't applied in that manner which may well be my problem.)

    It just seems to make the apply operator a little pointless to me from a performance stance.

    I have a sneaky suspicion I may just not quite be getting it though. 😀

    Follow me on twitter @EvoDBACheck out my blog Natural Selection DBA[/url]

  • Apply is definitely not the fasted method in any scenario, but there are plenty where it can outstrip other approaches.

    In this particular comparison against Row_Number, although it's not the most efficient, note that it's the only one of the three that enables a seek against your index. In other distributions of test data, this could well end up being the most efficient method.

    Unless the query can be rewritten by the optimiser into a JOIN (in which case, it becomes identical), you will always end up with a nested loop plan that looks like this. Remember that APPLY is also an efficient way of calling TVFs (CLR and T-SQL), including XMLQuery and that's one of it's primary uses.

    I'm not sure I'd define something that used Nested Loops + Index Seeks within a plan as RBAR. So does a JOIN in many scenarios.

  • If we skew your test data so there's a lot of rows per "ServicerHistoryID" that are not the top 1, you can see the dramatic switch in the cost of the 3 queries.

    Test data setup:

    --create objects

    IF OBJECT_ID('dbo.ServicerHistoryCommissionRate') IS NOT NULL

    DROP TABLE dbo.ServicerHistoryCommissionRate

    IF OBJECT_ID('dbo.ServicerHistory') IS NOT NULL

    DROP TABLE dbo.ServicerHistory

    CREATE TABLE dbo.ServicerHistory

    (ServicerHistoryID int IDENTITY(1,1) PRIMARY KEY, SomeOthercolumn BIT NULL)

    GO

    CREATE TABLE dbo.ServicerHistoryCommissionRate

    (ServicerHistoryCommissionRateID int IDENTITY(1,1) PRIMARY KEY

    , ServicerHistoryID int NOT NULL FOREIGN KEY REFERENCES dbo.ServicerHistory(ServicerHistoryID)

    , CommissionRate decimal(5,2) NOT NULL

    , EffectiveDate date NOT NULL)

    GO

    CREATE INDEX idx_ServicerHistoryID_EffectiveDate

    ON dbo.ServicerHistoryCommissionrate (ServicerHistoryID, EffectiveDate DESC)

    INCLUDE (CommissionRate)

    GO

    --populate with some sample data --go 100000 was taking ages, so added a meaningless column to do this in a set based way :)

    INSERT INTO dbo.ServicerHistory

    SELECT TOP 1000 CAST(NULL AS BIT) FROM master.sys.columns a, master.sys.columns b

    --set up some dates (added a lot more dates to skew the data distribution more towards seeks winning)

    DECLARE @Dates TABLE

    (SomeDate date NOT NULL)

    INSERT INTO @Dates

    (SomeDate)

    SELECT DATEADD(day, n,'20100101') FROM (

    SELECT TOP 3000 ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) n FROM master.sys.columns a, master.sys.columns b)

    nums

    --populate some servicer history ids with valid dates and commission rates

    INSERT INTO dbo.ServicerHistoryCommissionRate

    SELECT sh.ServicerHistoryID

    , CHECKSUM(NEWID())/100000000.0 CommissionRate

    , Dates.SomeDate

    FROM dbo.ServicerHistory sh

    CROSS APPLY (SELECT TOP 1000 SomeDate

    FROM @Dates AS d

    ORDER BY NEWID()) Dates

    LEFT OUTER JOIN dbo.ServicerHistoryCommissionRate shcr

    ON sh.ServicerHistoryID = shcr.ServicerHistoryID

    AND dates.SomeDate = shcr.EffectiveDate

    WHERE shcr.ServicerHistoryCommissionRateID IS NULL

    Three queries identical to the originals. Plans attached.

    --APPLY

    (1000 row(s) affected)

    Table 'ServicerHistoryCommissionRate'. Scan count 1000, logical reads 3213, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'ServicerHistory'. Scan count 1, logical reads 4, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    (1 row(s) affected)

    SQL Server Execution Times:

    CPU time = 16 ms, elapsed time = 15 ms.

    --ROW_NUMBER()

    (1000 row(s) affected)

    Table 'ServicerHistoryCommissionRate'. Scan count 1, logical reads 3137, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'ServicerHistory'. Scan count 1, logical reads 4, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    (1 row(s) affected)

    SQL Server Execution Times:

    CPU time = 468 ms, elapsed time = 475 ms.

    --Double Join

    (1000 row(s) affected)

    Table 'ServicerHistoryCommissionRate'. Scan count 1, logical reads 6208, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'ServicerHistory'. Scan count 1, logical reads 4, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    (1 row(s) affected)

    SQL Server Execution Times:

    CPU time = 437 ms, elapsed time = 440 ms.

  • Ok, now that's definitely interesting, I'll try it on some larger datasets later with that sort of skew too and post back (just done this and it stays consistent with your findings).

    I had also noticed that if I take a small set of ServicerHistoryIDs as an input to the main ServicerHistory table, then the APPLY operator worked better; I'm assuming because the ROW_NUMBER function still has to be calculated for the entire ServicerHistory table, whereas you only need to run the TOP 1 in the APPLY for the filtered set. In those cases, the double join is also quicker than the ROW_NUMBER version.

    DECLARE @ServicerHistoryIDs TABLE

    (ServicerHistoryID int PRIMARY KEY)

    INSERT INTO @ServicerHistoryIDs

    SELECT TOP 100 ServicerHistoryID

    FROM dbo.SErvicerHistory

    --retrieve the most recent commission rate for each row using outer apply (there may not be a rate)

    SELECT sh.ServicerHistoryId

    , shrc.CommissionRate

    FROM dbo.ServicerHistory AS sh

    INNER JOIN @ServicerHistoryIDs AS shid

    ON sh.ServicerHistoryID = shid.SErvicerHistoryID

    OUTER APPLY (SELECT TOP 1

    SHRC.CommissionRate

    FROM dbo.ServicerHistoryCommissionRate AS SHRC

    WHERE SHRC.ServicerHistoryID = SH.ServicerHistoryId

    ORDER BY SHRC.EffectiveDate DESC

    ) SHRC

    --try using ROW_NUMBER

    SELECT sh.ServicerHistoryId

    , shrc.CommissionRate

    FROM dbo.ServicerHistory AS sh

    INNER JOIN @ServicerHistoryIDs AS shid

    ON sh.ServicerHistoryID = shid.SErvicerHistoryID

    LEFT OUTER JOIN (SELECT SHRC.CommissionRate

    , ServicerHistoryId

    , ROW_NUMBER() OVER (PARTITION BY ServicerHistoryID ORDER BY EffectiveDate DESC) RowNum

    FROM dbo.ServicerHistoryCommissionRate AS SHRC

    ) SHRC

    ON sh.ServicerHistoryId = shrc.ServicerHistoryID

    AND SHRC.Rownum = 1

    --try old fashioned double join method

    SELECT sh.ServicerHistoryID

    , shrc.CommissionRate

    FROM dbo.ServicerHistory AS sh

    INNER JOIN @ServicerHistoryIDs AS shid

    ON sh.ServicerHistoryID = shid.SErvicerHistoryID

    LEFT OUTER JOIN (SELECT commid.ServicerHistoryID

    , MAX(ServicerHistoryCommissionRateID) ServicerHistoryCommissionRateID

    FROM dbo.ServicerHistoryCommissionRate commid

    LEFT OUTER JOIN (SELECT ServicerHistoryID

    ,MAX(EffectiveDate) MaxDate

    FROM dbo.ServicerHistoryCommissionRate

    GROUP BY ServicerHistoryID) commdate

    ON commid.ServicerHistoryId = commdate.ServicerHistoryID

    GROUP BY commid.ServicerHistoryID)comm

    ON sh.ServicerHistoryId = comm.servicerHistoryID

    LEFT OUTER JOIN dbo.ServicerHistoryCommissionRate shrc

    ON comm.ServicerHistoryCommissionRateID = shrc.ServicerHistoryCommissionRateID

    (100 row(s) affected)

    Table 'ServicerHistoryCommissionRate'. Scan count 100, logical reads 300, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'ServicerHistory'. Scan count 0, logical reads 200, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table '#17B73214'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 0 ms.

    (100 row(s) affected)

    Table 'ServicerHistoryCommissionRate'. Scan count 1, logical reads 10, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'ServicerHistory'. Scan count 0, logical reads 200, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table '#17B73214'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 0 ms.

    (100 row(s) affected)

    Table 'ServicerHistoryCommissionRate'. Scan count 100, logical reads 601, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'ServicerHistory'. Scan count 0, logical reads 200, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table '#17B73214'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 0 ms.

    Hm, so as per the norm, the answer to which performs best is "it depends". Ho-hum.

    Follow me on twitter @EvoDBACheck out my blog Natural Selection DBA[/url]

  • Matthew Darwin (6/7/2013)


    I had also noticed that if I take a small set of ServicerHistoryIDs as an input to the main ServicerHistory table, then the APPLY operator worked better; I'm assuming because the ROW_NUMBER function still has to be calculated for the entire ServicerHistory table, whereas you only need to run the TOP 1 in the APPLY for the filtered set. In those cases, the double join is also quicker than the ROW_NUMBER version.

    Yep. Basically, filtering on the result of one of the analytical functions (only selecting where row_number = 1) AFAIK, will always result in a scan across all rows in which row_number is applied. Therefore, it's likely to be the most efficient method where more than a percent or so of the overall number of rows match your filter. Seek + Key lookup is more expensive than people assume and will only "win" where there's less than a percent (Approx. As usual, it depends) of the overall number of rows that you're retrieving.

    In the case where the optimiser can make this decision itself, e.g. with regular joins and predicates, it will choose whether to seek or scan based on the statistics (check out Gail's blog here[/url]), but in this kind of scenario, it's not able (or not clever enough) to do this. Therefore, knowing the distribution of data and being able to make an educated decision about whether to use ROW_NUMBER, APPLY, or something else, can be critical.

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply