Linking to the Previous Row

  • I got this CTE approach working for a related case - my first not-entirely-trivial query! Thanks David and all who have contributed via these comments. Next stage is to try and understand Outer Apply and a temp table approach.

    Actually it was my second query, but my first was the RBAR approach 🙂 You live and learn.

  • OK, confession time. Reading Charles' Outer Apply suggestion I can't see why this isn't a RBAR solution. Doesn't the Outer Apply do exactly what a subquery of the type 'Top 1 where Date < Today' would do... slowly?

  • alistairgthomas (11/11/2010)


    OK, confession time. Reading Charles' Outer Apply suggestion I can't see why this isn't a RBAR solution. Doesn't the Outer Apply do exactly what a subquery of the type 'Top 1 where Date < Today' would do... slowly?

    Hi Alistair, would you mind providing a page number or link to Charles's Outer Apply suggestion (or quote it.)

    Thanks,

    David.

  • Thanks - Mr Charlie says Outer Apply might allow more efficient index use:

    charles.gildawie (8/20/2010)


    It's a good introduction for ROW_NUMBER but the query plan it generates is a little horrible. I wouldn't recommend ROW_NUMBER() for this task.

    With the ROW_NUMBER() approach you can't use any INDEX when doing your LEFT JOINS back to your CTE.

    So what happens is that you end up performing at least 2 CI SCANS on priceHistory table, regardless of how selective the rest of your query is. The example you gave actually uses 3 CI scans -- one because you are using the whole of priceHistory and 2 from the LEFT JOINS back to the CTE.

    Perhaps a better approach would be to use another 2005+ construct -- OUTER APPLY

    Example:

    SELECT

    i.[item] AS [Item]

    , previousPrice.[Price] AS [Old Price]

    , ph.[price] AS [RangePrice]

    , ph.[priceStartDate] AS [Startdate]

    , nextPriceStart.[nextPriceStartDate] AS [EndDate]

    FROM

    items AS i

    JOIN priceHistory ph ON ph.[itemId] = i.[itemID]

    OUTER APPLY (

    SELECT TOP 1

    phNext.[priceStartDate] AS [nextPriceStartDate]

    FROM

    priceHistory AS phNext

    WHERE

    phNext.[itemId] = ph.[itemID]

    AND phNext.[priceStartDate] > ph.[priceStartDate]

    ORDER BY

    phNext.[priceStartDate] ASC

    )

    AS nextPriceStart

    OUTER APPLY (

    SELECT TOP 1

    phPrev.[price] AS [Price]

    FROM

    priceHistory phPrev

    WHERE

    phPrev.[itemId] = ph.[ItemID]

    AND phPrev.[priceStartDate] < ph.[priceStartDate]

    ORDER BY

    phPrev.[priceStartDate] DESC

    )

    AS previousPrice

    Which generates the same results but only does 1 CI scan (because we are using every row in priceHistory. If we were more selective (for example only doing this for vacuum cleaner's then there would be no scans and all seeks)

    Give it a whirl and compare the execution plans. I bet OUTER APPLY would be a lot faster on a large dataSet (and where you are being selective on the products / dates you want to bring back).

    Regards,

    Transact_Charlie.

  • Alistair,

    Just looking at it, I'd tend to agree with you. While I agree with the assertion that the CTE approach can't benefit from indexes (there are none), I'd be surprised if the APPLY approach broke many records. The nature of CROSS APPLY / OUTER APPLY is that the "applied" is applied for every row - and this fits my understanding of RBAR.

    Regards.

  • David McKinney (11/12/2010)


    Alistair,

    Just looking at it, I'd tend to agree with you. While I agree with the assertion that the CTE approach can't benefit from indexes (there are none), I'd be surprised if the APPLY approach broke many records. The nature of CROSS APPLY / OUTER APPLY is that the "applied" is applied for every row - and this fits my understanding of RBAR.

    Regards.

    Not really. The optimiser may determine that an APPLY fits the criteria for a JOIN and construct a plan which does exactly that. Paul White's excellent articles referred to in my sig are well worth a read.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Thanks Chris...I guess it could be analogous to a non-equi join or a cross join with a where clause, but I remain somehow unconvinced and I'd still like to see some figures.

    Indeed what strikes me about all the discussion so far around this article is that no-one has actually put these methods seriously to test. (Not even the author!)

    I have thought about doing a follow up article based solely around the different methods proposed in the discussions, and putting each to the test against real volumes of data, but also against likely use cases. It's in my list of articles I really must get around to writing!

    Quizás, Quizás, Quizás.

  • Thanks - I read Paul's articles (skimmed the trickier bits) and it seems that the optimiser recognises a nested loop when it sees it... but I don't see that it then does something clever with an expression like select top 1 of a sorted column? I am new to this! I have simpler version of the example problem, and I tried the CTE, a temp table and an Outer Apply and the former two were (equally) fast while the Apply was a lot slower. I may have mangled the syntax - I'll refrain from posting it to avoid wasting your time and my own embarassment. I was just wondering if I'd understood Charlie's concept correctly.

    Many thanks,

    Alistair

  • David McKinney (11/12/2010)


    Thanks Chris...I guess it could be analogous to a non-equi join or a cross join with a where clause, but I remain somehow unconvinced and I'd still like to see some figures.

    Indeed what strikes me about all the discussion so far around this article is that no-one has actually put these methods seriously to test. (Not even the author!)

    I have thought about doing a follow up article based solely around the different methods proposed in the discussions, and putting each to the test against real volumes of data, but also against likely use cases. It's in my list of articles I really must get around to writing!

    Quizás, Quizás, Quizás.

    Paul puts APPLY though some rigorous proofs in his articles. Don't trust me, I'm only a doctor 😎

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Sorry Doc...I meant that specifically the different approaches to the "previous / next row" problem haven't been compared side by side.

    As for Paul's Apply articles, I have read them; indeed it was there where I learned about using this method for xml shredding.

  • David McKinney (11/13/2010)


    I meant that specifically the different approaches to the "previous / next row" problem haven't been compared side by side.

    Exactly. Here's code to build a million row table with 5000 ItemID's and around 200 price changes for each item over a ten year period. Let the races begin. In the mean time, no claims of performance should be made because it hasn't been proven on THIS thread.

    --===== Conditionally drop and rebuild a test table to make reruns easier.

    -- This whole thing takes about 35 seconds to run.

    IF OBJECT_ID('tempdb..#PriceHistory','U') IS NOT NULL

    DROP TABLE #PriceHistory

    ;

    GO

    WITH

    cteCreateData AS

    (

    SELECT TOP 1050000

    ItemID = CAST(ABS(CHECKSUM(NEWID()))%5000+1 AS INT),

    PriceStartDate = DATEADD(dd,ABS(CHECKSUM(NEWID()))%(DATEDIFF(dd,'2010','2020')),'2010'),

    Price = CAST((ABS(CHECKSUM(NEWID()))%1000)/100.0 +1 AS DECIMAL(9,2))

    FROM sys.all_columns ac1

    CROSS JOIN sys.all_columns ac2

    )

    ,

    cteDeleteDupes AS

    (

    SELECT Occurance = ROW_NUMBER() OVER (PARTITION BY ItemID, PriceStartDate ORDER BY ItemID, PriceStartDate),

    ItemID, PriceStartDate, Price

    FROM cteCreateData

    )

    SELECT TOP 1000000

    ItemID, PriceStartDate, Price

    INTO #PriceHistory

    FROM cteDeleteDupes

    WHERE Occurance = 1

    ;

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Excellent...thanks, Jeff, for getting the ball rolling.

    I adapted your code to make it as per the example i.e. so that the article code would run without modif.

    Specifically, I created the PriceHistory table upfront with PK as per the article, and also created an Items table with a foreign key constraint between the two. (Also I created these as non-temporary tables - for no good reason.) At the end of the code block below is the CREATE VIEW exactly as in the article.

    There are subsequent code blocks to show other methods.

    I'm not finished yet...I just wanted to post some code before someone beat me to it 😀

    The next steps are the following -

    1) post an example of the cross apply method.

    2) benchmark each method for different use cases.

    I propose the following use cases

    a) Entire table (as per select * FROM PriceCompare in the article.)

    b) History for a particular item e.g.WHERE Item='Item 512')

    c) Identify all increases in price. e.g. WHERE RangePrice>OldPrice.

    While I haven't done any benchmarking in earnest, I reckon for now that the CTE is holding its own on the million row dataset. But maybe with some appropriate indexing etc. the others may prove better.

    Regards,

    David McKinney.

    IF EXISTS (SELECT * FROM sys.foreign_keys WHERE object_id = OBJECT_ID(N'[dbo].[FK_PriceHistory_Items]') AND parent_object_id = OBJECT_ID(N'[dbo].[PriceHistory]'))

    ALTER TABLE [dbo].[PriceHistory] DROP CONSTRAINT [FK_PriceHistory_Items]

    GO

    IF OBJECT_ID('PriceHistory','U') IS NOT NULL

    DROP TABLE PriceHistory

    ;

    GO

    IF OBJECT_ID('Items','U') IS NOT NULL

    DROP TABLE Items

    ;

    GO

    CREATE TABLE [dbo].[Items](

    [ItemId] [int] NOT NULL,

    [Item] [varchar](100) NOT NULL,

    CONSTRAINT [PK_Items] PRIMARY KEY CLUSTERED

    (

    [ItemId] ASC

    ))

    GO

    CREATE TABLE [dbo].[PriceHistory](

    [ItemId] [int] NOT NULL,

    [PriceStartDate] [datetime] NOT NULL,

    [Price] [decimal](10, 2) NOT NULL,

    CONSTRAINT [PK_PriceHistory] PRIMARY KEY CLUSTERED

    (

    [ItemId] ASC,

    [PriceStartDate] ASC

    ))

    GO

    ;WITH

    cteCreateData AS

    (

    SELECT TOP 1050000

    ItemID = CAST(ABS(CHECKSUM(NEWID()))%5000+1 AS INT),

    PriceStartDate = DATEADD(dd,ABS(CHECKSUM(NEWID()))%(DATEDIFF(dd,'2010','2020')),'2010'),

    Price = CAST((ABS(CHECKSUM(NEWID()))%1000)/100.0 +1 AS DECIMAL(9,2))

    FROM sys.all_columns ac1

    CROSS JOIN sys.all_columns ac2

    )

    ,

    cteDeleteDupes AS

    (

    SELECT Occurance = ROW_NUMBER() OVER (PARTITION BY ItemID, PriceStartDate ORDER BY ItemID, PriceStartDate),

    ItemID, PriceStartDate, Price

    FROM cteCreateData

    )

    INSERT INTO PriceHistory ( ItemID, PriceStartDate, Price)

    SELECT TOP 1000000

    ItemID, PriceStartDate, Price

    FROM cteDeleteDupes

    WHERE Occurance = 1

    ;

    GO

    INSERT INTO dbo.Items(ItemId,Item)

    select ItemID, 'Item ' + CAST(ItemID as varchar) FROM PriceHistory

    GROUP BY ItemID

    GO

    ALTER TABLE [dbo].[PriceHistory] WITH CHECK ADD CONSTRAINT [FK_PriceHistory_Items] FOREIGN KEY([ItemId])

    REFERENCES [dbo].[Items] ([ItemId])

    GO

    /****** Object: View [dbo].[PriceCompare] Script Date: 11/16/2010 14:21:14 ******/

    IF EXISTS (SELECT * FROM sys.views WHERE object_id = OBJECT_ID(N'[dbo].[PriceCompare]'))

    DROP VIEW [dbo].[PriceCompare]

    GO

    CREATE VIEW [dbo].[PriceCompare] AS

    WITH PriceCompare AS

    (

    SELECT i.Item, ph.ItemId, ph.PriceStartDate, ph.Price,

    ROW_NUMBER() OVER (Partition BY ph.ItemId ORDER BY PriceStartDate) AS rownum

    FROM

    Items i

    INNER JOIN

    PriceHistory ph

    ON i.ItemId = ph.ItemId

    )

    SELECT

    currow.Item,

    prevrow.Price AS OldPrice,

    currow.Price AS RangePrice,

    currow.PriceStartDate AS StartDate,

    nextrow.PriceStartDate AS EndDate

    FROM

    PriceCompare currow

    LEFT JOIN PriceCompare nextrow

    ON currow.rownum = nextrow.rownum - 1 AND currow.ItemId = nextrow.ItemId

    LEFT JOIN PriceCompare prevrow

    ON currow.rownum = prevrow.rownum + 1 AND currow.ItemId = prevrow.ItemId

    The next code block shows the creation of a temporary table exactly equivalent to PriceCompare CTE in the View. The temporary table is populated once, and then joined with itself.

    -- Replacing the CTE with an entirely equivalent temporary table.

    IF OBJECT_ID('tempdb.dbo.#NumberedPriceHistory') is not null

    drop table #NumberedPriceHistory

    GO

    CREATE TABLE #NumberedPriceHistory

    (

    PRIMARY KEY (ItemId,rownum) ,

    Item varchar(100),

    ItemId int,

    PriceStartDate datetime,

    Price decimal(9,2),

    rownum int

    )

    GO

    --CREATE NONCLUSTERED INDEX idxNumberedPriceHistory_Item

    --ON [dbo].[#NumberedPriceHistory] ([Item])

    --INCLUDE ([ItemId],[PriceStartDate],[Price],[rownum])

    --GO

    INSERT INTO #NumberedPriceHistory

    SELECT i.Item, ph.ItemId, ph.PriceStartDate, ph.Price,

    ROW_NUMBER() OVER (Partition BY ph.ItemId ORDER BY PriceStartDate) AS rownum

    FROM Items i INNER JOIN PriceHistory ph

    ON i.ItemId = ph.ItemId

    GO

    SELECT currow.Item, prevrow.Price AS OldPrice, currow.Price AS RangePrice, currow.PriceStartDate AS StartDate, nextrow.PriceStartDate AS EndDate

    FROM #NumberedPriceHistory currow

    LEFT JOIN #NumberedPriceHistory nextrow

    ON currow.rownum = nextrow.rownum - 1

    AND currow.ItemId = nextrow.ItemId

    LEFT JOIN #NumberedPriceHistory prevrow

    ON currow.rownum = prevrow.rownum + 1

    AND currow.ItemId = prevrow.ItemId

    where currow.Price>prevrow.Price

    --where currow.Item='Item 512'

    The third code block shows a temp table but this time with an identity column to replace the row number. The insert into this table is sorted - is that OK?

    -- Replacing the CTE with a table with an identity column

    IF OBJECT_ID('tempdb.dbo.#PriceHistoryWithIdentity') is not null

    drop table #PriceHistoryWithIdentity

    GO

    CREATE TABLE #PriceHistoryWithIdentity

    (

    PRIMARY KEY (rownum),

    rownum int identity(1,1),

    Item varchar(100),

    ItemId int,

    PriceStartDate datetime,

    Price decimal(9,2)

    )

    GO

    INSERT INTO #PriceHistoryWithIdentity

    (Item,

    ItemId,

    PriceStartDate,

    Price

    )

    SELECT i.Item, ph.ItemId, ph.PriceStartDate, ph.Price

    FROM Items i INNER JOIN PriceHistory ph

    ON i.ItemId = ph.ItemId

    order by i.Item, ph.PriceStartDate

    GO

    SELECT currow.Item, prevrow.Price AS OldPrice, currow.Price AS RangePrice, currow.PriceStartDate AS StartDate, nextrow.PriceStartDate AS EndDate

    FROM #PriceHistoryWithIdentity currow

    LEFT JOIN #PriceHistoryWithIdentity nextrow

    ON currow.rownum = nextrow.rownum - 1

    AND currow.ItemId = nextrow.ItemId

    LEFT JOIN #PriceHistoryWithIdentity prevrow

    ON currow.rownum = prevrow.rownum + 1

    AND currow.ItemId = prevrow.ItemId

    where currow.Price>prevrow.Price

  • ok...so here are my results in csv format (sorry!) Conclusions follow.

    Full Resultset,Method,Execution Time,Client Processing Time,Total execution Time,Wait Time,Notes

    ,CTE View,13 secs,3261,5759,2498,93% in Hash Match

    ,Temp Table with rownumber,33 secs,21524,25422,3898,Index Scans

    ,Table Variable with rownumber,47 secs,36798,39196,2398,

    ,Temp Table with identity,16 secs,3269,8753,5484,Insert 81% Select 19%

    ,Table Variable with identity,Cancelled after 1 hour,,,,

    ,Cross Apply,12 secs,3795,4066,271,

    ,,,,,,

    One Item,,,,,,

    ,CTE View,2 secs,376,2179,1803,

    ,Temp Table with rownumber,5 secs,170,5039,4869,Insert 95%

    ,Table Variable with rownumber,4 secs,68,3380,3312,

    ,Temp Table with identity,6 secs,62,6585,6523,I 86% / S 14%

    ,Table Variable with identity,83 secs,78230,82767,4537,Insert = 100%!

    ,Cross Apply,0 sesc,14,139,125,Wow!

    ,,,,,,

    Price Rises,,,,,,

    ,CTE View,8 secs,1586,4357,2771,

    ,Temp Table with rownumber,19 secs,11213,15927,4714,

    ,Table Variable with rownumber,17 secs,10231,12995,2764,Insert 100%

    ,Temp Table with identity,12 secs,2138,8066,5928,Insert 83% Select 17%

    ,Table Variable with identity,Not Run,,,,

    ,Cross Apply,8 secs,3857,4111,254,

    (The cross apply code was exactly as previously posted by Charles Gildawie.)

    I ran the code on

    Microsoft SQL Server 2008 R2 (RTM) - 10.50.1600.1 (Intel X86) Apr 2 2010 15:53:02 Copyright (c) Microsoft Corporation Developer Edition on Windows NT 6.1 <X86> (Build 7600: )

    Before running each code block I executed the following, aiming to ensure that no caching was taking place.

    checkpoint

    dbcc freeproccache

    dbcc dropcleanbuffers

    My conclusions are

    1) that the Cross Apply solution comes out a clear winner when searching for an individual Item (although the CTE has very acceptable performance relative to the other solutions.)

    2) that any solutions using a table variable (as opposed to a temp table) do badly when using a identity column. Are they very slow to create identity values? In any case the equivalent code with row_number does much better.

    Final Conclusion (for now!)

    The Cross Apply and the CTE solutions stand clearly out from the bunch with the cup going to cross apply for it's performance on an individual item, and a special mention for the CTE 'cos it's much prettier 😉

    Next Steps?

    Perf testing isn't my speciality; perhaps you can find fault with my approach. Or perhaps the code used to illustrate each method could be improved e.g. by indexing.

    I haven't talked about the execution plans. I'll leave that perhaps to one better versed.

    I hope this enlightens more than one, and goes someway towards ending speculation on which approach would scale better etc..

    Regards,

    David McKinney.

  • That's very interesting David - thank you for investigating. Now to try and understand... 🙂

  • Just wanted to return to say I tried using Outer Apply as described and it was indeed faster for me too. Thanks all for your help.

Viewing 15 posts - 121 through 135 (of 147 total)

You must be logged in to reply to this topic. Login to reply