How to Compare Rows within Partitioned Sets to Find Overlapping Dates

  • Comments posted to this topic are about the item How to Compare Rows within Partitioned Sets to Find Overlapping Dates

  • Here i didn't get the use of this query. Please can u give a real time example on this...

  • If I read this correctly, you just implemented the LAG(...) Over(...) analytic function.

    However, I'm interested... if you do this on a large dataset, what does the execution plan look like?

    Random Technical Stuff[/url]

  • I think this would be of use in planning/supply systems where you want to know about future availablity of resources.

    However it is slightly misleading in that I would have thought PersonId should have 3 rows, and not two, as they will be resourced up until the end of December 2010.

    _________________________________________________________________________
    SSC Guide to Posting and Best Practices

  • Shouldn't you have a LEFT JOIN rather than an INNER JOIN?

  • ta.bu.shi.da.yu (10/11/2010)


    If I read this correctly, you just implemented the LAG(...) Over(...) analytic function.

    However, I'm interested... if you do this on a large dataset, what does the execution plan look like?

    Got a BOL clicky-link for this, by any chance? πŸ˜‰

    β€œWrite the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • I don't think you're going to find that in BOL, Chris; unless I am mistaken LEAD and LAG are not supported by SQL Server. Oracle uses them to access values in a previous or next row - see this article. Let's hope Microsoft adds them to a future version of SQL Server.

    Kevin, I too am curious about the performance of your method on large data sets. Doing a self-join with a WHERE a < b condition generally leads to very slow queries as the table gets large - a triangular join has Order N2 performance. But using a partition ought to be much more efficient.

  • We SOOOOOOO need full Windowing Function support in SQL Server. And yes, performance on this type of query will currently be exceptionally poor and approaching non-functional on increasingly large datasets.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • TheSQLGuru (10/11/2010)


    We SOOOOOOO need full Windowing Function support in SQL Server. And yes, performance on this type of query will currently be exceptionally poor and approaching non-functional on increasingly large datasets.

    Not so. Always check.

    -----------------------------------------------------------

    -- Create a working table to play with.

    -----------------------------------------------------------

    DROP TABLE #Numbers

    SELECT TOP 1000000

    n = ROW_NUMBER() OVER (ORDER BY a.name),

    CalcValue = CAST(NULL AS BIGINT)

    INTO #Numbers

    FROM master.dbo.syscolumns a, master.dbo.syscolumns b

    CREATE UNIQUE CLUSTERED INDEX CIn ON #Numbers ([n] ASC)

    -----------------------------------------------------------

    -- run a test against the table

    -----------------------------------------------------------

    SET STATISTICS IO ON

    SET STATISTICS TIME ON

    SELECT a.*, B.n AS Nextrow

    INTO #junk

    FROM #Numbers a

    INNER JOIN #Numbers b ON b.n = a.n + 1

    -- (999999 row(s) affected) / CPU time = 3516 ms, elapsed time = 3538 ms.

    -- Table 'Worktable'. Scan count 2, logical reads 6224, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SET STATISTICS IO Off

    SET STATISTICS TIME Off

    DROP TABLE #junk

    -----------------------------------------------------------

    -- run a functionally similar test against a CTE of the table with ROW_NUMBER() generating "row IDs"

    -----------------------------------------------------------

    SET STATISTICS IO ON

    SET STATISTICS TIME ON

    ;WITH CTE AS (SELECT NewRowNumber = ROW_NUMBER() OVER (ORDER BY n DESC) FROM #Numbers)

    SELECT a.*, B.NewRowNumber AS Nextrow

    INTO #junk

    FROM CTE a

    INNER JOIN CTE b ON b.NewRowNumber = a.NewRowNumber + 1

    -- (999999 row(s) affected) / CPU time = 7781 ms, elapsed time = 7808 ms.

    -- Table 'Worktable'. Scan count 2, logical reads 6224, physical reads 0, read-ahead reads 5, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SET STATISTICS IO Off

    SET STATISTICS TIME Off

    β€œWrite the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Real use of this query: Hospitals in the UK are penalised if patients have to wait too long for their operations. If the patient cancels an appointment the timer is reset on the waiting time. if the hospital cancels, the waiting time is not reset. (It's actually a lot more complex, but that's the basic principle)

    Last year I was asked to sort out some ETL stored procedures used in waiting time calculations that were taking over 20 hours to run. These procs were using cursors. I replaced the stored procs with ones using very similar code to that shown here. The run time dropped from 20 hours + to less than 10 minutes.

    Hope that's real enough πŸ™‚

    Glyn

  • David McKinney (10/11/2010)


    Shouldn't you have a LEFT JOIN rather than an INNER JOIN?

    You're right! In haste, I ran an INNER JOIN so that I could just see the compared rows, but in hindsight, I probably should have kept my base table pure and ran a LEFT JOIN instead. This at least would have allowed me to keep my table counts the same as well as singling out the last row in each partitioned set.

  • ta.bu.shi.da.yu (10/11/2010)


    If I read this correctly, you just implemented the LAG(...) Over(...) analytic function.

    Cool, I didn't even know this existed. But it looks like it's only available in Oracle databases. Microsoft, please bring this over!

    However, I'm interested... if you do this on a large dataset, what does the execution plan look like?

    Let me try to get my execution plan and I'll get back to you. I do remember that our initial process ran painfully slow and we had just 45K rows. It ran for something like 45 seconds (almost 1 sec/1K rows) using a scalar function. This updated process now takes less than one second to complete.

  • Real World Example: Wage and Hour Class Action Lawsuits. I am always working with dates on employment cases and many times the individual has many start and end dates. Part of the scrubbing process is to check for date overlaps or gaps. This is a good start to check those types of things.

    Thanks for the post! πŸ™‚

  • SQL 2012 makes pretty short work of this type of problem:

    WITH PartitionedSchedules AS

    (

    SELECT ScheduleID, PersonID, startDate, durationDays

    ,CalculatedEndDate=DATEADD(day, durationDays, startDate)

    ,row2startDate=LEAD(startDate, 1) OVER (PARTITION BY PersonID ORDER BY startDate)

    FROM Schedules

    )

    SELECT ScheduleID, PersonID, startDate, durationDays, CalculatedEndDate

    ,row2startDate

    ,datedifference

    ,analysis=CASE SIGN(datedifference)

    WHEN 0 THEN 'contiguous'

    WHEN 1 THEN CAST(ABS(datedifference) AS VARCHAR) + ' days overlap'

    ELSE CAST(ABS(datedifference) AS VARCHAR) + ' days gap'

    END

    FROM PartitionedSchedules a

    CROSS APPLY

    (

    SELECT DATEDIFF(day, row2startDate, CalculatedEndDate)

    ) b (datedifference)

    WHERE datedifference IS NOT NULL;

    No more need for a self-join.


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • And here is (I think) another way to do it in SQL 2005 that avoids the self-join:

    SELECT ScheduleID, PersonID, startDate, durationDays

    ,row2StartDate, CalculatedEndDate, datedifference

    ,analysis=CASE SIGN(datedifference)

    WHEN 0 THEN 'contiguous'

    WHEN 1 THEN CAST(ABS(datedifference) AS VARCHAR) + ' days overlap'

    ELSE CAST(ABS(datedifference) AS VARCHAR) + ' days gap'

    END

    FROM

    (

    SELECT ScheduleID=MAX(CASE WHEN rn2 = 2 THEN ScheduleID END)

    ,PersonID

    ,startDate=MIN(startDate)

    ,durationDays=MAX(CASE WHEN rn2 = 2 THEN durationDays END)

    ,row2StartDate=MAX(CASE rn2 WHEN 2 THEN CalculatedEndDate ELSE [Date] END)

    ,CalculatedEndDate=MAX(CASE rn2 WHEN 2 THEN [Date] END)

    ,datedifference=DATEDIFF(day

    ,MAX(CASE rn2 WHEN 2 THEN CalculatedEndDate ELSE [Date] END)

    ,MAX(CASE rn2 WHEN 2 THEN [Date] END))

    FROM

    (

    SELECT ScheduleID

    ,PersonID

    ,startDate

    ,durationDays

    ,CalculatedEndDate=CASE WHEN rn2=1 THEN DATEADD(day, durationDays, [Date]) END

    ,[Date]

    ,rn=ROW_NUMBER() OVER (PARTITION BY PersonID ORDER BY startDate)/2

    ,rn2

    FROM Schedules a

    CROSS APPLY

    (

    SELECT 1, startDate UNION ALL

    SELECT 2, DATEADD(day, durationDays, startDate)

    ) b (rn2, [Date])

    ) a

    GROUP BY PersonID, rn

    HAVING COUNT(*) = 2

    ) a

    ORDER BY PersonID;

    This one assumes though that a row does not overlap two or more following rows.

    The benefit of course of not doing a self-join is that the query does a single table or index scan (depending on indexing) instead of two.


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply