How to Compare Rows within Partitioned Sets to Find Overlapping Dates

Question

Post reply

How to Compare Rows within Partitioned Sets to Find Overlapping Dates

kevin.wu

Mr or Mrs. 500

Points: 581
More actions
October 9, 2010 at 1:19 pm

#239509

Comments posted to this topic are about the item How to Compare Rows within Partitioned Sets to Find Overlapping Dates

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply

santhosh.gunnala SSC Enthusiast Points: 148 More actions · Answer 1

Here i didn't get the use of this query. Please can u give a real time example on this...

ta.bu.shi.da.yu Hall of Fame Points: 3985 More actions · Answer 2

If I read this correctly, you just implemented the LAG(...) Over(...) analytic function.

However, I'm interested... if you do this on a large dataset, what does the execution plan look like?

Random Technical Stuff[/url]

Jason-299789 SSC-Insane Points: 21601 More actions · Answer 3

I think this would be of use in planning/supply systems where you want to know about future availablity of resources.

However it is slightly misleading in that I would have thought PersonId should have 3 rows, and not two, as they will be resourced up until the end of December 2010.

_________________________________________________________________________
SSC Guide to Posting and Best Practices

David McKinney SSChampion Points: 10472 More actions · Answer 4

Shouldn't you have a LEFT JOIN rather than an INNER JOIN?

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 5

ta.bu.shi.da.yu (10/11/2010)
If I read this correctly, you just implemented the LAG(...) Over(...) analytic function.
However, I'm interested... if you do this on a large dataset, what does the execution plan look like?

Got a BOL clicky-link for this, by any chance?

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

David Data SSCrazy Points: 2965 More actions · Answer 6

I don't think you're going to find that in BOL, Chris; unless I am mistaken LEAD and LAG are not supported by SQL Server. Oracle uses them to access values in a previous or next row - see this article. Let's hope Microsoft adds them to a future version of SQL Server.

Kevin, I too am curious about the performance of your method on large data sets. Doing a self-join with a WHERE a < b condition generally leads to very slow queries as the table gets large - a triangular join has Order N² performance. But using a partition ought to be much more efficient.

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 7

We SOOOOOOO need full Windowing Function support in SQL Server. And yes, performance on this type of query will currently be exceptionally poor and approaching non-functional on increasingly large datasets.

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 8

TheSQLGuru (10/11/2010)
We SOOOOOOO need full Windowing Function support in SQL Server. And yes, performance on this type of query will currently be exceptionally poor and approaching non-functional on increasingly large datasets.

Not so. Always check.

-----------------------------------------------------------

-- Create a working table to play with.

-----------------------------------------------------------

DROP TABLE #Numbers

SELECT TOP 1000000

n = ROW_NUMBER() OVER (ORDER BY a.name),

CalcValue = CAST(NULL AS BIGINT)

INTO #Numbers

FROM master.dbo.syscolumns a, master.dbo.syscolumns b

CREATE UNIQUE CLUSTERED INDEX CIn ON #Numbers ([n] ASC)

-----------------------------------------------------------

-- run a test against the table

-----------------------------------------------------------

SET STATISTICS IO ON

SET STATISTICS TIME ON

SELECT a.*, B.n AS Nextrow

INTO #junk

FROM #Numbers a

INNER JOIN #Numbers b ON b.n = a.n + 1

-- (999999 row(s) affected) / CPU time = 3516 ms, elapsed time = 3538 ms.

-- Table 'Worktable'. Scan count 2, logical reads 6224, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SET STATISTICS IO Off

SET STATISTICS TIME Off

DROP TABLE #junk

-----------------------------------------------------------

-- run a functionally similar test against a CTE of the table with ROW_NUMBER() generating "row IDs"

-----------------------------------------------------------

SET STATISTICS IO ON

SET STATISTICS TIME ON

;WITH CTE AS (SELECT NewRowNumber = ROW_NUMBER() OVER (ORDER BY n DESC) FROM #Numbers)

SELECT a.*, B.NewRowNumber AS Nextrow

INTO #junk

FROM CTE a

INNER JOIN CTE b ON b.NewRowNumber = a.NewRowNumber + 1

-- (999999 row(s) affected) / CPU time = 7781 ms, elapsed time = 7808 ms.

-- Table 'Worktable'. Scan count 2, logical reads 6224, physical reads 0, read-ahead reads 5, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SET STATISTICS IO Off

SET STATISTICS TIME Off

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

glyn-1145829 SSC Rookie Points: 33 More actions · Answer 9

Real use of this query: Hospitals in the UK are penalised if patients have to wait too long for their operations. If the patient cancels an appointment the timer is reset on the waiting time. if the hospital cancels, the waiting time is not reset. (It's actually a lot more complex, but that's the basic principle)

Last year I was asked to sort out some ETL stored procedures used in waiting time calculations that were taking over 20 hours to run. These procs were using cursors. I replaced the stored procs with ones using very similar code to that shown here. The run time dropped from 20 hours + to less than 10 minutes.

Hope that's real enough

Glyn

kevin.wu Mr or Mrs. 500 Points: 581 More actions · Answer 10

David McKinney (10/11/2010)
Shouldn't you have a LEFT JOIN rather than an INNER JOIN?

You're right! In haste, I ran an INNER JOIN so that I could just see the compared rows, but in hindsight, I probably should have kept my base table pure and ran a LEFT JOIN instead. This at least would have allowed me to keep my table counts the same as well as singling out the last row in each partitioned set.

kevin.wu Mr or Mrs. 500 Points: 581 More actions · Answer 11

ta.bu.shi.da.yu (10/11/2010)
If I read this correctly, you just implemented the LAG(...) Over(...) analytic function.

Cool, I didn't even know this existed. But it looks like it's only available in Oracle databases. Microsoft, please bring this over!

However, I'm interested... if you do this on a large dataset, what does the execution plan look like?

Let me try to get my execution plan and I'll get back to you. I do remember that our initial process ran painfully slow and we had just 45K rows. It ran for something like 45 seconds (almost 1 sec/1K rows) using a scalar function. This updated process now takes less than one second to complete.

k_t_Schmidt SSC Eights! Points: 823 More actions · Answer 12

Real World Example: Wage and Hour Class Action Lawsuits. I am always working with dates on employment cases and many times the individual has many start and end dates. Part of the scrubbing process is to check for date overlaps or gaps. This is a good start to check those types of things.

Thanks for the post!

Dwain Camps SSC Guru Points: 86978 More actions · Answer 13

SQL 2012 makes pretty short work of this type of problem:

WITH PartitionedSchedules AS

(

SELECT ScheduleID, PersonID, startDate, durationDays

,CalculatedEndDate=DATEADD(day, durationDays, startDate)

,row2startDate=LEAD(startDate, 1) OVER (PARTITION BY PersonID ORDER BY startDate)

FROM Schedules

)

SELECT ScheduleID, PersonID, startDate, durationDays, CalculatedEndDate

,row2startDate

,datedifference

,analysis=CASE SIGN(datedifference)

WHEN 0 THEN 'contiguous'

WHEN 1 THEN CAST(ABS(datedifference) AS VARCHAR) + ' days overlap'

ELSE CAST(ABS(datedifference) AS VARCHAR) + ' days gap'

END

FROM PartitionedSchedules a

CROSS APPLY

(

SELECT DATEDIFF(day, row2startDate, CalculatedEndDate)

) b (datedifference)

WHERE datedifference IS NOT NULL;

No more need for a self-join.

My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]
My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St

Dwain Camps SSC Guru Points: 86978 More actions · Answer 14

And here is (I think) another way to do it in SQL 2005 that avoids the self-join:

SELECT ScheduleID, PersonID, startDate, durationDays

,row2StartDate, CalculatedEndDate, datedifference

,analysis=CASE SIGN(datedifference)

WHEN 0 THEN 'contiguous'

WHEN 1 THEN CAST(ABS(datedifference) AS VARCHAR) + ' days overlap'

ELSE CAST(ABS(datedifference) AS VARCHAR) + ' days gap'

END

FROM

(

SELECT ScheduleID=MAX(CASE WHEN rn2 = 2 THEN ScheduleID END)

,PersonID

,startDate=MIN(startDate)

,durationDays=MAX(CASE WHEN rn2 = 2 THEN durationDays END)

,row2StartDate=MAX(CASE rn2 WHEN 2 THEN CalculatedEndDate ELSE [Date] END)

,CalculatedEndDate=MAX(CASE rn2 WHEN 2 THEN [Date] END)

,datedifference=DATEDIFF(day

,MAX(CASE rn2 WHEN 2 THEN CalculatedEndDate ELSE [Date] END)

,MAX(CASE rn2 WHEN 2 THEN [Date] END))

FROM

(

SELECT ScheduleID

,PersonID

,startDate

,durationDays

,CalculatedEndDate=CASE WHEN rn2=1 THEN DATEADD(day, durationDays, [Date]) END

,[Date]

,rn=ROW_NUMBER() OVER (PARTITION BY PersonID ORDER BY startDate)/2

,rn2

FROM Schedules a

CROSS APPLY

(

SELECT 1, startDate UNION ALL

SELECT 2, DATEADD(day, durationDays, startDate)

) b (rn2, [Date])

) a

GROUP BY PersonID, rn

HAVING COUNT(*) = 2

) a

ORDER BY PersonID;

This one assumes though that a row does not overlap two or more following rows.

The benefit of course of not doing a self-join is that the query does a single table or index scan (depending on indexing) instead of two.

My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]
My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St

How to Compare Rows within Partitioned Sets to Find Overlapping Dates

Cookies on SQLServerCentral