Linking to next row using a join on row_number()+1=row_number()

Question

Linking to next row using a join on row_number()+1=row_number()

StarNamer

SSCrazy Eights

Points: 8746
More actions
April 11, 2008 at 7:10 am

#184360

Having previously read Jeff Moden's article "Hidden RBAR: Triangular Joins" (http://www.sqlservercentral.com/articles/T-SQL/61539/), when I read the comments following this (http://www.sqlteam.com/article/joining-to-the-next-sequential-row) article, I decided to try the method of using a join with row_number()+1=row_number on a real problem I have which currently relies on a triangular join.
The aim is to take a series of statuses and dates for documents and get pairs of status+date with next_status+date. This is so that further processing can look at turnaround times, etc.
However, my implementation of the method described turns out to be slower than a simple triangular join (select ...,min(b.x) from t join t a join t b on a.x<b.x group by ...).
So my question is, is there a faster way to do this than then triangular join?
As a second question, can anyone suggest why the performance of the row_number test solution goes down the tubes (on my system) at between 712 and 714 records (713 sometimes takes ~300ms CPU, others ~76000ms!).
P.S. The server has 4 CPUs hence for 1000 records, elapse time is only about 25% of CPU time.
/*
100 records generated
Table 'Worktable'. Scan count 0, logical reads 0,
physical reads 0, read-ahead reads 0,
lob logical reads 0, lob physical reads 0,
lob read-ahead reads 0.
Table '#t'. Scan count 3, logical reads 9,
physical reads 0, read-ahead reads 0,
lob logical reads 0, lob physical reads 0,
lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 16 ms, elapsed time = 29 ms.
Table '#t'. Scan count 30, logical reads 18,
physical reads 0, read-ahead reads 0,
lob logical reads 0, lob physical reads 0,
lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0,
physical reads 0, read-ahead reads 0,
lob logical reads 0, lob physical reads 0,
lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 46 ms, elapsed time = 35 ms.
-------------------------------------------
500 records generated
Table 'Worktable'. Scan count 0, logical reads 0,
physical reads 0, read-ahead reads 0,
lob logical reads 0, lob physical reads 0,
lob read-ahead reads 0.
Table '#t'. Scan count 3, logical reads 39,
physical reads 0, read-ahead reads 0,
lob logical reads 0, lob physical reads 0,
lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 78 ms, elapsed time = 203 ms.
Table '#t'. Scan count 30, logical reads 78,
physical reads 0, read-ahead reads 0,
lob logical reads 0, lob physical reads 0,
lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0,
physical reads 0, read-ahead reads 0,
lob logical reads 0, lob physical reads 0,
lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 205 ms, elapsed time = 157 ms.
---------------------------------------------
1000 records generated
Table 'Worktable'. Scan count 0, logical reads 0,
physical reads 0, read-ahead reads 0,
lob logical reads 0, lob physical reads 0,
lob read-ahead reads 0.
Table '#t'. Scan count 3, logical reads 78,
physical reads 0, read-ahead reads 0,
lob logical reads 0, lob physical reads 0,
lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 297 ms, elapsed time = 716 ms.
Table '#t'. Scan count 28, logical reads 312,
physical reads 0, read-ahead reads 0,
lob logical reads 0, lob physical reads 0,
lob read-ahead reads 0.
Table 'Worktable'. Scan count 8, logical reads 775568,
physical reads 0, read-ahead reads 0,
lob logical reads 0, lob physical reads 0,
lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 152624 ms, elapsed time = 40658 ms.
--------------------------------------------------
*/
set statistics io off
set statistics time off
set nocount on
-----------------------------------
-- create some data
-- ----------------
-- This actually models part of a document workflow.
-- The flow has been simplified to just 3 states.
-- Documents start in state 'F' and may move to
-- either 'C' (compete) or 'P' (queried) or may be
-- amended and stay in state 'F'. Once in state 'P',
-- the query may be answered, so the document goes
-- to state 'F' or it may be amended and stay in
-- state 'P'. The probabilities of moving between
-- states are arbitrarily set for the model. In
-- order to simulate repeated queries making it
-- more likely for the document to be completed,
-- the probability is increased on each step.
-- Time intervals are randomly assigned between
-- 5 minutes and 1 month. In the model, the generated
-- IDs are sequential. In the real application, they
-- may not be.
--
drop table #t
create table #t (id int, st char(1), date datetime)
declare @n int
declare @st char(1)
declare @date datetime
declare @Cp float
declare @p float
set @n = 1000 -- <<<<<<<<<<<<<<<<< How many ID chains to generate
print convert(varchar(5),@n)+' records generated'
while @n > 0
begin
set @st = 'F'
set @date = dateadd(mi,rand()*1440.0*28,convert(datetime,'2008/2/1'))
insert #t (id,st,date) values(@n,@st,@date)
set @cp = 0.3
while @st <> 'C'
begin
if @st = 'F'
begin
set @p = rand()
if @p < @cp
set @st = 'C'
else
if @p < 0.99
set @st = 'P'
end
else
begin
set @p = rand(1)
if @p < 0.9
set @st = 'F'
end
set @date = dateadd(mi,rand()*1440*3+5,@date)
insert #t (id,st,date) values(@n,@st,@date)
set @cp = @cp * 1.1
end
set @n = @n -1
end
select * from #t
set nocount off
set statistics io on
set statistics time on
--------------------------------------------
-- version 1 - triangular join
------------------------------
-- This is the method currently used and
-- produces the desired result
--
select
id,
sta,
min(da) as 'da',
stb,
db
from (
select
a.id,
a.sta,
a.da,
b.st as 'stb',
b.date as db
from (
select
id,
sta,
da,
min(db) as 'db'
from (
select
a.id,
a.st as 'sta',
a.date as 'da',
b.st as 'stb',
min(b.date) as 'db'
from
#t a
join
#t b
on a.id=b.id
and a.date < b.date
and a.st <> b.st
group by
a.id,
a.st,
a.date,
b.st
) x
group by
id,
sta,
da
) a
join
#t b
on a.id=b.id
and a.db=b.date
) x
group by
id,
sta,
stb,
db
order by
id,
min(da)
--------------------------------------------
-- version 2 - using self join by row_number + 1 = row_number
------------
-- this was derived from comments following the article at
-- http://www.sqlteam.com/article/joining-to-the-next-sequential-row
--
;with t1 as (
select
row_number() over(order by id,date) as 'r_id',
id,
st,
date
from
#t
)
,t4 as (
select
row_number() over(order by id,date) as 'r_id',
id,
st,
date
from (
select top 1
id,
st,
date
from
t1
order by
id,
date
union all
select
id,
st,
date
from (
select
b.id,
b.st,
b.date
from
t1 a
join
t1 b
on a.r_id+1=b.r_id
where
(a.id=b.id and a.st<>b.st)
or(a.id<>b.id)
) z
) x
)
select
a.id,
a.st as 'sta',
a.date as 'da',
b.st as 'stb',
b.date as 'db'
from
t4 a
join
t4 b
on a.r_id+1=b.r_id
and a.id=b.id
order by
id,
da
set statistics time off
set statistics io off
Derek

Viewing 15 posts - 1 through 15 (of 37 total)

You must be logged in to reply to this topic. Login to reply

GSquared SSC Guru Points: 260824 More actions · Answer 1

I could probably help, if you provide the table structure, including indexes, some sample data, and the query you're using.

Without that, I would hazard a guess that the difference in performance at a specific number of rows is probably due to (a) the statistics being used, (b) the index(es) being used, and/or (c) the join being used.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

StarNamer SSCrazy Eights Points: 8746 More actions · Answer 2

GSquared (4/11/2008)
I could probably help, if you provide the table structure, including indexes, some sample data, and the query you're using.

Did you read the script? I thought I put everything in there.

The code section starts with the test results for 100, 500 and 1000 records.

Then there's a large comment describing what the model is simulating.

The first executable code creates a temporary table with sample data.

The next step (2 nested loops) fill it with sample data.

Then there's a query (labelled 'Version 1') which uses a simplified version of the existing query.

Then finally the last part ('Version 2') tries the new query.

I've not bothered to mention the 'set statistics...' and 'set nocount...' statements; they occur at the beginning, after the sample creation and at the end.

What is missing?

I haven't tried any indexes yet as I was so surprised by the query differences so there aren't any at present. Obviously, they might improve the performance of either version.

Derek

GSquared SSC Guru Points: 260824 More actions · Answer 3

I didn't realize the whole thing was being done on a temp table. Sorry about that.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

GSquared SSC Guru Points: 260824 More actions · Answer 4

I used your temp table and test data on 9000 rows, 1000 IDs in #t.

Here's the query I wrote, based on the description you gave in the original post:

;with CTE (Row, ID, St, Date) as

(select dense_rank() over (partition by id order by date),

id, st, date

from #t)

select t1.ID, t1.St, t1.Date, t2.St, t2.Date,

datediff(minute, t1.date, t2.date)

from cte t1

left outer join cte t2

on t1.id = t2.id

and t1.row = t2.row - 1

order by t1.id, t1.date

SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 1 ms.
(9000 row(s) affected)
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table '#t'. Scan count 2, logical reads 52, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 78 ms, elapsed time = 292 ms.

With 100,000 IDs, I had to modify the query (my desktop computer can't handle displaying the results of a 900,000 row query without running out of memory), to:

;with CTE (Row, ID, St, Date) as

(select dense_rank() over (partition by id order by date),

id, st, date

from #t)

select count(*)

from cte t1

left outer join cte t2

on t1.id = t2.id

and t1.row = t2.row - 1

And got:

SQL Server parse and compile time:
CPU time = 16 ms, elapsed time = 24 ms.
(1 row(s) affected)
Table '#t'. Scan count 6, logical reads 5144, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 7267 ms, elapsed time = 5087 ms.

Please test this and see if it gets you the kind of output you need.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

StarNamer SSCrazy Eights Points: 8746 More actions · Answer 5

Glen (4/11/2008)
Can you please add the parameters and settings on your tempdb? Could it be linked to the fact that you are simply hitting an "auto grow" of your tempdb during the statement execution?

I'm at home so can't easily check this (it's Saturday) but don't see how this could be the case.

If tempdb were to 'auto grow', doesn't it t it then stay extended? The effect I get is that rthe 712-714 step is reproducible if I do a series of tests.

712 takes ~300ms

713 takes ~300ms

714 takes ~25000ms

713 takes ~25000ms

712 takes ~300ms

713 takes ~300ms

...and so on.

There seems to be some memory effect.

My suspicion is that it changes choice of execution plan to one that's much worse, and then uses it again on the 'down to 713 test' because it's cached. I intend to check this on monday.

Derek

StarNamer SSCrazy Eights Points: 8746 More actions · Answer 6

GSquared (4/11/2008)
I used your temp table and test data on 9000 rows, 1000 IDs in #t.
Here's the query I wrote, based on the description you gave in the original post:
;with CTE (Row, ID, St, Date) as
(select dense_rank() over (partition by id order by date),
id, st, date
from #t)
select t1.ID, t1.St, t1.Date, t2.St, t2.Date,
datediff(minute, t1.date, t2.date)
from cte t1
left outer join cte t2
on t1.id = t2.id
and t1.row = t2.row - 1
order by t1.id, t1.date

Sorry, my fault. I've been working with this problem so long I missed out one of the key requirements in the description, although not in the code of either of the query versions.

The only datetime differences that are wanted are those between one state and the next different state.

So, given an id with the following states...

insert #t values(37,'F','2008-02-05 09:37')

insert #t values(37,'F','2008-02-05 09:54')

insert #t values(37,'P','2008-02-05 10:01')

insert #t values(37,'P','2008-02-05 10:17')

insert #t values(37,'P','2008-02-05 10:22')

insert #t values(37,'F','2008-02-05 10:32')

insert #t values(37,'F','2008-02-05 10:36')

insert #t values(37,'C','2008-02-05 10:42')

The result should be

ID sta da stb db

-- --- ---------------- --- ----------------

37 F 2008-02-05 09:37 P 2008-02-05 10:01

37 P 2008-02-05 10:01 F 2008-02-05 10:32

37 F 2008-02-05 10:32 C 2008-02-05 10:42

Your query starts the next step of the process which is to take the time differences, except that it hasn't eliminated the 'update' entries which don't change state. This is what complicates this part of the solution.

In reality, the differences aren't simple datediffs (which is why I left them out), but actually need to take account of working days, taking account of public holidays in the country the users are, and working hours, adjusted for the local timezones. All this logic works, so has been omitted.

I didn't realize the whole thing was being done on a temp table. Sorry about that.

The real table isn't a temp table, obviously, but it also contains a lot of extra fields that aren't relevant to getting this query to work, such as company code, user name (needed to get the country and timezone), vendor, amount, etc.

In addition, there are 7 states not 3 (F1, F2, F3, P1, P2, P3, CP) but while substring(st,1,1) will add overhead, it's not relevant to the logic, so I simply stored a char(1) value.

The full version of the query shown as version 1 is actually at the core of a larger query which does all the time differencing plus adding other required information.

The real table also has about 15000 records per month in it and contains at least 15 months data (currently about 40 months) and thus takes several minutes to run.

For the purpose of getting the logic of the query right, I stripped the data down to a subset which illustrates it. This is about as small as I could make it.

Obviously, I've been here long enough that I don't expect people to write my queries for me, what I'm looking for are some pointers as to the fastest way this can be done.

I'm quite happy to make the temp table into a permanent one, add any indexes or whatever if it will help speed up the query. I should then be able to scale it up to the real table and check it works there. unfortunately, I have a suspicion that the triangular/diagonal join may actually be fastest.

Derek

StarNamer SSCrazy Eights Points: 8746 More actions · Answer 7

Hi Gsquared,

Thanks for the hint.

I tried implementing your use of dense_rank instead of row_number and located an omission in my code. In my version 2, CTE t4 should have group by b.id, b.st, b.date just before the close of subquery z. Without it, multiple rows are produced which get eliminated later, but add excessive overhead.

With that added, and including a version using dense_rank, I get, for 100000 records.

(403358 row(s) affected)
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table '#t'. Scan count 3, logical reads 4278, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 7687 ms, elapsed time = 71055 ms.
(403358 row(s) affected)
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table '#t'. Scan count 6, logical reads 5714, physical reads 0, read-ahead reads 3, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 6266 ms, elapsed time = 64598 ms.
(322812 row(s) affected)
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table '#t'. Scan count 6, logical reads 5714, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 6094 ms, elapsed time = 54511 ms.

(This test is on my laptop rather than the server so it takes longer!)

The conclusion is that dense_rank is slightly faster than row_number and both are about 20% faster than the triangular/diagonal join.

On Monday, I'll see if this improves the 'real' code. 🙂

Derek

SwePeso SSC-Dedicated Points: 39758 More actions · Answer 8

Try this one...

;WITH Yak (id, st, date, rowid, col)

AS (

SELECTid,

st,

date,

ROW_NUMBER() OVER (ORDER BY id, date),

0

FROM#t

)

SELECTid,

MAX(CASE WHEN col = 0 THEN st ELSE '' END) AS sta,

MAX(CASE WHEN col = 0 THEN date ELSE '' END) AS da,

MAX(CASE WHEN col = 1 THEN st ELSE '' END) AS stb,

MAX(CASE WHEN col = 1 THEN date ELSE '' END) AS db

FROM(

select id, st, date, rowid, col from Yak

union all

select id, st, date, rowid - 1, col + 1 from Yak

) AS d

group byid,

rowid

havingcount(*) = 2

order byrowid

Results for 1000 sample records

derek 1 - triangular join

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table '#t'. Scan count 3, logical reads 78, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 141 ms, elapsed time = 1518 ms.

derek 2 - cte's

Table '#t'. Scan count 16, logical reads 208, physical reads 0, read-ahead reads 6, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'Worktable'. Scan count 4, logical reads 702520, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 55094 ms, elapsed time = 121561 ms.

peso

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table '#t'. Scan count 2, logical reads 52, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 47 ms, elapsed time = 565 ms.

N 56°04'39.16"
E 12°55'05.25"

Jeff Moden SSC Guru Points: 1004686 More actions · Answer 9

Sometimes, "old school" works better...

SET NOCOUNT ON

SET STATISTICS IO ON

SET STATISTICS TIME ON

SELECT IDENTITY(INT,1,1) AS RowNum,

ID,ST,Date

INTO #MyHead

FROM #t

ORDER BY ID,DATE

SELECT t1.ID,

t1.ST AS STa,

t1.Date AS DateA,

t2.ST AS STb,

t2.Date AS DateB

FROM #MyHead t1

LEFT OUTER JOIN #MyHead t2

ON t1.RowNum+1 = t2.RowNum

WHERE t2.ST IS NOT NULL

ORDER BY t1.ID

SET STATISTICS TIME OFF

SET STATISTICS IO OFF

SET NOCOUNT OFF

DROP TABLE #MyHead

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

SwePeso SSC-Dedicated Points: 39758 More actions · Answer 10

Add a

AND t1.id = t2.id

and you produce the same result too!

N 56°04'39.16"
E 12°55'05.25"

StarNamer SSCrazy Eights Points: 8746 More actions · Answer 11

Both Jeff's and Peso's queries do not eliminate pairs where the status stays the same. Additionally, as Peso pointed out, Jeff didn't remove the 'pairs' where the ids don't match. After fixing Jeff's ID match, I get for 10000 samples...

derek 1 (diagonal join)
(39972 row(s) affected)
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table '#t'. Scan count 3, logical reads 426, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 625 ms, elapsed time = 7687 ms.
derek 2 (row_number)
(39972 row(s) affected)
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table '#t'. Scan count 6, logical reads 576, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 609 ms, elapsed time = 6790 ms.
derek 3 (dense_rank)
(39972 row(s) affected)
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table '#t'. Scan count 6, logical reads 576, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 640 ms, elapsed time = 8898 ms.
peso
(41864 row(s) affected)
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table '#t'. Scan count 2, logical reads 286, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 391 ms, elapsed time = 5726 ms.
Jeff (copy)
Table '#t'. Scan count 1, logical reads 143, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 47 ms, elapsed time = 442 ms.
(51864 row(s) affected)
Jeff (query)
(41864 row(s) affected)
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table '#MyHead'. Scan count 2, logical reads 336, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 172 ms, elapsed time = 4826 ms.

The number of rows is higher for the last 2 for the reason given.

At the moment it looks like a join on row_number() over(order by id,date) is fastest.

Derek

StarNamer SSCrazy Eights Points: 8746 More actions · Answer 12

Peso (2)

create table #t (id int not null, st char(1) not null, date datetime not null, seq int null)

--

--.... main fill omitted

--

DECLARE@Seq INT,

@Yak INT,

@st CHAR(1)

SET@Seq = -1

UPDATE#t

SET@Seq = @Seq + 1,

@Yak = Seq =CASE

WHEN st = @st THEN NULL

ELSE @Seq

END,

@Seq =CASE

WHEN @Yak IS NULL THEN @Seq - 1

ELSE @Seq

END,

@st = st

SELECTid,

MAX(CASE WHEN Col = 0 THEN st ELSE '' END) AS sta,

MAX(CASE WHEN Col = 0 THEN date ELSE '' END) AS da,

MAX(CASE WHEN Col = 1 THEN st ELSE '' END) AS stb,

MAX(CASE WHEN Col = 1 THEN date ELSE '' END) AS db

FROM(

SELECTid,

st,

date,

Seq,

0 AS Col

FROM#t

WHERESeq IS NOT NULL

UNION ALL

SELECTid,

st,

date,

Seq - 1,

1

FROM#t

WHERESeq IS NOT NULL

) AS d

GROUP BYid,

Seq

HAVINGCOUNT(*) = 2

ORDER BYid,

MAX(CASE WHEN Col = 0 THEN date ELSE '' END)

For 10000 samples...

SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 1 ms.
Table '#t'. Scan count 1, logical reads 171, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 110 ms, elapsed time = 823 ms.
(52533 row(s) affected)
(40584 row(s) affected)
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table '#t'. Scan count 2, logical reads 340, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 313 ms, elapsed time = 6886 ms.

Including the update, this is 50% faster. Excluding the update, it's 100% faster. The only thing I'm dubious about is whether the update will always get the records in the right order? UPDATE...ORDER BY isn't legal SQL, unfortunately. 🙂

Derek

Jeff Moden SSC Guru Points: 1004686 More actions · Answer 13

Derek Dongray (4/14/2008)
Both Jeff's and Peso's queries do not eliminate pairs where the status stays the same.

Sorry... with all the other stuff, I missed that requirement altogether.

Additionally, as Peso pointed out, Jeff didn't remove the 'pairs' where the ids don't match. After fixing Jeff's ID match, I get for 10000 samples...

Correct... I flat forgot to add that to the criteria.

So, you all set now or ???

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

SwePeso SSC-Dedicated Points: 39758 More actions · Answer 14

Derek Dongray (4/14/2008)
Including the update, this is 50% faster. Excluding the update, it's 100% faster.
The only thing I'm dubious about is whether the update will always get the records in the right order?

If there is a clustered index on the table over columns (id, date) it will.

CREATE CLUSTERED INDEX IX_t ON #t (id, date)

You can also have the UPDATE part in a trigger and update only for all records of same id.

It doesn't matter which seq it is, as long as seq is unique within id group.

N 56°04'39.16"
E 12°55'05.25"