Divide and Conquer - Performance Tuning

Question

Post reply

Divide and Conquer - Performance Tuning

ClaireM

SSC Veteran

Points: 223
More actions
April 8, 2014 at 12:09 am

#292066

Comments posted to this topic are about the item Divide and Conquer - Performance Tuning

Viewing 9 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply

a.myasnikov SSC Enthusiast Points: 147 More actions · Answer 1

I wonder if checksum could be an alternative to a two-step process suggested by the article?

Something to the effect:

; WITH CTE

as (

SELECT worker_id,

CHECKSUM_AGG(CHECKSUM(rate_code)) as chk

FROM #worker_rate

GROUP BY worker_id

)

SELECT worker_id,

DENSE_RANK() OVER (ORDER BY chk) as SNO

FROM CTE

ClaireM SSC Veteran Points: 223 More actions · Answer 2

That would indeed be much faster and simpler if it wasn't for the possibility of duplicate checksums.

This link http://ask.sqlservercentral.com/questions/7253/limitations-w-using-checksum-for-comparing-data-wh.html mentions this very problem and the only way around it is to use additional processing which adds to the complexity.

a.myasnikov SSC Enthusiast Points: 147 More actions · Answer 3

Thank you for the answer, Claire.

I came to a similar conclusion after a bit of googling, but you were faster.

m mcdonald Hall of Fame Points: 3264 More actions · Answer 4

A little off topic...However....

How does one get SQLPrep or is it even available for public consumption?

I do not see a link at http://www.csqls.com/

Thanks

ClaireM SSC Veteran Points: 223 More actions · Answer 5

It's not available to buy yet but if you go to the website and click the contact us link then ask to be an alpha tester you might get to try it early.

Dwain Camps SSC Guru Points: 86978 More actions · Answer 6

Claire,

Interesting article. I have indeed found cases where divide and conquer can be helpful to solve a really tricky query form. But I'm afraid I'm going to need to point out a few fallacies in your arguments.

1. First off, you're comparing apples to oranges when you try to compare the timing results from creating a delimited character string of the rates to an encoded integer value. The string concatenation is always going to be slower.

2. Your method of creating the delimited list of rate codes can be improved upon. The reason it was spending so much time in those nasty sorts was (probably) because of using DISTINCT.

SELECT grp=COUNT(*), worker_id, combo=STUFF(

(

SELECT ',' + rate_code

FROM #worker_rate b

WHERE a.worker_id = b.worker_id

ORDER BY rate_code

FOR XML PATH('')

), 1, 1, '')

FROM #worker_rate a

GROUP BY worker_id

ORDER BY 1, 2;

3. You can literally do the same thing as your encoding approach in a single set-based query.

WITH lu AS

(

SELECT rate_code

,combo=POWER(2, ROW_NUMBER() OVER (ORDER BY rate_code)-1)

FROM

(

SELECT rate_code

FROM #worker_rate

GROUP BY rate_code

) a

)

SELECT grp=COUNT(*), worker_id, combo=SUM(combo)

FROM #worker_rate a

JOIN lu b ON a.rate_code = b.rate_code

GROUP BY worker_id

ORDER BY 1, 2;

The timing results I got for this (at 10M rows) were:

Method ElapsedMS

Original concat of rate codes as chars 136593

Improved concat of rate codes as chars 106220

'Optimal' Example of Divide and Conquer 29016

Divide and Conquer but in one step 24006

Note that all this was pre-coffee so possibly I made an error somewhere but you can check me yourself. Here's the test harness I used.

create table #worker_rate (

worker_id int

,rate_code nvarchar(10) collate database_default

primary key (rate_code,worker_id)

)

declare @rows int = 10000000 --How many rows of data to create

--Create test data

;with tt(x) as (

select row_number() over (order by (select null)) from

(values (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) x10 (n),

(values (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) x100 (n),

(values (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) x1000 (n),

(values (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) x10000 (n),

(values (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) x100000 (n),

(values (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) x1000000 (n),

(values (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) x10000000 (n)

)

insert into #worker_rate (worker_id, rate_code)

select top(@rows) tt.x, rc.code

from tt

inner join (values ('a'),('b'),('c')) rc(code)

on tt.x % 4 <> 0

or rc.code <> 'c'

DECLARE @StartDT DATETIME = GETDATE();

--Original example

;with c as (

select distinct w.worker_id, r.combo

from #worker_rate w

cross apply (

select combo = stuff((select ',' + r.rate_code

from #worker_rate r

where w.worker_id = r.worker_id

order by r.rate_code

for xml path('')),1,1,'')

) r

)

select grp = dense_rank() over (order by c.combo), c.worker_id, c.combo

from c

order by 1, 2;

SELECT [Method]='Original concat of rate codes as chars'

,[ElapsedMS]=DATEDIFF(millisecond, @StartDT, GETDATE());

SELECT @StartDT = GETDATE();

SELECT grp=COUNT(*), worker_id, combo=STUFF(

(

SELECT ',' + rate_code

FROM #worker_rate b

WHERE a.worker_id = b.worker_id

ORDER BY rate_code

FOR XML PATH('')

), 1, 1, '')

FROM #worker_rate a

GROUP BY worker_id

ORDER BY 1, 2;

SELECT [Method]='Improved concat of rate codes as chars'

,[ElapsedMS]=DATEDIFF(millisecond, @StartDT, GETDATE());

SELECT @StartDT = GETDATE();

--Optimal example

create table #lookup (

rate_code nvarchar(10) collate database_default primary key

,combo smallint

)

insert into #lookup (rate_code,combo)

select q.rate_code, power(2, q.combo-1)

from (

select distinct rate_code, combo=dense_rank() over (order by rate_code)

from #worker_rate

) q

select grp=dense_rank() over (order by combo), worker_id, combo

from (

select p.worker_id, combo=sum(l.combo)

from #worker_rate p

inner join #lookup l on l.rate_code = p.rate_code

group by p.worker_id

) q

order by 1, 2

SELECT [Method]='''Optimal'' Example of Divide and Conquer'

,[ElapsedMS]=DATEDIFF(millisecond, @StartDT, GETDATE());

SELECT @StartDT = GETDATE();

WITH lu AS

(

SELECT rate_code

,combo=POWER(2, ROW_NUMBER() OVER (ORDER BY rate_code)-1)

FROM

(

SELECT rate_code

FROM #worker_rate

GROUP BY rate_code

) a

)

SELECT grp=COUNT(*), worker_id, combo=SUM(combo)

FROM #worker_rate a

JOIN lu b ON a.rate_code = b.rate_code

GROUP BY worker_id

ORDER BY 1, 2;

SELECT [Method]='Divide and Conquer but in one step'

,[ElapsedMS]=DATEDIFF(millisecond, @StartDT, GETDATE());

SELECT @StartDT = GETDATE();

GO

drop table #lookup

drop table #worker_rate

My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]
My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St

Dwain Camps SSC Guru Points: 86978 More actions · Answer 7

Since there's very few rate codes present, I thought of a way you can have your cake (return delimited strings) and eat it too (get relatively good performance).

1. The all in one query.

WITH lu AS

(

SELECT rate_code

,combo=POWER(2, ROW_NUMBER() OVER (ORDER BY rate_code)-1)

FROM

(

SELECT rate_code

FROM #worker_rate

GROUP BY rate_code

) a

),

UNIQUEnTuples (n, Tuples, ID) AS (

SELECT combo, CAST(rate_code AS VARCHAR(8000)), rate_code

FROM lu

UNION ALL

SELECT n.n+t.combo, CAST(t.rate_code AS VARCHAR(8000)) + ',' + n.Tuples, rate_code

FROM UNIQUEnTuples n

CROSS APPLY (

SELECT combo, rate_code

FROM lu t

WHERE t.rate_code < n.ID) t

)

SELECT grp, worker_id, combo=Tuples

FROM

(

SELECT grp=COUNT(*), worker_id, combo=SUM(combo)

FROM #worker_rate a

JOIN lu b ON a.rate_code = b.rate_code

GROUP BY worker_id

) a

JOIN UNIQUEnTuples b ON a.combo = b.n

ORDER BY 1, 2;

2. Divide and conquer (split above into 3 steps)

WITH lu AS

(

SELECT rate_code

,combo=POWER(2, ROW_NUMBER() OVER (ORDER BY rate_code)-1)

FROM

(

SELECT rate_code

FROM #worker_rate

GROUP BY rate_code

) a

)

SELECT rate_code, combo, combo2=CAST(NULL AS VARCHAR(8000))

INTO #lookup2

FROM lu;

WITH UNIQUEnTuples (n, Tuples, ID) AS

(

SELECT combo, CAST(rate_code AS VARCHAR(8000)), rate_code

FROM #lookup2

UNION ALL

SELECT n.n+t.combo, CAST(t.rate_code AS VARCHAR(8000)) + ',' + n.Tuples, rate_code

FROM UNIQUEnTuples n

CROSS APPLY (

SELECT combo, rate_code

FROM #lookup2 t

WHERE t.rate_code < n.ID) t

)

MERGE #lookup2 t

USING UNIQUEnTuples s

ON t.combo = s.n

WHEN MATCHED THEN

UPDATE SET combo2 = s.Tuples

WHEN NOT MATCHED THEN

INSERT (rate_code, combo, combo2)

VALUES (s.Tuples, s.n, s.Tuples);

SELECT grp, worker_id, combo=combo2

FROM

(

SELECT grp=COUNT(*), worker_id, combo=SUM(combo)

FROM #worker_rate a

JOIN #lookup2 b ON a.rate_code = b.rate_code

GROUP BY worker_id

) a

JOIN #lookup2 b ON a.combo = b.combo

ORDER BY 1, 2;

Timings (just my 4 queries at 10M rows):

Method ElapsedMS

Improved concat of rate codes as chars 105503

Divide and Conquer but in one step 24816

Use nTuples to create strings 46926

Divide and Conquer 30913

And the test harness:

create table #worker_rate (

worker_id int

,rate_code nvarchar(10) collate database_default

primary key (rate_code,worker_id)

)

declare @rows int = 10000000 --How many rows of data to create

--Create test data

;with tt(x) as (

select row_number() over (order by (select null)) from

(values (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) x10 (n),

(values (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) x100 (n),

(values (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) x1000 (n),

(values (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) x10000 (n),

(values (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) x100000 (n),

(values (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) x1000000 (n),

(values (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) x10000000 (n)

)

insert into #worker_rate (worker_id, rate_code)

select top(@rows) tt.x, rc.code

from tt

inner join (values ('a'),('b'),('c')) rc(code)

on tt.x % 4 <> 0

or rc.code <> 'c'

DECLARE @StartDT DATETIME = GETDATE();

--Original example

--;with c as (

-- select distinct w.worker_id, r.combo

-- from #worker_rate w

-- cross apply (

-- select combo = stuff((select ',' + r.rate_code

-- from #worker_rate r

-- where w.worker_id = r.worker_id

-- order by r.rate_code

-- for xml path('')),1,1,'')

-- ) r

--)

--select grp = dense_rank() over (order by c.combo), c.worker_id, c.combo

--from c

--order by 1, 2;

--SELECT [Method]='Original concat of rate codes as chars'

-- ,[ElapsedMS]=DATEDIFF(millisecond, @StartDT, GETDATE());

--SELECT @StartDT = GETDATE();

SELECT grp=COUNT(*), worker_id, combo=STUFF(

(

SELECT ',' + rate_code

FROM #worker_rate b

WHERE a.worker_id = b.worker_id

ORDER BY rate_code

FOR XML PATH('')

), 1, 1, '')

FROM #worker_rate a

GROUP BY worker_id

ORDER BY 1, 2;

SELECT [Method]='Improved concat of rate codes as chars'

,[ElapsedMS]=DATEDIFF(millisecond, @StartDT, GETDATE());

SELECT @StartDT = GETDATE();

--Optimal example

--create table #lookup (

-- rate_code nvarchar(10) collate database_default primary key

-- ,combo smallint

--)

--insert into #lookup (rate_code,combo)

--select q.rate_code, power(2, q.combo-1)

--from (

-- select distinct rate_code, combo=dense_rank() over (order by rate_code)

-- from #worker_rate

--) q

--select grp=dense_rank() over (order by combo), worker_id, combo

--from (

-- select p.worker_id, combo=sum(l.combo)

-- from #worker_rate p

-- inner join #lookup l on l.rate_code = p.rate_code

-- group by p.worker_id

--) q

--order by 1, 2

--SELECT [Method]='''Optimal'' Example of Divide and Conquer'

-- ,[ElapsedMS]=DATEDIFF(millisecond, @StartDT, GETDATE());

--SELECT @StartDT = GETDATE();

WITH lu AS

(

SELECT rate_code

,combo=POWER(2, ROW_NUMBER() OVER (ORDER BY rate_code)-1)

FROM

(

SELECT rate_code

FROM #worker_rate

GROUP BY rate_code

) a

)

SELECT grp=COUNT(*), worker_id, combo=SUM(combo)

FROM #worker_rate a

JOIN lu b ON a.rate_code = b.rate_code

GROUP BY worker_id

ORDER BY 1, 2;

SELECT [Method]='Divide and Conquer but in one step'

,[ElapsedMS]=DATEDIFF(millisecond, @StartDT, GETDATE());

SELECT @StartDT = GETDATE();

WITH lu AS

(

SELECT rate_code

,combo=POWER(2, ROW_NUMBER() OVER (ORDER BY rate_code)-1)

FROM

(

SELECT rate_code

FROM #worker_rate

GROUP BY rate_code

) a

),

UNIQUEnTuples (n, Tuples, ID) AS (

SELECT combo, CAST(rate_code AS VARCHAR(8000)), rate_code

FROM lu

UNION ALL

SELECT n.n+t.combo, CAST(t.rate_code AS VARCHAR(8000)) + ',' + n.Tuples, rate_code

FROM UNIQUEnTuples n

CROSS APPLY (

SELECT combo, rate_code

FROM lu t

WHERE t.rate_code < n.ID) t

)

SELECT grp, worker_id, combo=Tuples

FROM

(

SELECT grp=COUNT(*), worker_id, combo=SUM(combo)

FROM #worker_rate a

JOIN lu b ON a.rate_code = b.rate_code

GROUP BY worker_id

) a

JOIN UNIQUEnTuples b ON a.combo = b.n

ORDER BY 1, 2;

SELECT [Method]='Use nTuples to create strings'

,[ElapsedMS]=DATEDIFF(millisecond, @StartDT, GETDATE());

SELECT @StartDT = GETDATE();

WITH lu AS

(

SELECT rate_code

,combo=POWER(2, ROW_NUMBER() OVER (ORDER BY rate_code)-1)

FROM

(

SELECT rate_code

FROM #worker_rate

GROUP BY rate_code

) a

)

SELECT rate_code, combo, combo2=CAST(NULL AS VARCHAR(8000))

INTO #lookup2

FROM lu;

WITH UNIQUEnTuples (n, Tuples, ID) AS

(

SELECT combo, CAST(rate_code AS VARCHAR(8000)), rate_code

FROM #lookup2

UNION ALL

SELECT n.n+t.combo, CAST(t.rate_code AS VARCHAR(8000)) + ',' + n.Tuples, rate_code

FROM UNIQUEnTuples n

CROSS APPLY (

SELECT combo, rate_code

FROM #lookup2 t

WHERE t.rate_code < n.ID) t

)

MERGE #lookup2 t

USING UNIQUEnTuples s

ON t.combo = s.n

WHEN MATCHED THEN

UPDATE SET combo2 = s.Tuples

WHEN NOT MATCHED THEN

INSERT (rate_code, combo, combo2)

VALUES (s.Tuples, s.n, s.Tuples);

SELECT grp, worker_id, combo=combo2

FROM

(

SELECT grp=COUNT(*), worker_id, combo=SUM(combo)

FROM #worker_rate a

JOIN #lookup2 b ON a.rate_code = b.rate_code

GROUP BY worker_id

) a

JOIN #lookup2 b ON a.combo = b.combo

ORDER BY 1, 2;

SELECT [Method]='Divide and Conquer'

,[ElapsedMS]=DATEDIFF(millisecond, @StartDT, GETDATE());

SELECT @StartDT = GETDATE();

GO

--drop table #lookup;

drop table #lookup2;

drop table #worker_rate;

Note that neither of these new queries will do well if you have scores of rate codes.

Links to explain UNIQUEnTuples[/url] and how I made it faster.

My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]
My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St

ClaireM SSC Veteran Points: 223 More actions · Answer 8

Sorry for the late reply but it's been a very busy week for me. Finally got time to look at these posts and it's always great to see people thinking about how to improve solutions as that's where you get a broader spectrum of knowledge from.

In response to Post #1559754.

1. I would prefer apples, pineapples or a tin of tropical fruit and I know which one I prefer landing on my head powered by gravity alone 😉

2. You spotted one of the further optimisations I left out as the article was getting quite large but it's not the distinct that was causing the issue, it was the rank function on duplicate rows and by moving that to the outer query, as you showed, you gain a little more performance. Well spotted.

The lookup table query changes to the following to get this performance increase.

insert into #lookup (rate_code,combo)

select q.rate_code, power(2, row_number() over (order by q.rate_code) - 1)

from (select distinct rate_code from #worker_rate) q

3. While it's useful to be able to provide the results in a single query you lose that important index so that when the number of rate codes increases the performance drops as you pointed out in your second post. I just wish MS would put in a feature to post-attach indexes to temporary worktables doing away with the need to split the query.

So starting with 1 million rows your single query approach on my server (different server this time so we'll compare both on the same server) with the same three rate codes came out at

1,842ms CPU and 1,177ms elapsed time

The article example in its original form took

2,356ms CPU and 1,175ms elapsed time (after adding the two stages)

The article example with the enhanced lookup table code took

2,013ms CPU and 1,195ms elapsed time (after adding the two stages)

Increasing the number of rate codes (which will reduce the number of worker ids as we are still only taking the top 1 million rows) your single query takes

966ms CPU and 320ms elapsed time

The enhanced article dual stage query takes

889ms CPU and 397ms elapsed time

So great spot with the row_number instead of the dense_rank function but not quite enough performance boost to warrant switching to the single query in my opinion. The result output was changed to be non-contiguous also so if that was important it causes a problem. You are returning the count of the rates for each worker instead of the unique group showing that each worker in that group has the same rate codes.

grp which was 1,2,3 is changed to 3,13,14 and two workers with different rate codes but the same number of them will have the same group code.

It's always very important to check the output of any optimisation and luckily this utility does that for me so is quick to catch as you can see in the attached image if it gets attached.