Divide and Conquer - Performance Tuning

  • ClaireM

    SSC Veteran

    Points: 223

    Comments posted to this topic are about the item Divide and Conquer - Performance Tuning

  • a.myasnikov

    SSC Enthusiast

    Points: 147

    I wonder if checksum could be an alternative to a two-step process suggested by the article?

    Something to the effect:

    ; WITH CTE

    as (

    SELECT worker_id,

    CHECKSUM_AGG(CHECKSUM(rate_code)) as chk

    FROM #worker_rate

    GROUP BY worker_id

    )

    SELECT worker_id,

    DENSE_RANK() OVER (ORDER BY chk) as SNO

    FROM CTE

  • ClaireM

    SSC Veteran

    Points: 223

    That would indeed be much faster and simpler if it wasn't for the possibility of duplicate checksums.

    This link http://ask.sqlservercentral.com/questions/7253/limitations-w-using-checksum-for-comparing-data-wh.html mentions this very problem and the only way around it is to use additional processing which adds to the complexity.

  • a.myasnikov

    SSC Enthusiast

    Points: 147

    Thank you for the answer, Claire.

    I came to a similar conclusion after a bit of googling, but you were faster.

  • m mcdonald

    Hall of Fame

    Points: 3180

    A little off topic...However....

    How does one get SQLPrep or is it even available for public consumption?

    I do not see a link at http://www.csqls.com/

    Thanks

  • ClaireM

    SSC Veteran

    Points: 223

    It's not available to buy yet but if you go to the website and click the contact us link then ask to be an alpha tester you might get to try it early.

  • Dwain Camps

    SSC Guru

    Points: 86873

    Claire,

    Interesting article. I have indeed found cases where divide and conquer can be helpful to solve a really tricky query form. But I'm afraid I'm going to need to point out a few fallacies in your arguments.

    1. First off, you're comparing apples to oranges when you try to compare the timing results from creating a delimited character string of the rates to an encoded integer value. The string concatenation is always going to be slower.

    2. Your method of creating the delimited list of rate codes can be improved upon. The reason it was spending so much time in those nasty sorts was (probably) because of using DISTINCT.

    SELECT grp=COUNT(*), worker_id, combo=STUFF(

    (

    SELECT ',' + rate_code

    FROM #worker_rate b

    WHERE a.worker_id = b.worker_id

    ORDER BY rate_code

    FOR XML PATH('')

    ), 1, 1, '')

    FROM #worker_rate a

    GROUP BY worker_id

    ORDER BY 1, 2;

    3. You can literally do the same thing as your encoding approach in a single set-based query.

    WITH lu AS

    (

    SELECT rate_code

    ,combo=POWER(2, ROW_NUMBER() OVER (ORDER BY rate_code)-1)

    FROM

    (

    SELECT rate_code

    FROM #worker_rate

    GROUP BY rate_code

    ) a

    )

    SELECT grp=COUNT(*), worker_id, combo=SUM(combo)

    FROM #worker_rate a

    JOIN lu b ON a.rate_code = b.rate_code

    GROUP BY worker_id

    ORDER BY 1, 2;

    The timing results I got for this (at 10M rows) were:

    Method ElapsedMS

    Original concat of rate codes as chars 136593

    Improved concat of rate codes as chars 106220

    'Optimal' Example of Divide and Conquer 29016

    Divide and Conquer but in one step 24006

    Note that all this was pre-coffee so possibly I made an error somewhere but you can check me yourself. Here's the test harness I used.

    create table #worker_rate (

    worker_id int

    ,rate_code nvarchar(10) collate database_default

    primary key (rate_code,worker_id)

    )

    declare @rows int = 10000000 --How many rows of data to create

    --Create test data

    ;with tt(x) as (

    select row_number() over (order by (select null)) from

    (values (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) x10 (n),

    (values (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) x100 (n),

    (values (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) x1000 (n),

    (values (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) x10000 (n),

    (values (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) x100000 (n),

    (values (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) x1000000 (n),

    (values (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) x10000000 (n)

    )

    insert into #worker_rate (worker_id, rate_code)

    select top(@rows) tt.x, rc.code

    from tt

    inner join (values ('a'),('b'),('c')) rc(code)

    on tt.x % 4 <> 0

    or rc.code <> 'c'

    DECLARE @StartDT DATETIME = GETDATE();

    --Original example

    ;with c as (

    select distinct w.worker_id, r.combo

    from #worker_rate w

    cross apply (

    select combo = stuff((select ',' + r.rate_code

    from #worker_rate r

    where w.worker_id = r.worker_id

    order by r.rate_code

    for xml path('')),1,1,'')

    ) r

    )

    select grp = dense_rank() over (order by c.combo), c.worker_id, c.combo

    from c

    order by 1, 2;

    SELECT [Method]='Original concat of rate codes as chars'

    ,[ElapsedMS]=DATEDIFF(millisecond, @StartDT, GETDATE());

    SELECT @StartDT = GETDATE();

    SELECT grp=COUNT(*), worker_id, combo=STUFF(

    (

    SELECT ',' + rate_code

    FROM #worker_rate b

    WHERE a.worker_id = b.worker_id

    ORDER BY rate_code

    FOR XML PATH('')

    ), 1, 1, '')

    FROM #worker_rate a

    GROUP BY worker_id

    ORDER BY 1, 2;

    SELECT [Method]='Improved concat of rate codes as chars'

    ,[ElapsedMS]=DATEDIFF(millisecond, @StartDT, GETDATE());

    SELECT @StartDT = GETDATE();

    --Optimal example

    create table #lookup (

    rate_code nvarchar(10) collate database_default primary key

    ,combo smallint

    )

    insert into #lookup (rate_code,combo)

    select q.rate_code, power(2, q.combo-1)

    from (

    select distinct rate_code, combo=dense_rank() over (order by rate_code)

    from #worker_rate

    ) q

    select grp=dense_rank() over (order by combo), worker_id, combo

    from (

    select p.worker_id, combo=sum(l.combo)

    from #worker_rate p

    inner join #lookup l on l.rate_code = p.rate_code

    group by p.worker_id

    ) q

    order by 1, 2

    SELECT [Method]='''Optimal'' Example of Divide and Conquer'

    ,[ElapsedMS]=DATEDIFF(millisecond, @StartDT, GETDATE());

    SELECT @StartDT = GETDATE();

    WITH lu AS

    (

    SELECT rate_code

    ,combo=POWER(2, ROW_NUMBER() OVER (ORDER BY rate_code)-1)

    FROM

    (

    SELECT rate_code

    FROM #worker_rate

    GROUP BY rate_code

    ) a

    )

    SELECT grp=COUNT(*), worker_id, combo=SUM(combo)

    FROM #worker_rate a

    JOIN lu b ON a.rate_code = b.rate_code

    GROUP BY worker_id

    ORDER BY 1, 2;

    SELECT [Method]='Divide and Conquer but in one step'

    ,[ElapsedMS]=DATEDIFF(millisecond, @StartDT, GETDATE());

    SELECT @StartDT = GETDATE();

    GO

    drop table #lookup

    drop table #worker_rate


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • Dwain Camps

    SSC Guru

    Points: 86873

    Since there's very few rate codes present, I thought of a way you can have your cake (return delimited strings) and eat it too (get relatively good performance).

    1. The all in one query.

    WITH lu AS

    (

    SELECT rate_code

    ,combo=POWER(2, ROW_NUMBER() OVER (ORDER BY rate_code)-1)

    FROM

    (

    SELECT rate_code

    FROM #worker_rate

    GROUP BY rate_code

    ) a

    ),

    UNIQUEnTuples (n, Tuples, ID) AS (

    SELECT combo, CAST(rate_code AS VARCHAR(8000)), rate_code

    FROM lu

    UNION ALL

    SELECT n.n+t.combo, CAST(t.rate_code AS VARCHAR(8000)) + ',' + n.Tuples, rate_code

    FROM UNIQUEnTuples n

    CROSS APPLY (

    SELECT combo, rate_code

    FROM lu t

    WHERE t.rate_code < n.ID) t

    )

    SELECT grp, worker_id, combo=Tuples

    FROM

    (

    SELECT grp=COUNT(*), worker_id, combo=SUM(combo)

    FROM #worker_rate a

    JOIN lu b ON a.rate_code = b.rate_code

    GROUP BY worker_id

    ) a

    JOIN UNIQUEnTuples b ON a.combo = b.n

    ORDER BY 1, 2;

    2. Divide and conquer (split above into 3 steps)

    WITH lu AS

    (

    SELECT rate_code

    ,combo=POWER(2, ROW_NUMBER() OVER (ORDER BY rate_code)-1)

    FROM

    (

    SELECT rate_code

    FROM #worker_rate

    GROUP BY rate_code

    ) a

    )

    SELECT rate_code, combo, combo2=CAST(NULL AS VARCHAR(8000))

    INTO #lookup2

    FROM lu;

    WITH UNIQUEnTuples (n, Tuples, ID) AS

    (

    SELECT combo, CAST(rate_code AS VARCHAR(8000)), rate_code

    FROM #lookup2

    UNION ALL

    SELECT n.n+t.combo, CAST(t.rate_code AS VARCHAR(8000)) + ',' + n.Tuples, rate_code

    FROM UNIQUEnTuples n

    CROSS APPLY (

    SELECT combo, rate_code

    FROM #lookup2 t

    WHERE t.rate_code < n.ID) t

    )

    MERGE #lookup2 t

    USING UNIQUEnTuples s

    ON t.combo = s.n

    WHEN MATCHED THEN

    UPDATE SET combo2 = s.Tuples

    WHEN NOT MATCHED THEN

    INSERT (rate_code, combo, combo2)

    VALUES (s.Tuples, s.n, s.Tuples);

    SELECT grp, worker_id, combo=combo2

    FROM

    (

    SELECT grp=COUNT(*), worker_id, combo=SUM(combo)

    FROM #worker_rate a

    JOIN #lookup2 b ON a.rate_code = b.rate_code

    GROUP BY worker_id

    ) a

    JOIN #lookup2 b ON a.combo = b.combo

    ORDER BY 1, 2;

    Timings (just my 4 queries at 10M rows):

    Method ElapsedMS

    Improved concat of rate codes as chars 105503

    Divide and Conquer but in one step 24816

    Use nTuples to create strings 46926

    Divide and Conquer 30913

    And the test harness:

    create table #worker_rate (

    worker_id int

    ,rate_code nvarchar(10) collate database_default

    primary key (rate_code,worker_id)

    )

    declare @rows int = 10000000 --How many rows of data to create

    --Create test data

    ;with tt(x) as (

    select row_number() over (order by (select null)) from

    (values (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) x10 (n),

    (values (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) x100 (n),

    (values (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) x1000 (n),

    (values (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) x10000 (n),

    (values (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) x100000 (n),

    (values (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) x1000000 (n),

    (values (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) x10000000 (n)

    )

    insert into #worker_rate (worker_id, rate_code)

    select top(@rows) tt.x, rc.code

    from tt

    inner join (values ('a'),('b'),('c')) rc(code)

    on tt.x % 4 <> 0

    or rc.code <> 'c'

    DECLARE @StartDT DATETIME = GETDATE();

    --Original example

    --;with c as (

    -- select distinct w.worker_id, r.combo

    -- from #worker_rate w

    -- cross apply (

    -- select combo = stuff((select ',' + r.rate_code

    -- from #worker_rate r

    -- where w.worker_id = r.worker_id

    -- order by r.rate_code

    -- for xml path('')),1,1,'')

    -- ) r

    --)

    --select grp = dense_rank() over (order by c.combo), c.worker_id, c.combo

    --from c

    --order by 1, 2;

    --SELECT [Method]='Original concat of rate codes as chars'

    -- ,[ElapsedMS]=DATEDIFF(millisecond, @StartDT, GETDATE());

    --SELECT @StartDT = GETDATE();

    SELECT grp=COUNT(*), worker_id, combo=STUFF(

    (

    SELECT ',' + rate_code

    FROM #worker_rate b

    WHERE a.worker_id = b.worker_id

    ORDER BY rate_code

    FOR XML PATH('')

    ), 1, 1, '')

    FROM #worker_rate a

    GROUP BY worker_id

    ORDER BY 1, 2;

    SELECT [Method]='Improved concat of rate codes as chars'

    ,[ElapsedMS]=DATEDIFF(millisecond, @StartDT, GETDATE());

    SELECT @StartDT = GETDATE();

    --Optimal example

    --create table #lookup (

    -- rate_code nvarchar(10) collate database_default primary key

    -- ,combo smallint

    --)

    --insert into #lookup (rate_code,combo)

    --select q.rate_code, power(2, q.combo-1)

    --from (

    -- select distinct rate_code, combo=dense_rank() over (order by rate_code)

    -- from #worker_rate

    --) q

    --select grp=dense_rank() over (order by combo), worker_id, combo

    --from (

    -- select p.worker_id, combo=sum(l.combo)

    -- from #worker_rate p

    -- inner join #lookup l on l.rate_code = p.rate_code

    -- group by p.worker_id

    --) q

    --order by 1, 2

    --SELECT [Method]='''Optimal'' Example of Divide and Conquer'

    -- ,[ElapsedMS]=DATEDIFF(millisecond, @StartDT, GETDATE());

    --SELECT @StartDT = GETDATE();

    WITH lu AS

    (

    SELECT rate_code

    ,combo=POWER(2, ROW_NUMBER() OVER (ORDER BY rate_code)-1)

    FROM

    (

    SELECT rate_code

    FROM #worker_rate

    GROUP BY rate_code

    ) a

    )

    SELECT grp=COUNT(*), worker_id, combo=SUM(combo)

    FROM #worker_rate a

    JOIN lu b ON a.rate_code = b.rate_code

    GROUP BY worker_id

    ORDER BY 1, 2;

    SELECT [Method]='Divide and Conquer but in one step'

    ,[ElapsedMS]=DATEDIFF(millisecond, @StartDT, GETDATE());

    SELECT @StartDT = GETDATE();

    WITH lu AS

    (

    SELECT rate_code

    ,combo=POWER(2, ROW_NUMBER() OVER (ORDER BY rate_code)-1)

    FROM

    (

    SELECT rate_code

    FROM #worker_rate

    GROUP BY rate_code

    ) a

    ),

    UNIQUEnTuples (n, Tuples, ID) AS (

    SELECT combo, CAST(rate_code AS VARCHAR(8000)), rate_code

    FROM lu

    UNION ALL

    SELECT n.n+t.combo, CAST(t.rate_code AS VARCHAR(8000)) + ',' + n.Tuples, rate_code

    FROM UNIQUEnTuples n

    CROSS APPLY (

    SELECT combo, rate_code

    FROM lu t

    WHERE t.rate_code < n.ID) t

    )

    SELECT grp, worker_id, combo=Tuples

    FROM

    (

    SELECT grp=COUNT(*), worker_id, combo=SUM(combo)

    FROM #worker_rate a

    JOIN lu b ON a.rate_code = b.rate_code

    GROUP BY worker_id

    ) a

    JOIN UNIQUEnTuples b ON a.combo = b.n

    ORDER BY 1, 2;

    SELECT [Method]='Use nTuples to create strings'

    ,[ElapsedMS]=DATEDIFF(millisecond, @StartDT, GETDATE());

    SELECT @StartDT = GETDATE();

    WITH lu AS

    (

    SELECT rate_code

    ,combo=POWER(2, ROW_NUMBER() OVER (ORDER BY rate_code)-1)

    FROM

    (

    SELECT rate_code

    FROM #worker_rate

    GROUP BY rate_code

    ) a

    )

    SELECT rate_code, combo, combo2=CAST(NULL AS VARCHAR(8000))

    INTO #lookup2

    FROM lu;

    WITH UNIQUEnTuples (n, Tuples, ID) AS

    (

    SELECT combo, CAST(rate_code AS VARCHAR(8000)), rate_code

    FROM #lookup2

    UNION ALL

    SELECT n.n+t.combo, CAST(t.rate_code AS VARCHAR(8000)) + ',' + n.Tuples, rate_code

    FROM UNIQUEnTuples n

    CROSS APPLY (

    SELECT combo, rate_code

    FROM #lookup2 t

    WHERE t.rate_code < n.ID) t

    )

    MERGE #lookup2 t

    USING UNIQUEnTuples s

    ON t.combo = s.n

    WHEN MATCHED THEN

    UPDATE SET combo2 = s.Tuples

    WHEN NOT MATCHED THEN

    INSERT (rate_code, combo, combo2)

    VALUES (s.Tuples, s.n, s.Tuples);

    SELECT grp, worker_id, combo=combo2

    FROM

    (

    SELECT grp=COUNT(*), worker_id, combo=SUM(combo)

    FROM #worker_rate a

    JOIN #lookup2 b ON a.rate_code = b.rate_code

    GROUP BY worker_id

    ) a

    JOIN #lookup2 b ON a.combo = b.combo

    ORDER BY 1, 2;

    SELECT [Method]='Divide and Conquer'

    ,[ElapsedMS]=DATEDIFF(millisecond, @StartDT, GETDATE());

    SELECT @StartDT = GETDATE();

    GO

    --drop table #lookup;

    drop table #lookup2;

    drop table #worker_rate;

    Note that neither of these new queries will do well if you have scores of rate codes.

    Links to explain UNIQUEnTuples[/url] and how I made it faster.


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • ClaireM

    SSC Veteran

    Points: 223

    Sorry for the late reply but it's been a very busy week for me. Finally got time to look at these posts and it's always great to see people thinking about how to improve solutions as that's where you get a broader spectrum of knowledge from.

    In response to Post #1559754.

    1. I would prefer apples, pineapples or a tin of tropical fruit and I know which one I prefer landing on my head powered by gravity alone 😉

    2. You spotted one of the further optimisations I left out as the article was getting quite large but it's not the distinct that was causing the issue, it was the rank function on duplicate rows and by moving that to the outer query, as you showed, you gain a little more performance. Well spotted.

    The lookup table query changes to the following to get this performance increase.

    insert into #lookup (rate_code,combo)

    select q.rate_code, power(2, row_number() over (order by q.rate_code) - 1)

    from (select distinct rate_code from #worker_rate) q

    3. While it's useful to be able to provide the results in a single query you lose that important index so that when the number of rate codes increases the performance drops as you pointed out in your second post. I just wish MS would put in a feature to post-attach indexes to temporary worktables doing away with the need to split the query.

    So starting with 1 million rows your single query approach on my server (different server this time so we'll compare both on the same server) with the same three rate codes came out at

    1,842ms CPU and 1,177ms elapsed time

    The article example in its original form took

    2,356ms CPU and 1,175ms elapsed time (after adding the two stages)

    The article example with the enhanced lookup table code took

    2,013ms CPU and 1,195ms elapsed time (after adding the two stages)

    Increasing the number of rate codes (which will reduce the number of worker ids as we are still only taking the top 1 million rows) your single query takes

    966ms CPU and 320ms elapsed time

    The enhanced article dual stage query takes

    889ms CPU and 397ms elapsed time

    So great spot with the row_number instead of the dense_rank function but not quite enough performance boost to warrant switching to the single query in my opinion. The result output was changed to be non-contiguous also so if that was important it causes a problem. You are returning the count of the rates for each worker instead of the unique group showing that each worker in that group has the same rate codes.

    grp which was 1,2,3 is changed to 3,13,14 and two workers with different rate codes but the same number of them will have the same group code.

    It's always very important to check the output of any optimisation and luckily this utility does that for me so is quick to catch as you can see in the attached image if it gets attached.

Viewing 9 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply