SQLServerCentral Article

Divide and Conquer - Performance Tuning

,

An interesting example came up recently on SQL Server Central that required a combination of grouping and pivoting which ended up in some T-SQL that could cripple your server. The example link is here. It required all workers to be grouped together that shared the same combination of rate codes regardless of the order. I’ll try to explain why the posted solution can cause problems and how to avoid them while demonstrating that sometimes you have to divide a problem in order to conquer it.

Preparation

Firstly, let’s set up the example data in such a way that we can scale it to truly put it through its paces. The code below creates our example data to query.

declare @worker_rate table (
  worker_id int
  ,rate_code nvarchar(10)
)
declare @rows int = 10000 --How many rows of data to create
--Create test data
;with tt(x) as (
  select row_number() over (order by (select null)) from
    (values (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) x10 (n),
    (values (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) x100 (n),
    (values (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) x1000 (n),
    (values (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) x10000 (n),
    (values (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) x100000 (n),
    (values (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) x1000000 (n),
    (values (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) x10000000 (n)
)
insert into @worker_rate (worker_id, rate_code)
select top(@rows) tt.x, rc.code
from tt
inner join (values ('a'),('b'),('c')) rc(code) 
  on (tt.x % 4) <> 0 
  or rc.code <> 'c'

This will create 10,000 rows in the table with all workers having all the rate codes with the exception that every recurring fourth worker does not have rate code “c”. A small sample is given here.

Worker_id    Rate_Codes

1            a, b, c

2            a, b, c

3            a, b, c

4            a, b

5            a, b, c

I’m using one of the fastest ways of generating rows using TOP and a Tally Table that only joins the next level when it needs to. This has been covered in many articles so I won’t go into detail here only to say that it is fast when used correctly.

Setting the scene

The task was to group all workers who had the same combinations of rate codes and we are assuming that the order of the rate codes is irrelevant. The solution presented was this.

;With cte AS (
  select distinct t1.worker_id,AllRateCodes.SetOfCodes from @worker_rate t1
  cross apply
  (
    SELECT SetOfCodes = STUFF((SELECT ',' + t2.rate_code
    from @worker_rate t2
    where t1.worker_id = t2.worker_id
    ORDER BY rate_code
    FOR XML PATH('')),1,1,'')
  ) AllRateCodes
)
SELECT t1.worker_id, t1.SetOfCodes, DENSE_RANK() OVER (ORDER BY t1.SetOfCodes) AS SNO
FROM cte t1
ORDER BY t1.SetOfCodes

What this query returns if you just take the first four rows of example data is shown here. It gives the first three workers the same SNO id which effectively groups them together as they have the same combination of rate codes.

Worker_id       SetOfCodes     SNO

1               a,b,c          1

2               a,b,c          1

3               a,b,c          1

4               a,b            2

Now this is indeed a clever piece of SQL using CROSS APPLY with the XML method of generating an ordered list of comma separated rate codes for each worker then grouping by that list to achieve the desired results. However, a few problems exist and I’ll try to explain what they are so you can avoid them. Some were set up by me to highlight known issues with SQL Server.

Investigation of existing solution

First we look at what happens with the original query when we use 10,000 rows. The Actual Execution plan comes back as shown in the diagram below. It’s really important to understand the Actual Execution Plan (AEP) as it tells you exactly what is happening behind the scenes. I’m using a nice little utility called SQLPRep which takes your T-SQL and produces a performance report giving you all the inner workings of your T-SQL including any called routines. It’s the AEPs that we are interested in.

The image shows the AEP taken from the report which is formatted to show key areas. On the left is the number of times an operation is executed followed by how many rows are involved so “10x50r” means ten times we cycle through 50 rows. The percentage next to these tells us how much elapsed time is spent on that particular operation as a portion of the whole time shown at the bottom underlined. Each operation is shown in bolded text, if it is coloured then we should pay particular attention to it as it could be a cause of serious slowdowns.

The AEP is a tree structure with each line operating on the result of the line below it or in the case of a nested loops operation, the two lines below it that are one level deeper so in the example below the Table Scan and the UDX are joined together in the Nested Loops operation. A good place to start when trying to understand AEPs is this great book and you may find the diagrammatic form easier to read when starting out but is more time consuming once you know what you are looking for in a glance.

The final line tells us how much CPU time on the server was taken up processing this T-SQL and how much time was actually taken up overall for the T-SQL to run.

Most of the time is taken up by the two sort operations, 78% of the total time which is over three minutes (188,632 ms or 188 seconds). Alarm bells start ringing. We can see that it is taking the majority of its time trying to sort through many, many rows as it is showing 10,000 times 100 million rows according to the highlighted lines above. This gives us a good idea what is going wrong as SQL Server is thinking that it is faster to repeatedly scan the table than to do as few reads as possible. We started with a table variable which is only designed to work with small numbers of rows, they do not have table statistics to allow efficient handling so SQL Server is thinking that there is only one row in this table and it can be quickly read. I admit that this was my mistake to show you how bad things can get with quite simple mistakes.

Adjusting the angle of approach

So we now change to using a temporary table instead which has table statistics allowing the query optimiser to choose a better plan. Care should be taken with any character types in temporary tables in case the temporary database has a different collation so the “collate database_default” ensures that the collation matches the current one.

create table #worker_rate (
  worker_id int
  ,rate_code nvarchar(10) collate database_default
)

This now produces the AEP shown below and we can see that the bottom Table Scan has been reduced to only being scanned once for every row.

Wow, now that is more like it. Nearly a thousand times faster, but a good sql sniffer knows not to stop at the first treat as there may be more so let’s keep sniffing.

The huge 77% of the time being spent sorting highlighted in the AEP above tells us that we are lacking indexes and sure enough, in our haste we forgot to put a sensible index on the data. So we will go for a clustered index and our code now looks like this. Please excuse the coding style but everyone has their own way of making things easy to read … for them!

create table #worker_rate (
  worker_id int
  ,rate_code nvarchar(10) collate database_default
  primary key (worker_id,rate_code)
)
declare @rows int = 10000 --How many rows of data to create
--Create test data
;with tt(x) as (
  select row_number() over (order by (select null)) from
    (values (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) x10 (n),
    (values (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) x100 (n),
    (values (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) x1000 (n),
    (values (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) x10000 (n),
    (values (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) x100000 (n),
    (values (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) x1000000 (n),
    (values (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) x10000000 (n)
)
insert into #worker_rate (worker_id, rate_code)
select top(@rows) tt.x, rc.code
from tt
inner join (values ('a'),('b'),('c')) rc(code) 
  on (tt.x%4) <> 0 
  or rc.code <> 'c'

As you can see above, we have added a primary key which is by default clustered and set the indexed fields to be worker_id and rate_code which both default to ascending order. We are ready to try it again and the AEP produced is shown below where you can see a few things. First is that we have lost that sticky sort completely as it has been replaced by the index seek which is by default ordered correctly. This has an extra beneficial effect of reducing the CPU time required for sorting so it is now down to 94ms. Most of our time is now spent on an index seek while doing the XML operation with a little on that final sort (remember to read the AEP from bottom to top).

This last change nearly halved the elapsed time and most of that is spent retrieving the data with a clustered seek. Most would stop there but I’m curious to see how this handles some big data (not to be confused with Big Data).

Always think big

Let’s try a paltry million rows. Luckily all I have to do is change the @rows variable to 1,000,000 and we can see how it performs. Nice to know we plan ahead.

We can see in the above AEP that the query takes about four seconds to run but I’m just not happy with those stats. 89% of the total elapsed time is spent sorting a long string which is just nasty. We can also see that while building the XML it is running the clustered index seek a million times. It’s worth noting that SQL Server has switched to using the parallel plan now so we should be wary of this as it can easily hog all of your servers processors leaving a multi-user system struggling to cope with high demand. We would slap it with a hint if we don’t want it to run wild on our server and that hint is MaxDOP which is explained in detail in this link but effectively tries to limit what your query can hog in terms of server processor resource. I say try because it doesn’t always work as expected, read the provided link for more details.

Identifying where to crack first

Here we hit one of the biggest disappointments about the SQL Server optimiser in my opinion. Where we have generated a nice temporary result set containing our list of rate codes, we now need to group it and that requires a sort which is now being run on an unindexed result set. This is slow and CPU hungry. The optimiser can’t put a sensible clustered index on our temporary result set mid-query or let us define one for ourselves. Maybe one day we won’t have to manually cope with it but until that day we can resolve this using two techniques that are reasonably easy to implement and can reap tremendous benefits when used in the right situation.

The first is to encode the nvarchar string into a faster form by using a bit lookup table. As we can have any combination of rate codes per worker we need to assign each unique rate code a bit that we can simply sum to get a single number for comparison. The smallint will hold 16 bits so as long as we don’t have more than 16 rate codes we are able to use this initial method but I will show you how to extend this later to cope with more than 16, it’s fairly easy.

The second is to get around that missing index by splitting the query into the worker part and the results part using a temporary table to hold the temporary result set with a suitable index on it. This table is created very quickly and is then very fast to query for the results part.

Just to be efficient we will merge the two methods above into a single lookup table and here is the new code.

Create the lookup table first, remembering to use a temporary table with a suitable index and collation.

create table #lookup (
  rate_code nvarchar(10) collate database_default primary key
  ,combo smallint
)

Now create the lookup table assigning each unique rate a specific bit value using the power function.

insert into #lookup (rate_code,combo)
select q.rate_code, power(2, q.combo-1) 
from (
  select distinct rate_code, combo=dense_rank() over (order by rate_code) 
  from #worker_rate
) q

Finally we have the results part that matches on the sum of each workers rates and groups all the same values together. A smallint sort instead of an nvarchar sort which is much quicker. The sum works because it adds all of the rate bit values to make one single smallint with each bit relating to a rate code. For rates “a”, ”b” and “c” we have 001 = “c”, 010 = “b” and 100 = “a”. Sum them together to get 111 = “abc”.

select grp=dense_rank() over (order by combo), worker_id, combo
from (
  select p.worker_id, combo=sum(l.combo)
  from #worker_rate p
  inner join #lookup l on l.rate_code = p.rate_code
  group by p.worker_id
) q
order by 1

We are ordering by the first column, so we don’t have to specify its name and the code is neater. We have to remember to drop the temporary table after we have finished with it.

drop table #lookup

We now have two AEPs to get our final results, the first is building the lookup table and the second is the grouping query. As can be seen from the final results we are now running in less than half the time on an already optimised version when you add the two elapsed times together and a fifth of the CPU time. We can see that the sort has been reduced and more importantly, the number of times we execute each operation is down to six at most instead of a million.

But that sort is still taking a whopping 73% of the time, so one last push is to swap the index to put the rate code first as that is where we are losing time. Doing that gives us the following which shows that the time sapping sort has gone and the time it takes to run is almost halved again to 934ms total with only a third of the CPU time being used.

I mean BIG

Now I want to make sure we have this well and truly nailed down so we’ll increase the row count to 10 million rows. The same Actual Execution Plans are returned and the timings now come out as follows …

  •     Original method                      CPU time = 175409 ms,  elapsed time = 44841 ms.
  •     Lookup table creation            CPU time = 3229 ms,  elapsed time = 3226 ms.
  •     Using the lookup                     CPU time = 13354 ms,  elapsed time = 7618 ms.

Now the new query runs in a quarter of the time of the original query, uses a tenth of the servers CPU time and only takes eleven seconds to process ten million rows. Not bad, I think we deserve a chocolate biscuit for this.

By dividing the query into two stages we simply take more control of the in-between process as we know SQL Server isn’t great at making the right choices or putting the right conditions in place for optimal performance.

Know when to quit

I’m sure this can be tweaked just a little more but when optimising you have to justify the time spent optimising on the time gained for the optimisation. If you only gain a 5% improvement then you just spent a lot of time for very little gain. This optimisation quartered the time, tenthed (excuse the new word) the CPU cost and significantly reduced the resources required as well as demonstrating a number of optimisation techniques to add to your arsenal.

Thinking ahead

Now to enhance this to cope with more than 16 codes we can just use a binary type instead of a smallint as shown here.

create table #lookup (
  rate_code nvarchar(10) collate database_default primary key
  ,combo binary(32)
)

We are using a binary type which we can adjust to however many bytes we need to contain the unique rate codes.

declare @binary32 char(32) = replicate(0x0,32)

We need to create a buffer as the binary operations only work with an initialised buffer.

insert into #lookup (rate_code,combo)
select q.rate_code
  ,convert(binary(32),stuff(@binary32,q.combo/8+1,1,cast(power(2,q.combo%8) as binary(1))))
from (
  select distinct rate_code, combo=dense_rank() over (order by rate_code) - 1
  from #worker_rate
) q

We convert to a bit as before but also use a byte offset so we set a bit in a specific byte in our buffer.

select grp=dense_rank() over (order by combo,combo2,combo3,combo4)
  ,worker_id
  ,cast(combo as binary(8))
  ,combo2
  ,combo3
  ,combo4
from (
  select p.worker_id
    ,combo=sum(cast(substring(cte.combo,1,8) as bigint))
    ,combo2=sum(cast(substring(cte.combo,9,8) as bigint))
    ,combo3=sum(cast(substring(cte.combo,17,8) as bigint))
    ,combo4=sum(cast(substring(cte.combo,25,8) as bigint))
  from #worker_rate p
  inner join #lookup cte on cte.rate_code = p.rate_code
  group by p.worker_id
) q
order by 1
drop table #lookup

We now need to decode this in blocks of 8 bytes as bigint types so we can quickly rank the binary field. I’ve cast one of them to a binary so we can see the hexadecimal value with the individual bits set.

The stuff and substring functions are used to operate on the binary type and they are very fast at doing this but are limited in how they can be used. This is a fixed mechanism that could be converted to cope with any number of bytes but I’m not going to go into that in this article.

Tidy up after yourself

The full, final code once I’ve cleaned it up is here.

create table #worker_rate (
  worker_id int
  ,rate_code nvarchar(10) collate database_default
  primary key (rate_code,worker_id)
)
declare @rows int = 10000 --How many rows of data to create
--Create test data
;with tt(x) as (
  select row_number() over (order by (select null)) from
    (values (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) x10 (n),
    (values (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) x100 (n),
    (values (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) x1000 (n),
    (values (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) x10000 (n),
    (values (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) x100000 (n),
    (values (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) x1000000 (n),
    (values (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) x10000000 (n)
)
insert into #worker_rate (worker_id, rate_code)
select top(@rows) tt.x, rc.code
from tt
inner join (values ('a'),('b'),('c')) rc(code)
  on tt.x % 4 <> 0 
  or rc.code <> 'c'
--Original example
;with c as (
  select distinct w.worker_id, r.combo
  from #worker_rate w
  cross apply (
    select combo = stuff((select ',' + r.rate_code
    from #worker_rate r
    where w.worker_id = r.worker_id
    order by r.rate_code
    for xml path('')),1,1,'')
  ) r
)
select grp = dense_rank() over (order by c.combo), c.worker_id, c.combo
from c
order by 1
--Optimal example
create table #lookup (
  rate_code nvarchar(10) collate database_default primary key
  ,combo smallint
)
insert into #lookup (rate_code,combo)
select q.rate_code, power(2, q.combo-1) 
from (
  select distinct rate_code, combo=dense_rank() over (order by rate_code) 
  from #worker_rate
) q
select grp=dense_rank() over (order by combo), worker_id, combo
from (
  select p.worker_id, combo=sum(l.combo)
  from #worker_rate p
  inner join #lookup l on l.rate_code = p.rate_code
  group by p.worker_id
) q
order by 1
drop table #lookup
drop table #worker_rate

Summary

Although this example was a very small, self-contained situation, it became apparent that multiple mechanisms would be required to get a sufficiently optimal solution only after delving deeper and looking at exactly what was going on in the Actual Execution Plan. By breaking up the problem into smaller steps we were able to easily produce a nice final solution using readily available techniques. This practice of using “Divide and Conquer” can be used iteratively on hugely complex situations to avoid overloading you with unnecessary detail. Knowing where to divide is the part that you need to perfect and hopefully I’ve helped to give you an understanding of how to begin going about doing that on complex T-SQL.

Follow me on Twitter: @csqls

Resources

Actual Execution Plan screenshots courtesy of Consequential Solutions Ltd SQLPRep

References

Rate

4.65 (20)

You rated this post out of 5. Change rating

Share

Share

Rate

4.65 (20)

You rated this post out of 5. Change rating