Solving Complex T-SQL Problems, Step-By-Step

What should you do if your first, most intuitive solution to a problem ends up scanning the data more than is necessary, resulting in poor performance? Have you missed a new SQL Server feature that can remove inefficiency from your technique? Alternatively, do you need a little help, and some lateral thinking, to open the path to a different approach? Sometimes, the answer is "both".

Repeatedly on the SQL Server forums, I see people struggling to solve common, complex business problems in T-SQL. More to the point, I see them struggling to solve them in a way that performs well and scales gracefully to large numbers of rows. My systematic approach to writing ‘difficult’ T-SQL is always the same:

  1. Examine the data set carefully, and note down any particular characteristics, especially appearance of NULL values, use of dates, and so on.
  2. Make sure I fully understand the expected results and validate them, manually.
  3. Break the problem down into a series of simple steps, write the T-SQL, view the results, adjust if necessary, and move on to the next step.

The key at each step is to understand and accept that the first, most intuitive solution is unlikely to be the fastest or most scalable. As I explained in my previous series, Writing Efficient SQL, we need to look for opportunities to use set-based techniques that will help avoid non-scalable row-by-row processing, work with the smallest data set for any expensive calculations, scan the data as few times as possible, and so on. If my solution still fails to meet my expectations for performance, I try to “pair” with a co-worker, or ask advice from a T-SQL guru in the community, to help me find a better solution.

I recently helped solve a business problem that illustrates this process, and these ideas, perfectly. Based on data from a time clock, the challenge was to calculate based on a series of entry and exit times for employees, the amount of time between each entry and exit recorded, for each person. I present first a solution that will work on SQL Server 2008 and SQL Server 2008 R2, but has to scan the data a number of times to arrive at the result. Since that solution was not acceptable, I enlisted the help of a T-SQL guru, Peter Larsson (known as “SwePeso” or “Peso” on the various community forums) who changed the approach to the problem, and found a better solution.

Finally, I’ll demonstrate how new analytic functions in SQL Server 2012, LAG and LEAD, make it possible to find a solution that performs just as well, and requires a little less lateral thinking.

Understanding the Business Requirements and Data

Listing 1 provides the code to create the StaffHours table and populate it with data that represents fairly the time clock data set with which I worked. It uses row constructors, introduced with SQL Server 2008, and so the code will not run in any earlier versions (see the code download for the 2005 version).

Listing 1: Creating and Populating the StaffHours table

Figure 1 shows the “time clock” data set, essentially a series of entry and exit times from each member of staff.

1767-1-86d61e9f-53bd-461d-8a98-18e674fca

Figure 1: The original time clock data set

The requirement is to calculate the amount of time in hours, minutes and seconds between each Enter and Exit, for each staff member. The first point to note is that while there is usually a matching Exit row for each Enter, sometimes there is an Enter with no Exit or an Exit with no Enter. Here are the business rules:

  • If there are consecutive Enter rows, use the first one.
  • If there are consecutive Exit rows, use the last one.
  • Ignore any Exit rows with no matching Enter row (i.e. if the first row for an employee is an Exit, then ignore it).
  • Ignore any Enter rows with no matching Exit row (i.e. if the last row for an employee is an Enter, then ignore it).
  • The Exit for an Enter could be a subsequent day.

We can validate our expected results, manually. Employee A starts a work session at 08:00 on Jan 7. There follows two Exit times and, according to our business rules, we use the last one (11:35), for a work session duration of 3 hours 35 minutes. Employee A enters again at 12:45 on the same day and again we have two Exit entries, the latter falling on the next day, to give 12:15:00 duration. In a similar fashion, we can hand-calculate the full result set, as shown in Figure 2.

StaffMember

EnterDateTime

ExitDateTime

WorkTime

A

2013-01-07 08:00:00

2013-01-07 11:35:00

 

03:35:00

A

2013-01-07 12:45:00

2013-01-08 01:00:00

12:15:00

B

2013-01-07 08:10:00

2013-01-07 16:45:00

08:35:00

C

2013-01-07 08:33:00

2013-01-07 17:35:00

09:02:00

Figure 2: Expected results

A Pre-SQL Server 2012 Solution: pairing events using row numbers

Listing 2 shows my first solution, which uses row numbersand self-joins on matching rows, based on the row numbers. It takes a very systematic approach, using five levels of CTEs, as follows:

  • Level1 – partition the data by StaffMember and apply ROW_NUMBER
  • Level2 – perform two self-joins to find, for each row, the value of the EventType for the previous and then the subsequent rows, in each partition
  • Level3 – filter out unneeded rows and reapply row numbers
  • Level4 – perform another self-join to pair up ENTER and EXIT rows
  • Level5 – use DATEDIFF to calculate work session durations for each employee

Listing 2: Calculating work session duration using row numbers, CTEs and self-joins

I spent quite a bit of time working on this query, taking it a step at a time, checking the results and moving on to the next level. However, as soon as I saw the STATISTICS IO information, I knew there was a problem.

The query scanned the table six times! Of course, with a small amount of data, this was not a problem, but the solution will never scale. Figure 3 visualizes the problem

1767-SixScans-4ed13d38-240f-4646-add8-1b

Figure 3: Why we need six scans to arrive at the result

While my solution is intuitive, I’d fallen into the trap of thinking about what I needed to do with each row, instead of considering what I could do by looking at the data as a set. This skill does not come easily to most of us, especially those with a background in programming that teaches an iterative approach.

A Better Pre-SQL Server 2012 Approach: Islands and Gaps

When I wrote a series of articles on the SQL Server Speed Phreak competitions (see https://www.simple-talk.com/author/kathi-kellenberger/), I discussed the compromise between solutions that were simple to understand, and “fast enough”, and the absolute fastest solutions possible. I spent time trying to explain how the latter worked, for the benefit of those with only medium-sized SQL brains.

On several occasions, when dissecting the fastest solution, the author was Peter Larsson. I turned to him for help on this problem, and he devised a very elegant solution that returns the answer in just one table scan instead of six. However, even Peter didn’t conjure his best solution straight off the bat. His first attempt, show in in Listing 3, reduces the scan count from six to three.

Listing 3: Peter’s First Solution Eliminates One Self-Join

Peter’s initial solution contains the two self-joins to determine the previous and next event types just as my solution does, but he manages to eliminate the final self-join (level 4 in my solution), and with it half the scans.

He achieved this by creating groups of rows representing each Enter and Exit pair, using row numbers generated in the cteGrouped level. At this level, he also used a simplified WHERE clause to eliminate most of the invalid rows, keeping every Enter row that had a previous Exit row and every Exit row with a subsequent Enter row. Figure 4 shows how the data looks at the cteGrouped stage.

1767-1-ad625675-fde7-4bcf-86bc-9f0e959be

Figure 4: The values from cteGrouped

Notice that the value of grp for all of the remaining Enter rows are even, and the remaining Exit rows are odd. Integer division throws away remainders, so when the grp values are divided by two we end up with group 0 for the first enter/exit pair, group 1 for the second enter/exit pair and so on. This observation is important because it means we can now find each enter/exit pair using that grouping, in the final step. Use of Having COUNT(*) = 2 ensures that we are working with pairs of rows and not any orphaned Exit events at the beginning, or Enter events at the end.

The other point to remember, at this stage, is that the Enter date and time is the MINIMUM for the group, and the Exit date and time is the MAXIMUM for the group. Peter takes advantage of this in the final step to match up the Enter and Exit times without doing another self-join. Remember that the final self-join, if performed, doubles the work with six scans of the data.

In the final step, the query performs the grouping as well as the calculations. Peter uses an interesting technique to do the calculation.

The DATEDIFF calculates the seconds between the MIN(EventDateTime), or ENTER time, and the MAX(EventDateTime), or Exit. He then adds those seconds back to time zero (00:00:00). He uses the style 8 (hh:mm:ss) to format the value returned. Often, people use the CAST function in preference to the CONVERT function, but the benefit of the latter is that you can specify a style that formats the date.

However, not happy with three scans of the table, Peter tried again to improve the performance and arrived at a masterful “Islands and Gaps” solution that eliminated the initial self-joins, thus reducing the table scans to just one!

Listing 4: The Islands and Gaps Pre-2012 Solution

The first CTE, cteIslands, applies row numbers to the data sorted in two different ways and subtracts the two values. In order to see what is going on more easily, let’s extract and execute the query from cteIslands and look at the results it returns.

1767-5c4b990a-c3bb-4fe0-b4af-77a7ea677de

Listing 5: Viewing the Islands

Listing 5 applies the ByTime row numbers in order of StaffMember, and then EventDateTime. It applies the ByType row numbers in order of EventType, StaffMember and then EventDateTime. If an Exit event follows another Exit event, they will end up in the same GRP (ByTime - ByType). The same goes for Enter events.

If the data were perfect, an Exit event would follow every Enter event. If that had been the case, all the ByTime values would have odd numbers for the Enter rows, and even numbers for the Exit rows and so the difference between the ByTime and ByType values would have increased by one for each Enter row, and increased by one for each Exit row, thus giving us no duplicate GRP numbers. Figure 6 shows how the results of Listing 5 would look assuming ideal data.

1767-1-f8678a92-0c9c-4750-9715-4258df364

Figure 6: Results of Listing 5, if the data were perfect

In our imperfect data, because two adjacent Exit events will increase both ByTime and ByType by one (Figure 5), the difference between ByTime and ByType for the two rows are equivalent, and those two adjacent rows become a group (an “island”). This enables us to use MIN for Enter rows and MAX for Exit rows to find the valid row for each GRP.

The remainder of this solution uses the same techniques as Listing 3 to pair Enter and Exit rows, and to calculate the difference in time.

Note that Peter submitted a further iteration to his solution, which uses the same basic algorithm but with refinements, including small bug fixes for two edge conditions. It is available as part of the code download at the bottom of the article.

SQL Server 2012-only Solution: New Analytic Functions

While this solution performs far better than my original attempt and will perform well on all SQL Server versions, up to and including SQL Server 2012, a solution that is easier to develop, and has equal performance, takes advantage of some new functions available in SQL Server 2012.

SQL Server 2012 brings us several new functions, three of which will be very helpful here: LAG, LEAD, and TIMEFROMPARTS. LAG and LEAD are two new analytic window functions. TIMEFROMPARTS is one of several new functions used to build a particular date and time with the parts.

In our previous solution, we created partition “windows” based on StaffMember, and used the ROW_NUMBER function to number sequentially each row in each partition (Listing 2). However, we then had to perform two extra table scans to get the event type values for the previous and next rows. The new LAG and LEAD functions allow us to retrieve these event type values from each “window” without needing to apply row numbers first. The LAG function returns the previous value from a particular column without rescanning the table. With LEAD, we can find the next value. Instead of rescanning, SQL Server performs the operation on data already in memory. This means fewer scans of the data and fewer logical reads.

The LAG and LEAD functions each accept three parameters. The first parameter specifies the column for the value we wish to retrieve. The second parameter is optional and is the number of rows to go back or forward with the default of one. The third parameter, also optional, defines a default value in case a NULL is returned.

Listing 6 shows the code using these functions to return the previous and next values of the event type. Notice that we also return N/A instead of NULL and so render redundant the COALESCE logic we previously added for this task.

1767-1-63bbb995-b857-4a61-9dd5-65aa955d4

Listing 6: Using LAG and LEAD to find the previous and next event type values

Critically, we’ve achieved this with only one scan of the table, instead of three. The next step is to filter out the unneeded rows. SQL Server processes the WHERE clause before the LAG and LEAD functions are processed, so we’ll have to use the CTE strategy again. Listing 7 shows the query for this step.

1767-1-d90fffe5-1189-4e9a-874d-a1af5f712

Listing 7: Filter out unneeded rows

Crucially, we are still at one table scan. We now have the rows we need, but we can’t do the time calculations because the EventDateTime values are on different rows. We can use the LEAD function again to access the EventDateTime value on the next row. Because we can nest functions, we can use the results of the LEAD function inside the DATEDIFF function to calculate seconds. Listing 8 shows the additional code for this step.

1767-1-4572ce3a-30b5-4f23-ba5d-ec4a395a5

Listing 8: Calculate work session duration in seconds

The final step is to turn Seconds into hours, minutes, and seconds. We can use the new TIMEFROMPARTS functions to simplify the formula. Each date or time data type in SQL Server 2012 has a corresponding *FROMPARTS function. These allow us to build a date from its component parts. There is no casting or appending. Finally, we filter to return just the Enter rows. Listing 9 shows the final 2012-only code.

1767-1-7f0a843b-581b-44e5-8558-277c94710

Listing 9: The final SQL 2012-only code

Here are the results, just as expected. Thanks to the new functions in SQL Server 2012, the query is simpler and easier to write and is much more efficient, scanning the data just one time. If you are working with SQL Server 2012, keep the new functions in mind to help you solve those tricky queries. If you’re interested in seeing Peter’s SQL 2012-only solution, which is similar to mine but with some refinements, it’s available in the code download file.

Summary

Writing T-SQL queries to solve complex problems is often difficult. Frequently, I will assist others, either on forums or by direct request, in solving those problems. My methodology involves breaking the problem down and solving it systematically. At each step, I examine the results, adjust if necessary, and move on to the next step. I don’t expect to get the right answer right off the bat; writing a complex query takes time. I hope that the individual asking for help will not just take the answer, but will learn something and be better able to handle the next challenging query that comes along. The same applies to me, of course. Peter managed to find a more efficient pre-2012 solution and I can take what I learned from that and apply to future problems.

Luckily for SQL Server developers, T-SQL functionality has improved immensely starting with the introduction of the ranking functions and the OVER clause in 2005, and continuing with the introduction of very useful analytic functions, such as LAG and LEAD, in SQL Server 2012. If you haven’t kept up with these improvements, you owe it to yourself to spend some time learning about them. I found that once I started working with such functions, and Common Table Expressions, I relied on them more and more.

In particular, if you are fortunate enough to run SQL Server 2012, be sure to learn about the new analytic and other functions. The improved performance in many queries may be the tipping point to get your organization to upgrade!