Blog Post

The Best Tool for Comparing Whether Two T-SQL Queries Produce Identical Results

,

There is one software tool that should be in every software developer’s tool chest.  That is knowledge of the platform on which you are developing.  With all the hype that surrounds every new software tool that comes out, it is important to recognize the tools you already have at hand, and the ones that may be hidden gems within your development platform of choice.

Let’s backtrack for a moment into how a T-SQL query is developed.  Remember my prior blog: Make it Work, Make it Fast, Make it Pretty?  That blog (based on the sagacious and immortal words of SQL MVP Jeff Moden) suggests there’s three steps.  In reality there are more, but for the time being let’s focus on two of the three mentioned in that statement.

  • Make it Work – you understand the requirements, you understand the physical and logical models in which the data is stored, you write a query that returns hundreds of rows of results and you painstakingly check that the results indeed match the requirements, and are as you intended, based on the underlying data stored in the tables.
  • Make it Fast – once your query is returning the desired results, at some point in time you may be confronted by the dreaded statement by somebody that “it isn’t fast enough.” Time to break out those performance tuning tools.

Let’s hope that the dreaded “it isn’t fast enough” pronouncement doesn’t come after the query is already in production, but that is probably the most likely scenario.

While I’m into doing the Make it Work part, I have a tendency to rely on known, high-performance code patterns to develop a query which, under most circumstances should probably be of sufficient performance.  Maybe I’m having an off day, or maybe my underlying data changes in such a way that the SQL Optimizer simply starts making bad choices for choosing an execution plan (see: Lessons Learned from a Poor-performing View).  Maybe the darned thing is very, very complex and you’re pushing up against tight delivery timelines, so you just need to get it done.  Whatever the case, there will come a time where you’ll need to rewrite a query to improve its performance.

This is where it might get a little tricky.  Suppose that the working query is highly complex, perhaps complex to the point that you’re afraid to touch it for fear of breaking it.  I’ve seen that happen a lot in cases where you weren’t the original author of the query.  Perhaps you don’t even know all of the outlying test cases that were run through it to prove that it worked correctly in the first place.  Let’s face it, documentation isn’t always all it’s cracked up to be, and easily could be lost with time anyway.

You’ll need a tool that allows you to compare the results of the working query, against any alternative(s) that you may need to construct to improve on performance.

Enter the T-SQL EXCEPT Set Operator

Well, first I suppose we need to have a couple of SQL queries to compare.  Let’s draw upon an old article of mine kindly published by the fine but eccentric editor of Simple Talk entitled: The SQL of Gaps and Islands in Sequences.

-- A solution of my own design for Gaps
WITH Islands AS 
(
    SELECT ID, StartSeqNo=MIN(SeqNo) - 1, EndSeqNo=MAX(SeqNo) + 1
    FROM (
        SELECT ID, SeqNo
            ,rn=SeqNo-ROW_NUMBER() OVER (PARTITION BY ID ORDER BY SeqNo)
        FROM dbo.GapsIslands) a
     GROUP BY ID, rn
)
SELECT ID, StartSeqNo=MIN(SeqNo), EndSeqNo=MAX(SeqNo)
FROM (
    SELECT ID, SeqNo
        ,m=ROW_NUMBER() OVER (PARTITION BY ID ORDER BY SeqNo)/2
    FROM Islands
    CROSS APPLY (VALUES (StartSeqNo),(EndSeqNo)) a(SeqNo)) a
GROUP BY ID, m
HAVING COUNT(*) = 2
ORDER BY ID, StartSeqNo;
-- A Gaps Solution from SQL MVP Deep Dives
SELECT ID, StartSeqNo=cur + 1, EndSeqNo=nxt - 1
FROM (
    SELECT ID, cur=SeqNo, nxt=(
        SELECT MIN(B.SeqNo)
        FROM dbo.GapsIslands AS B
        WHERE B.ID = A.ID AND B.SeqNo > A.SeqNo)
    FROM dbo.GapsIslands AS A) AS D
WHERE nxt - cur > 1;

Both of these solutions find gaps in a table containing a column of sequence numbers.  You won’t really need to run these two queries to follow along with the discussion, but if you want to try some of the things we’ll discuss that article has all the scripts you’ll need in its resources section (at the top).

Suppose that each of these queries returns about 10,000 rows give or take (they don’t but let’s pretend that they do).  To verify the veracity of the statement “they produce identical results” we just need to do a few steps:

  • Run both of the queries exactly as shown above and check the row counts of each. If they both return 10,650 rows we’re good. If one returns 10,650 rows and the other returns 10,633 then you know they’re not producing identical results (duh!).
  • Comment out the ORDER BY in the first query and add EXCEPT between the two. That should now run because each query produces exactly three columns of results.
  • Note that the first query contains a Common Table Expression (CTE). If the second query does too, you’ll need to move that CTE up (and possibly rename it) to make it cascaded with the first. If both queries use the same CTE(s), simply discard the one for the second query.

Our case should now look like this:

-- A solution of my own design for Gaps
WITH Islands AS 
(
    SELECT ID, StartSeqNo=MIN(SeqNo) - 1, EndSeqNo=MAX(SeqNo) + 1
    FROM (
        SELECT ID, SeqNo
            ,rn=SeqNo-ROW_NUMBER() OVER (PARTITION BY ID ORDER BY SeqNo)
        FROM dbo.GapsIslands) a
     GROUP BY ID, rn
)
SELECT ID, StartSeqNo=MIN(SeqNo), EndSeqNo=MAX(SeqNo)
FROM (
    SELECT ID, SeqNo
        ,m=ROW_NUMBER() OVER (PARTITION BY ID ORDER BY SeqNo)/2
    FROM Islands
    CROSS APPLY (VALUES (StartSeqNo),(EndSeqNo)) a(SeqNo)) a
GROUP BY ID, m
HAVING COUNT(*) = 2
EXCEPT
-- A Gaps Solution from SQL MVP Deep Dives
SELECT ID, StartSeqNo=cur + 1, EndSeqNo=nxt - 1
FROM (
    SELECT ID, cur=SeqNo, nxt=(
        SELECT MIN(B.SeqNo)
        FROM dbo.GapsIslands AS B
        WHERE B.ID = A.ID AND B.SeqNo > A.SeqNo)
    FROM dbo.GapsIslands AS A) AS D
WHERE nxt - cur > 1;

If the two queries produce identical results, when you run the above it should produce zero rows in the results set!  Then all you’re left with is to verify that the newer query runs faster, and that’s something I covered in The One Million Row Test Harness.

If some rows appear then there is something wrong and the queries are not producing identical results sets.  You could analyze the results by switching the order of the queries around the EXCEPT.

Or you could just forget this tip about using EXCEPT and painstakingly analyze each of the 10,650 rows produced by each result set.  But that is something that I’m just a bit too lazy to do!

Then again, maybe it is precisely the tool you need to get over your queryophobia and dive head first into that complex query written by some other SQL geek, without worrying too much about verifying that the results are correct.

If you liked this neat little T-SQL trick then maybe you should follow me on Twitter: @DwainCSQL because who knows what other cool stuff I might come up with!

© Copyright Dwain Camps 21 May 2015.  All rights reserved.

Rate

You rated this post out of 5. Change rating

Share

Share

Rate

You rated this post out of 5. Change rating