Devide ' n' Conquer techniques in t-sql query writing

  • Hi Folks,
    It would be a great help for me if you suggest some best ideas or blogs where we can improve our skills in  Divide and N conquer techniques implementation in  sql  server query writing.

    I have got only the below old thread from SSC written by  Steve.
    http://www.sqlservercentral.com/articles/Miscellaneous/batching/151/

  • To Steve's point in that article, most of the time set-based code will perform better than trying to "divide and conquer", but he was dealing with a trigger which meant it was adding requirements in the background as he tried to do his updates.

    In general, divide and conquer your query if it helps you logically think it through, but many times the end result may be achievable more efficiently by dealing with it as a whole.

    -------------------------------------------------------------------------------------------------------------------------------------
    Please follow Best Practices For Posting On Forums to receive quicker and higher quality responses

  • By the same token, a lot of people think that "set-based" means "all in one query".  While set-based code frequently does result in a single query, it's not always the best thing to do for performance.

    A great recent example of that for me is parsing execution plans to determine which statistics were used in every database for each plan with the ultimate goal of determine which statistics are unused and can be dropped simplifying and seriously shorting the semi-weekly statistics rebuild on large, very wide tables.

    The original single query set-based code took over 4 hours to execute on just 35,000 initial query plans.

    After doing an initial pre-process where a given encompassing RECOMPILE tag was extracted and stored in a temp table, the duration dropped to 8 minutes and processing on 75,000 plans only increased to 10 minutes.

    After the first run of all the available plans, we now had a table of plans that had already been analyzed for which stats had been used in a semi-permanent "temp" table in a special database built for the purpose.  Since a plan never changes, it was safe to not reprocess the plans that had already been processed.  The total time for each run dropped a whole lot further to an average of about 6 seconds with the occasional spike (due to new plans for heavy batch processing runs) to less than 3 minutes.

    Compare all of that to the original 4 hours and understand that "Divide'n'Conquer" was the key in this process.

    I'm frequently given the task of improving code for performance.  Frequently, the code is an "all in one" query.  One of my greatest success stories was several years ago when, as a recent new hire,  I was asked to standup a new server and clone a copy of the databases.  Since that involved a fair amount of work back then, I asked "Why".  They said they did it once a year for a single year end report that would drag the production server to it's knees for more than 45 minutes, and caused TempDB to grow in an insane fashion.

    I asked to see the code.

    I studied the single query code for about 2 hours (it had about 40 tables in the joins) along with the estimated execution plan and played around with the code a bit as a method of studying the code to see what was going on.  To make a much longer story shorter, I ended up resolving two of the joins in a separate rather tiny Temp table and then replaced the joins to those two tables to that single TempTable.  All the other joins were left in place.  The code produced the expected result in 8 seconds flat, was virtually imperceptible as to when it ran according to PerfMon graphs, caused no growth of TempDB, and had no long term locking or blocking (being blocked) at all.

    Once again, "Divide'n'Conquer" was the key.

    I'll also tell you that a lot of people make the mistake of believing that CTEs are separate entities and are calculated as such.  The ONLY time that's true is if the CTE has some sort of blocking operator in its overall plan to force early materialization of the results of the CTE.  CTEs are really "inline views" that work just like a VIEW works and they become a part of the larger query rather than a separate entity. 

    In other words, CTEs are great for helping with readability and understanding but they are NOT necessarily "Divide'n'Conquer" and usually are not.

    The bottom line is that there are two types of "Divide'n'Conquer".  One is to solve one problem at a time using CTEs or "Derived Tables" in the FROM clause (which I did a lot before the advent of CTEs).  This is "mental" Divide'n'Conquer and can make writing code a whole lot easier and faster.  The other is "physical" Divide'n'Conquer where you dictate the order things will be processed in by removing all chance that the optimizer will do otherwise.  This is normally done using a Temp Table (some use Table Variables, which I avoid like the plague for a ton a reasons I won't cover here) or by tricking the optimizer to materialize a CTE using a blocking operator, which is much less common and can unexpectedly and silently change (well, except for the ensuing horrible performance) if someone modifies the code.

    The reason why you don't find many articles on the subject of "Divide'n'Conquer" is because 1) a whole lot of people just don't understand it or the need and 2) it's anything but a precise science.  It's so imprecise that, in some cases, both physical and mental "Divide'n'Conquer" methods can make things a whole lot worse.

    There are, however, a couple of things I look for that indicate that physical Divide'n'Conquer might be a pretty good option.

    For example, in that conversion of a 45 minute server crippling wad of code to a streamlined and efficient 8 second run, the indication was the use of DISTINCT and the fact that the estimated execution plan contained many arrows with rowcounts that greatly exceed that of the tables themselves.  That was an indication that either the joins were totally on the wrong columns (they weren't, in this case), the database design sucked (this part was seriously true in this case), or the optimizer simply needed to do the DISTINCT a whole lot earlier.  I couldn't get the code to do that in the all-in-one query so I used physical Divide'n'Conquer to prepare a separate "uniquefied/preaggregated" table ("pre-aggregation" is a great Divide'n'Conquer tool and I thank Peter "Peso" Larsson damned near every day for both coining the phrase and demonstrating the advantages many times over the last 15 years or so).

    In the case of he XML splitting optimization, Eirikur Eiriksson suggested a quick and dirty isolation of only the RECOMPILE tag set, which is the only part of the XML that was actually needed to be examined.  Of course, that went into a Temp Table to prevent the optimizer from having any choice in the matter.  Think of it the same way as avoiding "SELECT *".  The quick copy to a Temp Table provided a MUCH smaller subset of data to be worked on, which means that SQL Server didn't have to work so hard, which almost always results in serious performance gains.  He also suggested that a semi_permanent Temp Table be used to identify what had already been processed and didn't need to be processed again.  Both are great examples of Divide'n'Conquer.

    There are times when you actually and precisely DO know when to use Divide'n'Conquer methods.  For example, you have a 100 GB table with a lot of history in it and you need to delete some or most of the older stuff by date.  There are three ways to do such a thing.

    1.  Just do the delete... and watch your log file explode and most everything else get pushed out of memory.

    2.  Use a "set based loop" to delete a monthly set of rows one month a time.

    3.  If most of the data needs to be deleted (say 70% or more), it may be (usually is)  MUCH more effective and quicker (especially if you can bring minimal logging into play) to copy the data you want to keep to an identical but differently named table, drop any FKs from the old table and rebuild them on the new, drop the old table, and then rename the new table.

    Methods 2 and 3 are Divide'n'Conquer methods and are one of the few places where it is a well known science rather than something imprecise that needs to be pre-tested for value.

    As a bit of a funny observation, unless you write a single stored procedure with no functions, views or other program objects to run an entire application, you're actually using Divide'n'Conquer methods with every stored procedure, function, or view that you write. 😉  Why should it be any different within separate programming objects?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • A case where I've seen it works best to divide and conquer, is when you have a query with a large number of joins (it's relative I suppose, I don't have an exact number of how many is large).

    Sometimes the SQL Server optimizer will accidentally try to use a join that may be for the purpose of presentation logic too early (further to the right in execution plan), instead of focusing on the joins that are designed for filtering type logic first.  The earlier those presentation joins are done, then the more work the rest of the query execution will have to do because the resultset wasn't narrowed down properly.  When I see that happen, then I'll do a query into a temp table with all the filtering designed joins, then a separate query off of that temp table with all the presentation type joins.

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply