Dynamically Query a 100 Million Row Table-Efficiently

  • Hi, first of all, thanks for a great article. I think the following query that you posted would return incorrect result set when one of the product ids is null and the other is not. For instances, if @product_id1 is null, it will bypass the condition for @product_id2 and hence all product ids would return. Please correct me if I am wrong.

    declare @product_id1 int = 170, @product_id2 int = 169, @class_id int = 1;

    SELECT COUNT(*)

    FROM Fact where

    (

    (@product_id1 = product_id or @product_id1 is null )

    or

    (@product_id2 = product_id or @product_id2 is null )

    )

    and

    (class_id = @class_id or @class_id is null)

    option (recompile);

    I have another question regarding the procedure cache. If I go with the dynamic sql and develop my query into a string and then execute it using sp_executesql to take advantage of the proc cache, a new efficient plan would be generated for each value passed and cached. Won't we have a lot of plans for each value in the proc cache if we have a proc being called for a number of different values and shouldn't this be a consideration also? Thanks

  • Great article and easy to follow along - even for rookie like me. Thanks again.

  • azhuravel3343 (2/24/2015)


    Mr Strange

    I appreciate the information but it also would be helpful to list title/ISBN of Itzik Ben Gan book that you are referring to.

    There are a few:

    http://www.amazon.com/T-SQL-Querying-Developer-Reference-Ben-Gan/dp/0735685045/ref=asap_bc?ie=UTF8

    http://www.amazon.com/Microsoft-Server-Fundamentals-Developer-Reference/dp/0735658145/ref=asap_bc?ie=UTF8

    http://www.amazon.com/Microsoft-High-Performance-Functions-Developer-Reference/dp/0735658366

  • PHYData DBA (2/24/2015)


    To start with. I did not carefully read the entire article.

    I stopped at the 5th time I saw "Select *" used in the code for an article about Query performance.

    At that point I rolled through and it seems all the SQL uses "Select *" instead of selecting from column names.

    You want to remove the Schema Scan that needs a shared lock to complete and will occur on every execution of your Performance tuned select.

    This was an article about performance tuning right? 😉

    There are times when "SELECT *" is a bad idea but, in the article, it appears that that the author wants to return all of the columns. When you need to return all of the columns then there is no problem with "SELECT *" from a performance standpoint. If you are returning all the columns "SELECT *" and "SELECT <column names>" produces the exact same query plan. Using the 120M row sample data the queries below produce the same query plan and perform exactly the same...

    -- query 1

    SELECT *

    FROM Fact

    WHERE product_id IN (97,90,79)

    OPTION (recompile);

    SELECT

    date_id,

    shipDate_id,

    class_id,

    product_id

    FROM Fact

    WHERE product_id IN (97,90,79)

    OPTION (recompile);

    -- query 2

    SELECT *

    FROM Fact

    WHERE shipDate_id =19

    OPTION (recompile);

    SELECT

    date_id,

    shipDate_id,

    class_id,

    product_id

    FROM Fact

    WHERE shipDate_id =19

    OPTION (recompile);

    -- query 3

    SELECT *

    FROM Fact

    WHERE product_id < 100

    OPTION (recompile);

    SELECT

    date_id,

    shipDate_id,

    class_id,

    product_id

    FROM Fact

    WHERE product_id < 100

    OPTION (recompile);

    The article is good. I would suggest continuing from where you left off 😉

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Good work! appreciate it.

    😎

  • Thanks muhammad.yasir, good spot!

    I've looked into this and perhaps the following would more accurately solve the problem....

    declare @product_id1 int = 170, @product_id2 int = 169, @class_id int = 1;

    SELECT COUNT(*)

    FROM Fact where

    (

    (@product_id1 = product_id or @product_id2 = product_id )

    or

    COALESCE( @product_id1, @product_id2 ) IS NULL

    )

    and

    (class_id = @class_id or @class_id is null)

    option (recompile);

  • Thanks for all the feedback, much appreciated.

    As I said previously, writing an article for a SQL Server forum that has enough detail to convey an understanding of a subject, but at the same time, being short enough to be consumed in someone's coffee break is a challenge.

    Some of the subjects I had to drop were...

    Symmetric adding

    Fuller, busier query plan cache

    Parameter Sniffing

    Plan parameterization

    But it sounds like you guys want them back, so I guess you've thrown me the challenge of article II. Looks like I need another loan from the time bank 🙂

    For now, I hope you enjoyed the article and its demonstrated well one of the keys points. That data is not uniform and this presents a problem for the query optimiser and plan caching, especially when the dataset is big.

  • Am I dreaming, or have I read somewhere that TVFs are stored as source code and expanded when the calling query is compiled ?

    This fits well with your results : an online query with constant parameters is compiled each time; the parameterised version utilises the stored plan and so can be sub-optimal.

    For me, there's no magic, the recompile hint works as expected.

    Having said this, I agree with the other commentators that you've written an excellent article.

  • Hi,

    You have suggested to read book written by Itzik. May I know which book you are referring? Is it 'Microsoft SQL Server 2012 T-SQL Fundamentals (2012 )'?

  • Steve Jones - SSC Editor (2/24/2015)


    azhuravel3343 (2/24/2015)


    Mr Strange

    I appreciate the information but it also would be helpful to list title/ISBN of Itzik Ben Gan book that you are referring to.

    There are a few:

    http://www.amazon.com/T-SQL-Querying-Developer-Reference-Ben-Gan/dp/0735685045/ref=asap_bc?ie=UTF8

    http://www.amazon.com/Microsoft-Server-Fundamentals-Developer-Reference/dp/0735658145/ref=asap_bc?ie=UTF8

    http://www.amazon.com/Microsoft-High-Performance-Functions-Developer-Reference/dp/0735658366

    Checkout Steve Jones's earlier post.

    Perhaps start with T-SQL Querying as its packed with solutions to common code problems.

  • gary.strange-sqlconsumer (4/21/2015)


    Steve Jones - SSC Editor (2/24/2015)


    azhuravel3343 (2/24/2015)


    Mr Strange

    I appreciate the information but it also would be helpful to list title/ISBN of Itzik Ben Gan book that you are referring to.

    There are a few:

    http://www.amazon.com/T-SQL-Querying-Developer-Reference-Ben-Gan/dp/0735685045/ref=asap_bc?ie=UTF8

    http://www.amazon.com/Microsoft-Server-Fundamentals-Developer-Reference/dp/0735658145/ref=asap_bc?ie=UTF8

    http://www.amazon.com/Microsoft-High-Performance-Functions-Developer-Reference/dp/0735658366

    Checkout Steve Jones's earlier post.

    Perhaps start with T-SQL Querying as its packed with solutions to common code problems.

    Get yourself a subscription to SafariBooksOnline.com and you have access to Itzik Ben Gan, Kalen Delaney and a host of other SQL Server writers or for that matter writers on all sorts of technology and business.

    Best £300 I ever spent.

  • Just finished this for the first time. Great work Gary! 5 stars.

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • ChrisM@Work (2/24/2015)


    ... Two omissions stood out however - the cost of recompiling, ...

    I will jump through hoops to trade CPU ticks to avoid disastrously bad plans (and the horrible IO and concurrency issues they bring among other things). 🙂 This is clearly such a beneficial case. In the past 2 decades of my SQL Server consulting business I have had hundreds of such opportunities but can't recall a single situation where I removed an OPTION (RECOMPILE) from a query at a client that was causing bad things to happen.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • Some time ago I saw an article where the author had noticed a problem with a query where the principle condition was WHERE OrderDate BETWEEN @StartDate AND @EndDate.

    The problem that was observed was that if the query ran to produce a weekly report first it performed well but the daily and monthly reports ran like snails.

    If the Monthly report ran first then that ran fine but the others ran slow.

    In the end the author decided to produce a dummy variable into dynamic SQL using sp_executesql. The name of the variable depended on the difference in days between @StartDate and @EndDate. In effect the daily, weekly and monthly reports all got a separate execution plan even though the working mechanicals were identical.

    SELECT {column list} FROM {tables} WHERE OrderDate BETWEEN @StartDate AND @EndDate AND @Daily=1

    SELECT ... WHERE OrderDate BETWEEN @StartDate AND @EndDate AND @Weekly =1

    ...etc

    The approach solved the problem.

    I'm nervous about an all encompassing stored procedure with loads of "maybe" parameters. I had a query that had qualified as such but when I used SQL Profiler I found that there were 3 permutations accounting for an almost total majority of cases.

    What I did was to write procs for those 3 cases and a 4th that simply decided which of the 3 to call.

    There are down sides to this approach as well but it met the SLAs and was easy to read

  • David.Poole (5/27/2016)


    Some time ago I saw an article where the author had noticed a problem with a query where the principle condition was WHERE OrderDate BETWEEN @StartDate AND @EndDate.

    The problem that was observed was that if the query ran to produce a weekly report first it performed well but the daily and monthly reports ran like snails.

    If the Monthly report ran first then that ran fine but the others ran slow.

    In the end the author decided to produce a dummy variable into dynamic SQL using sp_executesql. The name of the variable depended on the difference in days between @StartDate and @EndDate. In effect the daily, weekly and monthly reports all got a separate execution plan even though the working mechanicals were identical.

    SELECT {column list} FROM {tables} WHERE OrderDate BETWEEN @StartDate AND @EndDate AND @Daily=1

    SELECT ... WHERE OrderDate BETWEEN @StartDate AND @EndDate AND @Weekly =1

    ...etc

    The approach solved the problem.

    I'm nervous about an all encompassing stored procedure with loads of "maybe" parameters. I had a query that had qualified as such but when I used SQL Profiler I found that there were 3 permutations accounting for an almost total majority of cases.

    What I did was to write procs for those 3 cases and a 4th that simply decided which of the 3 to call.

    There are down sides to this approach as well but it met the SLAs and was easy to read

    Report queries like you mention are one of the most common wins for OPTION (RECOMPILE). You clearly don't want the same plan used for a 1 day date range as you would for a 10 year date range (or vice-versa). OPTION (RECOMPILE) in this case also protects you against actual data value skew at the same time in cases where you may also filter on a ClientID for example and one of your clients is General Electric and the rest are mom-and-pop corner stores.

    Why bother with fake stuff work arounds when you have a perfect solution at hand? 😎

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

Viewing 15 posts - 16 through 30 (of 36 total)

You must be logged in to reply to this topic. Login to reply