Odd issue, DBA says due to parameter sniffing, could use advice, please...

  • I manage this 1 data warehouse company (an Active Directory snapshot data warehouse), approx. 40M rows of data total, 1 app that stores data, everything else is query-only. It's BNF with only a few rare de-normalization features for performance reasons. There are only a few views (and pretty basic ones, at that), and most access is via sprocs. A dozen-or-so sprocs are general purpose and have a half dozen (or more) optional parameters so people can search using 1 or more of them (always, at least 1 is required, though).

    We have over 200 sprocs, but a couple are having odd performance problems. I'm hoping our situation might ring some bells with some of you and you might have some advice which I hadn't thought of yet...

    We have an occasional sproc which performs well thousands of times per day, then after a few weeks, they start performing pig-slow (see sample below). If I recompile the sproc, the performance problem goes away, sometimes for hours, sometimes for days or weeks; some never reoccur at all. The spocs normally return data in under 1s, a few heavy-hitters nominally returning in 5-10s; then, out of the blue, some of these will start taking minutes to return, some taking--no joke--45 minutes. Database activity is always good, no lengthy data locks, and all indices are rebuilt nightly. When a sproc goes wonky on us, we can execute it with the same parameter set previously giving near-instant turnaround time, now giving us 5-20 minute turnaround time (or more) using he same parameters, then being zippy again after the recompile while using the same parameters. Sometimes, we can go weeks w/o seeing any issue, then we might see several occurrences over a day or 2, then nothing again for weeks. Once a sproc starts running slowly, it keeps running slowly until recompiled.  I can even take queries from Profiler sessions, seeing examples where sprocs run normally fast then suddenly start taking minutes, even with same parameters. (E.g. processes doing repetitive scheduled queries, e.g. activity monitoring.)

    A sample sproc (code omitted for legal reasons) would join a Group table, a link table, and a User table to produce a group+user listing. There would be params for several group attributes (name, GUID, OU, path, etc.), same with user params (name, userid, dept, etc.), nothing unusual. Most params are optional with "= NULL" defaults or even static text defaults (e.g. @userType = 'desktop').

    Any ideas, anyone?

    PS I'm hesitant to try some param sniffing tuning techniques (e.g. local vars, with recompile) until I get a better handle on what the cause is.  I.e. I don't want to just "get lucky", I want to know exactly what the cause(s) is so I can produce a proper solution(s).
    PPS I'm a developer, not a DBA. We have DBAs who've been trying to help, but they're stumped as well, basically out of ideas. I'm trying to help them help me.

  • Do you have query store turned on on this database? It might be helpful to look if this code is showing up on the different views and look at different execution times that happened with it in the last hour or so.  If the execution times are all over the map, esp with one pattern performing well and others suffering, you very well might be looking at a a parameter sniff. 

    Its a subtle problem, but basically: QO figures out how to run your code based on the initial set of data its sent.  When you re-run that code with a wildly different set of parameters, well, shenanigans ensue, and not in a good way.

    If you have a test harness, capture some run-time stats with various executions of the SP.  Should be fairly easy to sniff it out. Initial set, and things that look like the initial set will run well, others may suffer.

  • "Query store"  Is that some form of auditing, like when a sproc might get re-inserted into the cache?  (I'm just guessing that maybe sprocs get dropped from memory periodically, possibly getting reinserted with a set of unusual params that give us wonky performance for nominal cases?

  • Ancient Coder - Monday, February 11, 2019 12:10 PM

    "Query store"  Is that some form of auditing, like when a sproc might get re-inserted into the cache?  (I'm just guessing that maybe sprocs get dropped from memory periodically, possibly getting reinserted with a set of unusual params that give us wonky performance for nominal cases?

    Its a feature that came with 2016...

    https://docs.microsoft.com/en-us/sql/relational-databases/performance/monitoring-performance-by-using-the-query-store?view=sql-server-2017

  • parameter sniffing gone wrong can often(almost always?) be related to statistics being out of date. 
    look at the tables being used in the procedure; it only takes a small percentage of inserts/updates/deletes to skew statistics but not be enough for an auto-update of statistics, and if your have a job that ,say runs on weekends to update stats, you might want a special job that updates stats on those specific tables more often, even multiple times a day.

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • Maybe, but it doesn't have the right smell. The sprocs we've noticed slowing down do so suddenly, going from, for example, 1s to 45s, given the same params.  If the slowdown were caused by CRUD data changes, I'd think it'd be more gradual.  I'm just guessing, here, since I don't know how internal sproc caching functions. Here's an example of a sproc's params, nothing really complicated, body just joins a Group and User table with a Member link table:

    CREATE PROCEDURE dbo.GetUserGroupListing
      @AccountType NVARCHAR(20) = N'Desktop'
     ,@EmployeeID NVARCHAR(64) = NULL
     ,@UserDomain NVARCHAR(20) = N'root'
     ,@UsersAMAccountName NVARCHAR(64) = NULL
     ,@UserNTAccount NVARCHAR(64) = NULL
     ,@UserCN NVARCHAR(64) = NULL
     ,@IncludeDistributionLists BIT = 0   /* set to 1 to include DLs, else only security groups are returned */

    The tables don't change much over the course of a day, though Member does a bit more often as people are assigned to different groups over time. Still, the table indices are rebuilt nightly for peak performance. Table row counts are approximately 900K, 6M, and 150K, respectively.

  • ManicStar - Monday, February 11, 2019 12:00 PM

    If you have a test harness, capture some run-time stats with various executions of the SP.  Should be fairly easy to sniff it out. Initial set, and things that look like the initial set will run well, others may suffer.

    We're in the midst of upgrading from SQL Server 2012 to 2016, so some enviable features aren't available yet. 🙁
    I benchmark all code objects* and the sprocs with crappy result times have all run superbly prior and now poorly with the exact same parameters and nearly the same data. Worse, I've recorded individual sprocs performing well upon creation and during spot checks, then going wonky until being recompiled. The state changes between "well" and "wonky" aren't gradual, they're sudden. I've even created Agent jobs which perform semi-random executions of select sprocs & functions, recording execution times in a temp table, so I can see code objects' sudden performance shifts.  Our DBA team says "parameter sniffing", but now I wonder if there's something going on at a deeper level that they're not noticing.

    * I'm a veteran software engineer, not a DBA/DBE, yet I must wear those hats on occasion.

  • I have a question about this:
    When it gets suddenly slow, are you able to reproduce the slowness on your machine from SSMS?

    What I am thinking is to nab the execution plan when it is fast and then grab the execution plan when it is slow and compare the two.  I expect they will look different and will probably help lead you to the actual problem.
    Another thing to check when things start performing slow is how bad do the statistics look when it is slow?

    Depending on how the stored proc is usually used too, you could do some tweaks with query hints to get better performance, but you should use those carefully as they can have unintended side effects.

    My first step with that problem would be to look at the execution plan though since you don't have query store.  Step 2 would be to look at how bad the statistics are.

    With your agent jobs, can you notice any sort of pattern with the slowness?  Having a pattern (such as a set amount of time before things behave slowly or it runs fast until Monday at 8:00 AM) can often help identify problems.

    I would ask for a copy of the execution plans BUT they will contain data that you won't want to share with us unless you run it through an obfuscator of some sort first.

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • I'm with the dba team, it sounds like sniffing.  
    Sniffing is caused in part by a cached data plan being reused with wildly different criteria than what it was compiled with.  Eventually data plans are redone by the query optimizer, so you can see performance come and go. 

    Create yourself a test harness and get run-time stats from a good sampling of stored procedure calls with different parameter sets.  If they are all over the map, then you are looking at sniffing.   Resolving sniffing is usually a case of re engineering the code so it doesn't have the situation of executing off a bad cached plan.  There are different techniques to do it. 

    I recommend you read this article by Gail Shaw about the optimizer and how to confuse the cr*p out of it: 

    https://www.red-gate.com/simple-talk/sql/t-sql-programming/how-to-confuse-the-sql-server-query-optimizer/?article=2280

  • (re: article) "They don’t work so well in T-SQL and they’re often not compatible with how the query optimizer behaves."
    My 6th sense is all aquiver here! I'm a software engineer, so I do OOP w/o thinking about it. It's great for code reuse, but I've run into weirdisms w/ SQL Server before, so I suspect this is a new case of that old problem.

    A few of our problem sprocs fall into the "generic query" category and I suspect others are, too.

    Thx for that article! It's time for reading and refactoring. 🙂

    Oh, and I've looked for patterns in the sproc benchmarking table and there are none apparent; that's what makes this so frustrating and implies that I'm bumping into ignorance, as in "something not-yet-known". 🙂

  • bmg002 - Tuesday, February 12, 2019 9:31 AM

    I have a question about this:
    When it gets suddenly slow, are you able to reproduce the slowness on your machine from SSMS?

    What I am thinking is to nab the execution plan when it is fast and then grab the execution plan when it is slow and compare the two.  I expect they will look different and will probably help lead you to the actual problem...

    Aye, consistent in SSMS. Recompile and slowness disappears but reappears w/in hours or days, no pattern (yet) as to what triggers it. Worse, it occurs at seemingly random times and for different sprocs, e.g. sproc A runs fine for hours then goes wonky at noon, sproc B runs fine for days and goes wonky on Friday at 6am, and sproc C goes wonky at Wed 4pm. No apparent pattern or causal events. To me, it smells like I'm inadvertently causing some conflict w/in SS' internals, e.g. statistics. (I'm looking into "statistics" now regarding SS.)

    I haven't compared plans before and after, so I'll set that up. Great idea!  (And yeah, sorry that I can't post much here due to the nature of our company.)

  • Ancient Coder - Tuesday, February 12, 2019 10:01 AM

    bmg002 - Tuesday, February 12, 2019 9:31 AM

    I have a question about this:
    When it gets suddenly slow, are you able to reproduce the slowness on your machine from SSMS?

    What I am thinking is to nab the execution plan when it is fast and then grab the execution plan when it is slow and compare the two.  I expect they will look different and will probably help lead you to the actual problem...

    Aye, consistent in SSMS. Recompile and slowness disappears but reappears w/in hours or days, no pattern (yet) as to what triggers it. Worse, it occurs at seemingly random times and for different sprocs, e.g. sproc A runs fine for hours then goes wonky at noon, sproc B runs fine for days and goes wonky on Friday at 6am, and sproc C goes wonky at Wed 4pm. No apparent pattern or causal events. To me, it smells like I'm inadvertently causing some conflict w/in SS' internals, e.g. statistics. (I'm looking into "statistics" now regarding SS.)

    I haven't compared plans before and after, so I'll set that up. Great idea!  (And yeah, sorry that I can't post much here due to the nature of our company.)

    You have sniffing.  Time to refactor stuff.

  • You can have temporary relief if you want by creating an agent  job 
    1. find out the max_elapsed_time when you have bad performance --- X
    2. just write a simple if  query, if  max_elapsed_time >  X then re-compile, this is an agent job that runs every 2-5 minutes  (depending on your need) at night when your job is running

  • Thank you guys for your advice, I appreciate it. While waiting to see if our DBA team can help more, I'm expanding my gathering of benchmark metrics and, unless the DBAs can cough up a rabbit out of their collective hat, I think the agent job will do the trick until I can refactor the affected sprocs.

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply