How did you deal with the pressure of a perf issue? (non-tech question)

  • Hi all,

    I don't think there's a greater test of one being a DBA than being faced with a performance issue on a critical server, and there's nobody EXCEPT the DBA to blame.  I myself have been fortunate to have had the luxury of repointing blame - "The exact same query/workload runs in under a second in Management Studio, obviously the problem is not within SQL", or delaying/deflecting blame,  "This new SSIS package is taking 8 hours to run where its DTS predecessor takes 113 seconds, at least we're still in Dev and have time to look for improvements, which is what Dev/Testing is for".

    Thankfully, I've never been in what must be the AWFUL position of a critical query turning to sludge on a critical system, and having a dozen members of management breathing down your neck as you frantically go through execution plans, asking every 30 seconds, "We need an update now on an eta for resolution", and that because a member or two of the board is breathing down THEIR necks.

    I've never been an 'Engineer Scott' - if I know a problem or task will take an hour to fix, I generally don't quote 3-4 hours so I can look like a genius.  If it's just a matter of more memory or upgrading a virtual disk, well hardware fails eventually and then, the focus comes off you, and onto whoever is provisioning the new hardware

    But what if it's a code rewrite?  That could take minutes, it could take DAYS.  "Which is it??!!!" ask management furiously.  They want a specific answer, understandably.

    How do you push back?  Such a shame there is so little written on this aspect of our job.

  • My take on this kind of situations is never to participate or encourage any kind of a blame game, but instead focus on the problem's solution. Get everyone on board and have them pulling the wagon in the same direction! Once one has demonstrated the ability to provide a resolution, things get easier and everyone will be contempt to the fact that it takes as long as it takes but eventually it will be resolved. 
    Pushing vigorously back will be seen as a responsibility deflection or in other words, ducking the blame. Not worth it.
    😎

    There is only one answer to the question whom which one does not know the answer: "I don't know!" 😉

  • Firstly 'It's not my problem, it's Bob's', is a terrible way of dealing with anything. I would far rather work with someone who takes responsibility even when it's not their fault and works towards a fix.
    p.s. "The exact same query/workload runs in under a second in Management Studio, obviously the problem is not within SQL", is wrong. The problem very likely is within SQL. SSMS and .Net apps use different SET options, so you can easily have a query run by an app take 5 minutes and exactly the same query run by SSMS take 5 seconds, and it's because the app is getting a different (and bad) plan to SSMS.

    I've had a couple of severe crises at a bank. The managers don't generally want to be hovering. They need to know what's going on, so establishing that you will give them regular updates (every 20 or 30 minutes) is often perfectly acceptable. Providing you do, of course.

    If I know that a problem will take 1 hour to resolve, I'll give an estimate of 2 hours at a minimum. Because I, like most people, tend to under-estimate, because production crises are often more complicate than they initially appear, because my first approach may not succeed, and because it's far better to under-promise and over-deliver than the reverse

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • I'm with Gail, never, ever, pass the buck.  You will almost always look better in both the boss' and your co-workers eyes, if you help work to resolve the problem, even if it has nothing to do with your responsibilities.

    As for managers breathing down your back, again, they've got people breathing down their necks including the end-users.  You need to keep them in the loop on what's going on with the problem, even if it's nothing more than a "I'm still working on the issue, it looks like it might still be X hours before the cause is identified / we have a solution / we have a work-around."

    I'd also agree, pad your time estimates.  Don't be excessive (while funny, Mr Scott's "take however long you think it will take and multiply by 4" is way excessive,) but put a cushion on there.  If you think it will take an hour, call it "it might take around 2hrs, maybe less depending on what we find."  If it only takes an hour, everyone is happy, if it takes 1:45, they're still OK because they thought it would take 2.  And absolutely, positively, as soon as you find that it's going to take longer then your quoted estimate?  Tell the boss.  Don't wait until 1:59 into your 2hr estimate to tell them "yeah, it's a gnarlier problem than I thought, it's going to be another couple hours" when you knew 30 minutes in it was going to take a lot longer.  That just erodes your boss' trust in you for the next time there's a problem, which means they're more likely to hover pestering for constant updates...

  • JaybeeSQL - Friday, March 24, 2017 5:54 AM

    Hi all,

    I don't think there's a greater test of one being a DBA than being faced with a performance issue on a critical server, and there's nobody EXCEPT the DBA to blame.  I myself have been fortunate to have had the luxury of repointing blame - "The exact same query/workload runs in under a second in Management Studio, obviously the problem is not within SQL", or delaying/deflecting blame,  "This new SSIS package is taking 8 hours to run where its DTS predecessor takes 113 seconds, at least we're still in Dev and have time to look for improvements, which is what Dev/Testing is for".
    ...
    But what if it's a code rewrite?  That could take minutes, it could take DAYS.

    I tend to take a kind of holistic approach to database performance issues.  I look at what the symptoms are, analyze the query that is running, and the environment that it's occurring in.  In this case you describe, I'd agree with Gail, and the fact that it runs fine in management studio but not in an application, or SSIS package, or whatever tells me that it isn't the code that's the problem.

    You can easily see the SET options in sys.dm_exec_sessions or in a trace.  One such parameter that's different in SSMS than the default is ARITHABORT:
    https://msdn.microsoft.com/en-us/library/ms190306.aspx#Anchor_1
    Microsoft says you should always set this to ON, even though the default is OFF.  SSMS explicitly sets this to ON in your queries, but most applications and even SQL Agent jobs or SSIS don't.  You can change the default at the instance level:
    -  right click on the server in SSMS Object Explorer, and select properties
    -  on the Connections page, scroll down in the Default connection options to the "arithmetic abort" setting and make sure it is checked.
    there's also a setting in each database you can use:
    -  right click on the database in SSMS Object Explorer, and select properties
    -  on the Options page in the Other Options, scroll down to the "arithmetic abort enabled" setting and make sure it is set to TRUE

  • Thanks for the replies everyone. 

    Gail, thank you for putting me right on the Set option, I had no idea all this time about it.  Chris, thanks for explaining it,

    I will say my "Repointing blame" comment has been misunderstood - I had simply meant finding out where the problem lay, and explaining the same to management, and if it's not code, then it's unlikely there's anything we as DBA's can do (sans purchasing authority) to expedite a fix EXCEPT liaise with the relevant parties.  Naturally if just the code which has fallen off the wagon, and we hadn't flagged up those nasty 20 unindexed joins earlier, we should of course do the decent thing and own up.

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply