Super long running queries (bad users)

  • I'm not really sure this is a "performance tuning" topic, but I didn't know where else to drop it so I figured I'd start here.

    I remotely manage several data warehouse platforms for Credit Unions around the country. Recently, several of my customers have hired new SQL cowboys that "know SQL"... but don't really know what they are doing, especially in dealing with very large tables.

    These guys seem to think it's ok to build a query that runs for 2-4 hours. Some have even gotten clever and used their book/classroom knowledge to schedule these mega long running queries to kick off in the middle of the night so they don't have to wait for them; which interferes with maintenance plans, backups, and warehouse fill jobs. In every case it is a problem of query construction, and the excessive use of sub selects (these guys are using subselects in one query more than I've used last year)... All you other DBA's out there know what I'm talking about, the problem isn't the database nor indexes...it's the programmers lack of respect for the environment. The environment is tuned...indexes in place, and so far I have been able to rebuild these 4 hour queries to run in a minute or two.

    Anyway, before I get on that soapbox... is there a away as a native SQL Server config setting, to automatically kill any query that runs for more than n minutes? I looked into the Query timeout setting in Adv options, but I'm not entirely certain that will do what I need.

    I've already built a procedure that runs at the beginning and end of my maintenance and ETL jobs to detect open queries, so I know who my cowboys are and I have captured copies of their offensive SQL. My next step is to add KILL statements to that look up to kill those connections during my time. But what about these scheduled things that kick off in the middle of my jobs? (That happened last night, my ETL job was 45 minutes in to a 90 minute process and one of these 4 hour queries locked one of the tables I was trying to fill for four hours, pushing my warehouse fill completion from 4:30 am to 9am).

    I could build a patrol dog sql agent job that cycles every 15-30 minutes looking for queries that have been running for too long and then cut them off (and trying to code for things that DO run for long periods, like backups and index reorgs), but that seems awfully severe...and not without a whole lot of potential negative side effects.

    Any ideas? (besides dealing with the cowboys poor techniques? I'm already on that one...)

  • If you have the Enterprise Edition on this site and it is alt least SQL Server 2008, then you could configure the Resource Governor to assist with the situation. I don't have experience with it myself but just know of it and its overall purpose. In the brief internet searches I did I found info from BOL that indicates it will help with throttling run-away queries and allow you to put those queries in a workgroup by themselves, and allow queries in other workgroups to finish quicker via work load priority settings.

    https://msdn.microsoft.com/en-us/library/bb895232%28v=sql.100%29.aspx

  • It's odd that the developers would be writing SQL involving complex sub-queries in a data warehouse. Perhaps it's not a star-schema or they're not familar with the data model.If you don't currently have a ERM diagram, then create one and share it with the SQL guys. Examine their SQL and give them some direction.

    Also, perhaps discuss with management what the issue is and suggest a policy where the developers build SQL queries based on business requirements, basically just documenting output columns and selection criteria, and have them share that you for input before they starting running ad-hoc queries out of the blue.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Ah... an idealist in the group. 😉

    You are exactly right.. they don't understand the data model, and to the institutions discredit... haven't spent enough time (and money) on training these poor souls simply trying to produce results.

    All of those things are underway... but the 4-8 hour queries persist... because they don't know any better. One of the things I've noticed about new SQL/Analysts is that there is an overwhelming lack of fundamental knowledge (theory) about how things work, and why things do what they do.

    I was working with a young man a few days ago (another that thought waiting an hour for a query was "OK"), and although he knew what a left outer join was, he could not explain what it actually did, or more importantly when it was warranted.

    I think it's just a phase... new comers and coders breaking into the SQL space, the language is not hard to learn, nor use... wizards make it even easier. But when it comes down to responsible query development, there is a core knowledge point that is missing.

    In the meantime however, I still need to find a way to throttle those runaway jobs to the background (or even better yet, shut them down) when they occur.

  • Could you create a group/role to revoke access to the dbs before the maintenance tasks occur and grant the access again after all the night processes end?

    Would that create problems?

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Clever idea... I had not thought of that. The big question would be if that change kicked in immediately (changing the behavior of existing open connections, and possibly running queries), or if I'd have to kill all those connections first anyway.

  • jchapman (2/12/2015)


    Ah... an idealist in the group. 😉

    Well, it's good to hold on to our ideals... but we must also carry a big stick. First try to educate and reason with these guys, even if it's just a pretext, but also setup Resource Governor or perhaps a scheduled job that kills processes running longer than X minutes.

    However, I'm not sure if Resource Governor will apply constraints on runtime duration, or even actually terminate processes under any condition. I think it just forces queries originating from defined workgroups (grouping of user accounts) to run within defined resource pools (allocated slices of CPU and memory), so it essentially prevents runaway queries from hogging resources. It may actually cause these queries to run even longer, and in the meantime the queries will still acquire shared locks on tables which interfere with whatever nightly maintenance jobs you have going on.

    But I have no actual experience with Resource Governor, you'll have to read up on that. I'm mostly just saying that when it comes to dealing with power users who run ad-hoc SQL, educating and reasoning with them may be your only recourse, other than taking away their login.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • jchapman (2/12/2015)


    Clever idea... I had not thought of that. The big question would be if that change kicked in immediately (changing the behavior of existing open connections, and possibly running queries), or if I'd have to kill all those connections first anyway.

    I think the best way to handle this is to temporarily set the database to restricted_user (sysadmin only) mode. The following will initiate termination of all user connections on a database, allowing 120 seconds for them to complete before termination and rollback. In the interim no users can re-connect. In your case, you may want to initiate RESTRICTED_MODE a half-hour before your maintenance window, allowing any reasonable query enough time to complete.

    ALTER DATABASE database-name SET RESTRICTED_USER WITH ROLLBACK AFTER 120 SECONDS;

    Once you're completed your critical maintenance operations, then you can reset database to muli_user mode with the following.

    ALTER DATABASE database-name SET MULTI_USER;

    Obviously this is NOT reccomended for a transactional line of business databases with users who are trying to do "real" work. It's for data warehouses with cowboy users who punch out for the day while leaving some experimental non-mission-critical query running from their desktop.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Killing even the most ridiculous users sessions is a really good way to need an updated resume. I know it takes a whole lot longer to do but proper use of the resource governor and training the cowboys will be a whole lot more effective over time. Management buy-in for all of it in the form of a company policy would help protect you and to set expectations.

    Just killing stuff is going to perpetuate and maybe make the problem worse.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff, your argument is exactly why I haven't put the Kill statement in the middle of my foreign process detective code (that records all open connections, statements, run time, user names, etc.).

    I have enough at this point to be able to show who/when/why this issues occur and can then use those as training cases.... ugh, it's just the next day cleanup when warehouse loads fail (or are mega late) and the ripple down effect of late reports, etc.

    I've also been concerned about a blind Tarantino/Robert Rodriguez style connection killer because there will be that one time when something important and justifiable is killed because of some last minute/crunch request and the planet comes unhinged from there.

    I've not used the resource governor before, but have since started reading up on it since the suggestions in this thread; I do like the idea of how it works, but also agree that it would likely make these 4 hour queries worse... not better.

    My next approach is a new "Coyotes on the Fence"/Wall of shame dashboard to show queries submitted by user with runtimes on them in the hopes that the cowboys will wise up and stop doing it.

    In the end... I guess there is no magic bullet for "this query has been running longer than allowed" master setting and as suggested... simply needs to be handled on cowboy by cowboy basis.

    Thanks to everyone who responded and for the ideas.

  • Jeff Moden (2/12/2015)


    Killing even the most ridiculous users sessions is a really good way to need an updated resume. I know it takes a whole lot longer to do but proper use of the resource governor and training the cowboys will be a whole lot more effective over time. Management buy-in for all of it in the form of a company policy would help protect you and to set expectations.

    Just killing stuff is going to perpetuate and maybe make the problem worse.

    I agree 95%, but in a data warehouse environment, insuring a successful DDL script deployment or critical data reload often times requires a window where the DBA has exclusive use of the database. That's impossible if ad-hoc users have SQL brain farts running 24/7. The DBA should schedule maintenance for a time when management approves, and then give users due notice, but when these maintenance windows arrive, it's time to kick the kids out of the pool.

    Real mission critical querying takes place during regular business hours.

    I've never seen a DBA lose their job because they aborted zombie SQL processes that ran for half and day and into the night. In fact, often times it's management who orders the kill (behind the scenes with a wink and a nod). But I have seen a DBA lose their job because the database was generally no longer available to executive level users due to missed deployments or runaway resource consumption caused by lower level script kiddies.

    In an organization where SQL coders act like cowboys, the DBA has to play the role of Sheriff, and that sometimes means (trouble)shooting... to kill.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • If those queries interfere with maintenance plans, backups, and warehouse fill jobs, then you should make your management aware of that fact. All three are important your customer base in one way or another, so everybody needs to know the consequences of them not running within the negotiated time slots or at all.

  • lptech (2/12/2015)


    then you should make your management aware of that fact

    That's what my wall of shame dashboard is for. The thing that makes it difficult (in this scenario, however) is it's not "my" management, it is my customers; I'm just the OEM vendor for the warehouse tools and defacto DBA because the sites have Cowboys... not DBA's, and when load fails, they a) blame the software and then b) call support. It doesn't take long to toss the support hat and don the DBA hat.

    The real rub is that there isn't a quick fix, this has to fix itself over time. And as my customers hire "qualified on paper" because some guy/gal can spell SQL that don't respect the environment, I'll just have to deal with it, use my DBA voice and educate a new batch of newbees.

    The evolution was that first I had to identify what was causing the delays/deadlocks (with the challenging part being that all this happens in the middle of the night, and the cowboy queries have long since released their locks by the time I figure out there is a problem). Once I figured out that was going on, I created a procedure (and my own audit table) that captures a list of open connections at the beginning and end of my maintenance and ETL jobs. I could use that to show the source of the issue. Then comes the long process to a) stop it and b) fix it. I'm now trying to figure out a clean step c) prevent it from reoccurring... which other than my wall of shame and police reports, I believe is the best I can accomplish.

  • It sounds like in your case the users are actually working for external clients, so unfortunately you're in a position where both you and your managers have to tread lightly. I know it must be hard as a DBA to stand by and watch SQL cowboys running wild and not being able to pull the trigger or even flash a badge. Once upon a time databases had a timeshare model where users payed according to resource consumption. In retrospect that might not be a bad idea today for hosted databases.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Eric M Russell (2/13/2015)


    It sounds like in your case the users are actually working for external clients, so unfortunately you're in a position where both you and your managers have to tread lightly. I know it must be hard as a DBA to stand by and watch SQL cowboys running wild and not being able to pull the trigger or even flash a badge. Once upon a time databases had a timeshare model where users payed according to resource consumption. In retrospect that might not be a bad idea today for hosted databases.

    Looks like this is coming 'full circle' in the mainframe days I hear there was this type of pay for resource consumption calculation... Maybe with the 'cloud' where we pay by resource consumption ( compute, io, storage, etc..) it's all coming back and we'll all be better at writing more optimum code... less long running queries for the poster otherwise the bill goes to them 😀 However doubtful the cloud will be for every organization in the future, but a definite % of a companies infrastructure likely will be in the cloud maybe?

    --------------------------------------------------
    ...0.05 points per day since registration... slowly crawl up to 1 pt per day hopefully 😀

Viewing 15 posts - 1 through 15 (of 21 total)

You must be logged in to reply to this topic. Login to reply