Trace Data from major blocking incident

  • My new company has a vendor app that is having major blocking problems. I arrived at the new job and got tossed right it but it is fun.

    Last Thursday I was tracing and running MSSQL Blocks (great tool) when a major blocking incident happened. I correlated the transactions and Sql statments using spid and starttime to get the details. I based my analysis on the trace deteails and the xml file (attached as txt) from the MSSQL Blocks. My conclusions are outlined in the trace detail file

    I'm looking for some feedback on my conclusions before showing them to the vendor. Neither file is particularly big and I think they are formatted fairly well. Please let me know if you have problems reading them.

    I look forward to your review and opinions.

  • An oddity for me, do you happen to still have eventsequence 7003193444, or was it completely unrelated to spids 91 and 94?

    I'll save some researchers a bit of lookup for this event's subevent values:

    event 50's a transaction event.

    ms-help://MS.SQLCC.v9/MS.SQLSVR.v9.en/udb9/html/4e175aa3-4f3d-4b23-a423-4a7a1bd4e84e.htm in BoL for 2k5.

    Type of event subclass.

    0=Begin

    1=Commit

    2=Rollback

    3=Savepoint

    So, 91 blocked by 94, waiting on an Intent Exclusive lock on the table that 94's blocking.

    94 along the way here is triggering a few events. It opens and rollbacks transaction 509630964 for no good reason: eventsequence 7003169999 and 7003170000, with nothing happening between them? (The end times are exact too: 2010-09-16 13:09:35.273, we're not missing an event of import there.)

    Then it gets its dynamic SQL, does a bunch of one-off commitals, with no obvious declaration of it, implicit transactions then. I'd LOVE to see that ad-hoc in full, if this isn't the entire statement (I'm going by timestamps here.) This shouldn't transaction (the two open/rollbacks that wrap this process) unless the front end code is sending down transaction requests independent of the ad-hocs.

    select Rn_Appointments.Appt_Date, Employee.Last_Name, Employee.First_Name, Rn_Appointments.Start_Time, Rn_Appointments.End_Time, Rn_Appointments.Appt_Description, Rn_Appointments.Access_Type, Rn_Appointments.Appt_Priority, Rn_Appointments.Rn_Appointments_Id, Rn_Appointments.Contact, Rn_Appointments.Opportunity, Rn_Appointments.Company, Rn_Appointments.Lead_Id, Rn_Appointments.Marketing_Project, Rn_Appointments.Notes, Rn_Appointments.Start_Time, Rn_Appointments.Appt_Date, Rn_Appointments.Appt_Date from Rn_Appointments Rn_Appointments join Employee Employee on Rn_Appointments.Rn_Employee_Id = Employee.Employee_Id order by Employee.Last_Name, Rn_Appointments.Appt_Date, Rn_Appointments.Start_Time, Rn_Appointments.End_Time

    Then again, it opens, and rollbacks, a random transaction at 2010-09-16 13:14:03.170, transactionID: 509682352.

    Well, it's blocking on Rn_Appointments, because of that ReadCommitted. I have no idea from this why it's open/rollingback transactions for no good reason, but without access to the source code and the objects they used, you may never know. I know there are coding objects that some people have used that always open a transaction around sql sends to the server (mostly cause they hosed up one of my systems), but not enough about it to determine the problem there.

    I don't think we've got all the pieces yet, but I'm content to be overruled if someone sees something I don't.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • Craig

    I've updated on the post with the full sql statement for spid 94.

    The detailed trace file has everything the spid 94 did from the full minute before the block to the full minute after the block.

    Thank you for you insight and analysis.

    David

  • David O (9/20/2010)


    Craig

    I've updated on the post with the full sql statement for spid 94.

    The detailed trace file has everything the spid 94 did from the full minute before the block to the full minute after the block.

    Thank you for you insight and analysis.

    David

    Yeah, that's pretty much the select from the XML you attached (which, btw, helped a lot). I'm getting more from the XML then I am from this log. It's... odd.

    Can you drop this statement into a query window and snag the execution plan? Even though it's a two table join, I'm wondering if it's closing off transactions on index reads, re-ordering, etc. It's interesting though.

    Either way, 94's not being a major problem, directly. Even the random transaction opening/closings aren't affecting anything on the fly, they just exist... for no reason. Your biggest issue is this sql statement is a 5 minute query.

    5 minutes. That's either a LOT of dang data, or it was waiting on something else. At a guess, 94 had its own blockage. The reason I say this is your dynamic SQL came down at 13:09:35.273. It doesn't close any of the resultant transactions until 13:14:01.200. So, either it was sitting there waiting on a lock, or it was chugging along.

    Is it possible to isolate this database from external connections in a test database on the same equipment (ie: backup/restore) and see if this query actually takes 5 minutes to run? I'm starting to think that while 94 was in the chain, it wasn't the starting cause, in spite of the 'blocked by' being 0. That's going to depend on being able to test the query's throughput in an unloaded database, at least lock-wise, though.

    Edit: Alternate option if you can't run this in a new db... use the (NOLOCK) hint on the sql call for both tables, at the very least the RN_Appointments, to avoid locking issues.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • Craig Farrell (9/20/2010)


    Edit: Alternate option if you can't run this in a new db... use the (NOLOCK) hint on the sql call for both tables, at the very least the RN_Appointments, to avoid locking issues.

    Just first check that the side effects of nolock (or as Kalen suggested it should be called 'WITH (AllBetsAreOff)') are acceptable.

    See - http://sqlblog.com/blogs/andrew_kelly/archive/2009/04/10/how-dirty-are-your-reads.aspx

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • GilaMonster (9/21/2010)


    Craig Farrell (9/20/2010)


    Edit: Alternate option if you can't run this in a new db... use the (NOLOCK) hint on the sql call for both tables, at the very least the RN_Appointments, to avoid locking issues.

    Just first check that the side effects of nolock (or as Kalen suggested it should be called 'WITH (AllBetsAreOff)') are acceptable.

    See - http://sqlblog.com/blogs/andrew_kelly/archive/2009/04/10/how-dirty-are-your-reads.aspx

    Gail is correct, don't do this as a permanent solution without a long hard look. I just want to see if the query takes an age to run on its own when there's little to no locking contention.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • Sorry for not updating the thread sooner. New employee orientation, fun, fun, fun.

    The spid 94 query takes almost 4.5 minutes to run on it by itself and pulls back about 800K records. The vendor doesn't seem to see the value of WHERE clauses.

    The random opening and rolling back of trans you point out is extremely frequent and odd. I ran the trace for about 2 hours and that pattern occurs thousands of times. Can't seem to find a pattern as yet.

    I'm going to run that two statements in test and trace for the locks this time. I would expect the Select to take out a shared on the pages as they are read and update take out an exclusive on the page(s) it wants; once those page(s) have been read I would think the update to get the exclusive lock it needs. I'll probably post the lock data later this noon.

    Again, thanks for the outside analysis and opinion.

    David

  • David O (9/21/2010)


    The spid 94 query takes almost 4.5 minutes to run on it by itself and pulls back about 800K records. The vendor doesn't seem to see the value of WHERE clauses.

    My pleasure. Hadn't had to poke in one of these in a while and the refresher was good. I was afraid of that, though, with your query. This may just be the cost of doing business, if the app's that poorly written, if you're not able to actually alter the vendor code, since I agree with the rest of your evaluation. 🙂

    Look forward to your next puzzle. If you find out what's causing the crazy transaction fluttering, though, do please post back. I'm quite curious.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • Ran the queries today in test and got some interesting results. Running them unmodified, I can reproduce the original block. The block didn't resolve until the SELECT ended. I traced the trans, batch, and locks. The locks were very extensive on this pass and I couldn't find anything helpful.

    My second test was running the SELECT and UPDATE but taking out the readcommitted hint. I started the update about 30 seconds after the SELECT and got the block. The intersting thing is the update's duration more then 30 seconds shorter than SELECT's duration. So at some point the block resolved letting the update complete before SELECT ended.

    I then traced the update by itself without running the SELECT. The first run without the hint, I get a very clean locking and tran pattern. Re-run the update with the hint and the locking pattern explodes. The readcommitted hint isn't acting as I would expect based on BOL so that confuses me.

    Regardless it is pretty clear to me that their over use of the hint, (I see it in almost all their queries) is the source of my blocking pains. To be fair, queries without WHERE clauses chip in too.:-)

    I've attached the trace data files (txt extension, csv format) from the last two runs. I'd appreciate any feedback or analysis anyone would like to add.

    David

  • I would go with an indexing problem...

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply