SQL Trace causes application to hang

  • Hi,

    We've recently upgraded to Windows 2008 R2, SQL Server 2008 R2 SP + CU4, multi-instance cluster. 32 cores (64 with h/t), 384GB per node. Maxdop 4 server wide, 320GB max server memory. It used to be Win2k3 R2, SQL Server 2005 SP3, 2 x single instance clusters - 8 cores, 96GB RAM, Maxdop 0 server wide.

    We have 6 databases on this server, the main one being an Enterprise Data Management Platform that munges data from lots of different sources. It's a 3rd party product that has something like 18000 encrypted stored procedures, 12000 tables, 3000 views. Good, eh ;-). There are triggers on a lot of these tables that send insert/update/deletes to a local staging database, then on to a repository database across a linked server.

    For the past few years we have been using a SQL monitoring product (name witheld to protect the innocent) against out production cluster (multi-instance) without any problems. However, on the day we upgraded, when we turned on a particular feature that monitors the types of queries being executed against the server, some data load went from completing in 15 mins, to taking 14 hrs to run. When we turned off this feature, it returned to normal. No other databases appear to be affected.

    We contacted the vendor of the monitoring software and found out in essense that this feature runs a SQL trace capturing the following events: SQL:BatchCompleted, SQL:StmtCompleted, SP:Completed for events with a run duration of greater than 5000ms. Sure enough, just running the trace against the server whilst processing data had exactly the same effect as using the query monitoring feature of the software. So, I think we say that although the monitoring software causes the problem, it does so only by leveraging a feature native to SQL Server.

    The EDM vendor was aware of the conflict between their product and the monitoring software but have been unable to explain the cause, or reproduce it in their environment. They've also tried against a version of the DB that doesn't have any encrypted procedures (I don't have that copy).

    During the slow down, I query sys.dm_exec_requests and all of the sessions associated with the hung jobs have a RUNNABLE status with a 'last_wait_type' of SOS_SCHEDULER_YIELD. There's nothing in sys.dm_os_waiting_tasks. No evident blocking. CPU's are completely idle. I tried changing MAXDOP settings, affinity, hyperthreading, etc. as well (to make it more like the old system), but no effect. As soon as I stop the trace, everything resumes at normal speeds and runs through to completion.

    These are the top 10 waitstats:

    wait_typewait_time_mssignal_wait_time_msresource_wait_time_mspercent_total_waitspercent_total_signal_waitspercent_total_resource_waits

    PREEMPTIVE_OS_WRITEFILEGATHER4729935085535490.3086820897807380.0000000000000000.308682089780738

    BACKUPTHREAD4721076246686204460.3111852748087840.0011889896950610.311096281335619

    ASYNC_IO_COMPLETION1543967532187974720.6783859142476040.0002565054816590.678366715331252

    OLEDB10197830935093043.3745814015518420.0000000000000003.374581401551842

    LCK_M_IX1921117074580010.0165539408307750.0003408822848370.016528426481413

    MSQL_DQ167309020999020.0757816594836540.0000000000000000.075781659483654

    LCK_M_IS941286913383340.0483005399289590.0000332685681090.048298049843943

    LCK_M_U867083162701950.0097622519948940.0001523603988800.009750848127286

    LCK_M_S7722316992452450.0089117616534150.0008191782205630.008850447820931

    LCK_M_X6158212072416010.0087625009052360.0005819588653440.008718942461557

    Any ideas what else I might check? Like I said, this used to work fine under SQL 2005/Win2k3, with the same version of the monitoring software.

    Thanks,

    Andy

  • looks specifically to be sp:Completed that's causing the problems. Any ideas?

  • adb2303 (11/22/2012)


    Windows 2008 R2, SQL Server 2008 R2 SP + CU4, multi-instance cluster.

    :w00t:

    adb2303 (11/22/2012)


    32 cores (64 with h/t),

    :w00t:

    adb2303 (11/22/2012)


    384GB per node.

    :w00t:

    adb2303 (11/22/2012)


    Maxdop 4 server wide

    ??

    Why dont you just pull all the CPUs back out except for the one socket in that case.

    Why have you installed 32 cores and then put a choker around SQL Servers neck?

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • weeks of testing demonstrated that MAXDOP 4 was most performant for our workload

  • you never considered raising the Cost threshold for parallelism instead?

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • yes. We still achieved more consistent results with setting maxdop.

  • anyone seen this before? It looks specifically to be the SP:Completed event that causes the delays. I suspect it's to do with the encrypted procedures, but 2005 worked okay.

  • if i were you Firtly i make maxdop 1 and i set automaticly set processor affinity

    and automaticly set i/o affinity

  • Hi,

    I've already tried a variety of maxdop and affinitiy options, to no avail. I can reproduce this problem in our test environments, which only have 2 vCPUs...

  • I think it is related your encreptyted procedures

  • i'm leaning towards that, too. But, what's different in 2008 R2? It worked running under 2005

  • Have you tried using Extended Events instead of an old-style trace? Does that do the same thing?

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • I haven't, but only because I'm trying to mimic the behaviour of this 3rd party monitoring software. It uses ye olde style trace under the hood, and I'm not going to get that functionality changed.

    I'll look into it though

  • adb2303 (11/27/2012)


    I haven't, but only because I'm trying to mimic the behaviour of this 3rd party monitoring software. It uses ye olde style trace under the hood, and I'm not going to get that functionality changed.

    I'll look into it though

    Yeah, you may not be able to use Extended Events in your production environment if a trace is what your monitoring software uses. But it'll still help eliminate a variable in the whole thing.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply