Very interested to know what can be done in the case of a long running select query

  • Hi everyone. I'd love to get some help understanding why SQL Server 2008R2 x64 Enterprise Edition does what it does.

    The SQL Server is dedicated to an application that as I understand it processes workloads which take 5-6 days to complete and then begins processing the next workload.

    Throughout the week, there are constant inserts/updates/deletes that take place in different SPIDs. During the weekly workload there is a single select query that stays active for almost the entire time. This query does nothing that I can interperate as database activity but holds several locks which prevent me from doing maintenance etc on the SQL Server.

    The SQL Server has 48 cores and 128GB RAM. We have MDOP set on this instance to 32. Tried MDOP settings of 0, 1, 6, 8, 16, and 32 but have not determined what setting would work best. If I recall, there were a lot of CXPACKET waits when MDOP was set to 0. Parallism threshhold has always stayed at the default value of 5.

    Subtree cost in the execution plan for this log running select query started at 9000+ but with the addition of a few indexes and even an indexed view, we have reduced the cost to ~90. This seems like a big win but the query still stays open for days with no apparent database activity. The common wait type for the query is ASYNC_NETWORK_IO which lasts for about 2 second on average so I believe that the bottleneck may be the application slowly taking in the data collected innitialy by the long running select query. Does that make sense?

    I wonder if there is anything we can do on SQL Server's side to improve things. My thought is that since SQL Server is parallelizing the query to 32 threads and then just sitting there for days, it is really not taking advantage of the parallel threads but instead blocking other operations from using them. Could a plan guide that forces MDOP of 1 for this specific query free up the other threads?

    Mainly, is there anything I can do to prevent the query from staying open for days?

    Please see the below output collected by Spotlight on SQL Server for the long running spid. Is is only a moment in time but it is pretty typical of what I see all week. Wish I knew how to format it better so the header rows lined up with the output.

    SPIDStatusBlocked ByLast CommandMemoryCPU ms/sI/O per secLogical ReadsReadsWritesTotal CPUTotal I/OTotal Logical ReadsTotal ReadsTotal WritesCurrent Wait Time (ms)Last Wait TypeLast Wait ResourceLast Batch TimeTime Since Last BatchTran CountHost ProcessNet LibraryLogin Time

    93running0SELECT200000247,635137,4351,265,971137,43411,230ASYNC_NETWORK_IO4/18/2013 17:001d 21:47:5103672TCP4/18/2013 16:59

    93running0SELECT200000247,635137,4351,265,971137,43411,230ASYNC_NETWORK_IO4/18/2013 17:001d 21:47:5103672TCP4/18/2013 16:59

    93running0SELECT200000247,635137,4351,265,971137,43411,230ASYNC_NETWORK_IO4/18/2013 17:001d 21:47:5103672TCP4/18/2013 16:59

    93running0SELECT200000247,635137,4351,265,971137,43411,230ASYNC_NETWORK_IO4/18/2013 17:001d 21:47:5103672TCP4/18/2013 16:59

    93running0SELECT200000247,635137,4351,265,971137,43411,230ASYNC_NETWORK_IO4/18/2013 17:001d 21:47:5103672TCP4/18/2013 16:59

    93running0SELECT200000247,635137,4351,265,971137,43411,230ASYNC_NETWORK_IO4/18/2013 17:001d 21:47:5103672TCP4/18/2013 16:59

    93running0SELECT200000247,635137,4351,265,971137,43411,230ASYNC_NETWORK_IO4/18/2013 17:001d 21:47:5103672TCP4/18/2013 16:59

    93running0SELECT200000247,635137,4351,265,971137,43411,230ASYNC_NETWORK_IO4/18/2013 17:001d 21:47:5103672TCP4/18/2013 16:59

    93running0SELECT200000247,635137,4351,265,971137,43411,230ASYNC_NETWORK_IO4/18/2013 17:001d 21:47:5103672TCP4/18/2013 16:59

    93running0SELECT200000247,635137,4351,265,971137,43411,230ASYNC_NETWORK_IO4/18/2013 17:001d 21:47:5103672TCP4/18/2013 16:59

    93running0SELECT200000247,635137,4351,265,971137,43411,230ASYNC_NETWORK_IO4/18/2013 17:001d 21:47:5103672TCP4/18/2013 16:59

    93running0SELECT200000247,635137,4351,265,971137,43411,230ASYNC_NETWORK_IO4/18/2013 17:001d 21:47:5103672TCP4/18/2013 16:59

    93running0SELECT200000247,635137,4351,265,971137,43411,230ASYNC_NETWORK_IO4/18/2013 17:001d 21:47:5103672TCP4/18/2013 16:59

    93running0SELECT200000247,635137,4351,265,971137,43411,230ASYNC_NETWORK_IO4/18/2013 17:001d 21:47:5103672TCP4/18/2013 16:59

    93running0SELECT200000247,635137,4351,265,971137,43411,230ASYNC_NETWORK_IO4/18/2013 17:001d 21:47:5103672TCP4/18/2013 16:59

    93running0SELECT200000247,635137,4351,265,971137,43411,230ASYNC_NETWORK_IO4/18/2013 17:001d 21:47:5103672TCP4/18/2013 16:59

    93running0SELECT200000247,635137,4351,265,971137,43411,230ASYNC_NETWORK_IO4/18/2013 17:001d 21:47:5103672TCP4/18/2013 16:59

    93running0SELECT200000247,635137,4351,265,971137,43411,230ASYNC_NETWORK_IO4/18/2013 17:001d 21:47:5103672TCP4/18/2013 16:59

    93running0SELECT200000247,635137,4351,265,971137,43411,230ASYNC_NETWORK_IO4/18/2013 17:001d 21:47:5103672TCP4/18/2013 16:59

    93running0SELECT200000247,635137,4351,265,971137,43411,230ASYNC_NETWORK_IO4/18/2013 17:001d 21:47:5103672TCP4/18/2013 16:59

    93running0SELECT200000247,635137,4351,265,971137,43411,230ASYNC_NETWORK_IO4/18/2013 17:001d 21:47:5103672TCP4/18/2013 16:59

    93running0SELECT200000247,635137,4351,265,971137,43411,230ASYNC_NETWORK_IO4/18/2013 17:001d 21:47:5103672TCP4/18/2013 16:59

    93running0SELECT200000247,635137,4351,265,971137,43411,230ASYNC_NETWORK_IO4/18/2013 17:001d 21:47:5103672TCP4/18/2013 16:59

    93running0SELECT200000247,635137,4351,265,971137,43411,230ASYNC_NETWORK_IO4/18/2013 17:001d 21:47:5103672TCP4/18/2013 16:59

    93running0SELECT200000247,635137,4351,265,971137,43411,230ASYNC_NETWORK_IO4/18/2013 17:001d 21:47:5103672TCP4/18/2013 16:59

    93running0SELECT200000247,635137,4351,265,971137,43411,230ASYNC_NETWORK_IO4/18/2013 17:001d 21:47:5103672TCP4/18/2013 16:59

    93running0SELECT200000247,635137,4351,265,971137,43411,230ASYNC_NETWORK_IO4/18/2013 17:001d 21:47:5103672TCP4/18/2013 16:59

    93running0SELECT200000247,635137,4351,265,971137,43411,230ASYNC_NETWORK_IO4/18/2013 17:001d 21:47:5103672TCP4/18/2013 16:59

    93running0SELECT200000247,635137,4351,265,971137,43411,230ASYNC_NETWORK_IO4/18/2013 17:001d 21:47:5103672TCP4/18/2013 16:59

    93running0SELECT200000247,635137,4351,265,971137,43411,230ASYNC_NETWORK_IO4/18/2013 17:001d 21:47:5103672TCP4/18/2013 16:59

    93running0SELECT200000247,635137,4351,265,971137,43411,230ASYNC_NETWORK_IO4/18/2013 17:001d 21:47:5103672TCP4/18/2013 16:59

    93running0SELECT200000247,635137,4351,265,971137,43411,230ASYNC_NETWORK_IO4/18/2013 17:001d 21:47:5103672TCP4/18/2013 16:59

    93running0SELECT200000247,635137,4351,265,971137,43411,230ASYNC_NETWORK_IO4/18/2013 17:001d 21:47:5103672TCP4/18/2013 16:59

    Thanks for reading and any insight.

    Howard

  • I guess Hyper Treading is enabled on your server. Microsoft recommends to set maxdop=number of physical processors when HT is enabled. However they also recommend to monitor and test it properly to identify what maxdop value suits to your environment. details http://support.microsoft.com/kb/322385

  • Mainly I am interested in understanding why a select query would stay open and running for almost an entire week. CPU use rarely if ever goes above 5 to 10%. The SQL Server is definitely busy doing something but I don't know what.

    The select query joins tables that contain hundreds of millions of rows. Adding non-clustered indexes have made the query just about as optimal as I can get it.

    The whole thing wouldn't bother me that much other than extreme curiosity but for the fact that index maintenance always causes blocking due to locks that are held all week. I spend a good deal of time trying to squeeze performance out of this database server because the application team is very frustrated with the overall performance of the application and I want to feel certain that there is not an optimization on the database that I am missing.

  • Would need to look at the exec plan to say anything useful. Could be a bad execution plan for one reason or another and the query is just running and running.

    Async network IO is usually an application problem, maybe take this up with whoever developed the system that is running the query.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Thanks much. What could I provide that would be useful? The XML of the execution plan? Product has been discontinued and the vendor is no longer actively supporting it as they are encouraging us to start over again with their new archival product.

  • With no specifics we're all shooting in the dark here, but if it stays 'processing' for days and you have no obvious way to tell what it's doing, it may be worth checking for blocking as well.

    Set the blocking process report with sp_configure. A value of 30 should be fine (checking for blocking every 30 secs). Then set a server-side trace to capture the blocking events. If after a few days there is nothing in the trace you will have eliminated blocking as a culprit. If you do get fish in the net, you'll know what holds you back.;-)

    Cheers,

    JohnA

    MCM: SQL2008

  • Thanks for responding John. There is no blocking going on. If I try to run index maintenance on an index that belongs to a table in the long-running select query, it will immediately cause blocking of my maintenance spid and all other processes that involve that table. Once I kill the spid of my index maintenance, all blocking clears up immediately.

    Really, I don't understand why a select query would stay in a running state for several days.

    I don't get anything useful from this SQL Profiler as the statement start was generated when the query began and the statement ending doesn't occur until several days later.

  • Also worth taking snapshots of waiting tasks during these runs. It could bring up resource names that force everything else to wait for.

    If there are many active spids but just one is under scrutiny - Paul Randal had a script to use XEvents and hone into a spid for a few seconds and see all waits incurred during those seconds. It's easy to capture 10s of thousands of events per second, but it is useful for a 'slice' of the query's activity.

    Cheers,

    JohnA

    MCM: SQL2008

  • PHXHoward (4/20/2013)


    Thanks much. What could I provide that would be useful? The XML of the execution plan?

    Table definitions, index definitions and execution plan, as per http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Table variable? User Defined Function? Up-to-date stats? CTE(s)? Improper datatypes? Incomplete join or valid join that has Cartesian? UDFs and Cartesians are my best guess. Will need to see tables, query statement and at least estimated plan though.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply