SQL system SPID experiencing lock timeouts

  • Hey Experts,

    I ran a SQLtrace (to monitor another problem I'm trying to solve) and set a max size of 2GB on the output file, thinking I'd never get close to that size. I checked up on the trace this morning and was surprised to find it had maxed out, and the trace had stopped after 24 minutes. There were a few (~60,000) events of the types I wanted (SP:Completed, SP:StmtCompleted, etc), but there were a staggering 12.9 MILLION events of type "Exception".

    Each record has the following attributes (always the same for every event)

    EventClass: Exception

    DatabaseID: 7 (our transactional database; I'm filtering on this so this is the only DBID possible for the trace)

    Application Name:

    LoginName: sa

    SPID: 21

    StartTime: events appear to be clumped in "groups" that occur every 10-15 milliseconds; each group contains maybe 50-150 events (the number in a group is definitely not constant or consistent)

    RequestID: 0

    IsSystem: 1

    TextData: Error: 1222, Severity: 16, State: 18

    Error/State/Severity columns: same as in the TextData column.

    There are 4 cores on this machine; here are the system SPIDs (broken down by the contents of the "Command" column in sys.dm_exec_requests)

    Resource Monitor: 1-4

    Lazy Writer: 5,7,8,9 (not sure where SPID 6 went? but prob. not relevant)

    Signal Handler: 10

    Lock Monitor: 11

    Task Manager: 12, 16, 18, 19, 20, 21, 22, 23, 24

    Brkr Task: 15, 36

    Brkr Event Hndlr: 35

    I know just a bit about the Resource Mon and Lazy Writer threads from Slava Oks blog and Kalen Delaney's SQL internals series. However, I don't know anything about the Task Manager threads, or what one of them would be timing out on. Obviously, Googling with "Task Manager" in the search produces a lot of false positives, so I'm at a bit of a loss for how to do more research.

    Has anyone seen this type of behavior before? Can anyone point me to some explanations on the Task Manager thread's purpose and types of operation?

    thanks,

    Aaron

  • do you have any schedule tasks running, perhaps something using select into?

    The probability of survival is inversely proportional to the angle of arrival.

  • sturner,

    Thanks for responding.

    We could be running SELECT INTO at/around that time. At that time of day we have some import processing running. Unfortunately, my trace window started after the imports started, and due to filling up ended before the import window ends. So I can't yet tie the "start" time of the exceptions to any specific statement. I've reconfigured the trace so that I'll start well before the window and use roll-over files until the end of the window.

    Before the imports kick off the only thing that runs is a maint plan in the middle of the night, and then a couple of hours of (in theory) complete inactivity before the imports..

    I'm curious to hear more of whatever it is you're thinking?

    thanks,

    Aaron M.

  • select into holds locks on various systems tables until the query completes. If the query is a long drawn out one it could result in other processes timing out like that. It was one of the first possibilities I thought of.

    Avoid using select into (i.e use create table then insert) in that database or use another data base to stage your data.

    The probability of survival is inversely proportional to the angle of arrival.

  • Hm...I don't think we'd have any SELECT INTOs that would run for 24 minutes, or enough of them to consume that whole range of time. We use that syntax mostly to load temp tables for intermediate processing, and I'm pretty sure those SELECTs run quickly.

    Thanks for the input though...

  • Did you ever get a bead on this? I'm also seeing a rash of 1322's, but only on one server...

  • No, I never did. We ended up solving the problem we were troubleshooting and I never spent much more time trying to figure out the 1222's on SPID 21.

  • Thanks for the update. Right, 1222, not 1322.

    What I found is that a job which use to run in a few seconds is now taking a few minutes. The job step that is taking longer puts a lock on a table that had just been updated in a previous step. The locking step is causing 1222's on system spids. My theory is that the system spid's are trying to come in and auto-update statistics or something since the table had recently changed, but the long job step is blocking them. Don't really know if that's reasonable...

  • You know, I've always wondered about the Lock Isolation level for the Stats Update process... I guess I'd always assumed that the DB engine just used dirty reads b/c the stats are aggregated somewhat into the histogram anyways. But maybe you're right.

    I've captured the T-SQL that the DB engine uses to update stats with Profiler in the past...but I can't remember the exact format of the T-SQL so I can't show you what it looks like. You might run an unfiltered trace on a relatively quiet system, generate lots of updates on a single table, and wait for the T-SQL auto-update statements to be captured. Then you could look for those on your Prod box that's having the problems...

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply