January 7, 2015 at 3:44 am
Please forgive the vague nature of this post, but I've been up all night dealing with issues so I'm likely to leave out some details.
Basically, our application has a procedure that runs every night around 9:15 PM and takes about 4 minutes. It updates 35-40K records in the a key table of our application (one procedure call per record). This procedure has run like clockwork for nearly 6 years, every night, with no code changes in that time.
Yesterday our battery backup on our SAN failed during a power cutover so we lost the storage (which triggered a reboot of all our SQL Servers in the AG). After power was restored all servers came up fine and operations were normal throughout the rest of yesterday. Last night this job ran for around 30 minutes and, most importantly, acquired an exclusive lock on the key table I mentioned above which it appears not to have released until all 35-40K calls were completed.
Myself and our other DBA are racking our brains as to what would cause a procedure to acquire an exclusive lock and run that long when it never had before. We pulled the plan and it matched our other environments, so there is something particular that pushed this procedure over the edge. We're just not sure what it is.
Can anyone help? I can provide whatever details you feel I left out, but I wanted to get this post out while the incident was fresh in my mind. Thanks!
Matt
January 7, 2015 at 4:07 am
Stale statistics
Plan regression due to data volume changes
Lack of lock memory, resulting an a lock escalation
etc
Could be a number of things, hard to say without any details (code, plan, etc)
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
January 7, 2015 at 4:11 am
GilaMonster (1/7/2015)
Stale statisticsPlan regression due to data volume changes
Lack of lock memory, resulting an a lock escalation
etc
Could be a number of things, hard to say without any details (code, plan, etc)
Thanks for the quick reply!
The statistics were updated directly before the run and the data volume is largely unchanged (and the plan is identical to the one from before the restart). Can you elaborate on the "lack of lock memory" theory? What would be easy way to check that? Memory stats on the server were pretty much healthy, but we did not dig into those in any way.
January 7, 2015 at 4:48 am
After the fact, you wouldn't. 🙁
In short, if a single session is holding a lot of locks on a single object (>4000) or lock memory is taking up a large portion of the available memory (and I forget what 'large' is), then row or page locks get escalated to table locks.
I think you're just going to have to dig in and monitor. Waits, locks, latches, that kind of thing. I'd start with what waits the query is incurring, and compare the actual execution plans carefully. Not just what operators are there, but row counts, number of executions, parallelism, etc.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
January 7, 2015 at 6:21 am
GilaMonster (1/7/2015)
After the fact, you wouldn't. 🙁In short, if a single session is holding a lot of locks on a single object (>4000) or lock memory is taking up a large portion of the available memory (and I forget what 'large' is), then row or page locks get escalated to table locks.
I think you're just going to have to dig in and monitor. Waits, locks, latches, that kind of thing. I'd start with what waits the query is incurring, and compare the actual execution plans carefully. Not just what operators are there, but row counts, number of executions, parallelism, etc.
There were indeed more than 4000 locks based on my analysis, so I wonder if another "batch job" procedure was running at the same time and provided the memory pressure needed to get this escalated to a table lock. I'm digging through our monitoring logs but I've asked our app team to look into what application jobs were running simultaneously that do not usually run at this time. Many, many thanks for the quick replies on this.
Matt
January 7, 2015 at 6:27 am
Not 4000 locks total. 4000 locks on the same object from the same session. If that did happen, then that would have triggered a lock escalation to table.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
Viewing 6 posts - 1 through 6 (of 6 total)
You must be logged in to reply to this topic. Login to reply