Deadlock graph/report experts wanted :)

Question

Deadlock graph/report experts wanted :)

977stijn

Old Hand

Points: 306
More actions
April 6, 2016 at 5:56 am

#326255

Hi great SQL minds!
I'm trying to get to the bottom of the below deadlock. Two transactions, of which one is running under serializable isolation level - spid 158 - and the other - spid 298 - under snapshot. I hear you thinking, get rid of serializable! Don't worry, I am :).
But for the sake of learning I would like to understand exactly what is going on and why this leads to a deadlock (and not just range blocking), and be able to reproduce the issue.
I'm scratching my head that survivor process (SPID 298, process3626188) has an X exclusive lock but is waiting for a RangeI-N lock, which I thought was only associated with a) INSERTS and b) SERIALIZABLE isolation level.
The deadlock seems to be caused by a lock on the (nonclustered) index (value 0 or 1)?
And how can the victim process (SPID 158, process55b78a188) already have a RangeS-U and at the same time be waiting for a RangeS-U lock?
Additionally, neither of the 2 wait resources (KEY: 7:72057594214154240 (bd53abbd2a72) and KEY: 7:72057594214154240 (ea5ababe0ee0)) return any information I can work with to find what exatly was being locked (I'm using the usal community scripts to find the resource, and otherwise works fine but in this case retruns nothing). 🙁
If anyone could provide additional insight on what is happening exactly so I fully understand (and can reproduce)?
Edit: Added index creation statements per suggestion of TheSQLGuru.
Clustered Index part of table creation:
CONSTRAINT [PK_user_run_clubs_id] PRIMARY KEY CLUSTERED
(
[id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 90) ON [PRIMARY]
Other Indexes:
CREATE NONCLUSTERED INDEX [server_index] ON [dbo].[user_run_clubs]
(
[server_index] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 90) ON [PRIMARY]
CREATE NONCLUSTERED INDEX [user_run_club_state_type_offline] ON [dbo].[user_run_clubs]
(
[user_run_club_state_type_offline] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 90) ON [PRIMARY]
GO
Much appreciated SQL Community! 🙂

Viewing 15 posts - 1 through 15 (of 21 total)

You must be logged in to reply to this topic. Login to reply

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 1

I am sorry that I don't have time to really dig into this, but it seems a contributing factor here may be the inclusion of updated fields in the index definition. You would be moving a row to a different location in the index structure that collides with the other query's "position". You also REALLY need to provide index create statements here too, including the table's clustered index definition if one exists.

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

977stijn Old Hand Points: 306 More actions · Answer 2

Not a problem TSG! I appreciate your feedback and have added the clustered and 2 other index create statements!

You'll get to it when you get to it, if ever 🙂

I'm following your train of thought:

Survivor - 298 - has an X exclusive lock on the record to update via way of the clustered index (id). 298 needs a RangeI-N lock granted in order to INSERT the value of the (now) updated fields (server_index and user_run_club_state_type_offline) into their new position in the index.

Victim - 158 - has a RangeS-U lock , not only on the rows identified where server_index between 0 and 104, but also completely covering the 'user_run_club_state_type_offline' index and 'server_index'.

But wouldn't these 2 be blocking each other rather than deadlocking (the block direction depending on which transaction running first)?

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 3

Huh - I thought based on those crazy-long namings that these were multi-field indexes. 😀

Look at the lock requests again in your diagram. This really does appears to be the "simple", classic deadlock scenario. Opposing granted and requested incompatible locks.

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

977stijn Old Hand Points: 306 More actions · Answer 4

TheSQLGuru (4/6/2016)
This really does appears to be the "simple", classic deadlock scenario. Opposing granted and requested incompatible locks.

Ok, I'm glad at least so far we're on the same page! How did those ever get granted :w00t:

Jacob Wilkins One Orange Chip Points: 27976 More actions · Answer 5

Most likely the RangeS-U locks are on different ranges, with one coming from a previous statement in the transaction.

The fact that the non-Serializable transaction is trying to take a RangeI-N lock is normal (although not particularly well documented, that I can see).

The most likely scenario, then is this:

1) Spid 158 (Serializable) runs a query that locks one range with an S-U lock, but the range is empty.

2) Spid 298 tries to run an update that would move a row into the locked range. That row is currently outside the range locked by 158, so spid 298 successfully grabs an X lock on that row.

3) However, it cannot complete the update and move the row into the locked range. It is then blocked, its attempt at taking a RangeI-N lock blocked by the first RangeS-U lock.

4) Spid 158 runs a second query, this time one that tries to take a RangeS-U lock on a range that includes the row exclusively locked by spid 298, and is thus blocked.

5) Now 298 is blocked by 158 on the first range, and 158 is blocked by 298 on the second range.

A deadlock of this general form is not so hard to reproduce.

In session A, run the following:

CREATE TABLE test_range (ID int);

CREATE CLUSTERED INDEX CI_ID ON test_range(ID);

INSERT INTO test_range VALUES

(6),(7),(8),(9),(10);

SET TRANSACTION ISOLATION LEVEL SERIALIZABLE

BEGIN TRANSACTION

UPDATE test_range

SET ID=ID+1

WHERE ID BETWEEN 3 AND 5; --Range is empty

Now, in another session, session B, run this:

SET TRANSACTION ISOLATION LEVEL READ COMMITTED; --Just to make sure we're at a lower isolation level

UPDATE test_range

SET ID=4

WHERE ID=8; --Update tries to move a new row into the range locked by the serializable transaction

Session B will be blocked by session A.

Run this in a third session to see that session B, despite being in READ COMMITTED, is waiting on a RangeI-N lock. It also got an X lock on the row with the key value 8.

SELECT *

FROM sys.dm_tran_locks

WHERE request_session_id = 54; --Substitute the actual spid for your session B, of course.

Now go back to session A, and run this query:

UPDATE test_range

SET ID=ID+1

WHERE ID BETWEEN 7 AND 9; --This will conflict with the X lock granted to session B

That will cause session A to try to take out a new RangeS-U lock on a range including 8, which is exclusively locked by Session B.

This will result in a deadlock of the sort described above.

Your situation may be turn out to be different, but this is the most likely scenario I can come up with.

I'd check the code for places where Serializable isolation level is used, and see if they are running multiple queries that could lock different ranges in a single transaction.

Cheers!

EDIT: Some slight rewording of sentences I wasn't all that thrilled with.

977stijn Old Hand Points: 306 More actions · Answer 6

Jacob, thanks for taking the time to both clearly explain and give an example. I must say everything you say makes sense and would explain where the RangeI-N comes from. Thanks again!

I'm still trying to replicate the deadlock in my test environment. With just the queries provided from the deadlock graph I haven't yet found a way to consistently do this. But i'm not done testing yet.

Of course at the time of the deadlock there was plenty going on on the server and (as i can see from SQL Diagnostic Manager) a handful of other SPids were blocked, cascading down to further blockages for that moment.

So can it be a query not included in the deadlock report is (part) responsible? Or should I - by design - be able to reproduce everything with just the queries provided in the report?

Thanks again for your time!

Brandie Tarvin SSC Guru Points: 173116 More actions · Answer 7

977stijn (4/6/2016)
I'm still trying to replicate the deadlock in my test environment. With just the queries provided from the deadlock graph I haven't yet found a way to consistently do this. But i'm not done testing yet.

I have to say you're the first person I've ever heard say "But I want the system to deadlock!" @=)

It would surprise me if someone hadn't figured out how to cause a deadlock. I'll ask around and see if anyone can help you replicate that.

Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

Jacob Wilkins One Orange Chip Points: 27976 More actions · Answer 8

The deadlock would be reproducible if you had all the queries run by the sessions involved in the deadlock.

The trick is that the execution stack and input buffer portions of the graph will not show all previous queries run by the session within that transaction.

In my demo above, for example, if you capture the deadlock xml, the first UPDATE (the one that takes the RangeS-U lock for the 3-5 range) is not shown anywhere. Of course, it's central to the deadlock, so you wouldn't be able to reproduce the deadlock with just the two queries shown.

So, while you should have all the sessions involved, you probably don't have all the queries necessary to reproduce it.

You'd need to either dig through the code to find the rest of the offending serializable code, or set up a process to capture running queries (could be XEvents, or a job that just pulls from DMVs every couple seconds) and try to track down the previous queries from that session by correlating the job's results with the times the deadlocks occur.

Of course, the former approach might require a lot of time depending on how big your codebase is, and the latter will have some performance impact since we don't have much information to use for a filter, so nothing is for free, unfortunately.

Cheers!

Grant Fritchey SSC Guru Points: 399334 More actions · Answer 9

977stijn (4/6/2016)
Jacob, thanks for taking the time to both clearly explain and give an example. I must say everything you say makes sense and would explain where the RangeI-N comes from. Thanks again!
I'm still trying to replicate the deadlock in my test environment. With just the queries provided from the deadlock graph I haven't yet found a way to consistently do this. But i'm not done testing yet.
Of course at the time of the deadlock there was plenty going on on the server and (as i can see from SQL Diagnostic Manager) a handful of other SPids were blocked, cascading down to further blockages for that moment.
So can it be a query not included in the deadlock report is (part) responsible? Or should I - by design - be able to reproduce everything with just the queries provided in the report?
Thanks again for your time!

Deadlocks, at their root, are a performance problem. After all, if the locks all cleared really fast, you might not ever see a deadlock. So, is another query possibly the root cause of the deadlock? Absolutely. It could be holding a lock on a resource completely unrelated to the locks involved with the deadlock, but, causing one of the processes in the deadlock to be waiting on a resource. That slow down causes the deadly embrace to occur, one of the process is chosen as a victim and rolled back. Meanwhile, that other, completely unrelated process, could still be holding locks, slowing things down.

Part of tuning deadlocks isn't just determining where the deadly embrace came from, and working to eliminate that, but it's also to identify how and where the process slowed down enough that the embrace became an issue.

Depending on the procedures involved, you may not reproduce this in an environment that is not under load.

"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt

Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning

Grant Fritchey SSC Guru Points: 399334 More actions · Answer 10

Brandie Tarvin (4/6/2016)
977stijn (4/6/2016)
I'm still trying to replicate the deadlock in my test environment. With just the queries provided from the deadlock graph I haven't yet found a way to consistently do this. But i'm not done testing yet.
I have to say you're the first person I've ever heard say "But I want the system to deadlock!" @=)
It would surprise me if someone hadn't figured out how to cause a deadlock. I'll ask around and see if anyone can help you replicate that.

If you really want a deadlock, here are a couple of procs that I guarantee will deadlock within AdventureWorks2014.

CREATE PROC dbo.EmployeeUpdate

AS

UPDATE HumanResources.JobCandidate

SET BusinessEntityID = 42

WHERE JobCandidateID = 11;

WAITFOR DELAY '0:0:10';

UPDATE HumanResources.Employee

SET NationalIDNumber = 42

WHERE BusinessEntityID = 42;

GO

CREATE PROC dbo.JobCandidateUpdate

AS

UPDATE HumanResources.Employee

SET NationalIDNumber = 42

WHERE BusinessEntityID = 42;

UPDATE HumanResources.JobCandidate

SET BusinessEntityID = 42

WHERE JobCandidateID = 11;

GO

Run EmployeeUpdate in one SSMS query window, and then, within 10 seconds, run JobCandidateUpdate. Deadlocks every time.

"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt

Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning

977stijn Old Hand Points: 306 More actions · Answer 11

Hi all, already thanks a lot for your insights and feedback! 🙂

I managed to reproduce the deadlock in our test-environment (containing 100+ VMs running the server application(s) and 10000s of bot-clients pretending to be users running all sorts of random scenarios), so I do have the 'system under load' covered that way.

The deadlock occurs on an edge case if one of the servers involved in the environment goes down (crashes). I basically just pull the plug.

The system is designed to cope with servers going down. It doesn't really impact anything as the transaction(s) that deadlock will be retried and the next time they go through without being deadlock, so yes it is a timing/short lived problem. But it annoys me greatly that I can't just seem to be able to reproduce the steps in a couple of SSMS windows (like I usually can, I have no problem getting to a state of deadlock, just not under this particular scenario with the transactions identified + mystery ones?? 🙂 ). So yes I'm definitely missing something.

Filtering the queries is nearly impossible as there is so much going on. I guess I could start filtering on SPID or something, but I'm afraid turning on query logging would cripple the system (I've experienced that in the past).

The good thing is, this afternoon when I reproduced it, as it happened, SQL Diagnostic Manager didn't even show any blocking going on at that exact moment. Just the deadlocks (3 reports generated to be exact). So I'm hoping that there was less noise and tomorrow with a clear head I'll be able to get to the bottom of this.

From these new deadlock reports, I still can't get the exact resource but at least the hobt_id returns a value this time. Though not surprising it is what we already now from the graph/report ...

SELECT o.name, i.name

FROM sys.partitions p

JOIN sys.objects o ON p.object_id = o.object_id

JOIN sys.indexes i ON p.object_id = i.object_id

AND p.index_id = i.index_id

WHERE p.hobt_id = 72057594223656960

user_run_clubs

user_run_club_state_type_offline

Grant Fritchey SSC Guru Points: 399334 More actions · Answer 12

Since you're on SQL Server 2012, have you tried using extended events to capture query metrics? They're not free (nothing is, TANSTAAFL always applies), but they are much lower cost than trace events and, because they have much better filtering mechanisms, you can target specific databases and queries much more easily, further reducing their overhead on the system. It might help.

"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt

Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning

Brandie Tarvin SSC Guru Points: 173116 More actions · Answer 13

Grant Fritchey (4/7/2016)
Since you're on SQL Server 2012, have you tried using extended events to capture query metrics? They're not free (nothing is, TANSTAAFL always applies), but they are much lower cost than trace events and, because they have much better filtering mechanisms, you can target specific databases and queries much more easily, further reducing their overhead on the system. It might help.

Article or article link? @=)

Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 14

Brandie Tarvin (4/7/2016)
Grant Fritchey (4/7/2016)
Since you're on SQL Server 2012, have you tried using extended events to capture query metrics? They're not free (nothing is, TANSTAAFL always applies), but they are much lower cost than trace events and, because they have much better filtering mechanisms, you can target specific databases and queries much more easily, further reducing their overhead on the system. It might help.
Article or article link? @=)

The lower overhead I can agree on, although honestly I don't think I have ever seen a definitive benchmark done that truly documents the overhead of each.

Filtering I will have to disagree on, at least for the given examples. Both database_id and query text are easy filters in a trace.

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service