Any info on Deadlock detection algorithms?

  • Hi Gurus,

    I'm trying to find a starting point from which to troubleshoot an issue we are facing where the same operations in one environment cause a deadlock, whereas in a different environment the operations just block indefinitely. One process is a query involving many tables, one of them being "TableA" we'll say. The second process is just one that truncates "TableA".

    Isolation Level of the databases are identical. The 2 environments are configured identically, same hardware, OS, settings etc.

    In one environment, we kick off the 2 processes which will contend with eachother and they start blocking, but SQL Server will shortly declare a deadlock and kill one of them

    In the second environment, we kick of the same 2 processes and they just stay blocking forever.

    Does anyone have a good technical overview of how SQL Server 2005 decides that 2 processes are in a deadlock? I know the "general" scenario of a deadlock:

    ProcessA has a lock on X and wants Y

    ProcessB has a lock on Y and wants X

    But in this case, It doesn't fit nicely into this canned scenario. One of the processes is just trying to truncating a table that the other process is doing a select on. In my simple mind I don't see this as a scenario that should cause a deadlock. The truncate should just keep waiting until the select is done.

    Thanks

  • If it is just a single truncate and a single select (neither in a transaction), that shouldn't deadlock. For a deadlock, the SQL deadlock detector needs to be able to walk the lock request chains and find a loop.

    Can you post the deadlock graph (from traceflag 1222)?

    Are the two environments the same w.r.t. database settings?

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Thanks!!!

    Here's the trace. Check out the last deadlock between SPID 113 and 20

  • This one?

    <deadlock victim="process59ee718">

    <process-list>

    <process id="process59ee718" taskpriority="0" logused="0" waitresource="OBJECT: 11:1529108538:3 " waittime="4109" ownerId="1805859773" transactionname="TRUNCATE TABLE" lasttranstarted="2012-07-30T07:00:56.693" XDES="0x18752eef0" lockMode="Sch-M" schedulerid="12" kpid="9428" status="suspended" spid="113" sbid="0" ecid="0" priority="0" transcount="1" lastbatchstarted="2012-07-30T07:00:56.693" lastbatchcompleted="2012-07-30T07:00:27.413" clientapp="Microsoft JDBC Driver for SQL Server" hostname="sworksetlap1" hostpid="0" loginname="SIEBEL" isolationlevel="read committed (2)" xactid="1805859773" currentdb="11" lockTimeout="4294967295" clientoption1="671088672" clientoption2="128056">

    <executionStack>

    <frame procname="adhoc" line="1" sqlhandle="0x01000b001cd16b34d07bedd00c0000000000000000000000">

    TRUNCATE TABLE WC_EPCR_SALESPERMISSION_D </frame>

    </executionStack>

    <inputbuf>

    TRUNCATE TABLE WC_EPCR_SALESPERMISSION_D </inputbuf>

    </process>

    <process id="processb0276d8" taskpriority="0" logused="0" waitresource="OBJECT: 11:1529108538:2 " waittime="1156" ownerId="1805864888" transactionname="Lookup" lasttranstarted="2012-07-30T07:00:59.803" XDES="0xb4db192f8" lockMode="Sch-S" schedulerid="4" kpid="11692" status="suspended" spid="80" sbid="0" ecid="0" priority="0" transcount="0" lastbatchstarted="2012-07-30T07:00:54.417" lastbatchcompleted="2012-07-30T07:00:37.770" clientapp="Oracle Business Intelligence" hostname="SWORKSOBIAP2" hostpid="1848" loginname="SIEBEL" isolationlevel="read uncommitted (1)" xactid="1805854139" currentdb="11" lockTimeout="4294967295" clientoption1="687865888" clientoption2="128056">

    <executionStack>

    <frame procname="adhoc" line="1" sqlhandle="0x020000006a1a2a045235b1e2b649205d19221596a59e7a9f">

    select max(D4.c2) as c4,

    max(D4.c1) as c5

    from

    (select distinct T307241."PRODUCTROLLUPFS" as c2

    from

    "WC_EPCR_ARREGION_TERR_D" T307909,

    "W_DAY_D" T70600 /* Dim_W_DAY_D_Order_Date */ ,

    "WC_EPCR_SALESPERMISSION_D" T306520,

    "WC_EPCR_ARSALESP_D" T306527,

    "WC_EPCR_FSPRODCOMP_D" T307241,

    "WC_EPCR_BOOKING_DETAIL_F" T307839,

    "WC_EPCR_MKTROLLUP_D" T307886,

    "WC_EPCR_FSPRODCOMP_HIST_D" T311257

    where ( T70600."ROW_WID" = T307839."ORDER_DATE_WID" and T306520."SALESPERSON_CODE" = T306527."SALESPERSON_CODE" and T306527."ROW_WID" = T307839."SALES_PERSON_WID" and T307839."TERRITORY_WID" = T307909."ROW_WID" and T307241."ROW_WID" = T307839."FSPRODCOMP_WID" and T307241."ROW_WID" = T311257."FSPRODCOMP_WID" and T70600."PER_NAME_MONTH" = N'2012 / 06' a </frame>

    <frame procname="unknown" line="1" sqlhandle="0x000000000000000000000000000000000000000000000000">

    unknown </frame>

    </executionStack>

    <inputbuf>

    select max(D4.c2) as c4,

    max(D4.c1) as c5

    from

    (select distinct T307241."PRODUCTROLLUPFS" as c2

    from

    "WC_EPCR_ARREGION_TERR_D" T307909,

    "W_DAY_D" T70600 /* Dim_W_DAY_D_Order_Date */ ,

    "WC_EPCR_SALESPERMISSION_D" T306520,

    "WC_EPCR_ARSALESP_D" T306527,

    "WC_EPCR_FSPRODCOMP_D" T307241,

    "WC_EPCR_BOOKING_DETAIL_F" T307839,

    "WC_EPCR_MKTROLLUP_D" T307886,

    "WC_EPCR_FSPRODCOMP_HIST_D" T311257

    where ( T70600."ROW_WID" = T307839."ORDER_DATE_WID" and T306520."SALESPERSON_CODE" = T306527."SALESPERSON_CODE" and T306527."ROW_WID" = T307839."SALES_PERSON_WID" and T307839."TERRITORY_WID" = T307909."ROW_WID" and T307241."ROW_WID" = T307839."FSPRODCOMP_WID" and T307241."ROW_WID" = T311257."FSPRODCOMP_WID" and T70600."PER_NAME_MONTH" = N'2012 / 06' </inputbuf>

    </process>

    </process-list>

    <resource-list>

    <objectlock lockPartition="2" objid="1529108538" subresource="FULL" dbid="11" objectname="OLAP.dbo.WC_EPCR_SALESPERMISSION_D" id="lock5771500" mode="Sch-M" associatedObjectId="1529108538">

    <owner-list>

    <owner id="process59ee718" mode="Sch-M"/>

    </owner-list>

    <waiter-list>

    <waiter id="processb0276d8" mode="Sch-S" requestType="wait"/>

    </waiter-list>

    </objectlock>

    <objectlock lockPartition="3" objid="1529108538" subresource="FULL" dbid="11" objectname="OLAP.dbo.WC_EPCR_SALESPERMISSION_D" id="lock5d704ba80" mode="Sch-S" associatedObjectId="1529108538">

    <owner-list>

    <owner id="processb0276d8" mode="Sch-S"/>

    </owner-list>

    <waiter-list>

    <waiter id="process59ee718" mode="Sch-M" requestType="wait"/>

    </waiter-list>

    </objectlock>

    </resource-list>

    </deadlock>

    </deadlock-list>

    It looks like the select from the oracle app is part of a user transaction. Can you find out if that's the case? If so, what else is in the transaction?

    Do you have lock escalation configured for the partition level?

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Yes, that's the one.

    The SQL is application-generated, but it's from a reporting tool that does not do any DML, only reads from star-schema data models.

    Still, I'll try and dig deeper and see if it's part of a bigger transaction.

    About lock escalation: It's SQL Server 2005, so I don't think partition level lock escalation is possible.

    I will run dbcc tracestatus (-1) on all environments to see if any lock escalation flags are set.

    Thanks so much for diving into this.

  • Yeah, you're right, partition locks were 2008. Lock escalation trace flags aren't relevant here.

    I suspect it's just got to do with the order that locks are acquired. If the select takes a schema stability lock on partition 2, then the truncate requests a schema mod on partition 2, then the truncate takes a schema mod on partition 3 and then the select requests a schema stability on partition 3, you've got a deadlock with just two statements. From the deadlock graph, that looks like what's happening.

    One thing that I would suggest - the Oracle's running read uncommitted, that's not generally a good thing. See if you can get it to default (read committed). Might help as the truncate couldn't start taking schema mod locks with shared locks in place.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Thanks Gail,

    Appreciate all the insights provided so far.

    I'm still trying to understand what the variable could be to explain the difference in behavior in the 2 environments we have.

    In one, we can consistently reproduce the deadlock, in the other the statements just block, but never deadlock.

    Any other variables at all besides order of lock aquisition? Are there any instance or db settings that affect sensitivity of when SQL Server detects a deadlock?

  • Isolation levels, data distribution, execution plans for the select. No direct settings.

    If it's a deadlock, it will be detected as such, so the question is why is it just blocking in one environment and deadlocking in the other. Exact same schema and database settings?

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • schema is exactly the same, database settings are the same definitely for IL (default is READ_COMMITTED), and we're trying to weed through all other possible settings to see anything that looks different.

    Data distribution could be the key, because this is certainly different.

    I like your earlier hint to check if we can control how our application connects to the DB - we found it does change isolation level to dirty reads when it establishes a session and that this is configurable, but again it's another thing that is the same in both environments.

    As a side question (sorry if it's a newbie one) : if the session coming from the Oracle app is set to read_uncommitted, why are we blocking the truncate table operation? In other words, why is it taking a Schema lock at all?

  • M_E_K (8/16/2012)


    As a side question (sorry if it's a newbie one) : if the session coming from the Oracle app is set to read_uncommitted, why are we blocking the truncate table operation? In other words, why is it taking a Schema lock at all?

    Short form: A Sch-S lock is required for any form of table access so the query reader can know the metadata of the expected results. What it doesn't do is S lock any of the rows, pages, or the table itself. It sounds a bit silly, but you might be curious about this QotD I did a bit back and the resultant discussion on avoiding the problems that resulted:

    http://www.sqlservercentral.com/Forums/Topic1250898-2864-1.aspx


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • A schema lock is just to prevent the table's structure changing while a query is running against it. If you thought dirty reads was a problem (and it is), imagine if a column could disappear half way through a select's execution.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Ok, could the different be number of CPUs? DOP is set the same on both environments, but one environment has 24 CPUs and the other only has 8 CPU.

    Having a read through this article below and it states that if SQL is running on a server with >= 16 CPUs then it starts to implement lock partitioning and this could cause deadlocks:

    http://aboutsqlserver.com/2012/06/12/locking-in-microsoft-sql-server-part-14-deadlocks-during-ddl-operations-alteration-partition-switch-etc/

    This could explain the difference we're seeing in our 2 environments, no?

    Here's a quote from the article:

    In nutshells when server has 16 or more CPUs (think about dual quad-core with hyperthreading enabled) SQL Server starts to partition intent (IS, IX, IU) and schema stability (SCH-S) object locks on per-cpu basis.

  • Yes, that could be the cause.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Bit of a breakthrough on this I thought I'd share....

    There is an undocumented flag, -T1229 which turns off Lock Partitioning on SQL Servers with >= 16 CPU.

    Once I set this flag, the deadlocks ceased to occur and the 2 processes just blocked as expected. Here is a nice write up the the problem:

    http://blogs.msdn.com/b/psssql/archive/2012/08/31/strange-sch-s-sch-m-deadlock-on-machines-with-16-or-more-schedulers.aspx

    Now... Does anyone see any downside to setting -T1229? I know lock partitioning is an optimization, but I would there really be a significant performance hit if we disable it? There doesn't seem to be much documentation at all on this flag.

    Thanks

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply