System-health extended-event session does not capture latest deadlocks

  • While running the following query to capture the latest deadlocks recorded in the default system-health extended-event session, I noticed that the latest deadlock captured was 3 days ago. However, in the SQL ERRORLOG I see that several deadlocks have occurred as recently as today.

    Has anyone else noticed this issue?

    --http://www.quest.com/whitepaper/how-to-use-sql-servers-extended-events-and-notifications816315.aspx

    ;WITH SystemHealth

    AS (

    SELECT

    CAST ( target_data AS xml ) AS SessionXML

    FROM

    sys.dm_xe_session_targets st

    INNER JOIN

    sys.dm_xe_sessions s

    ON

    s.[address] = st.event_session_address

    WHERE

    name = 'system_health'

    )

    SELECT

    Deadlock.value ( '@timestamp', 'datetime' ) AS DeadlockDateTime

    ,CAST ( Deadlock.value ( '(data/value)[1]', 'Nvarchar(max)' ) AS XML ) AS DeadlockGraph

    FROM

    SystemHealth s

    CROSS APPLY

    SessionXML.nodes ( '//RingBufferTarget/event' ) AS t (Deadlock)

    WHERE

    Deadlock.value ( '@name', 'nvarchar(128)' ) = N'xml_deadlock_report'

    ORDER BY

    Deadlock.value ( '@timestamp', 'datetime' );

    __________________________________________________________________________________
    SQL Server 2016 Columnstore Index Enhancements - System Views for Disk-Based Tables[/url]
    Persisting SQL Server Index-Usage Statistics with MERGE[/url]
    Turbocharge Your Database Maintenance With Service Broker: Part 2[/url]

  • I have not seen deadlock events in the system_health session that I thought should have been there. The session is setup to 'allow single event loss' so it's possible that is the reason but I never tracked it down fully.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • I am seeing this same issue. I was trying to evaluate to make a switch from using Event Notification and going through Extended Events due to the "multiple-victim" scenario that seems to be happening in our environment. However, I don't get the same amount of deadlocks and the last deadlock in the extended events happened more than a day ago.

    It seems that the scenario you are seeing is the same that I am facing right now, have you ever find out what could the issue be?

    Thanks a lot!

  • SQLSalas (3/5/2013)


    I am seeing this same issue. I was trying to evaluate to make a switch from using Event Notification and going through Extended Events due to the "multiple-victim" scenario that seems to be happening in our environment. However, I don't get the same amount of deadlocks and the last deadlock in the extended events happened more than a day ago.

    It seems that the scenario you are seeing is the same that I am facing right now, have you ever find out what could the issue be?

    Thanks a lot!

    Still no luck - I'm thinking of opening a case with MS on this.

    Here is a similar query with the same issue:

    --https://www.simple-talk.com/sql/database-administration/handling-deadlocks-in-sql-server/

    select

    DeadlockGraph

    ,[DbId] = DeadlockGraph.value ( '(/deadlock/resource-list//@dbid)[1]', 'int' )

    ,[DbName] = DB_NAME ( DeadlockGraph.value ( '(/deadlock/resource-list//@dbid)[1]', 'int' ) )

    ,[LastTranStarted] = DeadlockGraph.value ( '(/deadlock/process-list/process/@lasttranstarted)[1]', 'datetime' )

    from

    (

    SELECT

    CAST ( event_data.value ( '(event/data/value)[1]', 'varchar(max)' ) AS XML ) AS DeadlockGraph

    FROM

    (

    SELECT

    XEvent.query('.') AS event_data

    FROM

    ( -- Cast the target_data to XML

    SELECT

    CAST( st.target_data AS XML ) AS TargetData

    FROM

    sys.dm_xe_session_targets st

    JOIN

    sys.dm_xe_sessions s

    ON

    s.[address] = st.event_session_address

    WHERE

    name = 'system_health'

    AND target_name = 'ring_buffer'

    ) AS Data

    CROSS APPLY-- Split out the Event Nodes

    TargetData.nodes ( 'RingBufferTarget/event[@name="xml_deadlock_report"]' ) AS XEventData ( XEvent )

    ) AS tab ( event_data )

    ) AS A

    ORDER BY [LastTranStarted];

    __________________________________________________________________________________
    SQL Server 2016 Columnstore Index Enhancements - System Views for Disk-Based Tables[/url]
    Persisting SQL Server Index-Usage Statistics with MERGE[/url]
    Turbocharge Your Database Maintenance With Service Broker: Part 2[/url]

  • I just figured out that I am having this same issue. the system_health EE session has fewer deadlock events that what is logged by trace flags 1204, 1222.

    Has anyone got more info on this?

  • mthumphrey (8/14/2013)


    I just figured out that I am having this same issue. the system_health EE session has fewer deadlock events that what is logged by trace flags 1204, 1222.

    Has anyone got more info on this?

    I've opened a case with Microsoft, and I have gotten them to admit this is an issue.

    They told me that they recommend me using the ERRORLOG info for troubleshooting deadlocks, but I am not letting them off the hook so easily. There is real value in being able to track this info programmatically and save it in a table for historic analysis, and doing that from the ERRORLOG just doesn't cut it. Besides, what is the use of the system_health session then, if it is not reliable? Might as well not have it running at all!

    So I'm trying to get them to escalate this as a bug to their dev team.

    Waiting to hear back...

    Are you also seeing progressively newer deadlocks appearing on a day-by-day basis, but always a few days behind the current date?

    It's as if the info is all in memory but the process writing to the system_health output is very slow and is not able to catch up.

    Also, what version of SQL are you running?

    __________________________________________________________________________________
    SQL Server 2016 Columnstore Index Enhancements - System Views for Disk-Based Tables[/url]
    Persisting SQL Server Index-Usage Statistics with MERGE[/url]
    Turbocharge Your Database Maintenance With Service Broker: Part 2[/url]

  • It's almost like the system_health EE session has stopped collecting deadlock events on this particular server. The last one i have is 5 days ago. Meanwhile, with the trace flags set, I've had a couple each day in the SQL Error log, the latest was just a few hours ago.

    I agree this is a major disappointment. Extended events are great, but if you aren't confident you're capturing all the info, they become useless. I might try to set up a separate EE session that only collects deadlock events, just to see what it does.

    Have you opened a Connect case with Microsoft? If so, let me know the ID and I will +1. I'm having this issue on 2008R2. I wonder if it is still an issue in 2012.

    Thanks

    -Matt

  • mthumphrey (8/15/2013)


    It's almost like the system_health EE session has stopped collecting deadlock events on this particular server. The last one i have is 5 days ago. Meanwhile, with the trace flags set, I've had a couple each day in the SQL Error log, the latest was just a few hours ago.

    I agree this is a major disappointment. Extended events are great, but if you aren't confident you're capturing all the info, they become useless. I might try to set up a separate EE session that only collects deadlock events, just to see what it does.

    Have you opened a Connect case with Microsoft? If so, let me know the ID and I will +1. I'm having this issue on 2008R2. I wonder if it is still an issue in 2012.

    Thanks

    -Matt

    The funny thing is, progressively later deadlocks do appear eventually on the system_health xml, so the info is somewhere in the system, just not recorded in the system_health quickly enough. So it is not as if the events are lost, more that the updating of the system_health ring_buffer repository is not current/fast enough. At least that is my observation.

    I have never opened a connect item with MS before; will try to find out how, but I am currently in the process of a support case with them anyway. Will update this thread when I have more news.

    I was looking at this on a SQL-2012 instance yesterday; in sql 2012 system_health has 2 targets by default: ring_buffer (memory) and a file target (.xel file in the ERRORLOG folder). Perhaps it is better extracting this info from the file target? (but there might be the same issue there.)

    Not sure if this issue is version-specific, but so far I have found it to be an issue mostly on SQL-2008 servers.

    __________________________________________________________________________________
    SQL Server 2016 Columnstore Index Enhancements - System Views for Disk-Based Tables[/url]
    Persisting SQL Server Index-Usage Statistics with MERGE[/url]
    Turbocharge Your Database Maintenance With Service Broker: Part 2[/url]

  • There has already been a connect item on this - opened last year:

    https://connect.microsoft.com/SQLServer/feedback/details/754115/xevents-system-health-does-not-catch-all-deadlocks#tabs

    It was closed with no resolution though, as far as I can tell...

    __________________________________________________________________________________
    SQL Server 2016 Columnstore Index Enhancements - System Views for Disk-Based Tables[/url]
    Persisting SQL Server Index-Usage Statistics with MERGE[/url]
    Turbocharge Your Database Maintenance With Service Broker: Part 2[/url]

  • Jonathan Kehayias has an useful blog post about what is causing this problem:

    https://www.sqlskills.com/blogs/jonathan/why-i-hate-the-ring_buffer-target-in-extended-events/

    It doesn't give any solutions but it does explain why it's happening.

    tl;dr: The missing events may still exist in memory but cannot be displayed because the size of the XML output is limited to 4 MB. Looks like it's the most recent events that are left out. Presumably as new events are added and older ones removed, the "missing" events move down the list and eventually become visible.

    This would be why there appears to be a two or three day delay in seeing the deadlock details.

  • I have this issue on our SQL2008R2 SP2 instance.

    It just didn't register any deadlock data in system_health for the last 11 days ( and counting ).

    To fix it I request down time to bounce the instance :-$

    btw:

    SELECT * FROM sys.dm_xe_sessions;

    addressnamepending_bufferstotal_regular_buffersregular_buffer_sizetotal_large_bufferslarge_buffer_sizetotal_buffer_sizebuffer_policy_flagsbuffer_policy_descflagsflag_descdropped_event_countdropped_buffer_countblocked_event_fire_timecreate_timelargest_event_dropped_size

    0x0000000080D84081system_health0314415870043247610drop_event1flush_on_close0002015-03-20 06:12:13.5100

    dropped_event_countdropped_buffer_count both have a count of Zero !

    Very strange is that it states the create_time =2015-03-20 06:12:13.510 which is in the future !!! ( almost 3 days ahead)

    Johan

    Learn to play, play to learn !

    Dont drive faster than your guardian angel can fly ...
    but keeping both feet on the ground wont get you anywhere :w00t:

    - How to post Performance Problems
    - How to post data/code to get the best help[/url]

    - How to prevent a sore throat after hours of presenting ppt

    press F1 for solution, press shift+F1 for urgent solution 😀

    Need a bit of Powershell? How about this

    Who am I ? Sometimes this is me but most of the time this is me

  • Here's a Powershell workaround I've been using on 2008 R2. The script is run from a scheduler nightly on each instance, and adhoc as required to retrieve intraday deadlocking info (or query the buffer directly).

    - Run a separate xe session on each instance to capture deadlock data

    - Dump xml events to file (default backup location) and delete old events

    - Recycle buffer

    # Powershell v2.0

    # Print out contents of Deadlock Session buffer, then recycle buffer.

    # Workaround to manage system health session problem: http://www.sqlservercentral.com/Forums/Topic1399315-1550-1.aspx

    # Run once to establish buffer, then schedule to run daily. If sql agent is used as scheduler, remove write-host references as no monitor exists in sqlps

    # Run adhoc to dump latest data for immediate analysis

    # User Variables

    $InstanceName = "server\instance" # SQL instance

    $DaysToRetainFiles = 31 # Archive limit

    $UTCOffset = 10 # Offset of local time from UTC, set to 0 if not required

    $MaxBufferMemory = 4096 # Buffer sized at least double of max anticipated daily deadlocks to allow for expansion of the xml from binary

    #----------------------#

    trap {

    write-host $_

    write-host "Error in deadlock session recycle"

    break

    }

    # Establish connection to sql instance

    $conn = New-Object system.data.sqlclient.sqlconnection -ea stop

    $conn.connectionstring = "server=$InstanceName; database=master; trusted_connection=True"

    $conn.Open()

    $cmd = New-Object system.data.sqlclient.sqlcommand -ea stop

    $cmd.connection = $conn

    $cmd.CommandTimeout = 999999

    # Find default backup location and subfolder "DeadlockSession" (create it if it doesn't exist)

    $SqlStmnt = @"

    EXECUTE [master].dbo.xp_instance_regread N'HKEY_LOCAL_MACHINE', N'SOFTWARE\Microsoft\MSSQLServer\MSSQLServer', N'BackupDirectory'

    "@

    $cmd.commandtext = $SqlStmnt

    $sda = new-object System.Data.SqlClient.SQLDataAdapter($cmd)

    $sds = new-object System.Data.dataset

    $sda.fill($sds) | out-null

    $BackupLocation = $sds.tables[0].rows[0].Data

    $OutputLocation = Join-Path $BackupLocation "DeadlockSession"

    If ((Test-Path $OutputLocation) -eq $False) {

    New-Item -ItemType Directory -Force -Path "$OutputLocation"

    }

    Write-Host "OutputLocation: $OutputLocation"

    # Query existing session (if it exists) to get each xml event

    $SqlStmnt = @"

    DECLARE @HrsFromUTC int = $UTCOffset

    ;WITH cte_data AS (

    SELECT

    DATEADD(hh,@HrsFromUTC,event_data.value('(/event/@timestamp)[1]','DATETIME')) AS LocalTime,

    CAST(event_data.value('(event/data/value)[1]', 'nvarchar(max)') AS XML) AS DeadlockGraph

    FROM ( SELECT XEvent.query('.') AS event_data

    FROM (

    SELECT CAST(target_data AS XML) AS TargetData

    FROM sys.dm_xe_session_targets AS st

    INNER JOIN sys.dm_xe_sessions AS s

    ON s.address = st.event_session_address

    WHERE name = N'deadlock_session'

    AND target_name = N'ring_buffer'

    ) AS Data

    CROSS APPLY TargetData.nodes ('RingBufferTarget/event[@name="xml_deadlock_report"]') AS XEventData (XEvent)

    ) AS tab (event_data))

    SELECT REPLACE(REPLACE(REPLACE(REPLACE(CONVERT(CHAR(23), LocalTime, 121), '-', ''), ':', ''), '.', ''), ' ', '_') AS FileName,

    DeadlockGraph

    FROM cte_data

    "@

    $cmd.commandtext = $SqlStmnt

    $sda = new-object System.Data.SqlClient.SQLDataAdapter($cmd)

    $sds = new-object System.Data.dataset

    $sda.fill($sds) | out-null

    # dump files, using the utc offset time of deadlock as the filename

    foreach ($row in $sds.tables[0].rows) {

    $Filename = $Row.FileName

    $File = Join-Path $OutputLocation "$Filename.xml"

    [xml]$xml = $Row.DeadlockGraph

    $xml.Save($File)

    Write-Host "$Filename.xml created"

    }

    # Drop and recreate xe session (or if this is the first run, create the initial session)

    $SqlStmnt = @"

    IF EXISTS(SELECT * FROM sys.server_event_sessions WHERE name='Deadlock_Session')

    DROP EVENT session Deadlock_Session ON SERVER;

    CREATE EVENT SESSION Deadlock_Session

    ON SERVER

    ADD EVENT sqlserver.xml_deadlock_report

    ADD TARGET package0.ring_buffer(SET max_memory=$MaxBufferMemory)

    WITH (STARTUP_STATE = ON);

    ALTER EVENT SESSION Deadlock_Session

    ON SERVER

    STATE = start;

    "@

    $cmd.commandtext = $SqlStmnt

    $cmd.ExecuteNonQuery()

    Write-Host "Deadlock_Session xe recycled"

    $conn.Close()

    # Delete old deadlock graphs, if any exist

    $ArchiveLimit = (Get-Date).AddDays($DaysToRetainFiles *-1)

    Get-ChildItem -Path $OutputLocation | Where-Object {!$_.PSIsContainer -and $_.CreationTime -lt $ArchiveLimit } | Remove-Item -Force

    Write-Host "Files older than $DaysToRetainFiles days deleted"

  • tl;dr: The missing events may still exist in memory but cannot be displayed because the size of the XML output is limited to 4 MB. Looks like it's the most recent events that are left out. Presumably as new events are added and older ones removed, the "missing" events move down the list and eventually become visible.

    This would be why there appears to be a two or three day delay in seeing the deadlock details.

    If you watch the Pluralsight courses Kehayias does on Extended Events he goes into detail about the events being written to the XEL file. The size of the XML output is not limited to 4MB, the max memory allotted for the ring buffer is 4MB. You will not see an event show up in the event file (XEL) until it has been buffered out because it either hit the limit of max_memory or the limit of max_events_limit. Which are set to 4MB and 5000 respectively for the system_health session.

    It is also worth noting that if you use the "Watch live data" feature in SSMS 2012 to view a session that the ring buffer settings are overwritten while the API reader is actively reading the session stream, what the "Watch live data" utilizes in the background. I wrote up an answer on DBA.StackExchange here that shows an example of this.

    One thing I have used that has been consistent enough for the clients that utilize it is WMI monitoring. I use almost the exact same code shown in MSDN, with a few exceptions on table names and such. One particular client I simply added in a job step that will email a report of the deadlock information captured, providing enough information that they do not actually have to login to the SQL Server instance to pull it.

    Also, reading over a few other post in this forum I do seem to recall there were issues with the system health session capturing multi-victim deadlocks in SQL Server 2008 and R2. I don't remember where I read it but seems like it was fixed in 2012 because of they changes made in the XML format or something...no proof to provide mind you.

    Shawn Melton
    Twitter: @wsmelton
    Blog: wsmelton.github.com
    Github: wsmelton

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply