Blog Post

Debugging query timeouts with X-Events – part 1

,

I like hunting. Not the kind where you are walking around the forrest and trying to find a target. No, I like hunting for SQL queries that are messy, or down right nasty. My latest hunting adventure was to find the root cause for intermittent query timeouts, that could happen sometimes 3 times per day, and sometimes not for days. My starting point was, that I knew from the application, that it was a command timeout that occured, and the timeout value was 30 seconds.

So I knew that the queries did not complete in 30 seconds, but I also knew that the queries were executed thousands of times during a day, with no problem at all.

I didn’t know if the queries were actually hitting the cpu/disks heavily, or if they were blocked by something. If they were waiting for something, I didn’t know if it was physical resources like disk, cpu, network or if it was a lock on an object. If they were hitting the physical resources, it might be a bad execution plan, messed up statistics or a sick index that was the cause.

To answer some of the unknowns, I started a basic sql server trace, and caught all queries with a duration > 25 seconds. After running that trace a couple of days, the timeouts finally happened again. From the trace I could see that the queries had a duration of ~30.000 ms, but a cpu value of only 1-200 ms. The number of reads were also close to zero. This meant, that a bad execution plan wasn’t the cause, but some waiting/blocking was going on. I just couldn’t see what it was from the sql server trace.

X-Events to the rescue!

X-Event Sessions is the new tracing method introduced with SQL Server 2008, which is more lightweight than a classic profiling/tracing, and it is also able to give more detailed information. I won’t go into the details of XEvents in this blog post, but just show you how I used it for this specific task.

The goal

The goal for this X-Event tracing was to be able to find out what the queries were waiting for. It could either be other queries that had a lock on objects, or it could be a lack of physical resources in the server.

The code

To create the X-Event session, you simply need to run this:

CREATE EVENT SESSION [WaitEvents] ON SERVER
ADD EVENT sqlos.wait_info
(
    ACTION(
        sqlserver.database_id,
        sqlserver.session_id,
        sqlserver.sql_text,
        sqlserver.plan_handle
        )
    WHERE
        opcode = 1 AND duration > 200 AND session_id > 50
)
ADD TARGET package0.asynchronous_file_target
    (SET FILENAME = N'C:\XEvents\WaitEvents.xel',
    METADATAFILE = N'C:\XEvents\WaitEvents.xem')
WITH (EVENT_RETENTION_MODE=ALLOW_SINGLE_EVENT_LOSS,
    MAX_DISPATCH_LATENCY=5 SECONDS)
GO

 

Just adjust the path to the files to match your server.
The session triggers on events of the type sqlos.wait_info. This triggers an event on every wait event that is happening on the system.
The ACTION part defines the information I want extracted. In this case I just want the database_id, session_id, sql_text as well as the plan_handle to be able to see the actual execution plan for that specific query at that specific time.
The opcode=1 filters out so we only get the “end” events.
session_id > 50 ensures that we only get events caused by user sessions. If the root cause happens to be some SQL Server internal stuff, then we might miss it – but my gut feeling told me that it was ok in this case.
duration > 200 ensures that we are not getting spammed with a lot af small wait events for all io operations. If the root cause is a lot of slow IO’s, then  it might not be caught by this XEvent session. If that’s the case, it’s just a matter of adjusting the values. The lower value, the more events you will collect.

Because I need to have this X-Event session running for a couple of days, I want to store the events in files, so the ADD TARGET simply does that.

The session isn’t running per default, so we need to start it manually:

ALTER EVENT SESSION [WaitEvents] ON SERVER
STATE = START
GO

 

Stopping the session is as simple as:

ALTER EVENT SESSION [WaitEvents] ON SERVER
STATE = STOP
GO

 

And dropping it entirely from the server:

DROP EVENT SESSION [WaitEvents] ON SERVER

 

Like a server side trace, we are able to leave the session definition on the server, and then just starting and stopping it when we need to collect data.

Generating dummy events

While the above XEvent session was running on my test setup, I ran the following from connection 1:

CREATE TABLE DummyTable (col1 INT)
INSERT INTO DummyTable (col1) VALUES (1)
GO
WAITFOR DELAY '00:00:01'
GO
BEGIN TRAN
GO
DELETE FROM DummyTable
GO

 

This leaves an open transaction which holds a lock on the DummyTable. From connection 2 I then ran this:

 

DELETE FROM DummyTable
GO

 

I then aborted the query from connection 2, and committed the transaction from connection 1. I then stopped the X-Event session with the STOP command above, and was ready to analyze the data.

Analyzing the data

To analyze the data we need to do some XQuery magic. To be honest, I’m not too comfortable writing these queries from scratch, but luckily it’s easy to find tons of examples on the internet. If you haven’t found a query that matches your XEvent session 100%, then it’s most often easy to adjust the scripts to do so.

The query I ended up with, was this (Don’t worry, it’s not as nasty as it might seem Smiley)

;WITH MyXEventData AS (
SELECT
DATEADD(hh,
DATEDIFF(hh, GETUTCDATE(), CURRENT_TIMESTAMP),
XEventData.value('(event/@timestamp)[1]', 'datetime2')) AS [timestamp],
COALESCE(XEventData.value('(event/data[@name="database_id"]/value)[1]', 'int'),
XEventData.value('(event/action[@name="database_id"]/value)[1]', 'int')) AS database_id,
XEventData.value('(event/action[@name="session_id"]/value)[1]', 'int') AS [session_id],
XEventData.value('(event/data[@name="wait_type"]/text)[1]', 'nvarchar(4000)') AS [wait_type],
XEventData.value('(event/data[@name="duration"]/value)[1]', 'bigint') AS [duration],
XEventData.value('(event/action[@name="plan_handle"]/value)[1]', 'nvarchar(4000)') AS [plan_handle],
XEventData.value('(event/action[@name="sql_text"]/value)[1]', 'nvarchar(4000)') AS [sql_text]
FROM (
    SELECT CAST (event_data AS XML) AS XEventData
    FROM sys.fn_xe_file_target_read_file (
        'C:\XEvents\WaitEvents*.xel',
        'C:\XEvents\WaitEvents*.xem', null, null)) RawData
)
SELECT *
FROM MyXEventData

 

The query uses a CTE (Common Table Expression) to wrap the nasty xml stuff, and making it way easier to write simple queries against the CTE in the end.
Let’s break this down a bit. The following part simply reads the content of the files, and returns the entire xml as a column called XEventData:

SELECT CAST (event_data AS XML) AS XEventData
FROM sys.fn_xe_file_target_read_file (
       
‘C:\XEvents\WaitEvents*.xel’,
       
‘C:\XEvents\WaitEvents*.xem’, null, null)) RawData

This sub-select is what we execute the XQuery expressions against, to fetch the relevant data from within the xml document. Since this is not a blog post about XML and XQuerying, let’s just skip the details, and go to the resulting output from the entire query:

image

The first event was my WAITFOR DELAY query, which by design was waiting for 1 second. The wait_type is WAITFOR. The duration column says 1001 ms, which is pretty close to the expected 1 sec.

The second event was our DELETE statement from connection 2. This was waiting for 2166 ms before I manually aborted the query. The wait_type was LCK_M_U.

Back to my hunting experience

Using the XEvent session as demonstrated above, I ran it on the production system to find out what the timed out queries were waiting for. After a bunch of timeouts had happened, I found all the matching wait events, and they all had a matching wait event of type LCK_M_IX with a duration of ~30000 ms. The LCK_M_IX is an intent exclusive lock, which confirms my gut feeling.

Know I knew for certain, that some other query was blocking for a long time period. And since the affected queries were both inserts, updates and deletes, I thought I was dealing with a table lock. But I didn’t know for sure, and I didn’t know which query was the sinner.

Next step

The next is to identify the query that caused the blocking. I will show you how to do that in part 2 of this blog post, so keep and eye out for it.

Rate

You rated this post out of 5. Change rating

Share

Share

Rate

You rated this post out of 5. Change rating