SQLServerCentral Article

Using Server Side Traces for Dynamic Performance Evaluation

,

Introduction:

The query execution plan is probably the first tool that DBAs and T-SQL programmers resort to to solve the performance issues of their SQL Server applications. In my opinion, the biggest problem with this approach is the isolated context of the performance information gathered. It looks only at one query at a time. A working SQL Server is a dynamic system bombarded by hundreds or thousands requests at the same time and execution time of one query may impact another’s, especially if they access the same tables. In some cases the only way to further improve one query’s performance is to modify a different query.

The technique described in this article may not be a viable solution for every SQL Server deployment but, because it was successful in solving several technical issues in our implementation, I would like to share it with you. Before I begin, I would like to describe the environment we are operating within.

Our System:

We use SQL Server Express Edition 2005 on a medical device. This simplifies things on one end – there is very limited and predictable number of clients calling the database and the database is very small by any standard. On the other hand, because the database is a part of a physical system, the response time of the query in some cases must be limited to specific number of milliseconds regardless of the database size.

All clients are C# applications and all database access is thru stored procedure calls.

In our architecture the SQL database is one of the focal points and most of the business logic is handled inside the stored procedures. Because of this approach, our stored procedures are quite complex and they often call other stored procedures and views internally.

Our Need to Analyze

In order to be able to analyze the overall system performance and the impact of any changes we introduced during the development of the system, we needed a way to gather some physical evidence and we needed some hard numbers to compare the “before’n’after” with. Server side traces provided us with a way to collect data easily and in a consistent manner. By analyzing the collected data we were able to determine the dynamic relationship between different queries.

Because of the way we access the database, the trace used was limited to capturing the RPC calls only. Of course, you could capture different events according to your own needs. You just need to find out the one that matches your environment the best.

Creating the Trace

The code below is what I used to create the trace. In this case, the trace is captured in a file with an automatically dated file name on the local server drive. If the file reaches the maximum defined size the captured data is being rolled over to a new file and so forth. Other details of the trace code are explained in the embedded comments…

 -- Declare variables
 DECLARE @rc INT
 DECLARE @TraceID INT
 DECLARE @maxFileSize bigint
 DECLARE @fileName NVARCHAR(128)
 DECLARE @on bit
 DECLARE @off bit
 -- Set values
 DECLARE @CurrentDate NVARCHAR(20)
 SET @CurrentDate = CONVERT(VARCHAR(20),getdate(), 112)
 SET @maxFileSize = 10
 SET @fileName = N'C:\Trace' + @CurrentDate
 SET @on = 1
 SET @off = 0
 PRINT @fileName
 -- Create trace
 EXEC @rc = sp_trace_create @TraceID output, 2, @fileName, @maxFileSize, NULL 
 -- If error end process
 IF (@rc != 0)GOTO error
 -- Set the events and data to collect
 EXEC sp_trace_setevent @TraceID, 10,  2, @off   --BinaryData
 EXEC sp_trace_setevent @TraceID, 10, 12, @on    --SPID
 EXEC sp_trace_setevent @TraceID, 10, 13, @on    --Duration
 EXEC sp_trace_setevent @TraceID, 10, 14, @on    --StartTime
 EXEC sp_trace_setevent @TraceID, 10, 15, @on    --EndTime
 EXEC sp_trace_setevent @TraceID, 10, 16, @on    --Reads
 EXEC sp_trace_setevent @TraceID, 10, 17, @on    --Writes
 EXEC sp_trace_setevent @TraceID, 10, 18, @on    --CPU
 EXEC sp_trace_setevent @TraceID, 10, 24, @on    --IndexID
 EXEC sp_trace_setevent @TraceID, 10, 34, @on    --ObjectName
 EXEC sp_trace_setevent @TraceID, 10, 48, @on    --RowCounts
 EXEC sp_trace_setevent @TraceID, 10, 4, @on     --TransactionID
 -- Do not collect ADO.NET connection reset calls.
 EXEC sp_trace_setfilter @traceid = @TraceID
         , @columnid = 1
             , @logical_operator = 0
         , @comparison_operator = 1
         , @value = N'exec sp_reset_connection'
 -- Start the trace
 EXEC sp_trace_setstatus @TraceID, 1
 -- display trace id for future references 
 SELECT TraceID=@TraceID 
 GOTO finish 
 -- error trap
 error: 
 SELECT ErrorCode=@rc 
 -- exit
 finish: 
 GO

Importing the Trace Data

For performance and system dynamics testing, I created a set of simulators (C# apps) that execute database calls in a similar fashion the live system would. The usage of simulators vs the live system allows us to run database tests in the same environment regardless of the performance of the physical equipment. I ran the simulators for about 24 hours and collected the trace data. Once completed, we needed to put the trace data in a spot where we could do some analysis. What better place to do such an analysis than an SQL Table?

Using the code below, the collected trace data was loaded into a table called “Trace1”. Since this uses SELECT/INTO, it’s quick and easy because you don’t even need to create the trace table ahead of time. The reason I used the TOP clause is to capture the same number of calls after each test run, so the statistical data could be analyzed on comparable number of calls.

SELECT TOP 500000
 ObjectName,
 TransactionID,
 SPID,
 Duration,
 StartTime,
 EndTime,
 Reads,
 Writes,
 CPU,
 RowCounts
 INTO Trace1
 FROM ::fn_trace_gettable('C:\Trace20090918.trc', DEFAULT)

After that, a simple call like this…

 SELECT
 ObjectName,
 COUNT(*),
 AVG(Duration),
 MAX(Duration),
 MIN(Duration)
 FROM Trace1
 GROUP BY ObjectName

…can provide us with some basic performance numbers.

Analyzing the Trace

The table below illustrates some collected numbers from two independent test runs. It lists just 5 stored procedures of the stored procedures we analyzed, the number of calls made, and the average duration (in microseconds) of those calls.

ProcedureCountAvg Duration% Change
DrawerSamples3738450755
SequencerWorldView934174328
LISOrder303551529
ResultAdd778561650
MechanismPositionsStatus117637550
       
DrawerSamples41848350880.31
SequencerWorldView10474469300.37
LISOrder2718814770.03
ResultAdd7303416120.02
MechanismPositionsStatus1318435320.03

In both test runs, all the stored procedures were identical and the tests started with a clean database. The only difference was in a way the DrawerSamples stored procedure accessed the data. I have changed the call to use WITH (NOLOCK) clause on all tables accessed. As you can see from the table above, it provided a 31% performance gain for this stored procedure. Because of the way this stored procedure is used in our application, it makes no difference if the data is read with or without the locks (from the business perspective). The performance critical stored procedure, in our case, is the SequencerWorldView. This query was tuned up all the way possible but the required performance was just not there and additional tuning attempts proved fruitless. But thanks to the change made to DrawerSamples stored procedure, we got a 37% performance gain on the calls to the critical SequencerWorldView stored procedure.

Now I have to point out, that both stored procedures are totally independent. They both are primarily SELECT statements accessing the same tables but they retrieve different data. The performance gain of SequencerWorldView was achieved by providing more access time to the underlying tables due to the lack of locks set by the DrawerSamples during the first test run.

The other procedures listed here showed no performance gain (2-3 % statistically in this case is negligible).

Without the ability to trace all the database calls and collect the hard data to analyze, it would be almost impossible to see the impact of such changes on the overall system performance. Of course this is just a small sample of the performance data that can be collected that way. The average duration is critical piece of information in measuring the overall query performance but is not the only one.

In our case, we also need to look at the duration distribution because we are controlling a physical device and frequent long queries could potentially create hardware lockups.

Analyzing the Distribution of Code Duration

A query like the following can provide some distribution data from the same trace. It uses classic Cross-Tab methodology to count the number of times ranges of durations took place:

SELECT
 ObjectName,
 COUNT(*),
 AVG(Duration),
 MAX(Duration),
 MIN(Duration),
 SUM(
      CASE
           WHEN Duration < 10000 THEN 1
           ELSE 0
      END),
 SUM(
      CASE
           WHEN Duration BETWEEN 10000 AND 50000 THEN 1
           ELSE 0
      END),
 SUM(
      CASE
           WHEN Duration BETWEEN 50000 AND 100000 THEN 1
           ELSE 0
      END),
 SUM(
      CASE
           WHEN Duration BETWEEN 100000 AND 200000 THEN 1
           ELSE 0
      END),
 SUM(
      CASE
           WHEN Duration BETWEEN 200000 AND 300000 THEN 1
           ELSE 0
      END),
 SUM(
      CASE
           WHEN Duration BETWEEN 300000 AND 400000 THEN 1
           ELSE 0
      END),
 SUM(
      CASE
           WHEN Duration BETWEEN 400000 AND 500000 THEN 1
           ELSE 0
      END),
 SUM(
      CASE
           WHEN Duration > 500000 THEN 1
           ELSE 0
      END)
 FROM Trace1
 GROUP BY ObjectName
ProcedureCountAvg DurationMax DurationMinDuration0-1010-5050-100100-200200-300300-400400-500>500
DrawerSamples373845075512069514274422224541393236565128136
SequencerWorldView9341743287317764558191445700183734311
LISOrder30355152918506492730294302830000
ResultAdd7785616501334536577499328342810000
MechanismPositionsStatus1176375503521333671174781263110100
DrawerSamples418483508860362640784543978113681511610482
SequencerWorldView10474469304405284467576128404922316010
LISOrder27188147719673891627124392230000
ResultAdd73034161215832566171041188523850000
MechanismPositionsStatus131843532139235356131690150210000

The above table shows the duration distribution for two test runs. The first test run used the original code. The second used the modified DrawersSample code (using the WITH NOLOCK clause). As you can see, before the modification there were 5 calls to SequencerWorldView stored procedure that took longer then 300 ms including one taking more then 0.5 second. After the modification that number dropped to 1 call taking longer then 300 ms and no calls longer then 0.5 seconds. Our system guys were pleased with such results.

In the past (using a different software architecture on the same device) we noticed that if the decision making process took too long, it needed to be reset and started again to accommodate for the changes to the physical world. Occasional long queries could be absorbed by the system but when the ratio of long running queries reached a certain level, the system would just freeze. The distribution of the query duration provided us with a critical piece of information we needed to confirm that the database performance will not introduce any system stability issues.

Conclusion

Although the analysis of the execution plan for a given query will help you tune that one query, there are times when you need to look at an entire system of queries and the impact they can have on each other. This article demonstrated how analyzing the data collected by the SQL Trace could help solve certain performance issues.

Depending on your needs, you can collect different pieces of information. They can be used to provide additional valuable performance reports. Look up the sp_trace_SetEvent system stored procedure in the Books Online to get an idea of what can be collected this way. If you experiment with the traces you can probably come up with some other creative ways they can help you address some technical and business needs.

Thank You for reading.

Rate

4.62 (29)

You rated this post out of 5. Change rating

Share

Share

Rate

4.62 (29)

You rated this post out of 5. Change rating