When I was young, I went to nursing school. In school, we were taught that in order to come
up with a proper diagnosis, the doctor first had to get a patient history. Based on what that revealed, the appropriate tests
were run, labs were reviewed and a theory of what was wrong was formulated. The doctor would comb the literature and consult
colleagues and experts if needed until a diagnosis was made. When the doctor arrived at a diagnosis, he or
she found that in many cases, the cure was documented and it was usually a
matter of following standardized steps of care while carefully monitoring the
patient to prove that they were back on the road to health.
So now, all these years later, instead of being a nursing
student, I’m basically a database doctor, and this should all be analogous,
Yes, it should. No, it wasn’t.
I had a history of the servers and workloads in question,
and a list of the problems that had come up.
The problem was, I couldn’t find good documentation of similar cases
anywhere, much less a consult for a diagnosis or a solution. So, now what?
Now, I decided, I go back to the days of experimental medicine (I mean, database doctoring).
The first step was to go back and review what I had learned so far. I had previously done some limited test runs on our current SAN environment that ran somewhat slower than the real-time trace, but nothing approaching what I had just seen on the test SAN. As I mentioned in my previous blog, when I was first getting familiar with the DRU, I had tested with several different types of architecture: one where the controller and the client server were on the same instance, and another where the client and the target servers were on the same instance, and a third with the default architecture of one controller: one client: one target. I even tried making the controller, client and target the same server. There was no noticeable difference in replay times in those initial tests. Architecture didn’t seem to be the issue.
Next, I looked at the quality and quantity of the workloads.
I had two completely preprocessed workloads to work with: The controlRun and the HeavyIORun. In my controlRun
testing, I’d used a number of scripts for my trace, utilizing some resource-intensive
queries like CHECKDB, and a script using a nested cursor– but nothing that
actually modified data. The HeavyIORun,
however, processed millions of rows of data at a time and modified it. I had been
comparing apples to watermelons. That
would need to be fixed.
I felt convinced that the DRU was dividing replay events up
as evenly as it could to do the second step of preprocessing, but it could only
go as fast as its configuration would allow. It might go pretty quickly with a few hundred
thousand events to divvy up, or even take about 20 minutes for a little over a
million, but clearly, the CriticalSystemProcessingRun (topping out at over 55
million events) had not scaled at all. Given
this, I had to believe that the DRU itself was a bottleneck. That, I could do nothing about.
But how to make the replays go anywhere close to real time…. that was the question to which I had no answer.
The first step would be to compare the runs in each environment. I had captured the metrics on the test SAN, which was taken down shortly afterward. All future work would have to be done on the existing SAN, with similar (if not more detailed) metrics being recorded.
As I reviewed the initial trace that I had wound up using as
a control, I recalled that I had never done that trace with any intention of
using it for that purpose. It was only done
as a means to get to know the utility and to play around with its capabilities
and quirks. I had wanted something that
I could use as a gauge to give me some idea of how long it might take to do a
replay “for real”. Now, it seemed clear
that while my controlRun might have been interesting (and certainly insightful
for helping to determine how the preprocessing works to divide the workload),
it only served to see if the DRU worked. It wasn’t very helpful in and of itself to
get any idea of how it worked.
I also began to wonder if the fact that my controlRun had
not modified data had anything to do with the replay time. I similarly wondered if filtering the trace
to just one database would have had any effect on the replay.
It was time to come up with a set of better control traces. First, I would need a trace which would cause
physical changes on the server, and as closely as possible, try to match the
HeavyIORun run (since I was able to at least complete preprocessing with that
workload). This was challenging, because
the HeavyIORun process was run by a middle tier and was transparent to me. However, my team had traced part of the
process once while trying to troubleshoot replication issues, so I knew that it
delivered large batches of data in spurts of activity from an outside source. I could roughly mimic the activity by
creating a database (for this, I used Kimberly Tripp’s Credit database,
weighing in at roughly 1.4 GB) and then importing all the data into it from a
copy of the same database on another instance using SSIS. Once the database was
full of data, I would then drop it, recreate it, and reimport the data all over
again. I repeat this process until the trace time was up, which would be for about
the same amount of time as the HeavyIORun trace: 20 minutes. That would
simulate the idea of the bursts of IO from another process and give me some
idea of how the DRU handled it. It also
offered the advantage of using the same repeated activity throughout the trace,
which could help if it hit any bumps in the road on replay. I named this
control process the DMLRun.
To address the question of whether traces that didn’t modify
data would perform better on replay, I needed a process that would run for the
same amount of time as the DMLRun, which didn’t modify any data, but still
generated workload on the server. This process could then be compared to the
DMLRun to see the difference in execution time and preprocessing metrics. To accomplish that, I ran Glenn Berry’s
database-specific diagnostic queries, followed by a CHECKDB of all
databases. I called that trace
The Search for Answers
I duplicated the DRU architecture from the test SAN. I also duplicated the default replay settings.
I decided to begin by addressing the questions of whether a trace that modified no data really ran any faster on replay, or whether filtering the trace made any difference. To do that, I used the LimitedRunNoDML script on our existing SAN. In the first scenario, I would run the script with the trace filtered to just the one database that Glenn’s database-specific queries were hitting, and in the second run, I would run the same script with the default trace (no filters). I hoped that observing the contrast in this controlled trace might provide some insight. At least, it offered ideas.
The second run took about 3 seconds less to complete preprocessing (even though it was now running against the entire server). It also wound up with fewer preprocessed events, but those events took longer to actually replay. I suspect that the change in events could be due to the DRU caching some events that it “knows” that it will reuse. But why did it take almost a minute longer to replay fewer events the second time? And why was the pass rate slightly lower on the second run? Looking through the output files for the second run, I did find a number of interesting entries which I have sampled below (I say interesting, because it obviously knew where the trace was, and because the controller, the client, and myself – the person executing the scripts – were all sysadmin):
Without doing an actual row count of the trace itself as
opposed to the number of replay provider errors that were recorded, it did seem
likely that there were enough occurrences to equate to .03%. It also seemed possible that the “under the
covers” activity that was recorded might at least contribute to the lowered
pass rate and extra execution time. The
reason for the low pass rate was unclear until I did some reading on the SQL
Server Profiler Replay (which is basically a manual version of the DRU, using
the same trace template). Transactional
replication is skipped with this, and the use of linked servers would also have
some implications here.
Still, the trace ran between 2-3 minutes under real time in
both cases, and it didn’t make much of a difference whether the trace was
filtered to one database or not.
Next, I did the DMLRun. I ran the process 10 full times, and was about halfway through the 11th run when the trace stopped. Each completed run imported 1,676,574 rows of data.
This replay was moving in nearly real time until it reached the last 150 events. Then it hung for hour after hour, finally finishing in over 15 hours. I wondered if running it a second time would make any difference. It did. Just not the one I was hoping for:
Something had to give. I was just about out of ideas, until I decided to try changing the DRU settings. I tried changing the sequencing mode from the default of stress to synchronization. Synchronization mode replays events in the same order they occurred (vs. the default stress mode, which follows the sequence used in the StressScaleGranularity setting
(which in my case, was set to the default of “SPID”). There’s a caveat here: Microsoft warns that “if you actually have 8192+ sessions that require synchronization with the 8193rd, 8194th, ….sessions(s) the replay can stall because the 8193rd, 8194th, … session won’t have a replay worker until one of the 8192, previous sessions completes.” I checked my logs to be sure that wouldn’t be an issue for me on the DMLRun replay before I switched the setting.
From 15 hours (on the previous best case run) to 18.5+ hours
to 21+ minutes! This was progress! Had I found the fix? It was tempting to think so.
I thought I’d try a comparison of the unfiltered LimitedRunNoDML replay I’d done earlier in stress mode vs. synchronization modes to see if I got the same sort of dramatic results I’d gotten on the DMLRun:
The answer was: not
really. The preprocessing time was
roughly the same; it was just a tradeoff of which step I wanted to spend the
most time on. Still, the replays are
faster than the trace actually took.
Now to test the fix on the HeavyIORun, which was going to take over 12 hours using the default settings on the newer, faster test SAN. Here are the runs in Stress Mode vs. Synchronization Mode:
In this case, using synchronization mode actually takes longer overall, but that is still a huge
improvement over the test SAN, which had projected a 12+ hour run time. Both runs are slightly over the actual trace
time, but by a matter of minutes.
- Running replays on a different SAN environment
presents real performance challenges on replay.
One main difference that must be mentioned between the environments is
that both the CriticalSystemProcessingRun and the HeavyIORun were
application-driven, and the applications were not pointed to the new SAN.
processes were done through the application rather than through SQL Server, and
it was impossible to tell what might have been missed. Given another opportunity, I
would request more time for testing. I
would first run the control processes (DMLRun and LimitedRunNoDML), then I
would test to determine whether pointing the applications to the new
environment made a difference, and if so, how much.
- It seems clear that tweaking the configurations
of the DRU can have a real effect on
replay, depending on the workload involved.
However, documentation on the implications of changing the replay
configurations are difficult to find (if they exist at all). It would be helpful if more information on
this was provided for this Microsoft recommended utility, and optimally, a
mechanism embedded within the utility itself in the preprocessing phase to
recommend configuration settings for better replay performance.
- Filtering the trace to just one database vs.
running it wide open on the server seemed to have very little real effect on
replay, aside from allowing for a slightly better pass rate. More testing should be done with different
workloads in the future to provide more confirmation.
- The utility itself has the potential to be a
bottleneck for preprocessing. It does
not scale well during the second phase of preprocessing (although this may well
- The pass rate mechanism is unclear unless you
read up on the older Profiler Trace Replay mechanism. The Profiler Trace Replay is basically a
manual version of the DRU. It doesn’t
work with transactional replication (used in both runs), and of course, data
processed over linked servers would have some issues here as well.
- With Profiler traces deprecated, it would be
great to see the DRU work with Extended Events.
- A GUI interface would provide a faster and more
convenient way to run the utility.