The Short Answer? (In at least a
couple of places)
When I first learned about the DRU, I was very excited about the possibilities. I began to play with the utility in our test and stage environments, which are on a different SAN than our production environment. I experimented with the architecture, putting the controller and client on one server. I played with putting the client and target on one server. I tried the default architecture of one controller: one client: one target. I modified the configuration files and used different switch options. I ran the utility tracing myself running horrible queries that didn’t physically change a thing, and running traces where I was actually modifying data (usually just adding a database, then adding a table to a database) to assure myself that it was really physically modifying the target server.
After doing a number of smaller, limited practice runs to
assure myself that I had a basic understanding of the utility and how it worked,
I learned that our Infrastructure team wanted to use the DRU as part of
performance testing for the new SAN they wanted to buy. Once we were down to a couple of different
vendors, it was decided that we would use the DRU to test several different
workloads: one that was more CPU
intensive, one that was heavy on IO, one critical system processing run, and
one run to test the processing of a tabular cube.
This went well beyond anything I had ever used the utility
for. Additionally, because the test was being done in what was essentially an isolated
environment (e.g., the existing SAN could
talk to the new test SAN, but they wouldn’t during workload testing), it
presented a unique challenge. To further
complicate matters, both the IO-heavy run and critical system processing runs
used transactional replication in production, so the performance of replication
would need to be assessed.
I began by prepping the environment on our existing SAN. I set up the DRU controller in our stage environment and ran the needed traces in client test instances there. Then, the question of how to set up the isolated test SAN environment for the replay had to be resolved. Another DRU controller was set up on a SQL Server instance on the new SAN. The new controller also served as the client and the target. This time, only the necessary databases to run the needed processes were restored. Logins were copied over to the new test SAN environment, and permissions were confirmed. One of my teammates then set up a distribution server and a server for the replicated data for the test SAN. Replication was set up and ready to go. Lastly, the traces were copied over. Infrastructure would v-motion the storage from one SAN candidate to another, making that part of the process transparent to us. Between tests, we just needed to tear down replication, do database restores, fire replication back up, and we’d be ready to go again.
I used one of my last test runs when I was first playing with the utility on the existing SAN as a control, naming it controlRun. It had used some processes such as nested cursors and CHECKDBs to generate load on the server, but modified no data. It clocked in as a 5 minute trace. I hoped it would give me some kind of idea what to expect in terms of preprocessing and playback times. I had the controller set to all the default settings for the replay. Next, I began preprocessing data. I had two of our production workloads to work with, the CriticalSystemProcessingRun and the HeavyIORun. The CriticalSystemProcessingRun, while generating IO at the beginning and end of its process, was generally more CPU intensive, and a much longer process, usually taking about 2 hours. The HeavyIORun, however, lived up to its name, and typically processed millions of rows, but only taking about 20 minutes or so to run.
Clearly, both the CriticalSystemProcessingRun and the
HeavyIORun are much larger workloads than the very limited runs I had
previously done. It also seems clear
that the test SAN is much faster. I saw the
difference reviewing the first step of preprocessing, where the utility is
sucking up the data to prepare for replay.
Our existing SAN took the full 30 seconds to preprocess the first step
of the controlRun, where the testSAN took
a minute to process over a million events on the HeavyIORun.
It’s also interesting to note how consistent the second step of preprocessing is across workloads and SAN environments (Figure 1):
Each ten percent interval on the
CriticialSystemProcessingRun took exactly 73 minutes. Likewise, the HeavyIORun
took exactly four minutes to process each 20 percent interval. Meanwhile, the controlRun had taken two
minutes to process all 143,443 events. I decided to look a little deeper at the
average number of events processed per minute (Figure 2). There was some
difference there. This leads me to believe
that while this step may be influenced by the type of events it processes, it
is also a matter of the DRU function itself. The utility seems to take a look
at the total number of events and divide them up as evenly as possible for
processing. This consistency remained
through all three runs despite the fact that the three workloads in question
were very different in time, quality, and what they did on the server.
It may be apparent that I stopped looking at the
preprocessing on the CriticalSystemProcessingRun at 30%. That is because it had already taken well
over 3 hours to get that far, and I
could see that the events being processed weren’t going to go any faster as
time went on. That clearly wasn’t going to work for performance testing on a
Trying the HeavyIORun instead was more productive – it
finished in 20 minutes. I thought we
were back on track until the replay began, and DRU told me that the HeavyIORun
replay was going to take over 12 hours to replay (for a 20 minute trace)! Contrasting that to my previous testing, the controlRun
trace replay had also been slow (10 minutes for a trace that took about 5
minutes), but 12+ hours of replay for a 20 minute trace was a deal breaker for
server testing, and disproportionately long compared to what I had come to
expect. I didn’t have time to troubleshoot
the issue; our Infrastructure team was nearly at its time limit to finish testing
on the new storage candidates.
Hoping against hope that the DRU’s estimate was inaccurate
(although its stats for the preprocessing had been pretty accurate), I gave it
3 hours before I gave up and recommended doing actual test runs with the
How disappointing! Where did it go wrong?
I quickly found that I wasn’t alone. Questions of the same nature were easily
found on the internet; unfortunately, answers were not. It quickly became clear that if I wanted to
know why the DRU wasn’t performing as expected, I would need to try to find out