The Distributed Replay Utility: Replaying for Real (Where Could It Possibly Go Wrong)?

The Short Answer? (In at least a
couple of places)

When I first learned about the DRU, I was very excited about the possibilities. I began to play with the utility in our test and stage environments, which are on a different SAN than our production environment. I experimented with the architecture, putting the controller and client on one server. I played with putting the client and target on one server. I tried the default architecture of one controller: one client: one target. I modified the configuration files and used different switch options. I ran the utility tracing myself running horrible queries that didn’t physically change a thing, and running traces where I was actually modifying data (usually just adding a database, then adding a table to a database) to assure myself that it was really physically modifying the target server.

After doing a number of smaller, limited practice runs to

assure myself that I had a basic understanding of the utility and how it worked,

I learned that our Infrastructure team wanted to use the DRU as part of

performance testing for the new SAN they wanted to buy. Once we were down to a couple of different

vendors, it was decided that we would use the DRU to test several different

workloads: one that was more CPU

intensive, one that was heavy on IO, one critical system processing run, and

one run to test the processing of a tabular cube.

This went well beyond anything I had ever used the utility

for. Additionally, because the test was being done in what was essentially an isolated

environment (e.g., the existing SAN could

talk to the new test SAN, but they wouldn’t during workload testing), it

presented a unique challenge. To further

complicate matters, both the IO-heavy run and critical system processing runs

used transactional replication in production, so the performance of replication

would need to be assessed.

I began by prepping the environment on our existing SAN. I set up the DRU controller in our stage environment and ran the needed traces in client test instances there. Then, the question of how to set up the isolated test SAN environment for the replay had to be resolved. Another DRU controller was set up on a SQL Server instance on the new SAN. The new controller also served as the client and the target. This time, only the necessary databases to run the needed processes were restored. Logins were copied over to the new test SAN environment, and permissions were confirmed. One of my teammates then set up a distribution server and a server for the replicated data for the test SAN. Replication was set up and ready to go. Lastly, the traces were copied over. Infrastructure would v-motion the storage from one SAN candidate to another, making that part of the process transparent to us. Between tests, we just needed to tear down replication, do database restores, fire replication back up, and we’d be ready to go again.

I used one of my last test runs when I was first playing with the utility on the existing SAN as a control, naming it controlRun. It had used some processes such as nested cursors and CHECKDBs to generate load on the server, but modified no data. It clocked in as a 5 minute trace. I hoped it would give me some kind of idea what to expect in terms of preprocessing and playback times. I had the controller set to all the default settings for the replay. Next, I began preprocessing data. I had two of our production workloads to work with, the CriticalSystemProcessingRun and the HeavyIORun. The CriticalSystemProcessingRun, while generating IO at the beginning and end of its process, was generally more CPU intensive, and a much longer process, usually taking about 2 hours. The HeavyIORun, however, lived up to its name, and typically processed millions of rows, but only taking about 20 minutes or so to run.

Comparing preprocessing on the new test SAN vs. the controlRun, which was done on the existing SAN

Clearly, both the CriticalSystemProcessingRun and the

HeavyIORun are much larger workloads than the very limited runs I had

previously done. It also seems clear

that the test SAN is much faster. I saw the

difference reviewing the first step of preprocessing, where the utility is

sucking up the data to prepare for replay.

Our existing SAN took the full 30 seconds to preprocess the first step

of the controlRun, where the testSAN took

a minute to process over a million events on the HeavyIORun.

It’s also interesting to note how consistent the second step of preprocessing is across workloads and SAN environments (Figure 1):

Figure 1) Metrics for test SAN preprocessing. Step 2 info taken from DRU replay screen.

Figure 2) Looking at event division in step 2 of preprocessing. Since I stopped the CriticalSystemPreprocessingRun when it was only 30% finished, I took that into consideration in the first column by multiplying the totalevents (55,105,119) by .30 and then dividing by the minutes.

Each ten percent interval on the

CriticialSystemProcessingRun took exactly 73 minutes. Likewise, the HeavyIORun

took exactly four minutes to process each 20 percent interval. Meanwhile, the controlRun had taken two

minutes to process all 143,443 events. I decided to look a little deeper at the

average number of events processed per minute (Figure 2). There was some

difference there. This leads me to believe

that while this step may be influenced by the type of events it processes, it

is also a matter of the DRU function itself. The utility seems to take a look

at the total number of events and divide them up as evenly as possible for

processing. This consistency remained

through all three runs despite the fact that the three workloads in question

were very different in time, quality, and what they did on the server.

It may be apparent that I stopped looking at the

preprocessing on the CriticalSystemProcessingRun at 30%. That is because it had already taken well

over 3 hours to get that far, and I

could see that the events being processed weren’t going to go any faster as

time went on. That clearly wasn’t going to work for performance testing on a

deadline!

Trying the HeavyIORun instead was more productive – it

finished in 20 minutes. I thought we

were back on track until the replay began, and DRU told me that the HeavyIORun

replay was going to take over 12 hours to replay (for a 20 minute trace)! Contrasting that to my previous testing, the controlRun

trace replay had also been slow (10 minutes for a trace that took about 5

minutes), but 12+ hours of replay for a 20 minute trace was a deal breaker for

server testing, and disproportionately long compared to what I had come to

expect. I didn’t have time to troubleshoot

the issue; our Infrastructure team was nearly at its time limit to finish testing

on the new storage candidates.

Hoping against hope that the DRU’s estimate was inaccurate

(although its stats for the preprocessing had been pretty accurate), I gave it

3 hours before I gave up and recommended doing actual test runs with the

applications instead.

How disappointing! Where did it go wrong?

I quickly found that I wasn’t alone. Questions of the same nature were easily

found on the internet; unfortunately, answers were not. It quickly became clear that if I wanted to

know why the DRU wasn’t performing as expected, I would need to try to find out

myself.

The Distributed Replay Utility: Replaying for Real (Where Could It Possibly Go Wrong)?

The Short Answer? (In at least a
couple of places)

Rate

Share

Share

Rate

The Distributed Replay Utility: Replaying for Real (Where Could It Possibly Go Wrong)?

The Short Answer? (In at least acouple of places)

Rate

Share

Share

Rate

The Short Answer? (In at least a
couple of places)