Distributed Replay

  • ive got a trace of some end of month activity (that is time sensitive and i cant just recapture) that is about 75gb.
    It took 14 hours to preprocess yesterday.
    I try to replay it today (using 4 clients) and in the dispatch phase it only dispatches 100 or so items (different each time)
    and then says all complete, 0% pass rate etc...
    in the log is this:
     CRITICAL  [Controller Service] Exception - UnAdviseCallback
     CRITICAL  [Controller Service] **** Critical Error ****
     CRITICAL  [Controller Service] Machine Name: xxxxxxx
     CRITICAL  [Controller Service] Error Code: 0xC8502200
     OPERATIONAL [Controller Service] Event replay completed.

    for giggles I have replayed a preprared trace i worked with on the same set of machines a couple weeks ago (that is significantly smaller) and it ran through the entire dispatch phase without any issues (same clients as well). this set, controller and 4 clients has been used successfully in the last couple of weeks so I know it is not a setup or inter-server security issue. Do you think something went wrong on the prepare phase and it needs to be redone? is my trace simply too big for this service to read the results of?
    Ive restarted the controller and all client services to no avail.
    This trace replay and its outcome have a good bit riding on them and i will keep looking across the googles but if anyone's seen this and has some insight to share I'd sure appreciate it!
    Thanks!

  • for good measure and to make sure im not losing my mind, I ran a new trace this morning. capped it at 1gb. preprocessed it and it is replaying now without incident using the same clients/controller and command line i used earlier (save for the working directory path).

    The only difference is the size of the trace file and resulting irf file. and that the big file contains the load I actually need to replay.

  • ended up opening a case with MS. Proved to them it was all configured properly by running a trace, preparing and replaying it in this configuration.
    They ran some procmon captures of a prepared trace that runs and my bad one. its been a week +. I have very little hope this leads to anything other than the customer understanding this is a MS problem and not a me problem... 
    going to try to replay a smaller trace of a capture of load today... about 22gb worth of trace... fingers crossed this works and gets us over...

  • I have done a lot with Distributed Replay. Some things you should know:

    Distributed Replay is 32-bit code from the SQL2005 days, even if you pulled it off the 2016 install media. Have fun with the support case.

    The only way you're going to get that trace replayed is if you stick to some (unpublished) hard numbers:
    - Each trace rollover file should be kept at 200MB or less. The full trace may be larger, but each individual file must be at 200MB or less.
    - Break your trace files into smaller sets, and perform multiple runs to get through your whole trace. I don't recall the full-trace cap, it was either 40 files or 40GB.

    Eddie Wuerch
    MCM: SQL

  • thanks for the reply. this is first mention ive come across on any filesize limits. my MS support rep didnt seem aware of any of these guidlines while he and i spoke yesterday.

    Ive successfully replayed a (single) 5gb trace. 

    Why would the number of files or max trace file size matter (vs a maximum TOTAL size), as the preprocess all rolls it into one IRF file? The preprocess works fine my error occurs in the replay dispatch phase. seconds into it. it will dispatch one group of events then bork.

  • LAW1143 - Friday, May 11, 2018 4:40 AM

    The preprocess works fine my error occurs in the replay dispatch phase. seconds into it. it will dispatch one group of events then bork.

    I encountered a similar error trying to do a replay in my environment last week, and I ultimately figured out it was because one of the DReplay client machines didn't have enough disk space for its piece of the replay event data during the event dispatch phase. Once I resolved that, replay would proceed OK.

    In my case, I'm replaying ~130 GB worth of event data, which was originally captured with trace file rollover at 1 GB. Preprocessing worked OK (though it took ~3 days!), and replay was going OK once I got past the initial failure in ~8 seconds because of client disk space. I haven't yet let the replay proceed for longer than about 30 minutes as I'm still tuning the target server.

  • I don't have any testing on limits, but I can give you another datapoint to play with.  My typical setup for Replay is 50 GB broken into 5 GB rollover files.  I haven't tried it with a single 50 GB file though, and since this is how I set up my first one I haven't varied a lot, I just kept reusing the same script.  

    Glad to see some other people using DReplay, I love using it to predict performance for different environments and configurations but don't see it in use often.

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply