I apologize in advance for the store, but it is the only way I can tell what happened.
On Thursday of last week, our largest customer was running some stress test for a new infomercial they are getting ready to run. Nothing strenuous 5 concurrent create customer/create orders for 50 new customers. During this run, they returned approximately 25% deadlocks. 🙁 Now, as my company really really really likes to nest procs (12 deep in some places), I decide that I am going to run a profiler trace.
So I setup a trace with the following events
SQL Batch: Starting
My columns are pretty standard
DBName, DBID, EventSequence, TextData, Duration, Start/Endtime, ObjectName, OjbectID, ObjectID2, and TransactionID (I might be forgetting one, but it was last week)
Of course I setup the trace on another machine and had it write to a file on my personal computer so there was as little performance hit as possible on the server.
Had the customer run the process again and BOOM, deadlocks. Trace captures it, all is fine and dandy. Followed the chain and figured out what is going on. Get about 15K rows returned, which seemed like a lot, but I got the data I wanted and who was I to complain, I found the issue and fixed it.
Of course, just to verify, I setup the same trace and have the customer re-run the process. Good news, no more deadlock issues. Bad new. We are now getting timeout issues. Approximately same number of rows returned. Time to start optimizing. Dropped queries that had durations of 120-130 seconds and dropped them to below 30 seconds (in answer to your question, now we don’t have baselines, which I have been arguing for. Now I have a little more in my pocket).
Skip forward a couple of days to last night. My manager restores the database from the 32 bit machine it was on and sticks it on a 64 bit machine just for testing. Of course everything goes through great on the one run they did last night. Switches them back to the 32 bit machine and everything runs perfect as well. No timeouts, no deadlocks, longest duration is 10-15 seconds. Sweet, progress. Maybe it is just because production database (which is located on thee same server as the staging database 🙁 ) had nothing going through it. At this point they say they are going to do some more testing on the old machine, I get called to setup a trace and see if we can figure out why things are going bad. I set it up and go watch some TV at 8p.m. 11 p.m. rolls around and BOOM, start getting script errors out the rear end, and have to restart the instance. Call it a night
Today, start testing on the 64 bit. Setup the trace and everything that ran the night before failed due to timeouts. WTF! We finally realize it is failing due to the trace. Cut the trace down to just SP:Completed and RPC:Completed and everything runs like a street louger on a strait downhill slop.
Now, my questions:
1) why is it the trace caused the timeout issues on both the 32 and 64 bit servers. I know that a profiler trace can cause some performance degradation, but I have never seen/heard of anything this bad
2) when I setup the trace file last night and restart the server I checked the files on my harddrive and had almost 400mb worth of files from about 3 hours of actually trace. While I can see how this would slow down the network, what I am trying to determine is why it would shutdown sql server. Specifically, what happens when a profiler trace is getting so much information that it bottle necks on the side of the server. Does the data get stored into a temporary trace file in the instance, or what.
3) Could doing this, what could have caused the script errors that eventually lead to the database shutting down and needing to be restarted?
Let me know if there is anything that you need clarification on.
By the way, without running a trace, they were able to run 15 concurrent create customer/orders for 1000 orders with everything running less than 2 seconds.