• Funny thing is that in crept into our main cluster about a few months ago... but at that time, it wasn't impacting anyone... but It's gotten progressively worse... We have gone through the same exact motions you did... except considering rebuilding the LUNS. 🙂

    We have sent off this info to Microsoft (we have an ongoing open case about this) to get some more clarification as to why this would work... and we're still going to work with our SAN vendor 'just to be sure'... but my team and I were talking, and to us it is PLAUSIBLE that maybe it really is a network/disconnect thing... and the reason for the spike in disk activity is secondary... i.e. a disconnect forces SQL Server to commit or rollback... potentially momentarily causing a disk spike.. (a thought anyway).

    How we replicated the error... hmmm.. well.. we are getting a new/upgraded system in here, and last week, they went through a mock conversion. One part of the process runs a stored procedure that creates/runs dynamic SQL against the server.... so, it was running against a node in the cluster... and get kept failling for most of the day... until it finally ran clean -- it took about 2 hours. I took the database to another newly-made node, re-did my test, and it failed every time. I applied the poke, and it ran cleanly off the bat. I removed the poke, and the proc failed... then SQL profiler started getting disconnects, and even another query window I was using to query the status of the running proc. Put the poke back in, and the node was happy. I'm not sure if I could tell you how to make a generalized script/test out of it... if you're interested I'll see what I can do.

    -- Mike