• So after talking with MS and EMC for over two weeks, we've gotton down to what our problem was\is.

    When we upgrade our machine from 6gb to 36gb of memory, this allowed us to keep all our tables in memory, but the side effect was that now when running our maintance job, the in-memory tables now are flooding the disks with Write requests and it's overloading our disks with the number of iops per sec, which once the (San 4gb write cache) disk cache fills up on the san, then io comes to a screeching hault on the machine as it has to start retrying and it has to deal with the large amount of things waiting in queue on the server. This causes our node to become less than fully responsive as it pegs the first cpu (Which ironically in a defult setup is the only cpu that handles network requests). So for us the "Fix" was a few things"

    1. We turned off the write cache on the lun that the db was attached to which in turn made the disks slower so that the host did not throw all the thousands of write requests onto the lun and it throttled better.

    2. We also appplied a registery change as requested by MS that allows the network load to be assigned to any open processor rather than rely on one proc (I'll paste the article number and some more info below). This change by it's self may be one of the root causes of your cluster "Network" errors.

    3. We also tuned the query to reduce the io load as well.

    4. We change the lun from a raid 5 to a raid 10 (With a few extra disks for io performance) and turned back on the write cache on the san.

    I'd check to see if your getting any io requests taking longer than 15 seconds as well, as this also points to a disk issue as well. As a test lowering our sql server memory settings back down to 6gb elimited the issue as well for us, as the disks had to read which slowed down the write issue.

    So while we really improved performance for our users by adding in the additional ram, we didn't take into account the additional load that was now being able to be placed on the disks.

    Fun stuff.

    -Greg

    Ms info:

    The processor load is not distributed across multiple processors on a computer that is running Windows Server 2003, Windows 2000 Server, or Windows NT 4.0

    System error 64 has occurred. The specified network name is no longer available.

    http://support.microsoft.com/kb/892100

    Procs HEX BIN

    2 0x3 0b11

    3 0x7 0b111

    4 0xF 0b1111

    8 0XFF 0b11111111

    Also, some additional notes from our MS case:

    Apply MANDATORY Microsoft Hotfix 946448, required for all STORport driver installations (for Windows 2003 SP1/2) - http://support.microsoft.com/kb/946448