Serious Performance Issues with Compaq SAN and SQL

  • We had a similiar scenerio on our SAN. The SAN support folks said 'The array is nearly idle'. Win2k clearly seemed to be waiting on the SAN. It turns out that there seemed to be a bottleneck in sending all I/Os to a single LUN. The SAN was 'presenting' one 'large' LUN that Windows assigned a single drive letter. We could only get around 260 I/Os per second against a large RAID 10 array before queing started. We often saw >1000 queued I/Os. We had 16 GB of cache and two FC adapters on the server (8 way, 8 GB), but weren't even seeing the kind of performance we would expect against a similiar number of ordinary scsi drives. When we did nothing but seperate the array into two LUNs (either by presenting two LUNS from the SAN and assigning two drive letters OR by using dynamic disk to assign a single drive letter) we doubled our throughput. All of our vendors just pointed fingers at each other. We were fortunate enough to be able to drastically decrease the load the application was putting on the SAN, so the issue went away. As we left it with our vendors, it was last thought to be some sort of limitation in the Win2k scsi port drivers or how the FCs interacted with them. If you have any luck with MS, please let us know! A couple thoughts that don't mean much: Generally, we like to see queued I/O at <2 per physical drive for any sustained period of time. In this case you wouldn't seem to be that out of line. You may also want to check the '% Disk Idle Time' instead of '% Disk Busy Time' as the second one is usually incorrect in a SAN environment. However, if your end users are seeing slow response times and there appears to be no other bottleneck, then I would guess you are likely seeing the same issues we saw. Good luck!

  • I ran into a problem where one drive in our Raid-5 array was bad, but not in such a way that the controller would fail the drive. The drive was responding very slowly to write requests and would tell the controller that it needed more time to service a request so the conroller never threw a timeout error. In fact, it never did throw an error, it was just slow. During troubleshooting, we ended up becoming suspicious after disbanding the array and creating just a bunch of disks and writing 2GB files to them. One drive was noticeably slower. So we tried the diagnostic from the manufacturer and it *failed*. So I learned that sometimes only the mfr's tests can detect a bad drive. For my Compaq, that meant running Seagate tools: http://www.seagate.com/support/seatools/index.html

  • Thanks everyone for the insight on this issue.

    I have spoken to the SAN Administrator(s) and we have decided to try the multiple LUN approach and see if that alleviates some of our issues.

    I will post the results here.

  • I am using similar technology but on a smaller scale. We are running HP\Compaq’s (whatever) MSA1000. A couple of things that I watch very carefully are that the disk queue never exceeds 2x the physical disk count and that my average I\O rate never exceeds 80% of the physical disks capacity. Compaq 10K drives are rated at about 130 IOs\sec. The other issue we faced when creating our disk subsystem was the difference between Raid 5 and Raid 10. With Raid 5 it was often better to have multiple smaller physical arrays for performance and fault tolerance, but with raid 10, massive drive counts in a single physical array blew the raid 5 arrays off the map in terms of performance. This might be a consideration for you.

  • We have had nearly the same issue with Compaq MA8000 and Proliant servers running W2k and SQL2k Ent.

    It came out that the W2k SP3 was buggy if your environment had Compaq disk arrays (some hot fixes are missed in SP3 what was issued for SP2).

    After we came back to SP2 (n the W2K) the problem disappared.

    I would suggest to check if some of your upgrade couldn't cause this issue.

    As a test I would come back to the previouse versions of upgrades step by step to fine out wich one was buggy.

    Gabor



    Bye
    Gabor

  • After reviewing the previous suggestions to add more LUN's to the environment, I decided to test spreading one database across two LUNS, then three and four.

    The results were very definitive. I will post below, the Performance Monitor metrics for each phase so you can see:

    Phase 1: One LUN for Data as represented in Production Environment

    Averages

    PhysicalDisk CountersDriver Letter: D

    Avg. Disk Queue Length1.548

    Avg. Disk Read Queue Length0.145

    Avg. Disk Write Queue Length1.403

    Current Disk Queue Length1.581

    Maximums

    PhysicalDisk CountersDriver Letter: D

    Avg. Disk Queue Length30.988

    Avg. Disk Read Queue Length3.327

    Avg. Disk Write Queue Length30.988

    Current Disk Queue Length97.000

    Phase 2: Two LUNS for Data

    Averages

    PhysicalDisk CountersDriver Letter: DJ

    Avg. Disk Queue Length1.0000.986

    Avg. Disk Read Queue Length0.0000.000

    Avg. Disk Write Queue Length1.0000.986

    Current Disk Queue Length0.9740.795

    Maximums

    PhysicalDisk CountersDriver Letter: DJ

    Avg. Disk Queue Length20.57420.610

    Avg. Disk Read Queue Length0.0000.000

    Avg. Disk Write Queue Length20.57420.610

    Current Disk Queue Length53.00052.000

    Phase 3: Three LUNS for Data

    Averages

    PhysicalDisk CountersDriver Letter: DJK

    Avg. Disk Queue Length0.4240.4210.421

    Avg. Disk Read Queue Length0.0000.0000.000

    Avg. Disk Write Queue Length0.4240.4210.421

    Current Disk Queue Length0.3230.2130.210

    Maximums

    PhysicalDisk CountersDriver Letter: DJK

    Avg. Disk Queue Length8.5678.8568.844

    Avg. Disk Read Queue Length0.0000.0000.000

    Avg. Disk Write Queue Length8.5678.8568.844

    Current Disk Queue Length25.00029.00027.000

    Phase 4: Four LUNS for Data

    Averages

    PhysicalDisk CountersDriver Letter: DJKL

    Avg. Disk Queue Length0.4260.4230.4230.422

    Avg. Disk Read Queue Length0.0000.0000.0000.000

    Avg. Disk Write Queue Length0.4260.4230.4230.422

    Current Disk Queue Length0.4430.4290.4510.463

    Maximums

    PhysicalDisk CountersDriver Letter: DJKL

    Avg. Disk Queue Length8.6818.4468.8628.697

    Avg. Disk Read Queue Length0.0000.0000.0000.000

    Avg. Disk Write Queue Length8.6818.4468.8628.697

    Current Disk Queue Length19.00023.00021.00021.000

Viewing 6 posts - 16 through 20 (of 20 total)

You must be logged in to reply to this topic. Login to reply