High Disk Queue Length

  • I'm running SQL 2005 on Win2k3 and we are getting really high avg. disk queue length that can spike up over 1000 (I set the scale to be 1). The RAID is setup as follows.

    RAID 1 (two disks)

    --Drive C (System)

    ----Windows Installation

    --Drive D (Data)

    ----User Database Log Files

    RAID 5 (three disks)

    --Drive D

    --User Database files, System Database log files

    I'm wondering what kind of performance increase I would see if I added two more hard drives to the RAID 5 or should I try something else before adding more drives? An avg. disk queue length of 1000 is REALLY high, so I'm wondering if something else might be going on like a failing disk or controller. Any suggestions on how to narrow this down would be greatly appreciated. Thanks in advance for any help.

  • kmauser (3/19/2009)


    I'm running SQL 2005 on Win2k3 and we are getting really high avg. disk queue length that can spike up over 1000 (I set the scale to be 1). The RAID is setup as follows.

    RAID 1 (two disks)

    --Drive C (System)

    ----Windows Installation

    --Drive D (Data)

    ----User Database Log Files

    RAID 5 (three disks)

    --Drive D

    --User Database files, System Database log files

    I'm wondering what kind of performance increase I would see if I added two more hard drives to the RAID 5 or should I try something else before adding more drives? An avg. disk queue length of 1000 is REALLY high, so I'm wondering if something else might be going on like a failing disk or controller. Any suggestions on how to narrow this down would be greatly appreciated. Thanks in advance for any help.

    If this is a Local Raid configuration having your User data file on a RAID 5 configuration is bad idea. Because a Raid 5 configuration bad for writes; check to see your avg. disk queue length for writes. I bet that is off the charts and your reads might okay (not perfect but okay).

    Another issue with the above configuration is sharing OS with SQL Logs. If you don't have a choice yes this works, but remember SQL writes to those files before they get written to database files.

    Thanks.

    [font="Arial"]---

    Mohit K. Gupta, MCITP: Database Administrator (2005), My Blog, Twitter: @SQLCAN[/url].
    Microsoft FTE - SQL Server PFE

    * Some time its the search that counts, not the finding...
    * I didn't think so, but if I was wrong, I was wrong. I'd rather do something, and make a mistake than be frightened and be doing nothing. :smooooth:[/font]

    How to ask for help .. Read Best Practices here[/url].

  • Look at some of the other counters for phys and logical disk before concluding that there is a problem. Go check the Avg. Disk Sec/Read, write, and transfer during production hours and tell me what those are. Get the total % disk time as well and let us know what those counters are doing.

  • I've done some more performance monitoring on the server and I've attached a screenshot that shows the server during a relatively tame period and as you can see it is thru the roof. The problem definitely seems to be on the E drive (RAID 5), because disk queuing reads/writes are really high. The C and D drives (mirror) are pretty calm and disk queuing almost always stays below 1 and never higher than 2.

    Notice the scale on the perfmon chart. Thanks again for the help.

  • This is a good counter, but again, you need to look at others as well. Did you divide the number of disks by your value to get the true queue length? Queue length counters require some calculation... I'd still request that you sent me the counters in my previous post to get a quick estimate of how well your disks are responding.

    Thanks much

  • As Lee said you really have to look at other counters; to get indication of throughput. He also made a good note those look scary but you have to divide that number by # of disks to get actual value. It is very easy to get mislead by the graph because it does not do some of the manual calculations required.

    Another thing to note is the duration for the spike; was it 1 second 2, 3 ? Prefmon only plots 100 points if you have more then that it starts averaging. If you ran the tool on default setting of every second for 1 Min and 40 Seconds then you are not really getting a good view of your server. Some skips are expected, question/problem you have to identify is how long the spike is lasting? This is where benchmarks come in, if a spike of 10 min does not degrade your performance then it isn't bad. But next time you get a spike of 15 seconds? Now you got to ask what changed? Bad query? New users? new databases? Why did your spike increase by factor of 50%? Ideally you start looking into it but don't worry about it right away; but if spikes start lasting 5+ min then you have issues; even 30+ seconds depending on your databaes. Thanks.

    Mohit.

    [font="Arial"]---

    Mohit K. Gupta, MCITP: Database Administrator (2005), My Blog, Twitter: @SQLCAN[/url].
    Microsoft FTE - SQL Server PFE

    * Some time its the search that counts, not the finding...
    * I didn't think so, but if I was wrong, I was wrong. I'd rather do something, and make a mistake than be frightened and be doing nothing. :smooooth:[/font]

    How to ask for help .. Read Best Practices here[/url].

  • Thanks again for the help. I've attached a screenshot of the disk sec counters that you asked for, sorry I forgot them them in my last post.

    Also, I was talking to one of our end users on the phone while I took the attached screenshot and she was watching a progress bar of a process that was updating inventory batches. Every time the perfmon chart shot up the progress bar would stop moving and once it dropped back down it would continue advancing. I could perdict what was happening with her progess bar like clockwork just by looking at the chart in real time.

  • Also, it is important to understand that the server isn't always pegged. Sometimes it has very low usage and then it will spike for a few seconds, other times it will be pegged out for 30+ min.

  • Heh this looks very much like a server I finished fixing recently LOL. But I see you are checking _Total; it is not a good idea to do that because sometime individual disks are doing better and it is only some disks having bad performance. And since we know it is your RAID 5 disk that is giving hard time look at those stats only so your stats don't get skewed by the other hdd counters being combined in.

    As for solution ... check the database in question:

    1) Index fragmentation, how fast they become fragmented after you reorganize the pages? Indicates if the fill factor is wrong? Or if you need to implement padding.

    2) Cluster indexes; do all major tables have cluster indexes?

    3) Check memory constraints; if server is under memory constraint it will have to constantly go to disk to get information because it can't keep it in cache.

    Thanks.

    Mohit.

    [font="Arial"]---

    Mohit K. Gupta, MCITP: Database Administrator (2005), My Blog, Twitter: @SQLCAN[/url].
    Microsoft FTE - SQL Server PFE

    * Some time its the search that counts, not the finding...
    * I didn't think so, but if I was wrong, I was wrong. I'd rather do something, and make a mistake than be frightened and be doing nothing. :smooooth:[/font]

    How to ask for help .. Read Best Practices here[/url].

  • This is what i have for measuring Disk.

    Using PHYSCIAL disk counters I assume this is correct and not LOGICAL DISK counter.

    Avg Disk Queue length

    -->How long have to wait and how many threads are waiting to be resolved.

    Avg Disk Sec/Read --->Time in second requests read of data from disk < 11-15ms

    Avg Disk Sec/Write ---> Time in second request write of data to disk < 12 ms

    Disk Reads/Sec

    Disk Write / Sec

    (Then use this equation)

    Raid 0 -- I/Os per disk = (reads + writes) / number of disks

    Raid 1 -- I/Os per disk = [reads + (2 * writes)] / 2

    Raid 5 -- I/Os per disk = [reads + (4 * writes)] / number of disks

    Raid 10 -- I/Os per disk = [reads + (2 * writes)] / number of disks

    Get number of iops from ISOEXE test or by specs of drives

    Example

    RAID 5 Configuration with four physical disks4

    Disk Read /Sec120

    Disk Write / Sec150

    Avg. Disk Queue Length12

    Avg. Disk Sec / Read0.035

    Avg. Disk Sec/Write0.045

    (You can use the above RAID calculations to determine the I/Os per disk. 120 reads + (4(number of disks) * 150 writes) / 4 physical disks = 180 I/Os per disk. This is over the general rule of 125 I/Os per disk capacity. This system has a disk bottleneck.

    No sure what the physical disk equation is if you are on a SAN (would this be number of LUNS)...

  • Hi,

    this could be because of some unknown blocking issue. Some Indexes could reduce disk queue length considerably. I suggest u to take a trace for 10 mins and analyse queries causing huge IO.

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply