Speeding up database access Part 2 - Pinpointing Other Bottlenecks

  • Comments posted to this topic are about the item Speeding up database access Part 2 - Pinpointing Other Bottlenecks

  • In the Fragmentation section when it instructs to:

    DECLARE @DatabaseName sysname

    SET @DatabaseName = 'mydatabase' --use your own database name

    The uses it:

    FROM sys.dm_db_index_physical_stats(DB_ID(@DatabaseName), NULL, NULL, NULL, 'Sampled') ips

    The query only actually works from the current database so there is no need to "DECLARE @DatabaseName sysname" just use DB_ID() with no parameter:

    FROM sys.dm_db_index_physical_stats(DB_ID(), NULL, NULL, NULL, 'Sampled') ips

  • The Disk Usage section is not relevant to SAN disks. We have systems that run at over 100% disk time all day. I think this is because a SAN has multiple disks and perfmon add the percentages together so you can quite happily have a system running with a disk time of 150% all day. It is more important to look at the average and current queue lengths.

  • Disk usage

    SQL Server is heavily disk bound, so solving disk bottlenecks can make a big difference. If you found memory shortages in the previous section, fix those first, because a memory shortage can lead to excessive disk usage in itself. Otherwise check these counters to see if there is a disk bottleneck for some other reason.

    Categories: PhysicalDisk and LogicalDisk

    % Disk Time - Percentage of elapsed time that the selected disk was busy reading or writing.

    Avg. Disk Queue Length - Average number of read and write requests queued during the sample interval.

    Current Disk Queue Length - Current number of requests queued.

    If % Disk Time is consistently over 85%, the disk system is stressed. Avg. Disk Queue Length and Current Disk Queue Length refer to the number of tasks that are queued at the disk controller or are being processed. You want to see a counter value of 2 or less. If you use a RAID array where the controller is attached to several disks, you want to see counter values of 2 times the number of individual disks or less.

    Part 8 will show how to fix disk issues.

    It is very important to look at each PhysicalDisk individually, otherwise you get an average.

    In my opinion the numbers above indicate a severely stress disk subsystem.

    At half the above levels the disks are probably affecting performance.

    The Disk Usage section is not relevant to SAN disks

    Actually it is. First see above "look at each disk individually."

    If all you can get from your SAN is over 100% complain to the vendor to fix their driver.

    More important is the fact that a SAN is just another way of connecting disks to the computer. It does not make any one disk spin faster nor the heads settle down to a track any faster nor the performance advantages and disadvantages of the various RAID configurations. The only performance attribute the SAN brings to the table is higher total bandwidth. The rest of a SANs advantages are administrative. Configuring a three disk RAID 5 for the primary data is equally bad on a SAN, a high end RAID controller or JBOD.

    Having the transaction log file on dedicated spindles (eg dedicated heads) is still as important.

  • Sigh - another article mentioning 300 as a threshold for Page Life Expectancy. Go read this article that explains why 300 is nonsense.

    Also, low PLE or BCHR does NOT mean SQL doesn't have enough memory - it means something is causing the buffer pool to thrash - rarely is it a memory issue. It's usually some change in what SQL Server is doing - e.g. large parallel table scans instead of index seeks because of a bad plan.

    Looking at fragmentation for tables with less than a few thousand pages is usually a waste of time, and your statement that fragmentation levels < 20 aren't worth dealing with is quite wrong. What if the fragmentation is focused on the 'in use' portion of a table or index? You should use the numbers from Books Online (which I wrote). Your query is also broken because you're not filtering on allocation unit type - all indexes/tables with LOB or ROW_OVERFLOW data will show up as fragmented. Fragment_count is a pretty useless number we made up to try to make fragmentation easier to understand - you should ignore it.

    Thanks

    Paul Randal
    CEO, SQLskills.com: Check out SQLskills online training!
    Blog:www.SQLskills.com/blogs/paul Twitter: @PaulRandal
    SQL MVP, Microsoft RD, Contributing Editor of TechNet Magazine
    Author of DBCC CHECKDB/repair (and other Storage Engine) code of SQL Server 2005

  • This piece is just wrong.

    If % Disk Time is consistently over 85%, the disk system is stressed. Avg. Disk Queue Length and Current Disk Queue Length refer to the number of tasks that are queued at the disk controller or are being processed. You want to see a counter value of 2 or less. If you use a RAID array where the controller is attached to several disks, you want to see counter values of 2 times the number of individual disks or less.

    Firstly, with the number of layers between SQL and the disks with SAN or similar, it's near-impossible to diagnose a queue length as good or bad.

    The second reason is that there are operations in SQL that intentionally drive the queue length high, that submit large numbers of IOs and then carry on working while they wait.

    Queue length just isn't a useful metric these days.

    The problem with disk time is that it is not a value between 1 and 100. It can go to several times 100, depending on what's beneath that physical drive. A value of 500 may be perfectly fine.

    Rather stick with the latencies (Avg sec/read and Avg sec/write) for diagnosing disk-related bottlenecks.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • On the compiles and recompiles...

    These counters will show high values at server start up as every incoming query needs to be compiled. The plan cache sits in memory, so doesn't survive a restart. During normal operation, you would expect compilations per second to be less than 100, and re-compilation per second to be close to zero.

    It depends, really. If I've got a system with very volatile data, where the statistics are changing fast, I want to see relatively high recompiles/sec as it means I'm getting execution plans optimal for the current state of the data. SQL recompiles for two reasons - accuracy and performance. Accuracy meaning that something in the schema has changed, performance meaning the stats have changed. If the stats have changed significantly I'd rather a recompile than a non-optimal execution plan being reused.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Hi Gail and Paul,

    I've read several articles like this one, and they are frequently followed by posts saying that this metric or another is

    a) Pointless

    b) Out of date

    c) Just SQL folklore based on something Paul once said 😉

    d) All of the above

    Are there any metrics that can be trusted in Perfmon for diagnosing system performance problems? Do either of you know of, or have you written similar articles that are available online?

    Thanks,

    Chris

  • This is a fantastic discussion, slightly old but not seriously so:

    http://runasradio.com/default.aspx?showNum=81 (30 min or so podcast)

    Many of the metrics can be trusted and used, not in terms of hard thresholds, but in terms of what's normal. For example's Paul's example of PLE, say my production server usually has during business hours a PLE between 10000 and 15000, I check perfmon today and find that it's sitting at 4000, that is a cause for concern. Not because it's below some magic number, but because it's far from what's normal for my system.

    p.s. The PLE threshold at 300 was relatively sensible 10 or so years ago when servers had small amounts of memory. These days, not so much. A PLE of 300 means your entire buffer pool is getting discarded and fetched from disk in 5 minutes. On servers that can easily have > 64GB of memory, that's not great.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • I totally disagree with people saying that PLE (Page Life Expectancy) of 300 is an irrelevant number and how we should all watch our servers and look for deviations from usual PLE. BS! If your normal PLE is in 10-15 thousand range and it drops below 4000, then no, it's NOT a problem! The server is still performing well. While if it drops below 300 - you have a memory issue regardless of what your usual PLE is. I'm not saying 299 is bad, and 301 is ok, but if you're somewhere below that range, your server bogs down considerably. :w00t::w00t:

  • pkrudysz (7/30/2011)


    I totally disagree with people saying that PLE (Page Life Expectancy) of 300 is an irrelevant number and how we should all watch our servers and look for deviations from usual PLE. BS! If your normal PLE is in 10-15 thousand range and it drops below 4000, then no, it's NOT a problem! The server is still performing well. While if it drops below 300 - you have a memory issue regardless of what your usual PLE is. I'm not saying 299 is bad, and 301 is ok, but if you're somewhere below that range, your server bogs down considerably. :w00t::w00t:

    So you're saying you're totally happy if you're discarding and refreshing several hundred GB of cache in 10 minutes? (PLE of 600) when normally that takes 20 hours to do that?

    Think what PLE means. It means that on average a page in memory is going to stay there x seconds (where X is the PLE). Now back when we had maybe 1 GB memory for SQL I wouldn't have worried if my entire 1GB buffer pool got discarded and replaced (from disk) in 5 minutes (PLE of 300). If I'm working with a modern server with 1 TB of memory and my PLE is less than a few thousand I'm going to be very, very concerned.

    A sudden and dramatic drop in PLE means that something is causing the data cache to thrash (read, discard, read, discard) far, far faster than normally. If you think that's not something to be concerned with, well....

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • pkrudysz (7/30/2011)


    I totally disagree with people saying that PLE (Page Life Expectancy) of 300 is an irrelevant number and how we should all watch our servers and look for deviations from usual PLE. BS! If your normal PLE is in 10-15 thousand range and it drops below 4000, then no, it's NOT a problem! The server is still performing well. While if it drops below 300 - you have a memory issue regardless of what your usual PLE is. I'm not saying 299 is bad, and 301 is ok, but if you're somewhere below that range, your server bogs down considerably. :w00t::w00t:

    Wrong. You're not reading what people including me are saying.

    If your PLE is 10000 and drops to 4000, who cares? It's if it drops and *stays there* then it's a problem. If it drops to zero and themn climbs back up - who cares? But if it drops to zero and stays there, it's a problem.

    It comes down to a fundamental misunderstanding by most people of what PLE is telling you. It's an instantaneous measure of pressure on the buffer pool - it's nothing to do with memory. Pressure on the buffer pool could be because someone just started a large scan or ran sys.dm_db_index_physical_stats. Low PLE is only a problem when it stays low. Low PLE does not mean memory - that's the typical knee-jerk reaction. It means the buffer pool is under sustained pressure. Usually that's from unexpected plan changes forcing buffer pool churn. Fix it not to add more memory - it's to correct the query plan.

    Again, the 300 number as a worry threshold is very out-dated, and those who persist in using it display their lack of knowledge behind their recommendations.

    Thanks

    Paul Randal
    CEO, SQLskills.com: Check out SQLskills online training!
    Blog:www.SQLskills.com/blogs/paul Twitter: @PaulRandal
    SQL MVP, Microsoft RD, Contributing Editor of TechNet Magazine
    Author of DBCC CHECKDB/repair (and other Storage Engine) code of SQL Server 2005

  • GilaMonster and Paul,

    I understand the theory behind PLE well. I also agree, that a temporary drop even to 0 is ok (seen those many times). Paul, I know that queries might need to be corrected, and not memory added (you're such a genius). Now please tell me about a server that you've worked with, that had a PLE way above 300, say 1000, and had a performance issue. By performance issue I don't mean you crying, I mean a performance degradation observed at the user application level.

  • Dude - you start flinging insults around and you expect the conversation to continue?

    I've seen plenty of servers with high PLE (tens of thousands) that had perf issues which killed user app performance. Low PLE isn't the only thing that can be a problem - for instance, IO susbsystem latency, blocking, the list goes on and on. Really, do you have any performance troubleshooting experience/knowledge at all?

    I'm not sure what you're trying to argue here - apart from arguing for the sake of arguing.

    I'm done with this conversation.

    Paul Randal
    CEO, SQLskills.com: Check out SQLskills online training!
    Blog:www.SQLskills.com/blogs/paul Twitter: @PaulRandal
    SQL MVP, Microsoft RD, Contributing Editor of TechNet Magazine
    Author of DBCC CHECKDB/repair (and other Storage Engine) code of SQL Server 2005

  • Paul,

    Again, what you're saying is right, again, you're stating the obvious (IO subsystem could cause latency - who would have thought?), and again, you're missing my point. Or, ok, lemmie rephrase myself: PLE of 1000, and server doesn't have enough mem for the needs - you've seen that?

Viewing 15 posts - 1 through 15 (of 17 total)

You must be logged in to reply to this topic. Login to reply