SAN Response Times

  • Hi

    Our San guy has come to me in regards to a production SQL 2008 R2 cluster (physical) on windows server 2008 r2. He has monthly reports and has come to me to report a LUN that shows service time (6ms) but response times (30ms). This concerns him as his reports flag this as a warning. He advises that this indicates the san is performing well however the 30ms is waiting for the response back from the server and wants me to investigate the sql server. The lun in question stores the data file for a critical, highly utilized database and nothing else, apart from a couple of smaller database datafiles.

    I have had a very brief look at the server and nothing immediate stands out. The server is highly spec'd. CPU is no where near under pressure nor is memory, no disk queuing. I believe I will need to dig in to sql server waits. I am hoping I can be pointed in the right direct to efficiently investigate this issue, if it is in fact and issue... its been like this for a long time and is not a reported problem from the customer. However if this is a sign things can be improved then I would like to. I understand this post has minimal specific information so to avoid posting lots of pointless stats, what info would assist in this investigation?

    Thanks

  • you may try this for starters, your wait times could be the key, over 30 does sound like an issue

    http://www.brentozar.com/archive/2012/03/how-fast-your-san-or-how-slow/

  • It is a production 24/7 system. I wont be able to run that tool on there.

  • don't be ridiculous. you can run it on production. it's doing what your server is already doing; in this case it creates a four gig file and tests how fast the disk is performing. if you are ever doing a backup, or copying a backup from disk to another location is exactly what you'd be doing anyway...throwing large files around.

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • Lowell (1/11/2015)


    don't be ridiculous. you can run it on production. it's doing what your server is already doing; in this case it creates a four gig file and tests how fast the disk is performing. if you are ever doing a backup, or copying a backup from disk to another location is exactly what you'd be doing anyway...throwing large files around.

    Careful now. It says right in the article "Run it on an idle server (not your live SQL Server, because itโ€™ll slow things down while it runs.) "

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (1/11/2015)


    Lowell (1/11/2015)


    don't be ridiculous. you can run it on production. it's doing what your server is already doing; in this case it creates a four gig file and tests how fast the disk is performing. if you are ever doing a backup, or copying a backup from disk to another location is exactly what you'd be doing anyway...throwing large files around.

    Careful now. It says right in the article "Run it on an idle server (not your live SQL Server, because itโ€™ll slow things down while it runs.) "

    Thanks Jeff that's exactly right. If I ran this tool on the production server, this is a change in the server operation and effectively should go through change control. I don't think the customer would be happy with me if I ran that tool and it impacted the performance of their critical production system or worse.

    I would prefer if I could simply query the sql wait stats but not 100% sure on what I should be looking for.

  • I notice on the activity monitor, I regularly see high wait times on Buffer I/O however there is substantial amount of memory for this server....

  • High buffer I/O just means high logical IO which is probably just missing indexes or poor performing queries due to the amount of data being processed.

    For example, you might have 32GB RAM with most of your database in buffer, run a large select on 10GB of data and you'll see massive buffer I/O with low CPU and low disk I/O.

    Run through the performance check scripts one at a time - by Glenn Berry, use the correct version (SQL 2008): http://www.sqlskills.com/blogs/glenn/

    Don't worry about storing the results, just check the numbers.

    There is a script in there with all the exclusions for wait stats and other queries for disk performance and average response times per db.

    Post back your results here. Also check the missing indexes "index advantage" numbers, what range are they in (hint - millions is bad and is likely the problem).

  • Andrew G (1/19/2015)


    High buffer I/O just means high logical IO which is probably just missing indexes or poor performing queries due to the amount of data being processed.

    For example, you might have 32GB RAM with most of your database in buffer, run a large select on 10GB of data and you'll see massive buffer I/O with low CPU and low disk I/O.

    Run through the performance check scripts one at a time - by Glenn Berry, use the correct version (SQL 2008): http://www.sqlskills.com/blogs/glenn/

    Don't worry about storing the results, just check the numbers.

    There is a script in there with all the exclusions for wait stats and other queries for disk performance and average response times per db.

    Post back your results here. Also check the missing indexes "index advantage" numbers, what range are they in (hint - millions is bad and is likely the problem).

    Our SAN Admins always tell us bad queries are the cause, I agree though sometimes they are correct ๐Ÿ™‚ Most of the time I disagree but have to to suck it up because we need them they dont need us.... :-Don some systems but we have no choice but leave the bad queries because it's vendor database. (like voiding the warranty if we touch it)

    Maybe you can ask if you could run perfmon capture

    avg disk read/write per sec

    maybe every 10 sec intervals or 5 sec

    run it maybe 1-2 hours during your busiest times for enough data.

    That would give a metric that 'speaks' to server admins/san admins in a format they understand information.

    As Andrew mentioned if it's a large result being returned maybe you need higher throughput not response time as those can be mutually exclusive.

    avg disk byte writes/reads per sec (I hope I'm remembering right, you should be able validate search online)

    --------------------------------------------------
    ...0.05 points per day since registration... slowly crawl up to 1 pt per day hopefully ๐Ÿ˜€

  • Thanks. The high waits are CXPacket from report queries which I want to advise them to offload to a secondary copy but need the evidence that this would be an improvement to performance for them. Maxdop is already set to 6 but perhaps I should lower it to 4. Also OLEDB is a high wait which I have identified a badly performing linked server update query which the developers are looking in to fixing so I guess I will re-baseline once that is fixed.

    There are several missing indexes in the millions index advantage. I have passed some of these on. Others though have a lot of included columns and this database size right now, half of it is due to indexes. There are also some unused indexes however this database is based on a third party product (peopletools).

  • Maybe consider setting the cost threshold for parallelism higher but as long as queries arent waiting then no problem with cxpacket.

    OLEDB is typical, doesnt affect the server itself much, just the user waiting on the query.

    The recommended indexes always overestimates the included columns part. When it recommends too many includes, just use the primary key and any obvious foreign keys. Don't include any large columns. The missing index dmv also recommends adding indexes which already exist if you have duplicates and I've also found an issue where some indexes are highly fragmented are also returned as recommendations. Check your existing indexes first before creating any.

    Also did you see the query for average IO stall in milliseconds? It's an average value since the server was started, but can be useful way to know if theres an issue. It measures the round trip to disk on average. If you record the value, measure the difference over a day after adding indexes. The perfmon method mentioned by sqlsurfing will give you an instantaneous read which you might be able to use to identify the problem queries with a trace.

  • Andrew G (1/19/2015)


    High buffer I/O just means high logical IO which is probably just missing indexes or poor performing queries due to the amount of data being processed.

    For example, you might have 32GB RAM with most of your database in buffer, run a large select on 10GB of data and you'll see massive buffer I/O with low CPU and low disk I/O.

    Run through the performance check scripts one at a time - by Glenn Berry, use the correct version (SQL 2008): http://www.sqlskills.com/blogs/glenn/

    Don't worry about storing the results, just check the numbers.

    There is a script in there with all the exclusions for wait stats and other queries for disk performance and average response times per db.

    Post back your results here. Also check the missing indexes "index advantage" numbers, what range are they in (hint - millions is bad and is likely the problem).

    The drive level latency information (query 19)?

    Results below. The H drive is the on in question and the one our SAN guy is talking about

    DriveRead LatencyWrite LatencyOverall LatencyAvg Bytes/ReadAvg Bytes/WriteAvg Bytes/Transfer

    E:15239489922418910

    I:282310349262927256934

    F:254584176450961419

    G:944331626139761397

    H:531936949331614454647

    So it seems my best method going forward is to encourage the developers to tune their queries where they can?

  • Yes you should try to where possible, but it is usually the DBA responsibility to either add or recommend indexes. The devs have no visibility over how the app is being used and vendors usually roll out the same explanation. I usually prefix our indexes with the company initials then up to 3 column names and use a script to remove them before sending to the vendor. Make sure test/dev/prod environments are kept up to date with the same indexes. Good luck

  • I hate to ask this, but is your canned system running an application server? At a prior job we had one that required no stored procedures in the database. The system started slowing down as DB file size grew and the numbers on the server showed it was not working hard. The solution was a faster server, which really ticked me off.

    The vendor wouldn't admit that their model of using an application server and forcing RBAR over millions of rows was a bad thing.

    There's a reason why that vendor is not mentioned on my resume.

    -----
    [font="Arial"]Knowledge is of two kinds. We know a subject ourselves or we know where we can find information upon it. --Samuel Johnson[/font]

  • Yes it is an application with separate application servers and it is my least liked system at this site for many many reasons reasons some of which are...

    - 9448 tables, 12590 views, 13754 indexes, 23 stored procs :/

    - The developers have a very bad habit of building views upon views upon views and then query said view for one column making the underlying query unnecessarily big. I have expressed my recommendations to query the tables directly where possible

    - Over kill on indexes in some tables, half the datafile size is tables, the other half is indexes

    - would not surprise me if a lot of operations are RBAR

    - The developers are not the best at sql development - I have to explain a lot to them about why certain things they do are bad. Included in this is having a lock on a table to get a unique incrementing ID, do all the processing for a certain function, before unlocking that table so another process can do the same thing. This means a lot of blocking... albeit in short time frames but still not good. I believe it is because their code must rely on the ids to be in incremental fashion and cannot have one id missing in the order... if that makes sense. Apparently they cannot use auto increment id columns?!

    - The underlying vendor app requires sql password to be no more than 8 characters... wtf

    - The developers "resolution" to issues with the application process on the app server maxing the cpu is to reboot the server... no investigation...

    - Developers say the require 13 environments of this application for their development. I beg to differ.

Viewing 15 posts - 1 through 15 (of 21 total)

You must be logged in to reply to this topic. Login to reply