IO Performance Problem

  • We have a periodic problem on our sql server that I'm totally stumped on. This is on SQL Server 2012, clustered server, attached to a SAN. All latest updates and patches are installed. Every so often (maybe once a month totally random) for multiple hours the Disk IO will start spiking on and off. This has no correlation to load on the server. The IO will spike for roughly 90 seconds and then go back down for about the same period of time and repeat this over and over until it just stops. Failing over the cluster and restarts have no affect on the problem. I've narrowed it down to one database on one drive. This particular db has a lot of blob objects and is our largest db (2 TB) . When montioring perf mon the Avg. Disk sec/Transfer for this drive spikes up from the normal .010 to around 0.8. It will stay up around 0.8 for 90 seconds and then drop back to normal. We have checked all the SAN hardware even switching to totally new hardware to try to isolate but it doesn't fix it. Eventually the problem just goes away as mysteriously as it appears. I'm totally stumped any random thoughts appreciated, thanks.

  • Anything in the default trace or sys.dm_exec_requests when this is happening?
    😎

  • No it is completely unrelated to any sql transactions, load is unrelated.  It spikes for about 90 seconds then goes back down for about the same time rinse and repeat.  I've already checked for any scheduled processes, db and application.

  • huntersimon - Friday, February 10, 2017 7:27 AM

    No it is completely unrelated to any sql transactions, load is unrelated.  It spikes for about 90 seconds then goes back down for about the same time rinse and repeat.  I've already checked for any scheduled processes, db and application.

    What about the default trace?
    😎

  • We traced the db wide open for over an hour while the problem persisted and found nothing unusual.

  • Windows updates, anti virus scans, monthly reports?  Just thinking outside of the box...

    What type of SAN do you have?  Anything going on like adaptive optimization / tiering?

  • adb2303 - Friday, February 10, 2017 7:35 AM

    Windows updates, anti virus scans, monthly reports?  Just thinking outside of the box...

    What type of SAN do you have?  Anything going on like adaptive optimization / tiering?

    Its an HP p2000 g3 controller.  We have restarted all the controllers and everything and eventually we installed all new SAN hardware and moved one db to its own drive.  This has isolated the problem to that drive so we feel we have eliminated all the SAN possibilities.  I'll look into adaptive optimization / tiering that is not something I've researched.

  • Use sp_whoisactive in differential mode to see the cumulative amount of work done by spids that run for the duration. I might try 30 or 60 seconds. Also use it in interactive mode in another spid to see real-time work. Be sure to include system spids. 

    I would also do a profiler trace to see if I can capture things doing a lot of IO. The problem is that it could be bunch of little things that a large-IO filter wouldn't catch. You can find this by doing an aggregate profiler trace analysis. Perhaps something small is getting called iteratively for some reason.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • Is this a physical server or a VM? If a VM is there anything on the VM host layer going on during that timeframe (VM snapshots, vMotions, storage migrations, etc.)?

    Joie Andrew
    "Since 1982"

  • Its physical, problem has vanished again, still didn't figure out the root cause.  Thanks for the ideas.

  • huntersimon - Friday, February 17, 2017 6:09 AM

    Its physical, problem has vanished again, still didn't figure out the root cause.  Thanks for the ideas.

    Does EVERY piece of hardware in the server and along the ENTIRE IO stack have updated drivers and firmware? I have seen some VERY bad things due to both of those issues.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • Haven't checked to see what the others have said but do you , by any chance, have automatic updates or automatic downloads turned on in Windows or your anti-virus software?  If so, you need to turn that off.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Could it be caused by DB file auto-grow?

    How much free space do you have in the files?

    Do you shrink DB files overnight?

    _____________
    Code for TallyGenerator

  • Sergiy - Tuesday, February 21, 2017 1:45 AM

    Could it be caused by DB file auto-grow?How much free space do you have in the files?Do you shrink DB files overnight?

    DOH!! I HATE it when I miss the obvious possibilities (file growth)!! :hehe:

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply