Odd performance issues

  • I am trying to figure what is going on? I have a SQL 2012 enterprise cluster running on 64 processors and 786 gig of ram. Our company was hit by ransomware and we had to build a whole new network. The server is twice as powerful then the old one running on a all flash san it is seems like it is crippled. Queries that run in 10 seconds on the old server, not take 4 minutes and it has been like this for the last 2 months. I know there is a ton of intrusion stuff they put on our new network since the attack, so  trying to determine the root cause?

    I run  this:

    SELECT latch_class, wait_time_ms,waiting_requests_count, 100.0 * wait_time_ms / SUM

    (wait_time_ms) OVER() AS '% of latches'

    FROM sys.dm_os_latch_stats

    WHERE latch_class NOT IN ('BUFFER')

    AND wait_time_ms > 0

    Not sure what the DBCC_CHECK_AGGREGATE is? no checkdb's and I can't find much documentation on it?

    image

    I am pretty sure it something it our network that is slowing it down, but I need to prove it, so any help would be appreciated.

     

     

     

  • First thing I'd do is figure out what that latch is.  Quick google brought me to this:

    https://www.sqlskills.com/help/latches/dbcc_check_aggregate/

    which states:

    This latch class is when a thread is waiting for access to the structure that controls a running DBCC consistency check.

    So, to me, that wait is due to a DBCC CHECKDB call.  The numbers, if I remember right, are NOT the "currently running wait" metrics, but the metrics since the server was started.

    What I'd be doing is looking at the execution plan to see if the estimates and actual values are similar as my first step.  If they are similar enough, then statistics are likely not the problem.  Next, I'd be checking for blocking while that query is running.  If no blocking is happening, my next thought would be to check on server resources.  Does the CPU or memory spike while that 10 second/4 minute query runs?

    Another thing to look for is the antivirus.  If it is scanning on file changes, the database files may need to wait for the AV to finish its scan before it can do any writes to the database.

    To test if it is the network causing problems, run your 4 minute query directly on the database (ie localhost rather than over the network).  If it comes back in 10 seconds, the it is likely network causing the slowness.  If it still takes 4 minutes, then it is likely not the network.

    But it could be any number of things.  Your new hardware MAY have worse single-core performance which may make your queries slower.  The server may have more stuff running on it than the old one, so even though it is "beefier", it may have fewer resources for SQL.  Could be bad statistics.  I would also check sp_who2 to see if anything is running CHECKDB that you are not aware of.

     

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • What version of SQL Server were you using on the old server and what are you using on the new server?  This is really important because a new "Cardinality Estimator" came into play on version 2014.  It crushed us for performance and we had to use a start-up trace flag to disable it server wide.  Depending on what was and what is, there could be more "gotcha's" and so we need to know both.

    Also, have you configured things like MAXDOP, Cost Threshold of Parallelism, etc, and (possibly) any "start up" trace flags that the old server had?  On the latter, if you know what they were, let us know.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I tried to get them to temporary shut all virus scanning off just to see if it made a difference, but they are against it. Should database servers even have virus scanning? We found huge issues on the VM database servers running virus scanners.

  • SQL server 2012 enterprise, and the app won’t running on anything higher. Already tried 2014, 2016 and 2019 with no luck! The never version only runs on Oracle. It is app that hooks into 500 retail brick and mortar stores.

  • You could try looking at waits:

    http://www.sqlskills.com/blogs/paul/wait-statistics-or-please-tell-me-where-it-hurts/

    As the direction of travel with the App is Oracle, and SQL 2012 goes out of extended support next year, I would also try to create a test environment with Oracle ASAP.

     

     

  • Microsoft says you can, but recommends excluding most folders and file types associated with SQL Server: https://support.microsoft.com/en-us/topic/how-to-choose-antivirus-software-to-run-on-computers-that-are-running-sql-server-feda079b-3e24-186b-945a-3051f6f3a95b#:~:text=You%20can%20run%20antivirus%20software,is%20a%20cluster%2Daware%20version.&text=C%3A%5CWindows%5CCluster

    ... as well as the following on cluster servers:

    Q:\ (Quorum drive)

    C:\Windows\Cluster

    MSDTC directory in the MSDTC drive

  • Here is another thing I noticed running the query and maxdops is set to 0 and cost threshold for parallelism is set to 5. I would of thought it would spread the load across all processors? Unless HP is not a good for SQL servers?

    cpu

     

  • I think that is expected for some queries.  If the query has no need to go parallel or can't go parallel, then you may get a single core spiking like that.

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • with that server I would change maxdop to 8 (and try out other combinations - 4 6 12)

    and CTFP change to 50 - never below 25 unless fully tested and results indicate lower value is better.

    and what is the max memory setting on the server?

     

    if this is a physical server make sure that power settings are set to MAX - both in Bios and Windows - and in some machines (HP models are one if not mistaken) there is another setting on machine setup that is also required to allow for CPU to be on max power.

     

    If this is a VM machine make sure to follow VM best practices with SQL Server - there are a few settings that need to be set correctly or performance will go down the drain.

  • Thanks! I adjusted the settings and the server is a physical server with 786 gig of ram and I have the sql max memory set at 778.

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply