Performance Problem Migrating DW from Tin to VM - Suspected CPU Issue

  • Hello

    TL;DR:
    Would increasing the number of logical threads that a machine has increase the performance of SSIS and the SQL Server engine warehouse given that in the query plans of many long running queries many parallelism operations are present and in the design of many ETLs, 24 threads were considered? How would I prove this if I can't up the number of threads available in a machine?

    We are migrating our DW from on-premise tin to a hosted VM with the goal of expanding our storage on demand and lowering the time of our morning batch run from three hours, which is too long, to much less than that. We have been through a bit of an ordeal justifying and fighting for the performance specs of the box. We now have equivalence on the VM to our physical machine in terms of IOPS and RAM, but we're down from a 12 Core/24 Thread machine to 4C8T on the VM. Our processes are running very, very slowly. We've seen improvements after upping the IOPS and RAM, but the time taken is still five hours or more.

    IMHO, the number of threads is the issue. Our infrastructure team is saying that the VM is overspecced because the CPUs are hardly every at 100% load.

    I would like to clarify my understanding of the SSIS and SQL Server engines please, especially seeing as parallelism is present in all of the long running queries and the ETLs were designed to run on a machine with 24 threads - many, many operations run in parallel. Would SQL Server and SSIS run one thread up to 100% usage before starting to run on another thread, or do they take advantage of as many threads as are available? MAXDOP is not specified in any of the queries nor in any of the SSIS packages. Many packages run in parallel, as well as many of the components inside the packages.

    If my understanding is correct, how would I be able to prove this please, assuming that I cannot increase the number of logical threads on the VM?

  • There are a lot of things you can check:
    While the process is running, look at the CPU's (and other resoureces) in resource monitor.  How many cores are being fully utilized while it runs and how do the spikes look if there are any?
    Simulate a reduced core load by dropping the MAXDOP to 0 and see how it affects the performance of the queries <-- this one is not the "best" way to check as you already say it is using multiple threads, but you could run it at half the threads (4 threads) to see if that helps.
    Check for blocking
    Since you are using SSIS, make sure you have enough memory for SSIS.  SSIS operates in separate memory space from SQL Server Engine.  So if SQL Server is configured to use 2 TB (the default if memory serves) and you don't have more than 2 TB in the system, you will likely run into SSIS paging to disk and causing horrendous slowdowns.

    And I am sure there is more things you can check, but those are the first things that come to mind.  And those are in no particular order.
    I am expecting blocking to be a big one here, but it could also be memory pressure.  Or I could be completely out to lunch.

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • Thank you for your reply, bmg002. I'm leaning towards asking to up the cores to 24 for a day and try and see what that run looks like.

    Here are a few snapshots of performance on the VM in the middle of today's morning run that I took:

    10:14AM
     

    10:28AM

  • bmg002 - Thursday, May 24, 2018 12:08 PM

    There are a lot of things you can check:
    While the process is running, look at the CPU's (and other resoureces) in resource monitor.  How many cores are being fully utilized while it runs and how do the spikes look if there are any?
    Simulate a reduced core load by dropping the MAXDOP to 0 and see how it affects the performance of the queries <-- this one is not the "best" way to check as you already say it is using multiple threads, but you could run it at half the threads (4 threads) to see if that helps.
    Check for blocking
    Since you are using SSIS, make sure you have enough memory for SSIS.  SSIS operates in separate memory space from SQL Server Engine.  So if SQL Server is configured to use 2 TB (the default if memory serves) and you don't have more than 2 TB in the system, you will likely run into SSIS paging to disk and causing horrendous slowdowns.

    And I am sure there is more things you can check, but those are the first things that come to mind.  And those are in no particular order.
    I am expecting blocking to be a big one here, but it could also be memory pressure.  Or I could be completely out to lunch.

    MAXDOP to 0 - this is not a reduced core load - this is a "use all available cores".

  • AndrewJacksonZA - Friday, May 25, 2018 2:45 AM

    Thank you for your reply, bmg002. I'm leaning towards asking to up the cores to 24 for a day and try and see what that run looks like.

    Here are a few snapshots of performance on the VM in the middle of today's morning run that I took:

    10:14AM
     

    10:28AM

    Apart from CPU there is nothing major there.
    would it be possible to put also the top active processes ordered by cpu on the busy time.

    And I would change maxdop to 4 on this server (while you still have 4c/8t) to see the impact of it.

    Is this a Vmware or Hiper-V setup?
    And how many processors has the physical host have.
    and is Numa enabled on host and on vm? 
    These if not set correctly could have a negative performance impact

  • frederico_fonseca - Friday, May 25, 2018 3:37 AM

    bmg002 - Thursday, May 24, 2018 12:08 PM

    There are a lot of things you can check:
    While the process is running, look at the CPU's (and other resoureces) in resource monitor.  How many cores are being fully utilized while it runs and how do the spikes look if there are any?
    Simulate a reduced core load by dropping the MAXDOP to 0 and see how it affects the performance of the queries <-- this one is not the "best" way to check as you already say it is using multiple threads, but you could run it at half the threads (4 threads) to see if that helps.
    Check for blocking
    Since you are using SSIS, make sure you have enough memory for SSIS.  SSIS operates in separate memory space from SQL Server Engine.  So if SQL Server is configured to use 2 TB (the default if memory serves) and you don't have more than 2 TB in the system, you will likely run into SSIS paging to disk and causing horrendous slowdowns.

    And I am sure there is more things you can check, but those are the first things that come to mind.  And those are in no particular order.
    I am expecting blocking to be a big one here, but it could also be memory pressure.  Or I could be completely out to lunch.

    MAXDOP to 0 - this is not a reduced core load - this is a "use all available cores".

    Right... I meant 1... been a hectic week at work and had a brain fart moment.  That shoudl be 1 not 0 :/.

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • OK, we bumped the virtual cores up to 16 last night. Performance has gone down slightly for today's import, but because we bounced the box all of the cached data was obviously gone, so this morning's import process was run cold. We're going to warm the cache up a bit like it would normally be just before we do the run tomorrow morning.

    Possibly related qustion on Twitter:
    In order to resolve an issue with a query that I am currently developing, I changed the database compatibility levels of several databases from 2008 and 2014 up to 2017. That particular query was taking 1 second to run on the tin with 24 cores and 30m on the VM with 16 vcores because the query plan was entirely serial. It took 2m30s when I added OPTION(USE HINT('ENABLE_PARALLEL_PLAN_PREFERENCE')) Now that I upped the compatibility level on the database on the VM, the query takes 1 second to run again.

    Link to Twitter: https://twitter.com/AndrewJacksonZA/status/1001847877721870338

  • Is this a live server?
    If so, my first step would be to get a TEST server set up so you can do changes during company uptime and not have to wait for the next day to see how performance is.

    Get the test system set up then monkey around with things like MAXDOP and ask your VM guys if they see any balooning issues with the VM, and if possible, make sure they have that disabled on your SQL boxes (if it isn't already).
    Both the test and the live system.

    Then on test, I'd play around with things.  MAXDOP being the big one.
    As for that HINT about enabling the parallel plan preference, I am pretty sure this is because they have their cost threashold for parallelism set too high for that particular query.  In your scenario, it looks like it is already using a parallel plan as you can see all of your CPU's being hit, so that query hint likely won't make a difference.  
    The compatability settings might make a difference though.  Probably wouldn't hurt to see what those are at and change it up to 2017 (again, on TEST... you don't want to change it on live only to find out your query now takes 48 hours to complete).

    I VERY rarely make changes to a live system without first testing them on the test box which I know I can kill the instances with only a few grumpy developers at my desk instead of taking the company down.  And I will never recommend making changes to a LIVE system that have not been done on a test system first.

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply