SSIS 4 Minute Job Takes Over 19 *Hours* on New Server

  • Just thinking out loud...

    Is it possible that your are getting a different execution plan on the source system?  Since that source system is Sybase - it is possible the code that is being executed will generate different execution plans based on various factors such as connection settings.  And - if the code is not schema qualified (at least for SQL Server) you can also get different execution plans.

    I would also check the number of parallel processes running - and make sure you are not saturating the source system with too many extracts running at the same time.

    On the destination server - are you loading data into staging tables that have no indexes?  Are you disabling NC indexes during the load process?

    I just noticed that your max memory setting is the default...on an x86 system that would not cause any issues because of the 2GB memory limit, but on an x64 system it would allow SQL Server to take all available memory and starve the OS (and other processes - of course).  And...if you have locked pages in memory it would not allow SQL Server to release that memory back to the OS (and other applications).  You may have this enabled without specifically setting it - if you are running SQL Server with an account that is part of the local administrators group.

     

    Jeffrey Williams
    “We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”

    ― Charles R. Swindoll

    How to post questions to get better answers faster
    Managing Transaction Logs

  • Jeffrey Williams:  Thanks for the additional thoughts!  I can respond to a couple points and will have to research/test the rest.

    As for the destination database/tables, none of the tables are joined together, but they do have primary keys.   There are no indexes beyond the primary keys.  I suppose I could turn the tables into heaps, but given that the job runs fine on our old server and from Visual Studio on the new server, I have to wonder if this avenue is worth pursuing.

    Another point I think I can address is about parallel processes.  I did a test where I forced the package to execute only one task at a time.  The same problem occurred even when it was only one task at a time.  So, I think that rules out swamping the Sybase system with too many requests at once?

    For memory, I found that the amount of memory being used by the package when run from Visual Studio is no different than when the package is run from within SQL Server.  Sometimes the memory usage even went down a little when run from SQL Server.  I also had one of our network administrators look at memory and I got this back in an email: "I am seeing the RAM usage is well within comfort zones. I see about 40 – 50 KBps (fluctuating of course) for network traffic."

    While it may make sense to adjust memory settings from a best practice perspective (though that wasn't clear from my research either way), I was thinking that memory problems were not affecting this particular job. FYI: One way I looked at memory was by running this query that I found on a Microsoft page:

    SELECT 
    physical_memory_in_use_kb/1024 AS sql_physical_memory_in_use_MB,
    large_page_allocations_kb/1024 AS sql_large_page_allocations_MB,
    locked_page_allocations_kb/1024 AS sql_locked_page_allocations_MB,
    virtual_address_space_reserved_kb/1024 AS sql_VAS_reserved_MB,
    virtual_address_space_committed_kb/1024 AS sql_VAS_committed_MB,
    virtual_address_space_available_kb/1024 AS sql_VAS_available_MB,
    page_fault_count AS sql_page_fault_count,
    memory_utilization_percentage AS sql_memory_utilization_percentage,
    process_physical_memory_low AS sql_process_physical_memory_low,
    process_virtual_memory_low AS sql_process_virtual_memory_low
    FROM sys.dm_os_process_memory;

    I definitely don't know enough about memory to say for sure though.  So, memory problems are a possibility I am still keeping in mind.  I can also try to pursue the idea you raised of different query plans happening because the job is running from DTexec.exe vs DtsDebugHost.exe.  Thanks!

  • I have been migrating TSQL code from SQL 2008 to later versions. The code consists of complex SPs that call more SPs that call functions, and my tests were based on taking a backup of a SQL 2008 DB and restoring it into an Azure SQL Server 2019 VM, and testing the restored DB under different compatibility levels. I would start my tests by benchmarking the code in the Azure SQL VM under SQL 2008 compatibility level and checking the run time of the code. When I re-tested the same code using different compatibility levels I found:

    1. When testing in SQL 2019 mode the 'function inlining' optimisation (FROID) had a very harmful effect, probably because of the complex nature of the TSQL code. With the initial release version of SQL 2019 I found queries that previously took seconds would take much longer, then fail with obscure memory errors.
    2. I upgraded the Azure SQL VM from SQL 2019 RC0 to SQL 2019 CU4 which fixed problem 1. The complex TSQL code then failed with different errors, for example if a function 'f' that was elegible for inlining called another function 'g' that wasn't eligible, the optimiser assumed 'f' could be inlined, and produced an invalid plan. The fix was to run sp_refreshsqlmodule so that the optimiser knew correctly what functions could be inlined.
    3. Although you don't mention XML, I found the SQL 2014 optimiser changes had a large effect on XML queries, making some queries 10x - 100x slower. This was a frustrating problem because the query plan produced by the Xqueries is too complex to understand, and trivial rearrangements of the Xquery would produce wildly different running times. The fix was to rearrange the Xquery and re-test in 2008 and 2014 compatibility (and other compatibility levels) until the running time was similar in each compatibility level.

    It may be this isn't relevant because you're running SSIS. But if your SSIS package is running some complex SQL code, it may be worth ensuring SQL 2019 is on the latest CU, and running sp_refreshsqlmodule over all your functions.

    • This reply was modified 3 years, 9 months ago by  William Rayer.
    • This reply was modified 3 years, 9 months ago by  William Rayer.
  • William Rayer:  I'm saving your valuable post, because I think it is going to be relevant and *very* helpful in a couple of months.  I'll tell you why I don't think it is relevant now.  Please correct me if I get this wrong:

    1. In order to avoid query optimizer issues in the first phase of the migration, I'm leaving the compatibility levels at 100.
    2. I don't think most of the queries that get the data are all that complicated.  Some are union queries with a handful of tables, but some queries are as simple as: get me these 2 columns from a single table with a few hundred rows, and the query does not even have a WHERE clause.
    3. The queries are being run against a Sybase database outside our system.  Would the SQL Server query optimizer matter if the query is against a Sybase database?
    4. The data is being loaded into empty tables in our local SQL Server.  In other words, there is nothing about the query on the SQL Server end which is complicated as far as I know.  I'm guessing that the SSIS data flow task is simply running an INSERT INTO statement.

    Let me know if you think I missed one or more of your points.  Otherwise, I'm going to keep your tips around for that scary time in a few months when I will move the compatibility level to 150 (the level that goes with SQL Server 2019).  I do have more complicated queries in my databases/apps.  I suspect I will be going through a fixing process much like you have been going through.  Thanks for taking the time to share.

  • William Rayer:  I should also add for interest sake:  Out of desperation, I did a test earlier where I changed the compatibility level of the local SQL Server database to 150 just to see if the new Query Optimizer would fix the problem.  It was a long shot and sadly, the compatibility level made no difference to this problem.  🙁

  • Have you actually check what code is running on the server during this problem?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden:  I don't know what you mean by "check what code is running on the server."

    If you mean looking at the profiler, then yes.  After a few seconds, the profiler shows nothing happening.  There are no commands that went out that are waiting to be fulfilled.  When a query goes out in the first few seconds, we get the data fairly fast.  After that, no queries go out.

    If you mean looking at the log table, msdb.dbo.sysssislog, during the job run, yes.  The log shows the same thing the profiler shows.  Namely that a few tasks run and then it all stops for about an hour.

    I have also investigated memory (which I *think* is OK), CPU which goes through the roof and stays there, and network traffic (network connections stay open, but nothing flows).

    If I'm missing an avenue of research or not properly understanding avenues I have already tried to investigate, I'd be happy to hear about it.

    Thanks.  🙂

  • Does it pause at the same point each time?

    I'd start to break apart the package and test things separately. Pick a certain item, then make a new package to do that, test it. Repeat with subsequent items. If you need to, drop them into a duplicate table. Schemas can make this easy to keep the same code. Just dup the SQL table into a test schema.

    It sounds like some things run fine, and then some don't. You need to find the "don't" and track down that way.

  • Is the new VM in Azure by any chance?  How does the 2 core new VM compare to the old server (2 cores and 12GB seems small for a production system)?  The only times I've seen this sort of behavior change recently have been with systems migrated to Azure or the new system has fewer cores than the old.

    For the Azure IAAS servers the problem turned out to be disk I/O limits which were based on the server type chosen.  Azure VMs have individual drive I/O limits based on disk type and size, but there is another limit based on server type for all drives combined.  Hitting those limits resulted in the kind of symptoms you describe.  Easiest way to diagnose if you are running into disk throttling is to use Resource Monitor on the server (open Task Manager, show More Details, Performance tab, then click Open Resource Monitor).  Once in Resource Monitor, got to the Disk tab and look at the Storage section.  The key metric is Disk Queue Length.  If while the process is running this value is high for the drive(s) where your data or log files live, you have disk throttling.

     

     

  • Steve Jones:  At first the job seemed to be stopping at different points.  However, over the last couple of days, it seems to always load the same first two tables and then stop.

    Your suggestion makes sense.  I've been contemplating something like that, but then got stuck in my head about what I could do if I know that such and such a table is a problem.  I kind of know which tasks/tables are a problem already in a way since I know what pre-execute event started (from the log) but then nothing happens.  For example, I know the job loads two smaller tables just fine and does the pre-execute event for several larger tables, such as the Person table before everything stops for a long time.  Let's say I create a package that *just* downloads the Person table and I find that the same problem happens in the test package - ie, in that the job seems to hang when run in Agent (but runs fine in Visual Studio on the same server).   What would that tell me that I don't already know?  I've already tried to compare the data flow tasks for the two tasks which work compared to the data flow tasks which don't work, like for the Person table.  I don't see a pattern.  Of course I could be missing something, but this is why I haven't done your suggestion so far.  I was stuck on the question: What would I do even if I could precisely pinpoint all the tables which are failing?

    That said, your suggestion is a way forward/at least something to do.  I will try it and see if the process illuminates anything.  I will report back.  Maybe it is just a single table that is the problem and knowing that will help.

    Thanks!

  • TL:  Neither the old system nor the new system is Azure.

    Here's what I've been told: The old system has 2 cores and the new one has 2 or 4.  (I haven't gotten a definitive answer on that question yet, but I've been assured that none of our servers are less than 2 cores.)  I've also been told the following about our setup:

    "The “physical” computer is a very large blade center that hosts almost all of our 40+ virtual servers. Some of those servers are SQL instances, but they shouldn’t have any impact on the virtual instance that hosts OurNewLocalServer unless we had just over provisioned the blade center and all of the resources were maxed. Each server on the Blade Center think and act like they are a physical server. So as far as OurNewLocalServer is concerned, it thinks and acts like a physical machine. We aren’t running multiple sandboxed SQL server instances on that server either."

    Pretty much everything you wrote in your post is way over my head, and I don't understand it.  However, I do understand what you wrote about using Resource Monitor.  I will give that a try!  Thanks for the tip!!

  • JJ B:  Re-reading my post, I wasn't as clear as I should be.  Sorry about that.  What I was trying to say boils down to:

    1.)  Double check that the new VM isn't vastly smaller than than the old server.

    2.)  If the new box is in Azure, there may be some disk throttling issues.

    That second point may still be valid to some degree on a non-Azure VM if previously this was all on a stand alone physical.

     

     

  • <!--more-->I think you need to be looking at the Sybase system to see what is executing on that system.

    My guess is that the extract query is hanging.

    Jeffrey Williams
    “We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”

    ― Charles R. Swindoll

    How to post questions to get better answers faster
    Managing Transaction Logs

  • My guess too, which is why I'd like to break something out to check.

    If it hangs in SSIS, can you get it to hang in SSMS?

  • TL:  I found "Disk Queue Length" and watched the number before the package ran.  Before the package ran, the number was mostly zero, with some fluctuations between 0.01 to 0.30 popping up for a second every little bit.    Here's how the throttle research went when the package was run:

    When Run From Visual Studio

    This is the situation where the package runs successfully.   After the package started running: I would see lots of zeros, but then also higher numbers from 0.3 to maybe 1.0?  (The numbers went by fast.)  Most of the time, the Disk Queue Length was much less even if it was above zero.

    When Run From A Scheduled Job In Agent

    This is the situation where the package seems to hang.  In the first few seconds of the package running, I seemed to see a similar pattern to the test above, except that the numbers went a little higher I think.  For example, for a brief second, I saw a 1.98!  (Is that high?)  BUT!  (this is all very dramatic,) then after a few seconds of the job running, the Disk Queue Length went down to pretty much zero and occasionally popped up 0.01 to 0.o5 or something like that.  It stayed zero quite a bit.  The point where the Disk Queue Length stayed very low corresponded in time to the Reads per Second going down to zero that I had reported previously above.  Ie, it matches the same story I was getting elsewhere.

    --------

    If I understood the gist of your post above correctly, since the Disk Queue Length stays low/zero when the problem is happening, then that means that throttling is not happening and I should look elsewhere for the problem.  Let me know if I misunderstood.  Otherwise, it was a nice easy test to try, and I'm grateful for the suggestion.

    So now I'll get to cracking on Steve Jones' suggestion of trying to pinpoint exactly which tasks are a problem in the hopes that knowing which tables fail vs which succeed provides some illumination on what to do.  Fingers crossed.

Viewing 15 posts - 16 through 30 (of 51 total)

You must be logged in to reply to this topic. Login to reply