Troubles with in memory OLTP performance degradation

  • I started with a new company recently and they use in memory OLTP which is something I haven't used before so I'm excited to try a new technology.   I've been there about a month now and I'm reading/watching everything I can on the feature but they're seeing some strange behavior that I could use some help with.

    We are running SQL on Azure virtual machines on a mix of SQL server 2017 and 2019 but all DBs are in 2017 compatibility mode.  Due to the nature of the business they shut the servers down each night and turn them back on in the morning through automation.

    • Original problem: When the app started in the morning it was initially slow.   I think this makes sense.   A SQL server restart has happened and the stored procedure DLLs are not recompiled until first run time.

      • Work around 1: In order to prevent the initial application slowness they created a "warm up job" that executed the stored procedures in order to generate the DLLs before the users got on the system.   Over time the warm up job gets slower and slower.
      • Work around 2: They then replaced the warm up job with a SQL job that drops and recreates all procs/triggers/functions.  SQL will compile a DLL for these native objects at create time.   That worked for a bit.   Over time that job begins to slow down as well. On disk when I look at the files in the XTP folder I can see that the number of files is growing over time.   I think this aligns with the processes slowing down over time.  I think this is possibly an indication of the root problem.  Is the compiler failing?  How can I monitor that?
      • Work around 3: To get around that issue they began to backup and restore (overwrite) the database which would clear out the XTP directory and then the native objects would get recreated by the SQL job after the server starts.

    This all feels a bit off.   I don't love the idea of the warmup job, definitely don't like dropping and recreating all the procs, and I really hate the idea of restoring the database to alleviate the problem.  I think in the end we're just addressing the symptoms and not the root cause.  I know a lot of the work arounds that have been implemented are custom, but has anyone seen this sort of behavior with in memory OLTP before?   SQL is supposed to manage the files in the XTP folder.  MS states:

    "If compilation fails or is interrupted, some generated files are not removed. These files are intentionally left behind for supportability and are removed when the database is dropped."

    In trying to determine why the SQL job that recreates database objects slows down over time I looked at machine resources.   We don't seem to be hitting any CPU, memory or disk limits.   The disks are a mixed bag of P15 - P30.   I understand that MS recommends at least P30 for production systems and I intend to address that as soon as I can.   We have one system will all P30 disks and it still experiences this issue.   There are 4 in memory files spread across 4 disks.  The P30 disks are capable of 5000 IOPS but we also have (read only) disk caching turned on which should allow for greater performance.

    Could it be that compilation is failing often and leaving tons of files behind?   If so, how would I track that down?    Is the nightly shut down causing a problem with the in memory OLTP feature?   I'm considering adding a CHECKPOINT to the shutdown process.  Anyone have additional recommendation for troubleshooting next steps?

    TIA!

    • This topic was modified 1 year, 3 months ago by  tommiwan.
    • This topic was modified 1 year, 3 months ago by  tommiwan.
    • This topic was modified 1 year, 3 months ago by  tommiwan.
    • This topic was modified 1 year, 3 months ago by  tommiwan.
  • Thanks for posting your issue and hopefully someone will answer soon.

    This is an automated bump to increase visibility of your question.

  • tommiwan wrote:

    Due to the nature of the business they shut the servers down each night and turn them back on in the morning through automation.

    I have to ask... what "nature of the business" could require such a thing?  IMHO, this is one of the worst things you can do to a database server of any kind.

    Also, it's not likely that the recompiles are the slow part... getting the data from disk to memory is probably the slow part.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Tend to agree with Jeff. I'm not sure shutting down makes sense, and compilation isn't an issue.

    If you think the queries are running slower, capture some execution plans and stats. I'd like to see this proven somehow. Typically we wouldn't see slowdowns for the same queries unless there is drastic data changes, poor indexing, or plan changes.

    As far as warm up jobs, nothing wrong with that. I like those, especially for procs/functions that might be complex for compile effort. Granted, this still ought to be less than a second, but getting quick response and warming the buffer pool can be a good idea.

    Dig in with specifics and post back

  • Thanks for the replies so far!  Here are some answers to a few specific questions.

    The reason the servers get shut down each night is because the application is used by county governments and only needed during normal business hours.  The servers are SQL on Azure VM so the shut down is for cost saving purposes.   I do tend to agree that I would prefer not to shut them down each night but this process was in place already.  It may be something I can change over time.

    As far as the startup process and getting the data from disk to memory, I understand that can take some time and depends on the size of your database (all tables and sprocs are being loaded in to memory).  When a server starts up there is a short time that the database goes in to "in recovery" and I believe that this is when data is being streamed to memory.  Someone please correct me if I'm wrong.   Only once the database has been recovered do we run our SQL job that drops and recreates all sprocs/triggers/functions.   That process is specifically the thing that slows down over time.  As that process slows down I noticed that the number of files in the XTP folder grows each day.   Normally there are around 6k files in that folder, but I recently looked at one of our servers and it had 230k files in that folder.  This may just be a symptom but it definitely seems to indicate a problem.   It makes me wonder if there is something wrong during the recompile process where it's failing and leaving objects behind.   One thing that they noticed (before I started) was that when they would restore the database things would get better.  I also noticed when we restore the database that the number of files in the XTP folder gets reset.  This lines up with  the quote from MS documentation that I posted above.   The application will run fine as long as we can get the DROP/CREATE SQL job to complete, but it's that job which begins to take a long time.

    We've opened a ticket with Microsoft on this as well and will begin troubleshooting with them tomorrow.  Thanks for the ideas so far.   One specific question I had was this... since they are shutting the servers down each night, is there anything I should be doing to help ensure the database is in a good state?   Should I be running a CHECKPOINT before shutdown?

  • To your points:

    - shut down vm at night - makes perfect sense, worth just doing this for cost reasons.

    - files in XTP folder - I don't know enough about this, but I think this is where the built In-memory objects are stored. There might a need for some archive process to remove old DLLs here. I wouldn't be surprised if these aren't properly cleaned by SQL Server

    - in recovery on startup - this is nothing to do with the data in memory or the objects. This is the transactional process that ensures that committed transactions are written to disk and uncommitted ones are rolled back./

    - rebuild objects every day - no reason for this. My guess is this is someone starting a process when they didn't troubleshoot other problems and decided to rebuild in-memory objects. This isn't required.

  • Thanks again for the replies so far.  I think it may take some time to work through this with MS.   One specific question I have though...

    Since we shut the db servers down each night do I need to do anything special like a CHECKPOINT in order to help the databases shutdown and startup cleanly since these databases are using in memory oltp with SCHEMA AND DATA durability?

  • Although SQL itself should issue CHECKPOINT commands when being shut down, I've found that explicitly issuing them myself prior to shutdown often allows SQL to restart faster.  So, personally, yes, I would issue CHECKPOINTs for each db yourself before shutdown.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • Steve Jones - SSC Editor wrote:

    To your points: - shut down vm at night - makes perfect sense, worth just doing this for cost reasons.

    If such a savings makes such shutdowns "worth doing", you just added another notch to my gun-belt of reasons why I have a strong dislike for the smoke'n'mirrors that some refer to as "the Cloud" and prefer to refer to it as "the Clowned". 😀 😀 😀

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I wanted to provide some follow up in case this helps anyone in the future.  We've worked with MS over the past few months and found a pattern that allowed us to identify a work around for this issue.

    One of the main symptoms of the problem was that the number of files in the \<Data>\<Database_ID>\XTP folder would grow over time.   Additionally, we shut our servers down each night as they are only needed during business hours and this allows us to save money as they are azure SQL VMs.  MS found that they could reproduce the issue if they shut the server down while there were still active connections to the DB.  We were able to reproduce this and see that when we stopped the SQL server service while there were still active DB connections that the files in the XTP folder remained in place.  When we performed a clean SQL server service shut down with no users in the DB the files in the XTP folder were removed.   We are implementing a clean shut down process which will shut down all application services, kick out any remaining DB connections, run garbage collection and then shut down the Azure SQL VM.  If we do this then we can see all files in the XTP folder get removed and when the server starts back up it begins to compile the native objects again from scratch.  I'm not positive that the garbage collection step is needed.

    I currently have all of these shut down procedures configured to run as a SQL job but I think we're going to set them up to run as part of an azure pipeline.   Either way, the clean shut down helped to resolve our issue.   Hope this helps someone in the future.

     

    • This reply was modified 1 year ago by  tommiwan.

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply