Transaction Log Growing Despite Simple Recovery

  • I have a data warehouse database which is set to simple recovery. The transaction log was originally set to 1G, but has grown to 9G. If I shrink the file, it grows to 9G the next day during the data load process. Once the process is complete, none of the transaction log is being used, but it still takes up a great deal of space on disk.

    Through monitoring, I was able to determine the cause of the transaction log bloat. I have determined that a single stored procedure, which runs at the very end of the data load process, is the cause.

    The stored procedure is fairly simple. It has two INSERT/SELECT statements, which move data from the temporary data load tables to the actual data warehouse fact table. The first INSERT statement moves about 800,000 rows from these tables. The second INSERT statement creates about 200,000 rows in the fact table by querying the fact table itself. All of the transaction log bloat appears to be happening during the first INSERT statement.

    The first INSERT statement also performs a couple of lookups from tables in other databases. The data from our warehouse comes from multiple systems. System-dependant lookups and conversions are performed earlier in the process. The data union also happens just before this step. The purpose of this step is to move the data, while performing all the lookups/conversions that are system-independant. There are only a couple.

    However, for consistency across applications, one of the lookups is applied as a table-value UDF, which is OUTER APPLY'd against the data. Since this is the one thing that truly differentiates this query from the others, I suspect it is the cause. But I can't be sure.

    The inline UDF is little more than a few joins wrapped into a function (that should theoretically be expanded inline). It returns all of the data from those joins (about 8 columns), all of which are used in the data warehouse. I could theoretically expand these joins inline manually, but that wouldn't meet our business need very well. The business rules behind this lookup change often, and other applications and reports use this same data. I needed a single point where I could change the business rules and have it effect all applications that used it. The inline UDF served this purpose. Also, I don't know that this is the actual problem, nor can I be sure that expanding the function inline will actually solve it. I am hesitant to make such a huge change without gathering more information first.

    Has anyone run into something like this before. Does it sound like the OUTER APPLY is causing the transaction log bloat? Or could it possibly be something else?

    Specifically (without being to verbose or giving away confidential information), the query is an INSERT/SELECT with a long list of fields. A couple of fields have simple CASE/COALESCE conversions. The FROM statement is as follows:

    [font="Courier New"]

    FROM

    temp.OrderPrep OP

    LEFT JOIN dim.Customer C on C.CustomerID = OP.CustomerID

    OUTER APPLY Lookup.dbo.fn_SalesRegion(

    CASE

    WHEN C.CustomerType = 'OEM-N' then 'NAT'

    WHEN C.Channel = 'EXPORT' then 'EXP'

    ELSE ShipToState

    END, ShipToCountry) SR

    OUTER APPLY Lookup.dbo.fn_IRRegion(ShipToState, ShipToZipcode, ShipToCountry) IRR

    [/font]

    There is no WHERE statement. It loads the entire table.

    I can post the UDFs if need be, but they are just inline UDFs that are fairly simple SELECT/FROM/WHERE queries.

  • All of this:

    The first INSERT statement moves about 800,000 rows from these tables. The second INSERT statement creates about 200,000 rows in the fact table by querying the fact table itself.

    Is written to the log file before it is written to the database (not completely true, as some data may get written to the database before it is all written to the transaction log). The reason for this is in case of a failure and the transaction needs to be rolled back before it is completed (such as a server crash).

    If the transaction log has grown to 9 GB during this process, then that is how large it needs to be. If you are concerned about the size of the transaction log, can this stored procedure be modified to complete the load in smaller batches, say of 100,000 rows each instead of 800,000 and 200,000 respectively?

  • I had a similar issue in our production environment where our transaction log was growing to unacceptable sizes after bulk data loads. In my case, the transaction log was not needed so I would manually truncate it after the inserts to get it back down to an acceptable size. I added this query at the end of my stored procedure to perform this:

    EXEC sp_dboption 'db_name', 'trunc. log on chkpt.', 'TRUE'

    USE db_name

    CHECKPOINT

    GO

    DBCC SHRINKDATABASE (db_name, 10)

    Danny Sheridan
    Comtekh, Inc.

  • Danny Sheridan (4/19/2010)


    EXEC sp_dboption 'db_name', 'trunc. log on chkpt.', 'TRUE'

    That option was deprecated back before SQL 2000. It should not be used any more and the replacement is Simple recovery.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • It sounds like it's doing everything through a single, logged operation. As long as you do this, you'll need this much space, as Lynn says. You either have to live with it, or change something. You could break the procedure up to a bunch of small procedures, each that commits the data move when it's done, so that the transactions are smaller and more discrete. Another option would be to use a minimally-logged operation such as bcp or BULK INSERT.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Lynn Pettis (4/19/2010)


    All of this:

    The first INSERT statement moves about 800,000 rows from these tables. The second INSERT statement creates about 200,000 rows in the fact table by querying the fact table itself.

    Is written to the log file before it is written to the database (not completely true, as some data may get written to the database before it is all written to the transaction log). The reason for this is in case of a failure and the transaction needs to be rolled back before it is completed (such as a server crash).

    If the transaction log has grown to 9 GB during this process, then that is how large it needs to be. If you are concerned about the size of the transaction log, can this stored procedure be modified to complete the load in smaller batches, say of 100,000 rows each instead of 800,000 and 200,000 respectively?

    This makes sense. I've had really large INSERT/SELECT statements fail 3/4 of the way through, and the destination will still be blank. I figured it was using tempdb, but my tempdb is 5G, and typically never uses more than 3.

    I suppose I could build a loop into the process, I would probably have to add an identity key to the intermediate table. Not a huge problem. It shouldn't add too much time/effort to the creation of the intermediate.

    What confuses me is that the creation of the intermediate table doesn't cause the same bloat. The intermediate table is created using two different procedures, each adding data from one system. My numbers are a little off. The intermediate table is actually slight over a million rows. 800,000 of those come from one of the two systems. Those 800,000 rows are moved in a single INSERT/SELECT (very similar to the procedure I'm having a problem with). I have confirmed that the transaction log was at 2G after this intermediate step ran. It grow from 2G to 9G during the final step of processing, in about 5 minutes time. It seemed strange, so I didn't think that the data mass alone was causing the problem...

  • Danny Sheridan (4/19/2010)


    I had a similar issue in our production environment where our transaction log was growing to unacceptable sizes after bulk data loads. In my case, the transaction log was not needed so I would manually truncate it after the inserts to get it back down to an acceptable size. I added this query at the end of my stored procedure to perform this:

    The size of the file after the load is of no consequence. The fact that it grew to 9G in the first place means that I need to have 9G (or more) available on a drive where space is at a premium.

    "CHECKPOINT" should be enough, in simple recovery mode, to force a checkpoint. I've actually added this to several points in my code while troubleshooting. This is how I know that a single INSERT statement is causing the problem.

  • Grant Fritchey (4/19/2010)


    Another option would be to use a minimally-logged operation such as bcp or BULK INSERT.

    I'm confused as to how BCP or BULK INSERT would help, considering the data is already in a table.

    Perhaps using SSIS would minimally log the transfer? I've avoided using it in the past because I am unsure about the UDF call.

  • I don't think BCP/Bulk Import would help here.

    If space is at a premium, I would look to resolve that issue. Isn't data likely to grow? Meaning more log space needed?

    You can break up the insert, with the idea that you allow commits to occur, so a checkpoint can whack the log records and free up space.

  • jvanderberg (4/19/2010)


    Grant Fritchey (4/19/2010)


    Another option would be to use a minimally-logged operation such as bcp or BULK INSERT.

    I'm confused as to how BCP or BULK INSERT would help, considering the data is already in a table.

    Perhaps using SSIS would minimally log the transfer? I've avoided using it in the past because I am unsure about the UDF call.

    I suspect they won't help as things currently exist. But if you want to change the behavior of the system, you're going to have to change your process. It's really a question of picking and choosing where you make those changes.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • The easiest way seems to be looping based on an identity key. I've broken up the transaction into groups of 50,000 rows, and a CHECKPOINT is issued manually (just to be sure) after each load. The first part of the transaction seems to take up the most log space, at slightly over 1G. The remaining data groups seem to take only a few hundred megs of log space. It's certainly a great deal better than 9G. Thank you all for your help.

    An interesting thing I've learned from this - the dramatic effect of file resizes on query perforance. Typically, the stored procedure runs in 6-9 minutes (when there is enough log space to accomidate it). When I was forcing it to resize the log space (1G at a time), it took 16-20 minutes, and that's just 8 resizes. Imagine if my resize were set to 10%!! I could be waiting an hour for that to complete. File resizes truly do kill a query. You will notice the difference.

  • Here is a guess at why you see a larger growth from stage to production than from source to stage: indexes on the production table(s). Do you currently drop and recreate the indexes or are they in place during the transfer?

  • Lynn Pettis (4/19/2010)


    Here is a guess at why you see a larger growth from stage to production than from source to stage: indexes on the production table(s). Do you currently drop and recreate the indexes or are they in place during the transfer?

    Ah ha! It all make sense now. I keep the indexes online. I had considered drop/recreate, but up until now could see no specific reason to do so. The process itself didn't take that long, so I didn't think I'd really get much of a performance boost from dropping/recreating indexes. There are about twenty indexes on the fact table. Most are INTs, but still the index space for the table is nearly as much as the data space.

    On the other hand, there are virtually no indexes on the intermediate table. Only a identity key added today.

  • I wrote a set of procedures for a previous employer that dropped and recreated the indexes and foreign key references on specified tables that were used before and after the load process into their data warehouse. Since it dynamically captured the index information when run there was no need to maintain drop and create scripts so if a new index was created on a table, it was automagically captured during the next load.

  • Lynn Pettis (4/19/2010)


    I wrote a set of procedures for a previous employer that dropped and recreated the indexes and foreign key references on specified tables that were used before and after the load process into their data warehouse. Since it dynamically captured the index information when run there was no need to maintain drop and create scripts so if a new index was created on a table, it was automagically captured during the next load.

    I'll try it out and see what the results are. This seems like it might actually give me a performance boost, whereas breaking up the query is almost certainly a performance hit.

Viewing 15 posts - 1 through 15 (of 27 total)

You must be logged in to reply to this topic. Login to reply