Efficient way to delete historical data

  • I've been asked to help with an issue one of our factories is experiencing. They have a machine which prints labels. The data on these labels is written to a database during print runs. During one run over 24hours, the amount of data is in the millions of rows range. The company developed software which ran an archive and delete process on the database each evening, which backs up the database and then deletes data over 5 days old.

    This software does the delete in one transaction and they quickly found it locked the database for a good few hours and the tlog grew to well over 300GB. The database is only about 50GB. They were deleting around 30 million rows from multiple tables.

    I've been asked to come up with a better solution to delete the data. I can delete it in batches, which would solve the tlog growth, but it wouldn't make the deletion process any quicker. I'm trying to think of a better solution that would allow the database to still be used when the factory want to start another print run, but also clear out the old data, leaving the last 5 days worth of data still queryable. They can't have the machine down for hours whilst this process is running.

    What kind of solutions should I be looking at?

  • We remove multi-million rows after a variety of time-periods on several systems using partitioning.

    It isn't an easy thing to create for an existing system that already contains a lot of data, but I have done it.

    There are a couple of articles on my blog site to explain partition switching and setting it up to automatically clear data.

  • BrainDonor (3/30/2015)


    We remove multi-million rows after a variety of time-periods on several systems using partitioning.

    It isn't an easy thing to create for an existing system that already contains a lot of data, but I have done it.

    There are a couple of articles on my blog site to explain partition switching and setting it up to automatically clear data.

    Thanks. I would have gone this route too, but they are using SQL Server 2008 R2 Workgroup edition! So, I cannot do partitioning unfortunately.

  • Breaking it into pieces and making sure that the delete statements are using indexes instead of scanning, those are really your choices if you can't partition the data.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Thanks.

    Can I just check; BOL says about deleting data: "By default, a DELETE statement always acquires an exclusive (X) lock on the table it modifies".

    If I break down the deletes into transactional batches - say 100,000 rows per batch, would that cause blocking for the duration of all the batches, or would users queries (inserts/selects) run between each batch, effectively putting blocks on the next deletion batch? So in other words, everything will run, but possibly slower due to waits on everybody else doing their work in the order the statements are performed?

    I hope I've explained that clearly! 🙂

  • That doesn't mean it will lock the entire table (unless you're deleting a pretty substantial portion of it, but that's part of why you break it into smaller batches). It will cause blocking on the pages that it's deleting from, but that should be expected behavior. "You can't read the data I'm deleting." It mainly depends on if you're getting scans for your deletes, in which case, oh yes, it's pretty likely to block people. A delete, like an update, is a read and a write and they certainly impact other processes because of that.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Concur with Grant's explanation.

    Here's what I'd look at. Do you create an hour's worth of data every day? Or any day, maybe is a better question? Could you schedule the delete process (perhaps with an archive step) on a regular basis so that the deletions take less time?

    I could envision here that you do something like schedule a deletion of data over 5 days old every hour. You'd always be deleting an hour's worth of data, though you could certainly pause this if you encountered a busy time.

    The other thing you could consider is automating this, with a regular running process even more frequent, but which isn't run (or doesn't check for batches to remove) if a semaphore is set. The easiest way to do this is have someone click a button when they start print runs that puts a value in a table. Your deletion process reads that and if it's set, doesn't do anything. If it isn't (you need an "unset" method), the process deletes xx rows.

    I've worked with batches from 1,000 to 100,000. The indexing and hardware have dictated what is a good size, and it's varied. Need to test this yourself.

  • One other thing, perhaps you can do a simple "switch" for your table. Rename your table before print jobs are generated and recreate a new, blank one. Then you can delete from one without impacting (too much) the one used for print jobs.

    Be careful here, as you need to manage the process, and DRI could cause issues, but it's a thought.

  • Thanks all, very interesting.

    Unfortunately, a continual delete isn't possible as the data is loaded over the print run. Then reports are generated on that data, and it is used to generate other statistical information. Runs can run for a 1 to 2 days, so a lot of data is accumulated over that time. They keep the data for a few days as the reports don't have to be generated right after a run, i.e. a new run could be running whilst the reports for the previous runs are being generated.

    I've thought about a table switching process but again not too sure on how quickly I could switch the data. I.e. I could rename the live tables, to something else and recreate new ones with the same name. The factory has downtime of about an hour every Sunday morning, so this could be a possible time to do it when no one would be using the database. I'm thinking I would have to copy the "live" data back into the new empty table, but then that may take a while (over the hour downtime) and cause issues again with the timings. But it would have benefits that I could then delete the historic data at will without worrying about blocking other transactions.

    You're right though, the biggest thing here is testing and that's the only way I am going to be able to see what works well, what generates lots of log and blocking etc.

  • You could consider clustering the table on the datetime involved. Then you can delete old data with minimal effect on current rows, particularly by reducing the number of non-clustered indexes to the dead minimum.

    How is the table normally processed? By that datetime? Or by some type of machine_id, if there are multiple machines rather than just one? In that case, we might need to consider clustering by ( machine_id, datetime ).

    Edit: Also, that table should not require full recovery model. If you can, run it in simple mode, even moving that table to a separate db if needed to do so.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • Maddave (3/30/2015)


    ...

    I've been asked to come up with a better solution to delete the data. I can delete it in batches, which would solve the tlog growth, but it wouldn't make the deletion process any quicker. I'm trying to think of a better solution that would allow the database to still be used when the factory want to start another print run, but also clear out the old data, leaving the last 5 days worth of data still queryable. They can't have the machine down for hours whilst this process is running.

    What kind of solutions should I be looking at?

    I can tell you from experience that deleting from a table in multiple moderate sized batches (ie: 10k - 100k rows) is much faster than deleting in one single huge transaction (10 million rows). When you perform these huge delete operations, the bulk of the I/O and duration is consumed by transaction logging, and from what I've seen the duration is not linear based on number of rows deleted, but rather it takes exponentially longer. The difference can be several minutes versus several hours.

    Also, SQL Server just won't hold millions of individual row locks. So, when you're deleting millions of rows from a table, then lock escalation kicks in at a certain point, and you will eventually see table locking.

    Inserting millions of rows into a table, and then deleting them out is just messy. It's not really transactional data, so the overhead of the relational engine and storage just gets in the way. Consider writing your audit records to something like a tab delimited or csv text file. Your reporting tool should be able to work with that. You can use a YYYYMMDD naming convention for the files, so they automatically get partitioned that way. You can also use OPENROWSET() function to query from the text files into SQL Server, if needed. If the sole purpose for this SQL Server instance is for holding onto temporary audit records, then you may be wasting money on a license.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Thanks. I've written my procedure which deletes the data from the tables in batches. Initial testing shows it is much quicker and the log hardly grows at all, plus data can still be inserted and queried whilst the process is running, which is perfect!

    I'll be load testing the system with real life data tomorrow, so it wil be interesting if the process can complete in less than 5 hours!

    My only issue is that the owners of the system have said they are concerned that if there is a problem with the deletion half way through and it stops, they cant leave the data in an inconsisten state. I.e. some tables with just 5 days worth of data and some with older than 5 days for example. The data has proper referential integrity so there wouldn't be orphaned records, but I think they are concerned about haveing parent records with no children, which wouldn't be correct.

    I've said I dont think there is any way to avoid this problem and to resolve it, they would have to restore from a backup before the deletion, otherwise they will have to live with doing the whole lot in one transaction and all the problems that they already have!!!

    I'll post an update with the chosen solution once this is tested and successfully implemented if anyone is interested.

  • Really, your disposable audit records that you purge after five days must have referential integrity and must be totally consistent? Let's consider that over engineering on the part of the ISV has actually lead to poor performance and a higher risk of failure.

    The deletion of rows, when done in small batches is a straightforward operation. This type of thing typically only fails when it's deadlocked or the transaction log fills up, and that's exactly what your plan for deleting in batches avoids.

    If for some reason it should fails, you can simply re-run the batch deletion stored procedure to pick up where it left off. Take that into consideration when coding the procedure. It should be doing practically the same as the initial delete process, only in batches. With a little more attention it can be coded to delete parent and related child rows within the same batch, so the transaction is consistent.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Yes, you're correct; If the procedure fails, then running it again clears up whatever is left, so I am not worried by the process. I think the business owners are worried really because they want to sell the machines to other people, so if the deletion job fails and has only deleted half the records, then there potentially wouldn't be a DBA to help sort out the issue.

    That's a good point about deleting parent child records together. I actually have done it in table order committing after clearing out each table then moving onto the next. I'll change it to do all tables in the batch. That makes much better sense!

  • One other thing you could try is a small table (or series of tables) that give the oldest valid date for a table. Then join to that.

    In other words, if you have the PrintJobs table and the Labels table, each with information you need. You create a Valid_Dates table, with a single field/column. My queries go from:

    select printjobs, jobdate, job

    from PrintJobs

    to

    create valid_dates ( valid_Date datetime);

    select printjobs, jobdate, job

    from PrintJobs

    inner join Valid_Dates

    on jobdate >= valid_Date

    Then you can adjust this date in Valid_Dates as you archive data. Perhaps before you remove data, reset the valid date for queries.

    You could even start replacing the current tables with views that use this technique. We used to do this with financial information, but in the opposite direction. All queries looked at valid dates, but we could load new data into tables without it being considered until we were ready. That way if we needed to change, or delete/reload, we could for data sets. When we were ready, updated the date in the table to let queries get new data.

    It's really poor man's partitioning, but within a single table and letting us set certain "rows" as valid with a join condition.

Viewing 15 posts - 1 through 15 (of 28 total)

You must be logged in to reply to this topic. Login to reply