Running Delete on a Very Large table

  • Hi All,

     

    I am currently testing out an application that I just finished writing, the application needs to purge out old records from a table using the date.

    I.e Delete from personprofile where join_date < getdate() - 7

    I am thinking of storing getdate() - 7 into a variable, instead of running directly...

    Personprofile currently has around 200m records in it, at the moment, the delete takes ages, the join_date also has a clustered index on it.

    CREATE  CLUSTERED  INDEX [INDEX_join_date] ON [dbo].[PersonProfile]([join_date]) WITH  FILLFACTOR = 85

    ON [PERSON_FILEGROUP]

    at the moment, it has to delete around 35m rows in the delete operations.

    I am thinking of the following, but not sure which route I will follow:

    Batching the update, so it deletes at a rate of 100000 a time.

    Removing the clustered index prior to deleting, not sure if this will be needed.

    At the moment, the idea of using BCP to BCP out what I need will be ruled out, as there is no space.

    I also have non clustered indexes on the table, not sure whether removing this will be any good i.e make deletes faster.

     


    Kindest Regards,

    John Burchel (Trainee Developer)

  • Another think to check is that you do not have any foreign key constraints on this table, because referencing tables need to be checked prior to delete. if you do set them to NOCHECK,or drop and recreate them. i Don't see or cant think of a reason that dropping your clustered index will make this any faster. it will probably take considerably longer rebuilding the index.

  • Keep the clustered index

    Delete with 25K rows batch

    Backup the log more regulaly (maybe even each loop if you are very low on space).

     

    I don't think removing a single index will really speed this up in the end (a lot of work to recreate it for 90% of the total rows, VS just updating)

  • Hi All,

    I done as above, i.e delete in 25k batches, and it was still taking ages to delete around 30m records out of a table with 300m records.

    The delete is done using a query that checks the date are older than a certain date.

    IS there anythign else that can be done to this, i thought about select records that i want to delete into a temporary table then match using a unique key to do the delete, but thinking of that idea, the temporary table will be really massive size wise.

    I cant also BCP out, as BCP'ing 270m rows of records will prove unpopular..

    Not sure whether the slowness is due to a clustered index on the table, as the clustered index would need to be rebuilt everytime after an update.

     

     


    Kindest Regards,

    John Burchel (Trainee Developer)

  • Assuming the index on joindate is being used for the delete (check the plan) then I suspect you will be just hitting the discs really hard during the delete.But also check otther tables that reference this on ewith any foreign keys as that will required lots of checking for those constrints during the delete.

    Do you have adequate log file space for the delete or is it growing the log incrementally as it goes - that slows it down a lot.

    Log on separate disc to data - and raid level for the log - definitely get it off the same array ias the data f they are shared, and preferably get the logs off raid 5 - expensive for writes - Raid 1 better prefromance ,  1+0 better still

    Mike

     

     

  • Thanks for your post Mike, I do agree with you on the issue of hardware, I also forgot to mention, there are no foreign keys being referenced by this table.

    Lastly, the db is in simple recovery mode, so logging shouldnt be an issue, I know the delete is hiting the disks hard, as that is normal.

    The problem now is how to speed up the delete operation...

    thanks

     


    Kindest Regards,

    John Burchel (Trainee Developer)

  • John,

    a couple things. First simple logging does clear the transaction log, but only of committed transactions! So if you run 100,000 deletes as a "set rowcount 100000 delete from mytable where id = x", this is one transaction and the log won't be cleared of it until it commits. If you're having big performance problems, try a series of 100, then 1000, and run each for 10 minutes, seeing how quickly they delete. you can tune this to fit your system.

    I'd definitely avoid the getdate() - 7 and just calculate it before you start your looping and use the variable. The index should help you here to perform the deletes since it would have you only hitting those pages that are relevant.

    Let us know if any of this works or what your results are.

  • I would also suggest you increase the fillfactor on your clustered index build.  Assuming that as people sign up they are being added at the end of the index, based on the date value.  You don't need free-space in the other index pages.  Splits will always happen at the tail-end.  I would set the fillfactor to 100% and on a 300m row table it should be a significant reduction in total pages being searched.

    Let us know how it works out.

    Tom

  • I don't know what is the reason you want to delete 7 days data from the app.. which can be done using sql job...

    If I want to do it.. Here is my approch using the app...

    1. Create a table to Delete History with TableName, DeleteDays, DelSubmittedDate, DelCompletedDate, username, emailid... etc...

    2. When there is a request to delete data from the app..

    3. Insert the record into a table created in step1.

    3. Create a procedure to delete the data requested in step 2 and 3. Make sure you delete 10000/20000 a batch and after every batch use wait for_delay command to wait for 1/2 ms so that if any one is waiting for this table can be cleared. At the end of proc update the table with DeleteCompleteDate column so that you can track how much time delete is taking...

    4. Create a job to execute the procedure created in step3 and schedule it to run every minute.

    5. If you want you add an email alert to the user stating delete completed...

     

     

    MohammedU
    Microsoft SQL Server MVP

  • John,

    Some more info would help - you say deleting 30m rows is taking a long time, but  it is very difficult to judge what is "too long".

    Some info in row size would help us to do some basic sums. Lets assume for sake of argument a 2 k row length. If that is right you will get say 4 rows per page. Now regardless of the recovery model every one of those MUST get written to the log before it can update the relevant data page. So you will be doing at least 7.5m page reads, 7.5m page writes plus the same to the log. Regardless of your disc system that number of accesses is going to take a while, and if you have got the log on the same spindles as the data the disc drives will be rattling around like crazy.

    Also concerned you have a db this size in simple recovery, I hate to generalise but if your disc goes pop can you really afford to go back to your last full or diff backup and lose all the updates since then? In simple mode that is effectively your exposure.

    So if you could post details of the row length, disc system and raid architecture, plus the time you are getting to delete the 30m rows you may find what you are getting is actually quite reasonable - difficult to tell at the moment.

    Mike

     

     

  • In place of using the date range if you had an identity key on the table that was indexed as the clustered index you could use getdate()-7 to find the number of the first record that should be deleted and delete prior to that range, that way your range would be using the clustered index which would be quicker.

    Carolyn

     

    Facts are stubborn things, but statistics are more pliable - Mark Twain
    Carolyn
    SQLServerSpecialists[/url]

  • One thing I haven't seen asked is how big is your log file? Do you shrink it after the delete operation? If so what is you intitial size and growth factor? If the file has to grow during the delete this will be your highest point of contention outside of the other comments made.

    Furthermore are you doing the deletes during a time when insert transaction are very low (preferably when none are taking place). If not consider changing your start time to avoid contention between processes. Consider other design options if possible.

    Example. I have a system where I have a lot of inserts during the week Mon-Fri and no record is updated afterwards. I need to clear transactions older than 13 months but I want to avoid impact to the server as much as possible. What I devised was a routine of using a bit column named "Deleted" defaulted to 0. Daily I would set the column to 1 where the transaction was older than 3 months (alternatively you could code to avoid these based on date but we choose the bit column). The app code was adjusted to ignore any record with Deleted = 1 and on the weekend when nothing was happening we delete these records from the system the same way you do. But we have the issue offset to a more appropriate time this way.

  • Yes - store the current date into a variable - you should get a boost just from that

  • Have you considered the option to simply ignore the dates that were older than X months app side?  Since you were already doing some recoding either way would have been viable (I think).  Also you wouldn't have to flag them at all and you would still be able to run that delete any time you wanted!

     

    What's your take on that solution?

  • That is an option as well I just found with ours the Deleted column gave better performance on these items plus I had them available for recovery for that just in case scenario which came up a few times.

    There are all kinds of options. Another is to delete them hourly versus once a day.

Viewing 15 posts - 1 through 15 (of 16 total)

You must be logged in to reply to this topic. Login to reply