Deleting Large Number of Records

  • I guess you could look at setting trace flags (1211 springs to mind) to prevent lock escalation happening at all?

    But that could create another problem with insufficient memory ....

  • timothyawiseman (9/15/2009)


    Lynn Pettis (9/15/2009)


    I have to make two observations about this article. As of just checking the main page, there have been 7,992 views recorded for this article. Compare this to where it was originally published, 963 views in 90 days. Plus, there has been a nice discussion along with it, where there was none on the other site.

    I am quite happy! :w00t:

    Out of curiosity, where was it originally published?

    I had originally published this and one other article on sswug.org.

  • Interesting article. I took a slightly different approach when needing to delete almost 99.5% of rows from two tables with a total of 200 million rows between them. The tables are log tables used for history and troubleshooting.

    Because the rows to be kept were much fewer than the ones to be deleted, and because there is no need to log the operation (will take place at night after a full backup and with very low activity), and the criticality of individual rows is minimal, I chose to truncate the table and reinsert the rows to be kept. In my case, I get the "keep" rows from a query inserting into a temp table, but I could get them from a backup, which would also allow minimizing the use of tempdb. After gathering/obtaining these "keep" rows (takes a couple of hours in test), I add any rows that are added since the start of the gathering query (takes about 3 minutes in test), and then I truncate the table, insert the keepers, and reset the Identity value.

    Originally, I had considered a similar approach using BCP/SSIS, but wound up simply using temp tables, as the amount of data to be kept was manageable in our environment.


    Regards,

    Tore Bostrup

  • I agree, how you delete a large number of records is also dependent on how many records are being deleted. In the article I was deleting approximately 10% of the total number of records. I would definately use a different process if the goal was to KEEP only 10% (or in your case much less) of the records.

  • Tore Bostrup-382308 (9/16/2009)


    Interesting article. I took a slightly different approach when needing to delete almost 99.5% of rows from two tables with a total of 200 million rows between them. The tables are log tables used for history and troubleshooting.

    Because the rows to be kept were much fewer than the ones to be deleted, and because there is no need to log the operation (will take place at night after a full backup and with very low activity), and the criticality of individual rows is minimal, I chose to truncate the table and reinsert the rows to be kept. In my case, I get the "keep" rows from a query inserting into a temp table, but I could get them from a backup, which would also allow minimizing the use of tempdb. After gathering/obtaining these "keep" rows (takes a couple of hours in test), I add any rows that are added since the start of the gathering query (takes about 3 minutes in test), and then I truncate the table, insert the keepers, and reset the Identity value.

    Originally, I had considered a similar approach using BCP/SSIS, but wound up simply using temp tables, as the amount of data to be kept was manageable in our environment.

    Nice article, Lynn, it's in my briefcase now!:-)

    Tore Bostrup,

    I completely agree with this approach. I find myself very often in this situation, that basically applies to log tables.(I have wondered many times in the past if I'm logging too much information...)

    I think that write-intensive + non business-critical data (==>logs) fits better in a separate database, with simple recovery: it makes everything easier to maintain.

    -- Gianluca Sartori

  • Tore Bostrup-382308 (9/16/2009)


    Interesting article. I took a slightly different approach when needing to delete almost 99.5% of rows from two tables with a total of 200 million rows between them. The tables are log tables used for history and troubleshooting.

    Because the rows to be kept were much fewer than the ones to be deleted, and because there is no need to log the operation (will take place at night after a full backup and with very low activity), and the criticality of individual rows is minimal, I chose to truncate the table and reinsert the rows to be kept. In my case, I get the "keep" rows from a query inserting into a temp table, but I could get them from a backup, which would also allow minimizing the use of tempdb. After gathering/obtaining these "keep" rows (takes a couple of hours in test), I add any rows that are added since the start of the gathering query (takes about 3 minutes in test), and then I truncate the table, insert the keepers, and reset the Identity value.

    Originally, I had considered a similar approach using BCP/SSIS, but wound up simply using temp tables, as the amount of data to be kept was manageable in our environment.

    Nice article, Lynn, it's in my briefcase now!:-)

    Tore Bostrup,

    I completely agree with this approach. I find myself very often in this situation, that basically applies to log tables.(I have wondered many times in the past if I'm logging too much information...)

    I think that write-intensive + non business-critical data (==>logs) fits better in a separate database, with simple recovery: it makes everything easier to maintain.

    -- Gianluca Sartori

  • Sorry for double posting, looks like the site has some issues.

    It says an error has occurred, but the post is added.

    -- Gianluca Sartori

  • kjs714 (9/15/2009)


    I have not run across anything more definitive what is in BOL. I recommend that you setup a test database and run Profiler, looking for the Lock Escalation event to find out when it occurs. I've seen it at strange numbers like 1889, even though the number of rowlocks taken was well below what BOL states. In the context of the article, 10,000 will escalate to a table lock if no other processes have a blocking lock.

    SQL Server Storage Engine Team: Lock Escalation

    That should clear things up for you.

    edit: regarding trace flags - the same article also compares TF 1211 with 1224. The latter is often preferred in the infrequent cases where it is required. Taking and holding a conflicting lock in another session can be problematic, especially in the circumstances Lynn describes in his article: the open transaction will prevent the log backup from achieving its goals...

  • So, overall, a good article and another interesting contribution to the SSC archives from Lynn 🙂

    Paul

  • Ol'SureHand (9/15/2009)


    rob.lobbe (9/15/2009)


    This works if you are the ONLY one using the database.

    Running in a production system you can't just go about changing the recovery model.

    Not so sure about it.

    MSDN (http://msdn.microsoft.com/en-us/library/ms190203(SQL.90).aspx) say nothing to imply one cannot change recovery modes in full flight. I did change production DBs recovery modes with no adverse effects... OK, had to as someone creates them in FULL mode but does not set trans log backups...

    Does anyone have reliable info to the contrary (except for general principles etc)?

    Of course it's possible to change the recovery model in mid-flight. It's also possible to nail your foot to the floor. Doesn't mean either one is a good idea.

    If a database needs to be in full recovery so that point-in-time restores can be done, changing it to simple recovery in order to expedite a large delete process may not be a wise thing to do. In that case, "can't" doesn't refer to physical possibility, it refers to business needs.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Simon-413722 (9/15/2009)


    I've been using pretty much the same concept for data deletion since quite some time ago already. It does the same job as Lynn's code, except done a bit differently. I'll post the code anyway

    while 1 = 1

    begin

    delete top (???) from tablexxx

    IF @@Rowcount = 0 break -- (exit from loop or do something)

    end

    The code above doesnt require rowcount to be stored and readable.

    Simon

    You can also use:

    select 1;

    while @@rowcount > 0

    delete top (1000)

    from table

    where x= y;

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Hi Lynn,

    Nice article and an interesting discussion.

    I am wondering if any folks who have been following this discussion can explain or corroborate this phenomenon:

    Last year I was working on a large(ish) ETL project. I was playing with large transactions and using batching techniques to control the batch sizes. We had SIMPLE recovery mode and single user. I found that there seems to be an optimal batch size beyond which performance degrades. The mystery to me is that performance degrades as batch size increases even when there is plenty of free disk space for the transaction log. The performance degradation seemed to begin to occur with transaction volumes between about 500Mb and 4Gb. The performance degradation is beyond what would be expected for disk file allocation for the transaction log.

  • Hi, why not create an empty table and insert into it only the data that must be kept together with all systems like triggers and so on ? Then either a name swap between both tables and that's it. Of course, there will be a short "blank" moment for users but if it is done together with other maintenance jobs... Best regards. Michel

  • Hi guys and girls

    I'd like to know why no-ones considered the Delete Cursor...

    DELETE

    [ FROM ]

    { table_name WITH ( < table_hint_limited > [ ...n ] )

    | view_name

    | rowset_function_limited

    }

    [ FROM { < table_source > } [ ,...n ] ]

    [ WHERE

    { < search_condition >

    | { [ CURRENT OF

    { { [ GLOBAL ] cursor_name }

    | cursor_variable_name

    }

    ] }

    }

    ]

    [ OPTION ( < query_hint > [ ,...n ] ) ]

    Is there no batchability?

    So long, and thanks for all the fishpaste 😉

  • Hi Lynn, just now reading this article due to the re-publish. If you have a table with a high rate of transactions, what do you think of using the READPAST query hint to reduce locking even further?

Viewing 15 posts - 46 through 60 (of 72 total)

You must be logged in to reply to this topic. Login to reply