Options for deleting millions of rows from a heap?

Question

Options for deleting millions of rows from a heap?

jasona.work

SSC Guru

Points: 50083
More actions
March 27, 2013 at 8:22 am

#275637

One of the Devs I work with "inherited" a DB that started life back in the '80s with ISAM and such. It's only a couple tables, but one has ~500M rows, and weekly he needs to delete several million rows from it.
This table has NO PK / Clustered Index, nor are any of the columns (or group of columns) a good choice for such (every column has duplicates) Also thankfully, there are no FKs on any of the tables.
Right now, the Dev is modifying his delete routine to read in a couple thousand items from one table, delete them, then go and grab the next couple thousand, wrapping the delete in a transaction to help keep the log from ballooning out of control...
Myself and another Dev have been pushing him to add an identity column as a PK / CI to the table, so that he could populate his delete table with the identity value (hopefully speeding up the process) Of course, with a couple hunded million rows, it would take a while to populate that ID column...
Thanks all,
Jason

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply

SQLRNNR SSC Guru Points: 281334 More actions · Answer 1

The ident column is a good option. The one time build of it would be a tradeoff in time that will eventually be recovered by the process running faster.

Jason...AKA CirqueDeSQLeil
_______________________________________________
I have given a name to my pain...MCM SQL Server, MVP
SQL RNNR
Posting Performance Based Questions - Gail Shaw[/url]
Learn Extended Events

jasona.work SSC Guru Points: 50083 More actions · Answer 2

Thanks, I figured that was going to be the best option, now to get the Dev to eat the pain of taking care of this...

With help...

🙂

Jason

Erin Ramsay SSCrazy Eights Points: 9728 More actions · Answer 3

What's the criteria by which the rows are being deleted? I'm note sure I see the point of putting an identity column on the table if he's grabbing them by some other criteria. Unless he's grabbing the rows BY the identity column initially I don't see much gain here. It's going to be

1) grab all the rows that match his criteria

2) grab the identities of those columns to populate another table

3) delete from the main table with a join to the table holding the identity table

Wouldn't it make more sense to put an ordered index on the columns by which his criteria selects rows for deleltion and then just delete the top 10000 at a time? (or a number that doesn't over-tax the system, obviously)

Having said that.. and Identity column is a good thing and should be there but I'm not sure the process you want to put in place would be the most efficient.