delete millions of rows

Question

delete millions of rows

coool_sweet

SSCommitted

Points: 1892
More actions
September 9, 2016 at 6:27 pm

#313213

i have table which has 5 millions record and i need to delete 4 millions in production.
what is the best approach to do it ?

Viewing 15 posts - 1 through 15 (of 35 total)

You must be logged in to reply to this topic. Login to reply

The Dixie Flatline SSC Guru Points: 53354 More actions · Answer 1

You might want to have a backup copy of that table before you do such drastic surgery, but if you're sure you want to get rid of 80% of it....

I like to delete small sets of rows at a time, using a WHILE loop. In this case the rows to be deleted have a definite flag, but the delete condition could obviously be based on multiple columns. When the row count of deletions falls below the amount specified in @DeletionBatchSize, all deletions have been completed and the WHILE loop ends. You can use the OUTPUT clause to save the deletions, or just save the maximum key value deleted during each loop.

create table #ToBeDeleted( N int primary key , DeleteFlag bit)

insert into #ToBeDeleted

SELECT top(500000) N, case when N % 2 = 0 then 1 else 0 end -- flag all even rows for deletion

from dbo.vTally

declare @rc int = 5001

,@DeletionBatchSize int = 5000

,@TotalRowsDeleted int = 0

while @rc >= @DeletionBatchSize

begin

delete top(5000) from #ToBeDeleted

where DeleteFlag = 1

set @rc = @@ROWCOUNT

set @TotalRowsDeleted = @TotalRowsDeleted + @rc

end

select @TotalRowsDeleted as TotalRowsDeleted

select count(*) as RowsRemaining from #ToBeDeleted

select top 100 * from #ToBeDeleted

Be sure to test the WHERE clause that selects the rows to be deleted thoroughly before unleashing the Delete. Use it in a SELECT to be sure that the rows being selected are ALL rows you want to delete.

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

Jeff Moden SSC Guru Points: 1003865 More actions · Answer 2

coool_sweet (9/9/2016)
i have table which has 5 millions record and i need to delete 4 millions in production.
what is the best approach to do it ?

DELETING 4 million rows is much more arduous than INSERTing 1 million rows and that's just the start. Think about the hell that will be paid by your log file even if you're not in the FULL Recovery Model.

My recommendation would be to copy the million rows to a new table with a Clustered Index using Minimally Logged methods (read about those in "Books Online") using INSERT INTO WITH(TABLOCK)/SELECT, add any Non Clustered Indexes (also, Minimally Logged if not in FULL Recovery Model). Then, do a quick rename of the tables so that the new table carries the old name, do a final "true up" from the old table (which has been ONLINE and fully functional the whole time and could have new rows/updates), and then simply drop the old table after some period of time to make sure there are no user complaints. Don't forget to change back to FULL Recovery Model, if that's what the original condition of the database was.

There is an impact on the ability to do a Point-in-Time restore for any logfile backup that contains any Minimally Logged action but, you can minimize that possibility by taking a log file backup just before you change from FULL to BULK LOGGED recovery and then taking another log file backup right after you've changed back to FULL Recovery.

Note that this method builds the Clustered Index as the data is copied to the new table. It slows things down a bit but, especially for wide tables, can be quite a bit faster than using SELECT/INTO followed by the addition of the Clustered Index (especially if the table has much width to it) and won't "double the space" temporarily used when changing a HEAP into a Clustered Index Table.

Here are some links concerning what is needed for Minimal Logging.

https://technet.microsoft.com/en-us/library/ms191244(v=sql.105).aspx

http://sqlmag.com/t-sql/minimally-logged-inserts

There ARE some undocumented things that can and will prevent Minimal Logging during this evolution but the workaround that I've discovered is that using OPTION(RECOMPILE) in the INSERT/SELECT query easily overcomes those undocumented problems without adding any extra overhead.

Ah... don't forget about any foreign keys pointing at the table, although I can imagine why anyone would point FKs at such a large and volatile table.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

The Dixie Flatline SSC Guru Points: 53354 More actions · Answer 3

Jeff, wouldn't you have to stop activity on the ORIGINAL copy of the table to bring it current with transactions that occur not only during the initial copy but also during the "trueing up" process?

I made the assumption that it is a fairly active transaction table. Should have asked for more detail.

1. Does this table have active transactions taking place against it?

2. Will the cleanup take place during active hours or during a maintenance window?

3. Is the table partioned? If so, on what column?

4. What is the clustered index key?

5. What are the deletion criteria?

6. Does this table have foreign key relationships with other tables?

I feel brain-tired for not thinking about simply making a new subset copy of the table.

Given a maintenance window that would be ideal.

Edited to change "new copy" to "ORIGINAL copy" and to add questions.

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

Jeff Moden SSC Guru Points: 1003865 More actions · Answer 4

The Dixie Flatline (9/9/2016)
Jeff, wouldn't you have to stop activity on the ORIGINAL copy of the table to bring it current with transactions that occur not only during the initial copy but also during the "trueing up" process?

No... not during the initial copy. For the true up, you will have effectively taken the old table offline and brought the new table online through the renaming process for the true up process. That's when you can resolve any differences that occurred during copy time.

To be honest, though, and if the system could withstand it, I'd lock both tables for the 20-30 seconds and just get it done. The 20-30 seconds comes from a demo I wrote 2 days ago where I did similar to 7 million 1,040 byte wide rows in 2 minutes and 34 seconds. YMMV especially if the Minimal Logging fails either because you didn't do it right or you've run into one of the undocumented problems that I've run into in the past (most of which can be overcome by including OPTION(RECOMPILE) in the INSERT/SELECT) The really cool part was I didn't have to wait the 20 minutes to rebuild the clustered index that it took for the demo before that, didn't blow out the MDF doing so, and the piddly 100MB log file only grew to 200MB during the entire process. Because it only logged extent usage, it was nasty fast for all that it did.

And I agree with your other questions but I'd bet credits to Navy beans that this is being done on some form of temporal audit table or something like an invoice table with similar qualities (WORM table) with a temporal column of some sort. If the table was partitioned, the OP probably wouldn't be asking the question but, I agree... I could be wrong there.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1003865 More actions · Answer 5

As a bit of a sidebar, the use of a table-wide EXCEPT with temporal criteria is VERY quick at resolving differences between two tables for true ups and is one of those rare places where NULL = NULL without making any settings changes.

Another advantage of doing it this way is that you can keep the original table for a day or two, just to be sure that nothing was missed. It does take more planning than the fine delete-loop code you posted, though.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

The Dixie Flatline SSC Guru Points: 53354 More actions · Answer 6

I'll admit the loop is going to run a long time, but you don't have to worry about missing a constraint, nonclustered index, foreign key, etc. On the other hand, the loop may produce a LOT of fragmentation that will need to be cleaned up.

Have you tried EXCEPT on wide tables (200-300 columns)?

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

Jeff Moden SSC Guru Points: 1003865 More actions · Answer 7

The Dixie Flatline (9/10/2016)
I'll admit the loop is going to run a long time, but you don't have to worry about missing a constraint, nonclustered index, foreign key, etc.

I absolutely agree. As with all else, "It Depends" a whole lot on what type of table you're working with. If it's a WORM table, such as an audit table, you don't have to worry so much about constraints or FKs because there should never be such things on audit tables.

Of course, there should be some form of partitioning on such tables so that you can easily drop old, no-longer-wanted data. Since audit tables are usually also the biggest tables in the database, partitioning would help in other ways such as reduction of index maintenance and elimination of nightly backups on partitions that have not had any data changes (I normally set all but the current and next month partitions to READ_ONLY after a final backup).

On the other hand, the loop may produce a LOT of fragmentation that will need to be cleaned up.

If it's an audit table or something like an invoice detail table, then those are normally temporally based and probably won't cause any fragmentation or wasted disk space. The looping DELETE method works just fine for those and will usually not cause any blocking unless escalation has made it to the table level.

As you say, though, they will be slower and they will be more intensive, especially for the log file and, if the database is in something other than the SIMPLE Recovery Model, will also increase the size of backups.

As a bit of a sidebar, against my initial better judgment (we've all been brainwashed here), I've not rebuilt or reorganized any indexes in my primary server since the 17th of January. Not only has performance not degraded, it has actually improved. Brent Ozar has a whole series of articles and UTubes on the subject and, since we had major blocking every Monday (caused by page splits of the newly defragmented tables... even some that had an 80% Fill Factor) after the Sunday index maintenance, I reluctantly gave it a try and by-gum--and-by-golly, it worked great for me. Of course, you absolutely need to continue rebuilding stats, which is what the index maintenance usually did. I found it wasn't the defragmentation that worked so well... it was the rebuilding of stats cause by the index maintenance job that was doing the hat trick.

Of course, I'm also pretty religious about having Clustered Indexes on high volume tables that are narrow, ever increasing, immutable, unique, and not null. Only my totally static reference tables violate the "ever increasing" rule. That pretty much keeps me from ever having to worry about large amounts of "empty space" in a clustered table. For the NCIs on larger tables, NOT rebuilding indexes has cause a sort of "natural FILL FACTOR" do to the natural page splits that were killing performance on Monday mornings. It's been a truly amazing experiment. CPU usage and disk I/O dropped significantly of its own accord in just the first two weeks that I started the experiment.

Have you tried EXCEPT on wide tables (200-300 columns)

I have not. My widest table is "only" 138 columns wide and I'm pretty ticked off at that especially considering the all the bloody columns that have 0 or very few entries. 😉 I can't imagine how EXCEPT would be worse than trying to code a compare to the same level of effectively as EXCEPT, though. I'd also hate to be the poor slob that had to write such code. Imagine all the ISNULLs and ORs. :sick: You'd also still need a key to do a join with. Although I can't imagine not having such a key in the data, EXCEPT doesn't require it or a join because it looks at the whole row auto-magically. Not to sound like a kid on candy, it's pretty cool.

I have recently used EXCEPT to find differences between downloaded "snapshot" files of 129K rows of 40 columns each. It took 9 seconds. Considering how easy it was to write the code to do those snapshot comparisons compared to writing explicit code, I stamped it as "Hell yeah! Good enough" and went on to the next problem. 😀

As a bit of a sidebar there, they (folks where I work) were talking about writing some managed code to do the same thing or try to figure out how to get SSIS to do the differential comparison. It took me about 6 hours, womb to tomb, to write the fully productionalized code to automatically find the latest file and the previous latest file, import the two files (complete with row and column level error reporting in case they changed the format of the import file or had some bad data), make the differential comparison, and update the final table (total time takes 11 seconds for two 129K row files). They've (the same people that identified the requirement) decided that they really want to do it using either SSIS or managed code (even though I've demonstrated the code I wrote to them) but they still haven't even identified a plan to do it, never mind actually writing any code or building a Proof-of-Principle in SSIS... and the requirement for this differential comparison reared it's ugly head two months ago. 😉

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

The Dixie Flatline SSC Guru Points: 53354 More actions · Answer 8

I have not. My widest table is "only" 138 columns wide and I'm pretty ticked off at that especially considering the all the bloody columns that have 0 or very few entries. Wink I can't imagine how EXCEPT would be worse than trying to code a compare to the same level of effectively as EXCEPT, though. I'd also hate to be the poor slob that had to write such code. Imagine all the ISNULLs and ORs.

We are going to test EXCEPT this week, against a job that is using HASH totals to compare rows. I'm not comfortable with it, because only one HASH is being calculated and I fear false positives. I'll let you know how that works out.

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

Jeff Moden SSC Guru Points: 1003865 More actions · Answer 9

The Dixie Flatline (9/11/2016)
I have not. My widest table is "only" 138 columns wide and I'm pretty ticked off at that especially considering the all the bloody columns that have 0 or very few entries. Wink I can't imagine how EXCEPT would be worse than trying to code a compare to the same level of effectively as EXCEPT, though. I'd also hate to be the poor slob that had to write such code. Imagine all the ISNULLs and ORs.
We are going to test EXCEPT this week, against a job that is using HASH totals to compare rows. I'm not comfortable with it, because only one HASH is being calculated and I fear false positives. I'll let you know how that works out.

Perhaps a little of both would work. Done properly, HASH is pretty fast and, although I'd also be uncomfortable with the possibility of false positives, you might be able to use the false positives against an EXCEPT for final determination. Best of both worlds, so to speak.

Of course, it would also be interesting to see how EXCEPT faired on such wide tables. Got any stats on how long it takes to setup the HASH and then use it?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Eric M Russell SSC Guru Points: 125522 More actions · Answer 10

4 million rows is actually not a lot, but it depends on the speed of your storage system and whether the table will need to remain online and usable while the deletion is in process. Essentially, you want to delete the rows in batches to reduce the size of active transaction logging. The following example deletes from a table in 100,000 row batches and waits for 1 minute between each delete operation. This wait window gives the storage system additional time to keep up and prevents extended periods of blocking by allowing other user processes an opportunity to start and complete. When no more rows to be deleted, no rows were affected by the last delete operation, it breaks from the loop. The number of rows per batch and wait time between batches can be adjusted as needed.

WHILE 1 = 1

BEGIN;

DELETE TOP(100000) FROM MyTable WHERE <delete criteria>;

IF @@ROWCOUNT = 0 BREAK;

WAITFOR DELAY '00:01:00';

END;

Also, it helps a lot if the the delete criteria is indexed.

"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

Sergiy SSC Guru Points: 110206 More actions · Answer 11

The Dixie Flatline (9/10/2016)
I'll admit the loop is going to run a long time, but you don't have to worry about missing a constraint, nonclustered index, foreign key, etc. On the other hand, the loop may produce a LOT of fragmentation that will need to be cleaned up.
Have you tried EXCEPT on wide tables (200-300 columns)?

I'd suggest to set it running permanently.

At the end of the day - it's not about deleting 4 mil rows out of 5 mil.

It's about deleting "historical data" which has become irrelevant.

So, in the DELETE statement define the criteria of the "irrelevant" data and set it running as a job every hour or so.

As soon as another chunk of data becomes irrelevant the job will take care of removing it.

If such a chunk happens to be big - the loop will delete it block by block - slowly but steadily.

As fo fragmentation - it's not about the loop.

It's about choosing a correct clustered index.

If clustered index definition matched the definition of "old" data - there would be barely any fragmentation caused by the looping deletion.

_____________
Code for TallyGenerator

The Dixie Flatline SSC Guru Points: 53354 More actions · Answer 12

If clustered index definition matched the definition of "old" data - there would be barely any fragmentation caused by the looping deletion.

That's a big IF, Sergiy. Old data is often not defined by the date a row was created, but rather by when it was flagged as inactive. Nobody would build a clustered key on when a row is flagged as inactive.

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

The Dixie Flatline SSC Guru Points: 53354 More actions · Answer 13

you might be able to use the false positives against an EXCEPT for final determination

Jeff, how am I going to know if it's a false positive or not, without doing a column by column comparison?

By the way, on a dedicated production-strength development server, EXCEPT took an hour and a half to handle about 2.5 million rows. The rows in question consisted of 150 columns mixed between character data, medium to long varchars, dates, and integers, with many of the columns being sparse.

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

Eric M Russell SSC Guru Points: 125522 More actions · Answer 14

All forum discussions boil down to a debate over clustering keys. 🙂

"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho