Mistake #1 - Downtime
Definitely my fault, and I should have known better. I had come to Denver to work at a small company, and a relatively small database. This was in 1999, and we had a 20 or 30GB database, a decent size at the time, but not huge. However we had performance issues, and we had fires to fight every day.
We had a few very large tables that were used for many joins and written to constantly. On SQL Server 6.5, there was a limited amount of things we could do to improve things, especially in the short term, and about half my time was patching things. The other half was slowly improving things. I thought I was getting a handle on things, which is probably the wrong attitude. I should have been looking to see what I didn't know.
One of the issues was archiving old data. I finally convinced someone that we could archive some data to another database and use a view to join the current table with the archive table when we needed to. My plan was to archive the data over and get it done quickly. I decided over lunch I could move some number of hundreds of thousands of rows from one database to another. We had fairly fast servers, and based on testing, I figured it would be 20-30 minutes tops.
I should have tested it out in the QA environment. They were busy, but I should have kicked them off and really stressed the system. I had wrapped an insert and a subsequent delete in a transaction to ensure things moved. They did, but I found the delete was a problem. It ran slowly, which didn't make logical sense, but deleting 100k rows > 10 * (delete 10k rows). Much, much larger, by a factor of 10 at least.
Needless to say I had things locked so I couldn't check progress. When people came back from lunch (20 minutes past my deadline), I kept telling them it would be any minute. Eventually the disks filled with logs, things rolled back, and we barely got the system back 90 minutes later. All of our customer people had to scramble to get work done by the end of the day. I didn't make many friends that day.
If I'd tested it at anything close to scale I would have realized that there was an issue and batched things up in groups of 10k, or even 5k, rows.