SQLServerCentral Article

Worst Day as a DBA Story Finalists

,

Update: Thanks everyone who voted, UPS Karate Kid is the winner.

Following on from the first article in the 5 Worst Days of a DBA article series, we asked DBAs to share their worst day as a DBA with us. Responses flew in, and we've whittled it down to our 5 favorites to present to you so the community can vote for the best story. Understandably, many of the entrants have chosen to remain anonymous, please don't let that affect your vote. Voting closes Friday, March 28.

Story 1 - UPS Karate Kid

The 10TB database I was supporting requires almost 24 hours to process the last 24 hours of data. Due to a storm, which knocked out the power for several days, the server facility lost power. The data center folks had not properly tested their UPS battery backup solution so they were not aware that the UPS batteries were out of warranty (and failed to come online for most of the racks). When power was restored 2 of the 3 drive arrays for my database server had multiple drive failures (in one 14 bay array we lost 8 drives, in the other we lost 7). This required a server rebuild but since this had occurred on multiple racks the rebuilds were triaged (and due to the service contract my server was at the bottom).

When I found out that the server would not be rebuilt for another 3-4 days, I attempted to perform the rebuild personally. I had access to the facility so I went there with replacement drives to rebuild. Unfortunately when the facility folks found me working on the server they got really upset (read this as they called security because they felt I should wait until they got to the work themselves). My stressed-out response was something along the lines of "f** off". Security came and informed me that my access had been revoked and I was "invited" to leave. My 2nd stressed-out response was something along the lines of "if my access had been revoked then the facility had violated the support contract, if so then I was taking the server with me".

NOTE: I hadn't slept for a couple of days due to the power outage. When security heard this, one enterprising security fellow informed me that he would eject me physically. My witty response was "whatever". He took this as an invitation to try. He placed a hand on my shoulder and I (in my exhausted state) responded without thinking. The response involved a judo throw of the enterprising security fellow into a rack and a power conduit pillar which knocked out the power to that half of the facility (again). The rest of story involved police, handcuffs, and a couple of lawyers. I was "relived" of my position on that project and placed on another one. When I left the company (I found another position, I wasn't fired) about 6 months later the database still had not caught up with processing the data.

By Logitestus

Story 2 - Here a Jong, There a Jong

In my early days as an accidental DBA, I was in charge of making updates to our timesheet system. This involved a number of direct table updates, as the admin interface had not been developed.

On this particular morning, I was working on four hours of sleep, and excited to get away for a week of holidays. I was asked to update an employee's name in the system, which was not an unusual request. I used my template script, entered the new first name, selected the code and ran it. Except I forgot the WHERE clause... And renamed the entire company Jong... Luckily, our system had a trigger that captured all changes to an audit table. 15 minutes later, we were able to restore everyone's name to the original. All is well, crisis averted. But then I did it again (D'oh!)

After two very public failures in under half an hour, the entire process was changed. And I learned that you should always use a transaction, even for a simple well known process.

By Jason Joesephson

Entry 3 - Early Morning Drive Disappearance

A junior report writer contacts me, the DBA, and says he can't connect to our Reporting Services instance. It happens that this instance is sitting on a Dev machine, for many poor reasons I refuse to divulge at this point. Anyway, he can’t connect to it. As I receive this news, standing in my bedroom in my underwear, not quite ready for the work day to start, I want so badly to push it off as a 'junior' employees mistake. However, I gather my courage, and some clothing, and head to my computer downstairs, while talking him down from the ledge he apparently is standing on as this potential early morning disaster looms. Once connected to our VPN, I too am unable to reach said Reporting Services instance. Hum. Odd. A little more digging and we find out that not only 1, but 2 of the drives of its Raid have gone bad. One recently, while the previous one had been bad for a few weeks, and no one was the wiser. 

As the awful truth sinks in, I realize that ALL our internal corporate reports exist on this instance, all 4000 scheduled reports that are sent internally, and externally to customers, reside within this now defunct system. Drives are gone. DBs are absent. Reporting has vanished. And since it is a DEV box, backups are as prevalent as fields full of wild unicorns. It is here that I wish the dream would end, and I would simply wake up to a Monday and chalk it up to an overactive imagination. I'm still waiting to wake up. Please, wake up. It's not really happening. WAKE UP!!

By TJay Belt

Entry 4 - Executive Crap

I reported early to work before anyone else, and was working in the server room when I heard a strange dripping of water. There is no water anywhere near the server room, and no sign of leaks in the celling. I finally noticed the stack of papers on the CPU were wet. Culprit was a new executive "Toilet" that was installed by tenants above us. It was guaranteed never to leak with an almost large enough bond to back the claim.

I had to shut down the company system and call our computer maintenance to explain our computer on the 6th floor had flooded out (the tech had responded to responded to a hurricane disaster flooding of computers two years prior). The system was down until quitting time, and I had to explain to execs how the s___ had hit the system flooding it causing the corporate data to go down the ________. The jokes and punch lines did not stop for at least two months. (No loss of data or hardware only to a full day of IT business and a lot of red faces)

By Anonymous

Entry 5 - Saturday Night Partition Party

We have an 8 TB database, all the big tables are partitioned and spread across multiple disks and filegroups. We store only 1 month data and tables have 32 partitions (1 day/1 partition). We were asked to extend the data duration from 1 month to 45 days. So we added the partitions to all the tables. Everything was fine until infrastructure team found out one of the file groups running out of space.

Upon further investigation, new partitions were added to primary file group instead of different file group, the file group got full as new partitions are being poured in with data. DBA prematurely merged all the new partitions and tried to split them by placing them in proper file group. Splitting partition with data to new file group on other disk blocked everything as that partition has over billion records. DBA killed the split operation after 30 minutes, took 1 hour to rollback and database went to recovery mode. Customers couldn't see the latest data for the past 3 hours and couldn't connect to db anymore as the db is in recovery mode.

Finally db was backup after staying in recovery for 1 hour. Now the db is backup, but the partitions are messed up. 7 days of data is going to single partition opposed to 1 day /1 partition. Fortunately we have some space left on the disk so expanded the file group and switched out that partition to a swap table on the same drive and split the empty partition to new file group. The swap table has 1 day of data and a new partition has been created on the same file group and swapped back into the table for that day. It all took 5 hours on Saturday night. Ultimately, DBA learned never switch the huge partitions to different disks.

By Anonymous

Rate

4.57 (7)

You rated this post out of 5. Change rating

Share

Share

Rate

4.57 (7)

You rated this post out of 5. Change rating