The Worst Day

  • Comments posted to this topic are about the item The Worst Day

  • I work at a large Law firm with 20 offices across the country.  After a six month grueling conversion to iManage from opentext (Document management software) last Monday @ 9AM users started having problems saving documents.  It started occurring across the company.  70 users encountered this problem, support tickets were coming in.  We got on the phone with support and determined the problem was duplicate workspaces were being generated.  I was given SQL scripts to "Patch" the problem.  I would fix a user then they would encounter the same problem.  Classic game of "Whack-a-mole".  Finally after 10 grueling hours it was determined that Active Directory sync was not on.  Support wasn't able to help but enabling the AD Sync seemed to have corrected the issue.  I spent the next day monitoring for this issue all day.  Finally I created an SSIS package to check for this issue and send email notifications if it occurred again.  We are still closing the tickets created from  this event.  In the email notification I included the SQL to "Patch/Identify" this issue in case it comes up again 6 months from now.
    Not the worst day of my life but it was pretty miserable.

  • Working for a large international company when we got hit by ransomware. Kissed goodbye to 60,000 laptops, 10,000 servers and then spent the next 3 months rebuilding all the SQL instances in our data centres (approx 500 instances of SQL).
    How I'd love to go into more detail but I can't, but needless to say I wouldn't wish that on anyone and wouldn't want to go through that ever again

  • In one of my earlier roles when I was learning SQL we working on a billing project we had a contractor working for us.  He was an interesting character to say the least.

    At this time I had very little SQL experience and we were reliant on the contractor to advise us.  The project involved running 2 processes a month based on a certain day.  The way this was being tested was by changing the dates on server.. feel free to cringe now!

    Anyway myself and then manager were working on some data cleansing which had taken a fair amount of time to complete of the previous week or so.

    The first problem was when the contractor ran a delete script on the database which was actually a truncate script and all the tables we had worked on were very quickly emptied!

    The second problem was that we had no back ups because the server date had been moved forward and consequently no overnight backups had run since this change.

    The contractor was asked to leave at that point.  Myself and my manager then just worked through to get the data back and the told the operations director.  This led to my manager being suspended and consequently dismissed even though we had recovered the data.  And by the way this was in a development area, however the directors still wanted to get rid of my manager.

    So not only did we work hard to recover the data which didn't matter in the end as my manager was still dismissed...

    Plenty of learning from that exercise!

  • Yes the ride has been overall good for me. I consider myself fortunate that I am both willing and able to do database work and that it pays relatively well. I joke that 100 or 200 years ago I would have been much more useless.

    Possibly the worst day in my career was the first time I got laid off. I felt like such a loser. However, once the emotions were gone and a new job with higher pay was obtained, I realized that I likely had nothing to do with the decision. It was likely a financial decision by someone who doesn't even know me to trim x% so the quarter would be better. Also the rest of my team got blown away too.

    Even with that I consider my time there good - a learning experience for sure. I am not angry anymore.

  • OK, it wasn't SQL Server and I was a developer and not a DBA but (without going into details) a lot of historical production data was lost because:
    1) I was afraid to admit what I had done and spent a lot of time trying to reconstruct what data I could.
    2) No one ever let us know about the disaster recovery procedure.
    We could have saved the data because everything was redundant (see Tandem Computers) if I had stopped and admitted what I had done immediately instead of trying to fix it.
    If anyone out there remembers the "August Fiasco" I fully apologize for the mess I made.....

  • I was pretty new to a company working for a technically challenged DP manager, and was responsible for code I had not written.  One day things crashed and production was down.  A coworker and I  were working at my desk trying to find the problem, when our manager sauntered in, hands in pockets, and said:  'Haven't you guys got that fixed yet?  It should be pretty easy.' 

    My unconsidered response, without thinking, was 'Fred, if it is so damned easy, why don't you fix it yourself?'  That got me a very long meeting with HR and the owner of the company. 

    But guess what.  From the perspective of another thirty years, I'd do the same thing today.

    Rick
    Disaster Recovery = Backup ( Backup ( Your Backup ) )

  • Trying to restore data for comparison (before & after) and find out the backupprocess missed the necessary archivelogs for a restore to time.
    Luckily it could be reconstructed from other sources.

  • I quite enjoy being a DBA overall.  Every now and then I get hit with odd problems that make me think and hopefully I have time to think and resolve it and it isn't a "everything is broken" moment.

    My worst day was actually a 2 day process.  Wednesday, in the middle of my vacation, I get a phone call that nobody can connect to the SQL Servers.  I have no laptop, no internet access, nothing that I can use to be helpful.  My suggestion - reboot server X.  We have 3 servers, reboot 1 of them and see if it comes back up.  Normally I would have suggested checking the logs, but this was taking the whole company offline so get it back up ASAP and do root cause analysis after it is working again.  They rebooted it, waited 5 min for it to come back up and things were working again.  So reboot the other 2 boxes and call it a day.  I'll look at it more when I get back to work.  I do mention that they should check the logs just to see why everything died, but was told I could look at it when I got back.
    Thursday it was rainy, so cut my vacation short and went home for the remainder of my vacation so I could relax and still have fun.
    Friday, I get a call that everything is down again and rebooting it didn't help this time.  I go into work, poke through the logs... all of the SQL disks were corrupt.  Spent part of the weekend on the phone with support and ended up needing to format the disks.  So the remainder of the weekend was getting everything restored from backup.  But come Monday morning, everything was back up and running 100% AND thankfully we only had about 15 minutes of data loss which was within our acceptable loss window.

    Not as bad as some of the others who posted, but not exactly a good way to have a vacation either.  Since no critical data was lost and thus I was happy with that experience, just not happy to have my vacation cut short.  It was a good learning experience and a way (not a good way) to test the DR plan.

    Another fun one I had was a database that had broken the 1 TB mark in size at a company that doesn't need THAT much data, I got tasked with working with the company to reduce the size.  I had quite a few meetings with the data owners (roughly a month of meetings and analysis) and it was determined that the data in the tables was unneeded as long as the table structure remained as a 3rd party tool needed the tables to be there.  So, weekend comes around and a bunch of truncate scripts later, the database is empty so I shrink it and get IT to reclaim some of the disk space back (they were the ones pressuring me to shrink things to get some disk back).  Come Monday, I get into work to 5 emails about some things not working properly.  Long story shorter; turns out about 5 tables (small tables) were required by some in-house software and when those got blown away, we could no longer get anything to pass our internal tests.  So restore the database onto the dev system and move the tables over and we were back up and running before lunch.  Flash forward about 2 months (so now the backups are on tape and off-site thus not easily restored), somebody loads up the 3rd party tool and cannot log in.  I go to the database, dig through some tables and views and realize that in my cleanup - I blew away the username table.  Thankfully they told me they don't need the 3rd party tool.
    The bad part of that story - they told me that they would change their code to stop using the database as nobody looks at that data and nothing should be reading it.  That was 3 years ago and I see the database autogrow every now and then...  Gotta love data graveyards, eh?

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • I'm not sure any of my worst days at work involved databases outside of getting caught in a layoff or two.

    One of my first jobs was working on the family farm so we had all sorts of stressful incidents/accidents with equipment and weather. While working at a restaurant I got burned by someone handling hot grease. Lot's close calls with death while in the military, firefighting and working security. Some extremely stressful days working in semiconductor fabs. Some very bad issues with management at telecom company about ethics and criminal activity.

  • On another occasion many years ago, in a shop that did online data entry from CRT terminals 24 hours a day into flat file structures, our process periodically was shut down and did intermittent backups for rollback purposes.  At one point, our main file got corrupted.  Thinking that I would keep a copy for research, I ran the backup process  ...and... you guessed it...  wiped the most recent backup. Older backups were on magnetic tapes that were stored off premises across town.   This caused a complete halt to warehouse operations loading trucks with food orders for hospitals, nursing homes, schools, restaurants and such institutions.  And of course, after the file restoration, we had to determine which orders had been lost, and these had to be re-keyed with our Teamsters Union CRT operators on overtime while the warehouse crews waited, also on overtime.  Not s good day for me.

    Rick
    Disaster Recovery = Backup ( Backup ( Your Backup ) )

  • Hmm.... you know that scene in Office Space when peter is getting hypnotized and they ask him what the worst day of his life was, and he says every day is worse?  

    Fortunately nothing that bad but just a stupid recurring event at the last company I worked for that got worse every time just because it kept happening.  Since they refused to do anything to fix their extremely shady sales process/credit card processing they were consistently getting in trouble for it.  To the point of getting Cease and Desists from Amex and discover and having multiple banks shut down chunks of their merchant accounts, every time that happened it was always some big fiasco to find all the customers impacted and do something about it.

  • This wasn't the worst day, but it exemplifies the environment in one of the longest running worst times in my work life.  I was working at a university, and we had just gotten a new VP to head our department.  As new employees are wont to do, they were making lots of changes to prove how necessary they were.  One of the changes that this VP asked for was a report for each of the colleges (17) within the university.  It was a complex report and I had other duties, so it took me awhile to automate this report.  While I was in the process, we were compiling the report by hand.  It ran about 40-50 pages every month.  One month, they found one mistake on one page and blew up, despite us having produced the report successfully for a couple of months.  He just had very little tolerance for human error and we always felt like we were walking on eggshells.  Shortly after that I found another job, and within six months, four more of the remaining five IT people in that department had also left.

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • Interesting topic.
    I allways believed that the DBA´s are adicted to the peril, the adrenaline, the caffeine shot. Because if not, I don´t find another reazon why we can cut off vacations, work on weekends, Christmas day or any other long week period to perform some tasks that usually we can not do in office hours.
    I really enjoy my work as DBA. It is when our wisdom are challenged and it is good. I end the day, feeling like House, MD, when I correctly diagnosed a "desease" and applied the correct the medicine. And eager to see the next hard case to solve.
    I know, it is a child thinking, but helped me a lot.

  • My worst day involved an important application whose database (not a SQL Server one) was replicated to a shadow copy for DR. The replication only ran in one direction at a time but was defined as bi-directional to make failover easier.

    Although I had ceased to be involved in day to day support of that system, I arrived at work one morning to be told that data was mysteriously disappearing from the production database. I logged on, and saw it vanishing record by record but with no clear indication of how. None of the other DBAs was aware of any out of the ordinary events, but then someone mentioned that a copy of production had been restored onto another server the previous day and was being 'cleaned up' ready for some testing.

    Root cause found. As data was deleted from the restored copy, the deletions were being replicated to the DR copy (because nobody had thought to disable replication from the copy after the restore) and DR faithfully did its thing and replicated its deletions to its twin, i.e. Production. Of course we stopped the 'clean up' as soon as we realised what had happened, and I spent the rest of the working day developing a method to reconstruct the deleted data from the audit logs. This was a 24x7 system with a lot of user-input data, so although we had backups, restoring to an earlier point in time was not a desirable option. I spent the evening and night rebuilding the missing data, ran out of time to complete everything before the start of the next online day, went home for a few hours' sleep and returned that evening to finish the job.

    Yes it was very satisfying to have diagnosed and fixed the problem without losing any data. No, I wouldn't like another day or two like that again.

Viewing 15 posts - 1 through 15 (of 24 total)

You must be logged in to reply to this topic. Login to reply