From the Real World: On Call Again


On call, again?

It’s Monday morning, and I find myself on-call again, for the second week in a row.  This is so not fair.  However, I’m a DBA and I have a duty.  I feel like I must explain the back story to the worst on-call week ever.  We recently hired a TechOps DBA that would handle the day to day on-call duties.  This was a brilliant move to spread out the team of 2 DBA’s workload.  This would also give the 2 of us another individual to rotate with for after hours on-call duties.

 This plan has worked great, as the new DBA was trained and duties shifted.  Over the last few months, we have made great strides in cleaning up our processes and our systems have been a lot more stable and friendly, especially to the on-call DBA.  During this time of innovation, I have noticed that the Lead DBA seemed to be getting more and more frustrated, even though the systems have become less burdensome to us.  It was a couple of weeks ago that his desires had been made known to management, and he has put in his two week notice.  This leaves only me and the TechOps DBA to support the on-call rotation.  My turn was last week.  At the end of that week, the TechOps DBA also put in a two week notice.  This left me alone to take this week and next week of on-call rotation.  No sense in making the quitters perform on-call, they wouldn’t care if something did break under their watch., and besides that, it’s a security risk.

So, I start the day with normal on-call duties.  Most of which you, dear reader, are already familiar with.  A review of past issues is always a good starting point for any on-call rotation.  I start reading the logs and reports of past occurrences.  Of course, I was on-call last week, so most of this is a rehash.  We have had some occurrences of replication issues in the past month, but the other DBA (who is much more adept than I at replication) has been handling those issues.  At each occurrence, he has deftly solved the issue, gotten the ticket resolved with ProdOps and we’ve moved on.  We’ve had some odd backup failures as well in the past few weeks.  Those were resolved by the TechOps DBA, when his rotation occurred.  Single occurrences of backup failures are worrisome, but easily resolved.

As I peruse the weekend reports from ProdOps, I notice a few instances of groups complaining that data is not as expected in some of the non-OLTP systems.  I will have to dig into these soon.  I am hoping that we can get another DBA hired soon, because next week will be my third on-call rotation in a row, and I dread this house of cards collapsing upon me during my rotation.  With negative thoughts being pushed out of my mind, I dig into the issues of the day.  The day turns out to be a smooth day, but that night, that’s another story.

The backups failed Monday night.  This is always a great way to start the on-call rotation, a call late in the night about failed whatever.  As I dig into the backup failures, I notice that the drives that hold our backups are almost maxed out.  New backups are unable to be created.  It seems that there are quite a few large files spread out all over the place.  They all appear to be backup files, but are massive in size.  Why are they not going away?  We have maintenance plans that remove these after a period of time.  Tape backup pulls these files nightly as well. 

After bringing in a few folks to discuss and research, I find that the large files scattered all over the backup drives, appear to be backup files, only they are not backup files.  Somehow, each day, for who knows how long, files have been copied and renamed and copied and renamed.  Thus creating these bogus backup files.  Why, who, and so on start screaming thru my mind.  Which files are backups and which are not?  We do not have a practice of restoring each night’s backups to test them, so it’s anyone’s guess which is which.  Confirmation comes from the tape operators that backups have been occurring, and we have bogus files on tape as well.  It’s not the tape system causing this oddity.  A little more time spent researching this shows that the maintenance plans have some added SQL that is unexpected.  Each plan seems to not be removing old files as it should, but creating copies of some file, over and over.  This has been going on for weeks now.  No wonder we had failures on the backup systems in the past few weeks.  After testing quite a few of these files, I find that few, if any, contain real backups of our system.  Wow.  Bad week just got worse.

This puts into jeopardy everything that the TechOps DBA has done recently.  I start combing thru emails, reports and the like to track down any other nefarious activity he has possibly performed.  Hopefully it was only the backups.  We now stand at having 2+ weeks of nothing backed up, since the maintenance jobs were stripped of real backup code, and replaced with this fancy file copy crap. 

The next day, while I am recreating the maintenance plan on all our production servers, cleaning the folders of bogus backup files, and sincerely hoping that I can get this fixed before anything else happens, ‘it’ happens.  I get some reports from ProdOps that we are having more reports, lots more reports, about replicated report data missing.  It seems that we have crossed some daily threshold where no data exists for a timeframe from a few weeks back.  We gather a lot of data and replicate it from the OLTP system to report and search servers.  Often data is needed to be up to the day or minute of processing, but some systems do not query that often.  It appears that these groups of folks have started running their reports, and they came up empty. 

Could this be because of the backup issue? 

I doubt it.  I start digging into Replication Monitor and see that all appears to be online.  No failed jobs.  No latent replication. Yet, the data is not in the subscriber.  How can this be?  A little more investigation shows that for the past few weeks, these subscriptions have all been online and working. I next tried to find the publication that holds data for one of the tables of missing data.  As I dig into it, I realize that the first few publications seem to have very few tables in it, and the tables that are listed, are test tables.  We have some test tables that we publish, so we can quickly test data flowing.  These are the only tables in all the publications.  How can this be?  I look into the history, and realize that a few weeks ago, these publications were restarted.  This must have been when the other DBA ‘fixed’ replication issues.

Now I have no backups for the past few weeks.  I also have no replicated data on other systems.  My data is only in the production system, and not anywhere else.  I’m in a world of hurt.  Redoing the publications and subscriptions will take a while.  Re-snapshoting all the databases will take a long time.  Performing backups will take a long time as well.  I still have to remove bogus backup files … I guess it’s a good thing this simply another bogus story for April Fools day.  No actual data was in jeopardy, no DBA’s were harmed in the creation of this fable, nor is there a fictional system without active backups.  Whew.

April Fools.


2.09 (90)




2.09 (90)