SQLServerCentral Article

Distributed Backup Checking

,

Having just read Santveer Singh’s article Which databases are being backed up (another look), I thought the following might add something to it. Our situation is slightly more complex. We develop software applications that make use of replicated databases. Our users are distributed across Europe, and in some cases consist of small offices with no real system administration personnel. Consequently we typically manage their servers for them remotely. Part of this is the backups, and with the number of servers involved this is a bit of a nightmare. I came up with the following solution which hopefully might be of use to others as well. In addition to checking back up status we also use it to monitor disk space for critical disks, and replication status, though for conciseness I’ll ignore that part for now.

We use maintenance plans to manage our backups. Each system database is backed up once daily, as is each application database. Transaction log backups for application databases are taken every 15 minutes. Backups are done to a separate live disk on the machine, and then backed up to tape as part of the sites daily tape backup. The maintenance plan is set to keep a single full backup and 24 hours of transaction log backups (in the event that anything earlier is needed we can ask the site to pull the relevant tape, but this has not been needed to date). The system I’ve put in place tells us if the right number of backups exist, if any backups have failed, if any verifies have failed, or if any of the deletes have failed – the failed delete is particularly important as the databases are large, and failures to delete could rapidly fill a disk.

My system is simple enough. It uses msdb.dbo.sysdbmaintplan_history as the source of its information. This table is queried hourly to extract the statistics we require, and the results placed in a table in the database SysAdmin (a database we use on all our servers for holding system management related data and procedures). This table is replicated hourly to one of our local servers. We could have avoided replication since one of the maintenance plan options is to send history information to a remote server. We decided against this for the following reasons:

  • We had replication in place any way
  • It was something that could be missed when setting up the maintenance plan
  • The current approach generates less network traffic
  • We had no idea how reliable the alternative would be

Once the information is in place on our local server we’re free to extract any statistics we require from it. We currently do this in two ways. There is a daily task scheduled on the server that extracts information on problems and mails a report to designated users. Also I have written an MMC snap-in that displays the current status of all sites and highlights any problems.

The various tables, views and procedures to implement this are shown below.

BackupList is used to specify which databases should be checked on each server. The server column should match the value returned by @@SERVERNAME for the server in question.

CREATE TABLE [dbo].[BackupList] (
   [guid] uniqueidentifier ROWGUIDCOL NOT NULL ,
   [ID] [int] IDENTITY (1, 1) NOT NULL ,
   [Server] [varchar] (50) NOT NULL ,
   [DatabaseName] [varchar] (50) NOT NULL 
) ON [PRIMARY] 

BackupLog is used to store the results of our queries of the maintenance plan histories.

CREATE TABLE [dbo].[BackupLog] (
   [guid] uniqueidentifier ROWGUIDCOL NOT NULL ,
   [TestDate] [datetime] NOT NULL , 
-- Default GetDate() – records time of last test
   [DatabaseName] [varchar] (50) NOT NULL ,
   [Server] [varchar] (50) NOT NULL ,
   [IsSystemDB] [bit] NOT NULL ,
-- Set if the database is a system database
   [BackupCount] [int] NOT NULL ,
-- Number of full backups in last 24 hours
   [BackupFails] [int] NOT NULL ,
-- number of failed backups in last 24 hours
   [TransCount] [int] NOT NULL ,
-- number of transaction logs in last 24 hours
   [TransFails] [int] NOT NULL ,
-- number of failed transaction logs in last 24 hours
   [VerifyFails] [int] NOT NULL ,
   -- number of failed verifies in last 24 hours
   [DeleteFails] [int] NOT NULL 
-- number of failed deletes in last 24 hours<
) ON [PRIMARY]

Code

The following procedure performs the actual checks for a specific database

CREATE PROC dbo.adm_CheckBackups
  @dbname varchar(50), -- database to check
  @IsSystem bit=0      -- indicates a system database
AS
DECLARE @tlog_fails int, @tlog_count int, @verify_fails int
DECLARE @backup_fails int, @backup_count int, @delete_fails int
-- Return count of failed transaction log backups in the last 24 hours
SELECT @tlog_fails=COUNT(*) 
 FROM msdb.dbo.sysdbmaintplan_history
 WHERE database_name LIKE @dbname
 AND activity LIKE 'Backup transaction log' 
 AND start_time > DATEADD(hour, -24, getdate())
 AND succeeded=0
-- Return count of transaction log backups in the last 24 hours, 
-- whether they succeeded or not
SELECT @tlog_count=COUNT(*) 
 FROM msdb.dbo.sysdbmaintplan_history
 WHERE database_name LIKE @dbname
 AND activity LIKE 'Backup transaction log' 
 AND start_time > DATEADD(hour, -24, getdate())
-- Return count of failed verifies in the last 24 hours
SELECT @verify_fails=COUNT(*) 
 FROM msdb.dbo.sysdbmaintplan_history
 WHERE database_name LIKE @dbname
 AND activity LIKE 'Verify Backup' 
 AND start_time > DATEADD(hour, -24, getdate())
 AND succeeded=0
-- Return count of failed full backups in the last 24 hours
SELECT @backup_fails=COUNT(*) 
 FROM msdb.dbo.sysdbmaintplan_history
 WHERE database_name LIKE @dbname
 AND activity LIKE 'Backup Database' 
 AND start_time > DATEADD(hour, -24, getdate())
 AND succeeded=0
-- Return count of full backups in the last 24 hours, whether they succeeded or failed
SELECT @backup_count=COUNT(*) 
 FROM msdb.dbo.sysdbmaintplan_history
 WHERE database_name LIKE @dbname
 AND activity LIKE 'Backup Database' 
 AND start_time > DATEADD(hour, -24, getdate())
-- Return count of failed deletes in the last 24 hours
SELECT @delete_fails=COUNT(*) 
 FROM msdb.dbo.sysdbmaintplan_history
 WHERE database_name LIKE @dbname
 AND activity LIKE 'Delete old%' 
 AND start_time > DATEADD(hour, -24, getdate())
 AND succeeded=0
BEGIN TRANSACTION
-- Clear the previous results for this database on this server
DELETE FROM BackupLog
 WHERE Server=@@SERVERNAME
 AND DatabaseName=@dbname
-- Create a new record with the current information
INSERT BackupLog(DatabaseName, Server, IsSystemDB, TransCount, 
  TransFails, VerifyFails, BackupCount, BackupFails, DeleteFails)
 SELECT @dbname, @@SERVERNAME, @IsSystem, @tlog_count, @tlog_fails, @verify_fails, 
        @backup_count, @backup_fails, @delete_fails 
-- If there are any problems put things back as they were
IF @@ERROR<>0
    ROLLBACK TRANSACTION 
ELSE 
    COMMIT TRANSACTION

The next procedure calls the check procedure for each applicable database – it is this procedure that is called hourly as an sql server agent job:

CREATE PROCEDURE adm_CheckAllBackups
AS
-- First do the user databases specified in the control table
DECLARE cr CURSOR READ_ONLY FOR
 SELECT DatabaseName
 FROM BackupList 
 WHERE Server LIKE @@SERVERNAME
DECLARE @dbname varchar(50)
OPEN cr
FETCH NEXT FROM cr INTO @dbname
WHILE (@@fetch_status <> -1) BEGIN
IF (@@fetch_status <> -2) 
 BEGIN
   EXEC adm_CheckBackups @dbname, 0
 END
FETCH NEXT FROM cr INTO @dbname
END
CLOSE cr
DEALLOCATE cr
-- finally do the system databases – these are done automatically, 
-- and do not need to be specified in the BackupList table
EXEC adm_CheckBackups 'master',1
EXEC adm_CheckBackups 'model',1
EXEC adm_CheckBackups 'msdb',1
-- The distribution database will not exist on all servers, so a
-- check for its existence is performed first
IF EXISTS (SELECT * FROM master..SysDatabases
              WHERE name LIKE 'distribution')
   EXEC adm_CheckBackups 'distribution',1

Almost done now – the following view is used to support the summary which is emailed to staff. Notice that the first error checked for is if an update hasn’t been received. This would indicate either a replication failure or a failure to run the statistics gathering job on the remote server.

CREATE VIEW vBackupReport
AS
select 1 AS Severity, 'Update not received' AS Error
  , Server, DatabaseName, TestDate 
 from backuplog where testdate<DATEADD(day, -1, getdate())
union
select 5 AS Severity, 'System Backup not taken' AS Error
  , Server, DatabaseName, TestDate 
 from backuplog
 where backupcount=0 and issystemdb=1
union
select 5 AS Severity, 'System Backup failed' AS Error
   , Server, DatabaseName, TestDate 
 from backuplog
 where backupfails>0 and issystemdb=1
union
select 9 AS Severity, 'Application Backup failed' AS Error
 , Server, DatabaseName, TestDate 
 from backuplog
 where backupfails>0 and issystemdb=0
union
select 9 AS Severity, 'Application Backup not taken' AS Error
 , Server, DatabaseName, TestDate 
 from backuplog 
 where backupcount=0 and issystemdb=0
union
select 6 AS Severity, 'Application Backup (transaction) failed' AS Error
   , Server, DatabaseName, TestDate 
 from backuplog 
 where transfails>0 and issystemdb=0
union
select 6 AS Severity, 'Application Backup (transaction) not taken' AS Error
   , Server, DatabaseName, TestDate 
 from backuplog 
 where transcount<90 and issystemdb=0
union
select 4 AS Severity, 'Backup Verify fails' AS Error
   , Server, DatabaseName, TestDate 
 from backuplog 
 where verifyfails>0
union
select 2 AS Severity, 'Backup Delete fails' AS Error
   , Server, DatabaseName, TestDate 
 from backuplog 
 where deletefails>0

The output from this view is formatted a bit and send by email to a distribution list on our site.

Conclusion

Hopefully the above might prove useful to someone. It’s all fairly simple stuff, but it has made checking of remote system states far easier for us, and has given us a much greater degree of confidence that every site is working as it should be.

Rate

You rated this post out of 5. Change rating

Share

Share

Rate

You rated this post out of 5. Change rating