Blog Post

How many Backups is Too Many?

,

Does this title seem strange?  Of course we know this is the true answer:

https://cdn.meme.am/cache/instances/folder633/58097633.jpg

But here is a story of a client that may have too many backups after all.

--

Our monitoring system found that the backups on a client server had suddenly started running very long.  When I signed on to look I saw that the nightly backups were still running even the next morning!  These were the three currently running processes related to backups:

--
LINE

dd hh:mm:ss.mss

session_id

sql_text

login_name

1

00 15:24:36.674

134

backup database [BobData] to virtual_device = N'CA_BAB_MSSQL_a786700_10002_4c59b51f_BobData' with differential, blocksize = 65536, buffercount = 1, maxtransfersize = 2097152

DOMAIN1\sa_arcbackup

2

00 11:27:32.043

165

BACKUP DATABASE [BobData] TO  DISK = N'B:\SQL_BAC\BobData\BobData_backup_2017_02_23_020011_9582786.bak' WITH  RETAINDAYS = 10, NOFORMAT, NOINIT,  NAME = N'BobData_backup_2017_02_23_020011_9582786', SKIP, REWIND, NOUNLOAD, COMPRESSION,  STATS = 10

DOMAIN1\sqlcluster

3

00 02:58:20.540

84

backup database [FredData] to virtual_device = N'CA_BAB_MSSQL_a786080_10001_4f06c827_FredData' with blocksize = 65536, buffercount = 1, maxtransfersize = 2097152

DOMAIN1\sa_arcbackup



LINE

wait_info

CPU

tempdb_allocations

tempdb_current

blocking_session_id

1

(55475564ms)ASYNC_IO_COMPLETION

236

0

0

NULL

2

(41252214ms)LCK_M_U

0

0

0

134

3

(10699785ms)ASYNC_IO_COMPLETION

357

0

0

NULL



LINE

percent_complete

host_name

database_name

program_name

start_time

login_time

1

79.8087

Server5

BobData

Arcserve Backup

2/22/2017 22:02

2/22/2017 22:02

2

NULL

Server1

BobData

Microsoft SQL Server Management Studio

2/23/2017 2:00

2/23/2017 2:00

3

18.4672

Server5

FredData

Arcserve Backup

2/23/2017 10:30

2/23/2017 1:05

The middle process (Line 2 – session ID/SPID 165) is a regular BACKUP DATABASE statement that had been running since 2am local server time.  As you can see from the wait info, the entire life of this SPID has been spent waiting for an update lock (LCK_M_U).  Also notice that the Percent Complete is still NULL because it is being blocked by the first SPID (134) and hasn’t done anything.

The first and third SPID’s (134 and 84) are Arcserve backups of the BobData and FredData databases respectively.  Looking in the backup history in msdb for the BobData database I see this:

database_name

physical_device_name

type

backup_start_date

backup_finish_date

BobData

B:\SQL_BAC\BobData\BobData_backup_2017_02_22_020015_7258925.bak

D

02/22/2017 04:29:47

02/22/2017 04:40:41

BobData

CA_BAB_MSSQL_8ea33d0_10002_4732d2d8_BobData

I

02/21/2017 22:01:34

02/22/2017 04:29:44

BobData

B:\SQL_BAC\BobData\BobData_backup_2017_02_21_020010_3116691.bak

D

02/21/2017 11:54:41

02/21/2017 12:05:22

BobData

CA_BAB_MSSQL_8c6ff40_10002_420ce264_BobData

I

02/20/2017 22:02:02

02/21/2017 11:54:39

BobData

B:\SQL_BAC\BobData\BobData_backup_2017_02_18_020007_3945711.bak

D

02/19/2017 23:39:43

02/19/2017 23:50:38

BobData

CA_BAB_MSSQL_8d4b360_10001_329872eb_BobData

D

02/17/2017 22:00:34

02/19/2017 23:39:41

The Arcserve backup was kicking off around 10pm local server time, and taking a widely variable amount of time to complete – the first backup (start date 02/17 22:00) was a FULL backup (type D) and took just over *2 days* to complete (02/17 22:00 to 02/19 23:39)!

https://s-media-cache-ak0.pinimg.com/736x/25/b2/82/25b2825a9b57026e1932ba48909699e9.jpg

The catch is the next row up – the regular FULL backup to the B: drive looks like it kicks off right when the Arcserve backup completes, at 23:39 – this is most likely because it was being blocked by the Arcserve backup (just like the current activity above) and when the Arcserve backup completes, the FULL backup to the B: drive begins.

This history shows this same pattern of the B: drive backup kicking off right when the Arcserve backup completes through the following days.

The SQL “regular” Agent job kicks off at 2am, so it is hanging as a blocked SPID each day until the Arcserve job completes and then running in a very reasonable amount of time – BobData is a 340GB database that backs up into a 36GB file (nice compression!) and as seen in the start/finish times above only takes 10-30 minutes to run, which by my experience is very good for a database this size.

From the DBA end we can’t speak to why the Arcserve backup takes such a long time – completely guessing I would assume it isn’t compressing and therefore is sending the full 340GB across the network pipe, but this still seems very slow.

The SPID’s for the Arcserve backups are waiting on “ASYNC_IO_COMPLETION” which means it is waiting on storage read/writes, in this case certainly *writes*.

--

There are two pieces to this for follow-up – one very important and another less so.

First and foremost – as a DBA I strongly recommend not having “competing” backups like this.  The backups should either run to disk (very *very* strongly preferred) or to a tool like Arcserve.  Running to two different tools leads to exactly the kind of problem seen here.

https://imgflip.com/i/1k9t97

Note that this is not related to Arcserve in particular - any third-party backup tool that uses a database agent to directly run backups will display this behavior.

A conversation showed that the Arcserve backups had been recently added and it was definitely a match to the problems.

The workaround to this is to back the databases up to one tool and then to *copy* those backup files to the other tool.  The best recommendation I have in these situation is always to run “regular” SQL Server Agent job backups to a file location (either a local or network drive) and then to have your third-party tool use file backup for the actual BAK/TRN files from that location *rather* than running the “database agent” on the third party tool to backup the database directly.

In this model the third party tool never directly touches SQL Server – in this client's environment you would run SQL Server Agent jobs similar to the current maintenance plan (although something better than maintenance plans, such as Ola Hallengren's scripts or maybe MinionBackup) to backup to actual disk (in thiscase the B: drive) and then use Arcserve to backup the files that have been written on the B: drive.

Another advantage to this backup model is it makes it easier for the DBAs managing your SQL Servers to access the backups for disaster recovery/data copy/etc.  If you backup to the third-party tool alone it usually requires some third party (usually a server admin or SAN admin) to perform a database restore – with regular SQL backups directly to disk, the DBA can perform that work by themselves which can greatly decrease recovery time in case of a disaster.

If you absolutely *must* backup directly to a tool like Arcserve, it needs to be made more reliable than what this client was seeing – wide swings of duration from 6 hours to 13 hours to 49 hours can’t be acceptable – and the bottleneck must be determined and resolved – there is no way it can be acceptable for a backup to take over 2 days on a 340GB database.  Fixing this requires work across your network, storage, and admin teams.

--

Second – as you can see from the limited number of rows in my history grid above, the current SQL Maintenance Plan is purging backup history, etc. that is older than 4 days

I understand a desire to have backup file cleanup of 4 days, as seen in the next item in the maintenance plan, as the actual backups can take up a lot of space:

However I strongly recommend editing the “Clean Up History” task the retain records for a much longer interval, such as 35-40 days.  There are often processes that happen monthly that require troubleshooting (for example, an end-of-month data load that slows the system down), and only keeping 30-31 days can often cause an issue where something happens on a Friday or Saturday but isn’t investigated until Monday (or Tuesday) – by the time the investigation begins, the 30th or 31st day of history has rolled off and the troubleshooting is hampered by lack of data...

http://i0.kym-cdn.com/entries/icons/original/000/019/630/ihnmotp.jpg
Maintenance/Backup history takes up very little space and can easily be retained for even several months (such as 90-100 days) without disk space issues.
--

At the end of the day watch out for duplicative backups - some times they are necessary, but in most cases you can meet your requirements to offshore your backups, send them to a DR site, etc. using a simple file copy of BAK and TRN files.

Also watch your history retentions closely - don't swing the other way - I can't count the number of servers I have found with huge msdb's because no cleanup is happening - but set it to a realistic number that won't hamper your troubleshooting.  Consider the frequency of your key processes; in almost all environments there are at least monthly considerations, but often there are also bi-monthly or quarterly concerns that would require you to keep up to 100 days of history for effective troubleshooting.

--

Hope this helps!





Rate

You rated this post out of 5. Change rating

Share

Share

Rate

You rated this post out of 5. Change rating