SQL Clone
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 

How many Backups is Too Many?

Does this title seem strange?  Of course we know this is the true answer:

https://cdn.meme.am/cache/instances/folder633/58097633.jpg

But here is a story of a client that may have too many backups after all.

--

Our monitoring system found that the backups on a client server had suddenly started running very long.  When I signed on to look I saw that the nightly backups were still running even the next morning!  These were the three currently running processes related to backups:

--

LINE
dd hh:mm:ss.mss
session_id
sql_text
login_name
1
00 15:24:36.674
134
backup database [BobData] to virtual_device = N'CA_BAB_MSSQL_a786700_10002_4c59b51f_BobData' with differential, blocksize = 65536, buffercount = 1, maxtransfersize = 2097152
DOMAIN1\sa_arcbackup
2
00 11:27:32.043
165
BACKUP DATABASE [BobData] TO  DISK = N'B:\SQL_BAC\BobData\BobData_backup_2017_02_23_020011_9582786.bak' WITH  RETAINDAYS = 10, NOFORMAT, NOINIT,  NAME = N'BobData_backup_2017_02_23_020011_9582786', SKIP, REWIND, NOUNLOAD, COMPRESSION,  STATS = 10
DOMAIN1\sqlcluster
3
00 02:58:20.540
84
backup database [FredData] to virtual_device = N'CA_BAB_MSSQL_a786080_10001_4f06c827_FredData' with blocksize = 65536, buffercount = 1, maxtransfersize = 2097152
DOMAIN1\sa_arcbackup



LINE
wait_info
CPU
tempdb_allocations
tempdb_current
blocking_session_id
1
(55475564ms)ASYNC_IO_COMPLETION
236
0
0
NULL
2
(41252214ms)LCK_M_U
0
0
0
134
3
(10699785ms)ASYNC_IO_COMPLETION
357
0
0
NULL



LINE
percent_complete
host_name
database_name
program_name
start_time
login_time
1
79.8087
Server5
BobData
Arcserve Backup
2/22/2017 22:02
2/22/2017 22:02
2
NULL
Server1
BobData
Microsoft SQL Server Management Studio
2/23/2017 2:00
2/23/2017 2:00
3
18.4672
Server5
FredData
Arcserve Backup
2/23/2017 10:30
2/23/2017 1:05


The middle process (Line 2 – session ID/SPID 165) is a regular BACKUP DATABASE statement that had been running since 2am local server time.  As you can see from the wait info, the entire life of this SPID has been spent waiting for an update lock (LCK_M_U).  Also notice that the Percent Complete is still NULL because it is being blocked by the first SPID (134) and hasn’t done anything.

The first and third SPID’s (134 and 84) are Arcserve backups of the BobData and FredData databases respectively.  Looking in the backup history in msdb for the BobData database I see this:

database_name
physical_device_name
type
backup_start_date
backup_finish_date
BobData
B:\SQL_BAC\BobData\BobData_backup_2017_02_22_020015_7258925.bak
D
02/22/2017 04:29:47
02/22/2017 04:40:41
BobData
CA_BAB_MSSQL_8ea33d0_10002_4732d2d8_BobData
I
02/21/2017 22:01:34
02/22/2017 04:29:44
BobData
B:\SQL_BAC\BobData\BobData_backup_2017_02_21_020010_3116691.bak
D
02/21/2017 11:54:41
02/21/2017 12:05:22
BobData
CA_BAB_MSSQL_8c6ff40_10002_420ce264_BobData
I
02/20/2017 22:02:02
02/21/2017 11:54:39
BobData
B:\SQL_BAC\BobData\BobData_backup_2017_02_18_020007_3945711.bak
D
02/19/2017 23:39:43
02/19/2017 23:50:38
BobData
CA_BAB_MSSQL_8d4b360_10001_329872eb_BobData
D
02/17/2017 22:00:34
02/19/2017 23:39:41

The Arcserve backup was kicking off around 10pm local server time, and taking a widely variable amount of time to complete – the first backup (start date 02/17 22:00) was a FULL backup (type D) and took just over *2 days* to complete (02/17 22:00 to 02/19 23:39)!

https://s-media-cache-ak0.pinimg.com/736x/25/b2/82/25b2825a9b57026e1932ba48909699e9.jpg
The catch is the next row up – the regular FULL backup to the B: drive looks like it kicks off right when the Arcserve backup completes, at 23:39 – this is most likely because it was being blocked by the Arcserve backup (just like the current activity above) and when the Arcserve backup completes, the FULL backup to the B: drive begins.

This history shows this same pattern of the B: drive backup kicking off right when the Arcserve backup completes through the following days.

The SQL “regular” Agent job kicks off at 2am, so it is hanging as a blocked SPID each day until the Arcserve job completes and then running in a very reasonable amount of time – BobData is a 340GB database that backs up into a 36GB file (nice compression!) and as seen in the start/finish times above only takes 10-30 minutes to run, which by my experience is very good for a database this size.

From the DBA end we can’t speak to why the Arcserve backup takes such a long time – completely guessing I would assume it isn’t compressing and therefore is sending the full 340GB across the network pipe, but this still seems very slow.

The SPID’s for the Arcserve backups are waiting on “ASYNC_IO_COMPLETION” which means it is waiting on storage read/writes, in this case certainly *writes*.

--

There are two pieces to this for follow-up – one very important and another less so.

First and foremost – as a DBA I strongly recommend not having “competing” backups like this.  The backups should either run to disk (very *very* strongly preferred) or to a tool like Arcserve.  Running to two different tools leads to exactly the kind of problem seen here.

https://imgflip.com/i/1k9t97
Note that this is not related to Arcserve in particular - any third-party backup tool that uses a database agent to directly run backups will display this behavior.

A conversation showed that the Arcserve backups had been recently added and it was definitely a match to the problems.

The workaround to this is to back the databases up to one tool and then to *copy* those backup files to the other tool.  The best recommendation I have in these situation is always to run “regular” SQL Server Agent job backups to a file location (either a local or network drive) and then to have your third-party tool use file backup for the actual BAK/TRN files from that location *rather* than running the “database agent” on the third party tool to backup the database directly.

In this model the third party tool never directly touches SQL Server – in this client's environment you would run SQL Server Agent jobs similar to the current maintenance plan (although something better than maintenance plans, such as Ola Hallengren's scripts or maybe MinionBackup) to backup to actual disk (in thiscase the B: drive) and then use Arcserve to backup the files that have been written on the B: drive.

Another advantage to this backup model is it makes it easier for the DBAs managing your SQL Servers to access the backups for disaster recovery/data copy/etc.  If you backup to the third-party tool alone it usually requires some third party (usually a server admin or SAN admin) to perform a database restore – with regular SQL backups directly to disk, the DBA can perform that work by themselves which can greatly decrease recovery time in case of a disaster.

If you absolutely *must* backup directly to a tool like Arcserve, it needs to be made more reliable than what this client was seeing – wide swings of duration from 6 hours to 13 hours to 49 hours can’t be acceptable – and the bottleneck must be determined and resolved – there is no way it can be acceptable for a backup to take over 2 days on a 340GB database.  Fixing this requires work across your network, storage, and admin teams.

--

Second – as you can see from the limited number of rows in my history grid above, the current SQL Maintenance Plan is purging backup history, etc. that is older than 4 days


I understand a desire to have backup file cleanup of 4 days, as seen in the next item in the maintenance plan, as the actual backups can take up a lot of space:


However I strongly recommend editing the “Clean Up History” task the retain records for a much longer interval, such as 35-40 days.  There are often processes that happen monthly that require troubleshooting (for example, an end-of-month data load that slows the system down), and only keeping 30-31 days can often cause an issue where something happens on a Friday or Saturday but isn’t investigated until Monday (or Tuesday) – by the time the investigation begins, the 30th or 31st day of history has rolled off and the troubleshooting is hampered by lack of data...

http://i0.kym-cdn.com/entries/icons/original/000/019/630/ihnmotp.jpg

Maintenance/Backup history takes up very little space and can easily be retained for even several months (such as 90-100 days) without disk space issues.

--

At the end of the day watch out for duplicative backups - some times they are necessary, but in most cases you can meet your requirements to offshore your backups, send them to a DR site, etc. using a simple file copy of BAK and TRN files.

Also watch your history retentions closely - don't swing the other way - I can't count the number of servers I have found with huge msdb's because no cleanup is happening - but set it to a realistic number that won't hamper your troubleshooting.  Consider the frequency of your key processes; in almost all environments there are at least monthly considerations, but often there are also bi-monthly or quarterly concerns that would require you to keep up to 100 days of history for effective troubleshooting.

--

Hope this helps!









Nebraska SQL from @DBA_ANDY

I’m a forty-something Microsoft SQL Server DBA of 15+ years, a devoted husband, and a father of three young boys. I have been a DBA at a public university, at a major bank, at a healthcare system, and I now work as a remote DBA with customers across the United States. I write and speak primarily about the tips and tricks that I discover along my SQL Server journey.

Comments

Leave a comment on the original post [nebraskasql.blogspot.com, opens in a new window]

Loading comments...