Impact of Striped Backups on Data Deduplication

  • We are implementing backup software called Commvault as part of an effort to eliminate tape backups and reduce the storage size of backups.  Commvault performs data deduplication of our SQL Server backups, which greatly reduces the size of the data stored off-site and in Azure.  However last week we modified a few of our SQL Server backup jobs to perform striped backups. Since then, deduplication appears to no longer work.  Backups that would take 8 hours in Commvault now takes about 2 days.  Before contacting the vendor I wanted to see if anyone has some insight as to what striped SQL backups is doing that would hurt data deduplication.  

    Thanks,  Dave

  • Keeping this relatively "simple" and avoiding technical terms, deduplication by applications like CommVault rely largely on the assumption that any two successive versions of a given file have changed minimally, or at least little enough that the changes (deltas) don't impact the whole file, but only a limited portion or portions of the file. They then use various algorithms to eliminate anything they have backed up before and only backup the Deltas. For most flat files this is reasonable, but database backups are less predictable in general because the original database could be modified almost anywhere, resulting in potentially significant changes in the data pages, and then in the resulting backup. In a simple scenario the DBA rebuilds the clustered index on a large table with a poorly designed cluster key (e.g. Customer_Name). This could change many pages and as a result the backup, enough that the number of deltas relative to previous backups is almost equivalent to a full backup. Depending on how CommVault handles this, this can cause longer and bigger backups.

    With striped backups, we can think of each backup file as containing data starting from a different point in the database. The potential is there that a change in the database can result in almost every backup file looking significantly different to the original set even if only one page near the "start" of the database changed, resulting in apparently massive deltas and a resultant long backup. I would definitely expect the CommVault backup of the first striped backup to take longer than the previous CommVault backup simply because CommVault must do one full backup to establish the baseline. After that the time would depend on how the data changes impact the backup files.

    Note that you can get a similar result if someone enabled compression on the SQL Native backups. It is recommended to either use SQL Native Compression OR CommVault Deduplication, but never both on the same files. As a first check I would see if someone turned on compression by default or in the Maintenance Plan or script.

    My preference as a DBA is to use SQL Native Compression as this gives me faster, smaller backup, but I know the System Admins tend to prefer their backup app doing the compression.

    Leo
    Nothing in life is ever so complicated that with a little work it can't be made more complicated.

    Leo
    Nothing in life is ever so complicated that with a little work it can't be made more complicated.

  • Thanks for the reply.  We suspected the striping was causing more changes at the block level, and Commvault, like similar products, performs block-level data dedup.  Going from one backup file to four means Commvault has four different files to compare for similar blocks rather than one large file, resulting in fewer common blocks.  I didn't realize SQL backup compression should be turned off if using Commvault data deduplication.  I'll read about that today, and run some tests this week to verify the differences in change data.  Prior to striping, and using SQL backup compression, Commvault was able to take 650GB+ of backups and reduce it to 150GB+, so even with compression enabled we saw a significant reduction in backup size, after of course the initial full backup.  But now I'm curious if we will see greater gains by turning off SQL backup compression. 

    We may end up scrapping striped backups and just go with differentials, along with possibly changing BUFFERCOUNT, BLOCKSIZE and MAXTRANFERSIZE, which should help. We need the data deduplication to reduce our off-site backup footprint.

    Thanks again, Dave

Viewing 3 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic. Login to reply