SQL Backup Strategy Solution

  • What are people using for their entire SQL backup strategy solution?

    We use Ola Hallengren’s maintenance solution to backup all of our SQL databases.  We write those backups to a physical server that has multiple drives on it.  Our network folks then use Rubrik to create a VSS snapshot each night of each drive of the physical backup server.  When we run Ola’s backup scripts we delete files older than 23 hours on the physical server.  If we need a file after it’s been removed from the backup server, we will get it from Rubrik from one of the VSS snapshots it took.  We then have a retention schedule of those files in Rubrik that were captured with the VSS snapshot.

    We recently discovered that Rubrik is failing to back up some files when it’s doing its VSS snapshot.  The scenario is as follows: Rubrik runs one of its VSS snapshots at 4 a.m. on one of the drives on the physical backup server.  Between the time that Rubrik starts its snapshot and ends it snapshot, tlog files are deleted.  The tlog files that were present when Rubrik started its snapshot may not be present anymore when the snapshot finishes because when we run our backups (full and tlog) we delete any file older than 23 hours.

    This situation has underscored the need for us to re-evaluate our SQL backup strategy solution.  Feedback from you on what your company uses and what your SQL backup strategy solution looks like will be appreciated.

  • Our solution is a bit different than yours, but our RTO and RPO drives it differently than yours does.

    We used to use maintenance plans, but have switched over to direct TSQL coded in-house for our backup solution.  We investigated Ola's scripts, but they were too lengthy and complex for our small team to support and we only have 1 3rd party backup solution in place so it didn't make sense to grab a backup script like Ola's that handled multiple backup solutions.  By using our own scripts, we can tweak and tune things as needed.  We went with a multi-stored procedure approach where one is for backup, one is for index maintenance, one is for statistics maintenance, etc and a final "master" script that can call these other stored procedures based on parameters.  Having the smaller scripts made debugging almost trivial as each stored procedure can be run stand-alone and it is easy to test and fix any problems that come up.

    Next, all of our backups go out to a disk on the SAN.  The disk is hosted on one of our SQL Servers, so 1 out of 3 servers has direct access to the disk while the other 2 need to go through the 3rd to handle the backups.  We have that disk set up in DxEnterprise to fail over to the other 2 servers in the event the primary one goes offline.  This way our backups of any SQL instance should never fail due to disk issues except if the backup is in progress when the failover occurs.

    We keep the backups on disk for 2 weeks.  This is so in the event we need to do a restore, we nave data we can grab going back 2 weeks from today without needing to get the data off of a long-term backup solution.  The backups are copied to tape nightly that are moved off-site weekly.

     

    The reason we keep the backup files on disk for 2 weeks is restoring from tape is slow, plus we then have multiple tapes we can grab the backup from in the event a tape goes bad or the tape backup reported success but actually failed.  Results in duplicate data on tapes, but redundancy is a very big "nice to have" for me.

    If I was you, I'd be trying to keep the data on disk for longer even if it means wasted disk as I'd rather waste a few hundred GB of disk than have a backup get "lost".

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • This was removed by the editor as SPAM

  • Quick Question, do you have clear RPO and RTO?

    😎

    Without those, designing a DR solution will be a futile exercise!

  • Mary wrote:

    Between the time that Rubrik starts its snapshot and ends it snapshot, tlog files are deleted.

    I don't use Rubrik but any decent tool will turn off the "A" (archive) attribute on a file once it's done it's thing to the file.

    The key is, tell whatever is doing your deletes to not delete anything where the archive attribute on the file is still set.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply