Sudden fail of (long-time) scheduled db backup with OS error 5/Error 3201/Error 3013.

  • A backup job of one of my prod dbs failed last night. It has not been changed since August of last year. It runs at 10pm every day and backs up the db to a (SAN) mapped drive. Because of the db size I back it up to multiple backup devices. I run a virtually identical job (which calls the same usp that runs the backup) for a different db within a few minutes-- and it succeeded. So it should not be a matter of permissions; if one failed they both should have, if that were the case.

    Here is the feedback from the job step:

    [font="Courier New"]Msg 3201, Sev 16, State 1, Line 1 : Cannot open backup device 'U:\folder\dbname_20120409_3_5.bak'. Operating system error 5(Access is denied.). [SQLSTATE 42000]

    Msg 3013, Sev 16, State 1, Line 1 : BACKUP DATABASE is terminating abnormally. [SQLSTATE 42000]

    Msg 0, Sev 16, State 1, Line 95 : ---------------------- [SQLSTATE 01000]

    Msg 0, Sev 16, State 1, Line 96 : THE BACKUP COMMAND [SQLSTATE 01000]

    Msg 0, Sev 16, State 1, Line 97 : ---------------------- [SQLSTATE 01000]

    Msg 0, Sev 16, State 1, Line 98 : BACKUP DATABASE [dbname] TO DISK = N'U:\folder\dbname_20120409_1_5.bak', DISK = N'U:\folder\dbname_20120409_2_5.bak', DISK = N'U:\folder\dbname_20120409_3_5.bak', DISK = N'U:\folder\dbname_20120409_4_5.bak', DISK = N'U:\folder\dbname_20120409_5_5.bak' WITH NOFORMAT, INIT, SKIP, REWIND, NOUNLOAD, STATS = 5, NAME = N'dbname_backup' [SQLSTATE 01000][/font]

    ...any ideas?


    Cursors are useful if you don't know SQL

  • That is strange.

    The job that calls the first one is owned by the same user that owns the others? but then again that wouldn't really matter, backups are executed in the context of the SQL Server itself. What kind of user does SQL login as?

    CEWII

  • >The job that calls the first one is owned by the same user that owns the others?

    yes

    >What kind of user does SQL login as?

    a domain user... not sure if you are looking for something more specific.

    2 other backup anomalies from yesterday: 2 tlog backups (run via a maintenance plan, it's also been running without incident for a long time) failed-- one at 8:30am, one at 6:30pm (I back up tlogs every 30 minutes). In both cases, it appears that the tlog backup for a single db failed. The failed tlog backups were for 2 other different dbs. On disk, I find the failed dbname.trn file, showing as 0 bytes in length.

    Here is one of the errors for the tlog fails. The other is pretty much identical:

    [font="Courier New"]Executing the query "BACKUP LOG [differentdb] TO DISK = N'U:\\folder\\differentdb_backup_201204090830.trn' WITH NOFORMAT, NOINIT, NAME = N'differentdb_backup_20120409083002', SKIP, REWIND, NOUNLOAD, STATS = 10" failed with the following error: "Cannot open backup device 'U:\\folder\\differentdb_backup_201204091830.trn'. Operating system error 5(Access is denied.).

    BACKUP LOG is terminating abnormally.". Possible failure reasons: Problems with the query, "ResultSet" property not set correctly, parameters not set correctly, or connection not established correctly.[/font]

    The commonality for all these is the target drive letter; I suspect this could indicate a communication problem with the SAN- but don't have any real proof of that.


    Cursors are useful if you don't know SQL

  • Just throwing out ideas here.

    Is the U: drive a mapped drive or a local drive?

    Have there been group policies enforced recently?

    Are you using mount points in any places?

    And lastly, as you mentioned, network maintenance/SAN maintenance?

    ______________________________________________________________________________________________
    Forum posting etiquette.[/url] Get your answers faster.

  • thinking along the same lines as calvo...i think the issue is outside of SQL server:

    could the password have expired for the local/domain user that is used to run the SQL Service? then that user no longer has access to network objects until the password is reset, right? that happens to me at work every 30 days.

    The U: drive really exists and is accessible, right?

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • All excellent posts and points..

    Did you see anything in the windows eventlog during those periods?

    CEWII

  • Mass reply:

    >Is the U: drive a mapped drive or a local drive?

    Local drive-- from a SAN

    >Have there been group policies enforced recently?

    No info yet; I sent a "did anybody change anything" email but no indication that's the case.

    >Are you using mount points in any places?

    erm.. not that i know of

    >And lastly, as you mentioned, network maintenance/SAN maintenance?

    >could the password have expired for the local/domain user that is used to run the SQL Service?

    >The U: drive really exists and is accessible, right?

    Remember-- as I mentioned in the op, "...I run a virtually identical job (which calls the same usp that runs the backup) for a different db within a few minutes-- and it succeeded..."; so it'd be hard to believe it truly is a permission issue.

    >Did you see anything in the windows eventlog during those periods?

    I haven't had time to check that yet; I've been doing a mad restore from the previous days backup with an extra day of tlogs applied-- to get our reporting environment up.

    An additional observation... a while back we had a few days in a row of 3013 errors when RESTORING the large prod db on our reporting sql box. For lack of better ideas, we rebooted the reporting box... and all was well. I'm wondering if we may have recently patched ourselves up to a memory-leaky level. For reference the prod sql box is SQL 2005 enterprise, 9.0.5057.


    Cursors are useful if you don't know SQL

  • Curiouser and curiouser...

    Remember: I have a tlog backup job (created as a maintenance plan) that runs every 30 minutes. It does tlog backups of 9 dbs.

    1:30 - It ran fine; no errors. backups done.

    2:00 - It got an error (as shown in earlier posts and below) on ONE of the log backups... the .trn file for it has 0 length. The other 8 all backed up successfully.

    2:30 - It ran fine; no errors. backups done.

    For each of the 3 cases of errors on the tlog backup-- it was a different one of the 9 dbs.

    [font="Courier New"]Executing the query "BACKUP LOG [thirddb] TO DISK = N'U:\\folder\\thirddb_backup_201204101400.trn' WITH NOFORMAT, NOINIT, NAME = N'thirddb_backup_20120410140002', SKIP, REWIND, NOUNLOAD, STATS = 10

    " failed with the following error: "Cannot open backup device 'U:\\folder\\thirddb_backup_201204101400.trn'. Operating system error 5(Access is denied.).

    BACKUP LOG is terminating abnormally.". Possible failure reasons: Problems with the query, "ResultSet" property not set correctly, parameters not set correctly, or connection not established correctly.[/font]


    Cursors are useful if you don't know SQL

  • At this point it's looking like a hardware problem. The server crashed and failed over (it's in a cluster; I hadn't mentioned that yet) after 5pm last night, and after the node restarted-- we got this:

    [font="Courier New"]Subject: HpEventLog: Partition being reset due to watchdog timeout expiring (1E5DH)

    The system has detected the following event:

    SNMP Trap: 7773

    Date time: 04/10/2012 06:39:03 PM

    Computer: TheProblemServerName

    Source: HpEventLog

    Type: Error

    Category: (0)

    Description:

    The watchdog mechanism triggers the MP to reset a partition if its OS becomes unresponsive. An unresponsive OS is detected when the OS fails to refresh the watchdog timer before it expires. PA systems refresh the watchdog timer by emitting an event with data field set to activity level/timeout, and the timeout field specifies the desired timeout. IPF systems refresh the watchdog timer using the IPMI clear watchdog command. The MP emits this event when timer expiration triggers resetting the partition. OS-specific and platform-specific procedures are used to enable/disable the watchdog timer from resetting the partition. See platform and OS documentation for details. Find out why the partition's OS had hung. The cause could be bad HW that crashed the partition, or in rare cases, a combination of events that caused the OS to be unable to refresh the watchdog timer. Look for other events preceding the timeout for clues to the root cause of the partition being unresponsive[/font]


    Cursors are useful if you don't know SQL

  • What may be the final post here: Yesterday we started getting errors writing backup files on the 2nd node of the SQL cluster, which was failed over to on day 1.

    We were running out of logical explanations, so (as a minimally impactful swap out) we created a new drive on our SAN, moved all data from the possibly failing drive to the new drive and took the original offline (then the new drive was mapped as the same drive letter as the original). This was done around 2pm yesterday-- as of now, about 10am, we've not encountered additional errors writing db or tlog backups.


    Cursors are useful if you don't know SQL

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply