SQL Server closes database backup request to NAS share in middle of restore

  • We have an issue with our Dev and Test servers for some of our larger database backups.

    Whether we are running the restore job from SQL Server Agent or trying to copy the file via Windows Explorer down to the local database server, we keep seeing NAS share timeouts. Our network guy took a sniffer to the network and determined that the database server itself is saying "I have everything I need, goodbye" in the middle of the restore. (haven't traced the file copy yet).

    When we manually restore the database on the GUI (even when logged RDP into the server itself) we can get the database 50-80% restored before we lose the connection to the NAS.

    The NAS share is in the same datacenter as the SQL Server. Permissions are fine. Smaller databases restore just fine (manually or through the job) from the same NAS share to the same SQL Server.

    This is beyond frustrating. We've resorted to using robocopy to bring the file local and then do the restore, but we don't have a lot of space to play with and it takes hours to restore a database (sometimes days if we keep losing the connection) that should only take maybe 1 hour to restore.

    Any thoughts on what server settings (Windows or SQL) I can check so we can stop the server from telling the NAS it no longer needs data in the middle of the restore?

    Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

  • What type of connection (iscsi or fiber)?

    What is your packet size setting in SQL Server? And with that, what is your MTU?

    Do you have tcp chimney offload disabled or enabled?

    How large is your swap file (it affects large file copies)?

    When he looked at the trace, did your sysadmin take note of the packet sequence to see if maybe SQL Server received a packet out of sequence (it most likely did)?

    Have him do the trace for the file copy. If the file copy is also puking at about 80%, you can pretty much eliminate SQL Server as the culprit but more of a configuration issue or a transmission issue between the NAS and the Server.

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • SQLRNNR (1/26/2015)


    When he looked at the trace, did your sysadmin take note of the packet sequence to see if maybe SQL Server received a packet out of sequence (it most likely did)?

    Network guy was looking at this, not Sysadmin. He doesn't know boo about SQL Server, so I don't know if he can tell if the server received a packet out of sequence. But I will ask.

    Have him do the trace for the file copy. If the file copy is also puking at about 80%, you can pretty much eliminate SQL Server as the culprit but more of a configuration issue or a transmission issue between the NAS and the Server.

    Running that as we speak. He left the trace up so we can check these things.

    EDIT:

    What is your packet size setting in SQL Server?

    EDIT: 4096

    What type of connection (iscsi or fiber)?

    What is your packet size setting in SQL Server? And with that, what is your MTU?

    Do you have tcp chimney offload disabled or enabled?

    How large is your swap file (it affects large file copies)?

    Ummmm. I have to look for all that other stuff. What is MTU?

    We have fiber between the Server and the SAN, but we're not sure about the server and the NAS. The network guy is assuming fiber, but my coworker (ex-server admin) is saying he doesn't think it is fiber.

    Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

  • Brandie Tarvin (1/26/2015)


    SQLRNNR (1/26/2015)


    Have him do the trace for the file copy. If the file copy is also puking at about 80%, you can pretty much eliminate SQL Server as the culprit but more of a configuration issue or a transmission issue between the NAS and the Server.

    Running that as we speak. He left the trace up so we can check these things.

    And it fails after only 2 minutes. The DB doesn't send the FIN command, but the server essentially handshakes off. Also, the network guy doesn't see any packets going out of sequence from his side.

    EDIT:

    What is your packet size setting in SQL Server? And with that, what is your MTU?

    EDIT: 4096

    Ummmm. I have to look for all that other stuff. What is MTU?

    Ahh, found it. Maximum Transmission Unit. BTW, SQLSoldier has a fantastic article on MTU here. So I did testing and the largest number I could send was 1472 (like his first test). #Headdesk. I'm sure that has something to do with it.

    Do you have tcp chimney offload disabled or enabled?

    We have multiple NICs with IPv4 and IPv6, both Enabled on all NICs.

    How large is your swap file (it affects large file copies)?

    18431 MB

    EDIT: Regarding the connection to the NAS, 10 GB Fiber with a 10Gbase-SR type connection.

    Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

  • Brandie Tarvin (1/26/2015)


    Brandie Tarvin (1/26/2015)


    SQLRNNR (1/26/2015)


    Have him do the trace for the file copy. If the file copy is also puking at about 80%, you can pretty much eliminate SQL Server as the culprit but more of a configuration issue or a transmission issue between the NAS and the Server.

    Running that as we speak. He left the trace up so we can check these things.

    And it fails after only 2 minutes. The DB doesn't send the FIN command, but the server essentially handshakes off. Also, the network guy doesn't see any packets going out of sequence from his side.

    EDIT:

    What is your packet size setting in SQL Server? And with that, what is your MTU?

    EDIT: 4096

    Ummmm. I have to look for all that other stuff. What is MTU?

    Ahh, found it. Maximum Transmission Unit. BTW, SQLSoldier has a fantastic article on MTU here. So I did testing and the largest number I could send was 1472 (like his first test). #Headdesk. I'm sure that has something to do with it.

    Do you have tcp chimney offload disabled or enabled?

    We have multiple NICs with IPv4 and IPv6, both Enabled on all NICs.

    How large is your swap file (it affects large file copies)?

    18431 MB

    EDIT: Regarding the connection to the NAS, 10 GB Fiber with a 10Gbase-SR type connection.

    K, change the Packet size in SQL Server. But that won't help any with the fact that a file copy also fails.

    Disable TCP chimney offload.

    Is this a VM or physical?

    (I know I know, lots and lots of questions.)

    You may benefit from bumping up the swap file size (and make sure it is contiguous).

    Can you copy a file that is about 1/2 the size of the backup you are trying to copy?

    I would also disable tcpv6. I have run into problems with v6 causing odd behaviors. I'd have to find some other resources supporting that (they are out there just don't remember them off the top of my head).

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • SQLRNNR (1/26/2015)


    K, change the Packet size in SQL Server. But that won't help any with the fact that a file copy also fails.

    Change it to the same size as the MTU?

    Disable TCP chimney offload.

    Is this a VM or physical?

    VM

    Can you copy a file that is about 1/2 the size of the backup you are trying to copy?

    Haven't tried exactly half, but I know I can copy smaller files. I will see if I have a half sized file to copy.

    You may benefit from bumping up the swap file size (and make sure it is contiguous).

    I would also disable tcpv6. I have run into problems with v6 causing odd behaviors. I'd have to find some other resources supporting that (they are out there just don't remember them off the top of my head).

    If you find those resources, would you let me know? We don't have control over the servers and network, just over SQL Server. And if I want to change something from a corporate standard, I have to have documentation proving why something needs to be shut off.

    And how do I verify the swap file is contiguous? Is there something on Disk Manager that shows the swap file?

    Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

  • Brandie Tarvin (1/26/2015)


    SQLRNNR (1/26/2015)


    Is this a VM or physical?

    VM

    I have just been corrected. Our servers are physical.

    Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

  • Brandie Tarvin (1/26/2015)


    SQLRNNR (1/26/2015)


    K, change the Packet size in SQL Server. But that won't help any with the fact that a file copy also fails.

    Change it to the same size as the MTU?

    Disable TCP chimney offload.

    Is this a VM or physical?

    VM

    Can you copy a file that is about 1/2 the size of the backup you are trying to copy?

    Haven't tried exactly half, but I know I can copy smaller files. I will see if I have a half sized file to copy.

    You may benefit from bumping up the swap file size (and make sure it is contiguous).

    I would also disable tcpv6. I have run into problems with v6 causing odd behaviors. I'd have to find some other resources supporting that (they are out there just don't remember them off the top of my head).

    If you find those resources, would you let me know? We don't have control over the servers and network, just over SQL Server. And if I want to change something from a corporate standard, I have to have documentation proving why something needs to be shut off.

    And how do I verify the swap file is contiguous? Is there something on Disk Manager that shows the swap file?

    Yes same size as MTU.

    Contiguous - check a disk defragging utility to see how many chunks that swap file is in.

    IPv6

    Don't disable unless you have a good reason...

    http://sqlha.com/2010/01/12/ipv6-windows-server-2008-rtm-and-r2-and-failover-clustering/

    SQL Browser Service issues when IPv6 enabled...

    https://support.microsoft.com/kb/2526552/en-us

    Google-fu is failing me for solid stuff but still looking.

    If nothing else, start with disabling the chimney offload.

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • Thank you for your assistance, Jason. We are testing out a few options.

    Fortunately, these are all non-prod environments, so we can test (almost) with abandon.

    Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

  • UPDATE:

    We have some issues with changing the TCP Chimney Offset (corporate owns these servers and our server guy is leery of playing with the settings). The suggestion of changing the packet size in SQL Server didn't work. And given the earlier comment of not turning off IPv6, we opted not to play with that (also another corporate issue involved here).

    A coworker, however, was following a different tack.

    Given that the problem exists not just with SQL Server, but with the OS copying of the files, he went searching and found a registry key (Session Info) that could be causing the problem. So he went in and changed the key last week. This weekend we had a bunch of restores happening in Dev and lo and behold, nothing failed!

    So he seems to be on to something.

    Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

  • Brandie Tarvin (2/2/2015)


    UPDATE:

    We have some issues with changing the TCP Chimney Offset (corporate owns these servers and our server guy is leery of playing with the settings). The suggestion of changing the packet size in SQL Server didn't work. And given the earlier comment of not turning off IPv6, we opted not to play with that (also another corporate issue involved here).

    A coworker, however, was following a different tack.

    Given that the problem exists not just with SQL Server, but with the OS copying of the files, he went searching and found a registry key (Session Info) that could be causing the problem. So he went in and changed the key last week. This weekend we had a bunch of restores happening in Dev and lo and behold, nothing failed!

    So he seems to be on to something.

    What did he change it from/to?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (2/2/2015)


    Brandie Tarvin (2/2/2015)


    UPDATE:

    We have some issues with changing the TCP Chimney Offset (corporate owns these servers and our server guy is leery of playing with the settings). The suggestion of changing the packet size in SQL Server didn't work. And given the earlier comment of not turning off IPv6, we opted not to play with that (also another corporate issue involved here).

    A coworker, however, was following a different tack.

    Given that the problem exists not just with SQL Server, but with the OS copying of the files, he went searching and found a registry key (Session Info) that could be causing the problem. So he went in and changed the key last week. This weekend we had a bunch of restores happening in Dev and lo and behold, nothing failed!

    So he seems to be on to something.

    What did he change it from/to?

    He created a new key at \HKLM\SYSTEM\CurrentControlSet\Services\LanmanWorkstation\ParametersValue type: Dword

    Value name: SessTimeout

    Value: 360

    He mentioned that it could be any value, but he chose 360 because all the research he did indicated that was a good consistent value.

    This change was made in all our environments, Dev, Test and QC, but we've only tested it so far in Dev. I still had issues in Test right after he changed it, so I don't know if that had anything to do with Test needing to be restarted or not. We have NOT restarted any of our servers (that I am aware of), so I'll probably try Test again as soon as we're done with month end support.

    Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

  • Brandie Tarvin (2/2/2015)


    Jeff Moden (2/2/2015)


    Brandie Tarvin (2/2/2015)


    UPDATE:

    We have some issues with changing the TCP Chimney Offset (corporate owns these servers and our server guy is leery of playing with the settings). The suggestion of changing the packet size in SQL Server didn't work. And given the earlier comment of not turning off IPv6, we opted not to play with that (also another corporate issue involved here).

    A coworker, however, was following a different tack.

    Given that the problem exists not just with SQL Server, but with the OS copying of the files, he went searching and found a registry key (Session Info) that could be causing the problem. So he went in and changed the key last week. This weekend we had a bunch of restores happening in Dev and lo and behold, nothing failed!

    So he seems to be on to something.

    What did he change it from/to?

    He created a new key at \HKLM\SYSTEM\CurrentControlSet\Services\LanmanWorkstation\ParametersValue type: Dword

    Value name: SessTimeout

    Value: 360

    He mentioned that it could be any value, but he chose 360 because all the research he did indicated that was a good consistent value.

    This change was made in all our environments, Dev, Test and QC, but we've only tested it so far in Dev. I still had issues in Test right after he changed it, so I don't know if that had anything to do with Test needing to be restarted or not. We have NOT restarted any of our servers (that I am aware of), so I'll probably try Test again as soon as we're done with month end support.

    Good to hear.

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply