Snapshotting our database causing application timeouts!

  • I’ve had an open ticket with Microsoft on a timeout issue we are having for over a month now with no resolution. I’m throwing the problem out to see if anybody else is experiencing this and any suggested work-arounds.

    SS2005, SP3 database running on WinSrvr2003 cluster with NETAPP SAN.

    Problem is that we are experiencing dotnet application timeouts when doing a NetAPP SnapManager full database snapshot backup. The NetApp SnapManager usings the standard VSS API to get a snapshot image. The timeouts are intermittent in nature and affect several of our dotnet applications including web services and our internal document processing (batch mode). More than 95% of the application timeouts occur during the “prepare” phase of the snapshot backup just prior to freezing the database. The timeouts do not occur during the freeze phase, but just prior to that as the VSS method is trying to get the database in a consistent state.

    We do about 20 full database snapshot backups a day for a rolling ETL process to update our data warehouse. The primary advantage of the snapshot backup is that we had a “consistent” database and do not have to worry about transit data modifications during the data extract of the ETL phase.

    The timeout stack traces indicate both connection timeouts and sql timeouts. The sql timeouts are from very simplistic stored procedure calls that should never time out.

    Both my boss and I think that there is some sort of blocking process (that I cannot capture in a trace) from the VSS method that is causing the problem.

    I’ve listed a typical combined portion of the NETApp SnapManager log and our application systemlog below to illustrate the sequence of events:

    [14:07:46.176] Starting VDI backup thread for database #1: [EXCL1VS1\NGSCORE : NgsCore]...

    [14:07:46.179] Sending SQL command to instance EXCL1VS1\NGSCORE for database NgsCore : device NTAPDEV__NGSCORE_NgsCore...

    [14:07:46.179] Receiving by server [EXCL1VS1\NGSCORE]...

    [14:07:46.180] BACKUP DATABASE [NgsCore] TO VIRTUAL_DEVICE='NTAPDEV__NGSCORE_NgsCore' WITH SNAPSHOT

    [14:07:46.180] Waiting for SQLServer to respond...

    [14:07:46.189] Performing data transfer of [EXCL1VS1\NGSCORE : NgsCore]...

    From application systemlog:

    2010-07-21 14:08:13.4 Stack trace: System.Reflection.TargetInvocationException: Exception has been thrown by the target of an invocation. ---> System.Data.SqlClient.SqlException: Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding. at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection) at System.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection) at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj) at System.Data.SqlClient.TdsParserStateObject.ReadSniError(TdsParserStateObject stateObj, UInt32 error) at System.Data.SqlClient.TdsParserStateObject.ReadSni(DbAsyncResult asyncResult, TdsParserStateObject stateObj) at System.Data.SqlClient.TdsParserStateObject.ReadNetworkPacket() at System.Data.SqlClient.TdsParserStateObject.ReadBuffer() at System.Data.SqlClient.TdsParserStateObject.ReadByte() at System.Data.SqlClient.TdsParser.Run(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj) at System.Data.SqlClient.TdsParser.TdsExecuteTransactionManagerRequest(Byte[] buffer, TransactionManagerRequestType request, String transactionName, TransactionManagerIsolationLevel isoLevel, Int32 timeout, SqlInternalTransaction transaction, TdsParserStateObject stateObj, Boolean isDelegateControlRequest) at System.Data.SqlClient.SqlInternalConnectionTds.PropagateTransactionCookie(Byte[] cookie) at System.Data.SqlClient.SqlInternalConnection.EnlistNull() at System.Data.SqlClient.SqlInternalConnection.Enlist(Transaction tx) at System.Data.SqlClient.SqlInternalConnectionTds.Activate(Transaction transaction) at System.Data.ProviderBase.DbConnectionInternal.ActivateConnection(Transaction transaction) at System.Data.ProviderBase.DbConnectionPool.GetConnection(DbConnection owningObject) at System.Data.ProviderBase.DbConnectionFactory.GetConnection(DbConnection owningConnection) at System.Data.ProviderBase.DbConnectionClosed.OpenConnection(DbConnection outerConnection, DbConnectionFactory connectionFactory) at System.Data.SqlClient.SqlConnection.Open() at Newgistics.Common.Persistence.DalCommon.SetOpenConnection(SqlCommand Command) at Newgistics.Entity.DalEntity.ReadReferencesByURN(String ClassName, StringDictionary URNs, DataSet References) at Newgistics.Entity.Entity.GetByURN[BusinessKeyType](Type ClassType, StringDictionary URNs, ReferenceList EntityReferences) at Newgistics.Entity.Org.GetByURN(StringDictionary URNs, ReferenceList EntityReferences) at Newgistics.Entity.Newgistics..ctor() --- End of inner exception stack trace --- at System.RuntimeTypeHandle.CreateInstance(RuntimeType type, Boolean publicOnly, Boolean noCheck, Boolean& canBeCached, RuntimeMethodHandle& ctor, Boolean& bNeedSecurityCheck) at System.RuntimeType.CreateInstanceSlow(Boolean publicOnly, Boolean fillCache) at System.RuntimeType.CreateInstanceImpl(Boolean publicOnly, Boolean skipVisibilityChecks, Boolean fillCache) at System.Activator.CreateInstance(Type type, Boolean nonPublic) at Newgistics.Application.Services.ServiceCoordinator.GetProvider(Service service, Org org)

    [14:08:18.540] SQL Server is prepared to freeze the database now [EXCL1VS1\NGSCORE : NgsCore]...

    [14:08:18.692] Ready to prepare to create snapshot...

    [14:08:18.693]

    [14:08:18.693] SNAPSHOT DATABASE

    In almost every case where we have an application timeout (dotnet default is 30 seconds) it occurs between the start of the “Waiting for SQL Server to respond…” step and the “SQL Server is prepared to freeze the database now” step. Note that the time between the 2 events is just over 32 seconds.

    So I am wondering if anybody else has seen the problem and second if anybody has any ideas about a work-around. My company is not prepared to reduce the number of backups needed during the day.

    Mike Byrd

  • How long do your snapshots take? Apps need to be written in such a way as to be able to handle an occasional timeout no matter what is causing it. You apps should retry the transaction and fail out after X number of retries.

    The probability of survival is inversely proportional to the angle of arrival.

  • The creation of a snapshot takes virtually no time whatsoever. No data from the database is copied until it is changed. Only the snapshot header is written when a snapshot is created. It should take less than a second and there's pretty much no reason it should time something else out.

  • not an expert here by any means, but i think "NetAPP SnapManager ", and what it is doing behind the scenes is more than jsut a SQL Server Enterpise Snapshot; as jvandenburg stated, the SQL Snapshot takes no time at all...i think the issue is the application "NetAPP SnapManager ", i suspect it takes some time for it to do other stuff before /during the snapshot of the database(is that what it is doing?)

    the "Snap" is misleading, i think....

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • Yeah, I probably should have read a little further into it. Our SAN provides snapshots too, which are largely analogous with SQL Server database snapshots. Only the snapshot header is written to disk, then the changed pages are written to disk as they are changed. This is the generally accepted definition of "snapshot" when it refers to data. If your application is doing anything but that, it makes perfect sense why it would time out.

    How well does your SAN application understand SQL Server? Does it communicate with it in any way? If not, it's just grabbing data files which SQL Server could very well be writing to at that exact moment. It might not care, and just do a dirty read, or it might care and wait.

    How often does it fail? Is this a consistant problem or periodic? If periodic, does it fail more at certain times?

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply