Replication Reliability?

  • I just set up snapshot replication as a test between Dev servers to see if that will fulfill my needs to duplicate the same DB across many DB servers. This is a reference DB that the majority of my SQL Servers need to have available on them. Because of the nature of the network and distance between servers, management deems it necessary to have a copy on each server. My goal is to find a solution that allows me to have the DB updated in one place and then duplicated on as many as 65 SQL Servers at various locations on my network. Since I have SQL versions from 2000 to 2008, I cannot just do backups and restores of the DB, which is the preferred method.

    Anyway, so I set up a publisher from a SQL 2008 R2 Enterprise server, with the agents location on the same, and push subscriptions to 3 different servers, two 2005 servers and one 2000 server.

    Not much to the subscription, just replicate the entire DB to each server every day.

    The initial synchronization of the subscriptions on all boxes goes like clock work. After the first on though, I have been getting random errors. The subscription to the SQL 2000 box failed so much that I deleted that subscription for now, just to see if I could get the 2005 subscriptions to work OK.

    This morning I found that both 2005 subscriptions had failed. The error pointed to a file in the publication as missing.

    2012-02-15 13:05:10.605 Bulk copying data into table 'Zip'

    2012-02-15 13:05:10.605 Bulk copying data into table 'ZipCodes'

    2012-02-15 13:05:10.605 Agent message code 20037. The process could not bulk copy into table '"dbo"."physn"'.

    2012-02-15 13:05:10.683 Category:NULL

    Source: Microsoft SQL Server Native Client 10.0

    Number:

    Message: The system cannot find the file specified.

    2012-02-15 13:05:10.698 Category:NULL

    Source:

    Number: 20253

    Message: To obtain an error file with details on the errors encountered when initializing the subscribing table, execute the bcp command that appears below. Consult the BOL for more information on the bcp utility and its supported options.

    2012-02-15 13:05:10.698 Category:NULL

    Source:

    Number: 20253

    Message: bcp "Reference2"."dbo"."physn" in "D:\SQLData\Replication\unc\RICHMSSQL11_REFERENCE_REFERENCEPUBLICATION\20120215070002\physn_17#16.bcp" -e "errorfile" -t"<x$3>" -r"<,@g>" -m10000 -SRICHISSQL01\DEVSQL -T -w

    When I went to the folder where the publication is stored I found a single file,

    physn_17#15.bcp.

    I also noticed that the latest folder in my replication storage location has what appears to be all of the articles in it, whereas I have a bunch of older folders that have just one file in them, on the order of what what I mentioned above. Examples:

    physn_17#8.bcp

    physn_17#5.bcp

    physn_17#6.bcp

    I am trying to figure out 1. did the publication not get generated properly, so my subscriptions failed when they ran, 2. if yes to 1, why, and 3. does the replication process just leave files in folders when it is done if there is an error, or is this another problem with replication?

    Thanks,

    Chris

    Learning something new on every visit to SSC. Hoping to pass it on to someone else.

  • Just to be clear you don't have a file in the location specified?

    "D:\SQLData\Replication\unc\RICHMSSQL11_REFERENCE_REFERENCEPUBLICATION\20120215070002\physn_17#16.bcp"

    If not then I would definitely be suspicious of the subscription.

    David

    @SQLTentmaker

    “He is no fool who gives what he cannot keep to gain that which he cannot lose” - Jim Elliot

  • This morning the publication was generated at 7:00 and the subscriptions were run at 8:00. I got the following errors:

    2012-02-16 13:05:01.919 Bulk copying data into table 'Zip'

    2012-02-16 13:05:01.919 Bulk copying data into table 'ZipCodes'

    2012-02-16 13:05:01.919 Agent message code 20037. The process could not bulk copy into table '"dbo"."wac_price"'.

    2012-02-16 13:05:02.028 Category:NULL

    Source: Microsoft SQL Server Native Client 10.0

    Number:

    Message: The system cannot find the file specified.

    2012-02-16 13:05:02.028 Category:NULL

    Source:

    Number: 20253

    Message: To obtain an error file with details on the errors encountered when initializing the subscribing table, execute the bcp command that appears below. Consult the BOL for more information on the bcp utility and its supported options.

    2012-02-16 13:05:02.028 Category:NULL

    Source:

    Number: 20253

    Message: bcp "Reference2"."dbo"."wac_price" in "D:\SQLData\Replication\unc\RICHMSSQL11_REFERENCE_REFERENCEPUBLICATION\20120216070002\wac_price_28#9.bcp" -e "errorfile" -t"<x$3>" -r"<,@g>" -m10000 -SRICHISSQL01\DEVSQL -T -w

    and the other subscription got this error:

    2012-02-16 13:05:09.515 Bulk copying data into table 'Zip'

    2012-02-16 13:05:09.515 Bulk copying data into table 'ZipCodes'

    2012-02-16 13:05:09.515 Agent message code 20037. The process could not bulk copy into table '"dbo"."physn"'.

    2012-02-16 13:05:09.531 Category:NULL

    Source: Microsoft SQL Server Native Client 10.0

    Number:

    Message: The system cannot find the file specified.

    2012-02-16 13:05:09.547 Category:NULL

    Source:

    Number: 20253

    Message: To obtain an error file with details on the errors encountered when initializing the subscribing table, execute the bcp command that appears below. Consult the BOL for more information on the bcp utility and its supported options.

    2012-02-16 13:05:09.562 Category:NULL

    Source:

    Number: 20253

    Message: bcp "Reference2"."dbo"."physn" in "D:\SQLData\Replication\unc\RICHMSSQL11_REFERENCE_REFERENCEPUBLICATION\20120216070002\physn_17#16.bcp" -e "errorfile" -t"<x$3>" -r"<,@g>" -m10000 -SRICHMSSQLDEV01 -T -w

    And the files left in the publication folder are:

    physn_17#15.bcp

    wac_price_28#8.bcp

    I do not know if the file mentioned in the error message was ever there.

    The publication generates every hour.

    Since it generated a new publication at 8:00, when the two subscription jobs were running, I decided to run the two subscription jobs around 8:20, when the publication job was not running, and both were successful.

    I wonder if the subscriptions jobs are failing because the Publisher is generating a new publication when the subscription jobs run. I did see in the logs that the subscriptions were pulling files from the publication that was generated an hour before, not the one that was being generated at the same time the subscription jobs were running.

    thanks,

    Chris

    Learning something new on every visit to SSC. Hoping to pass it on to someone else.

  • To test my question in the last post, I have changed the schedules to generate the publication on the hour, and the subscriptions run at quarter after, every hour.

    Thanks,

    Chris

    Learning something new on every visit to SSC. Hoping to pass it on to someone else.

  • It will be good to hear how things go with that change.

    Just a note but when you say publication, I think you are referring to the snapshot agent. You really have two different agents in snapshot replication, the snapshot agent which will develop the snapshot bcp file, and the distribution agent that will share that with the appropriate server.

    David

    @SQLTentmaker

    “He is no fool who gives what he cannot keep to gain that which he cannot lose” - Jim Elliot

  • OK, here's what I have, after some more testing.

    I now believe that if the subscription jobs are running at the same time that a NEW snapshot is being created, the snapshot creator attempts to cleanup the old snapshot, which the subscription jobs are currently using, and deletes the script files before the subscription jobs can complete their work.

    My test case was a snapshot generated at 9:00, and the subscription jobs running at 9:15. No errors.

    Schedules changed to all run on the hour.

    At 10:00 all job ran at the same time. The subscription jobs did nothing because they found that there was no newer snapshot to push in the subscriptions. New snapshot was generated at same time, but was not looked at by subscription jobs because it was just starting to be generated when they ran.

    At 11:00 all jobs ran again. New snapshot was created and both subscription jobs failed. Errors were like I saw before, and the folder for the previous subscription was left there, with various files in it. My hypothesis is that these files were left because when the publisher cleanup job attempted to delete them they were in use by the subscription jobs and could not be deleted so the cleanup job just left them.

    What I think this boils down to is a design flaw, in that the snapshot cleanup and subscription jobs should coordinate to make sure that no one is using the snapshot before it gets deleted.

    I will schedule my jobs so they cannot step on each other in the future.

    Thanks,

    Chris

    Learning something new on every visit to SSC. Hoping to pass it on to someone else.

  • With the jobs set for snapshot generation on the hour and subscription jobs on the quarter hour I have had no errors all night.

    Chris

    Learning something new on every visit to SSC. Hoping to pass it on to someone else.

  • That's great news and thanks for posting back.

    David

    @SQLTentmaker

    “He is no fool who gives what he cannot keep to gain that which he cannot lose” - Jim Elliot

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply