Last week, in our production, we noticed some records are missing in the subscriber side while existing on the publisher side. The detailed issue description can be found at:
MS online support suggests to apply the latest hotfix and then check again. I can understand online support's practice but I do not believe this should be a recommended approach. Because most of the time, the issue occurs in a production environment, which for whatever reason is hard to copy to the test environment, for example, production environment is EE edition, production is clustered and some other specific configurations for production etc, etc, all these factors make us hard / impossible to create a comparable test environment. On the other hand, it is not easy to apply the latest hotfix to production because there usually exists a whole test cycle to test the hotfix from the application perspective, esp. when MS cautions that users should weigh the risks of applying not-fully tested hotfix.
After almost one week's debugging and investigation with other team members, I finally come up with a test script that can repeat the issue and thus points out the conditions under which this issue can occur.
To repeat the issue, we need to first create the environment as the following (assume you have SQL 2K5 Developer):
1. Prepare two sql server instances DB1 and DB2, assume you install SQL 2K5 Developer with edition to 3161
2. On server DB1, create a database TestDB and then create a table dbo.T and then create a publication UpdatePub (using dbo.T as the article for UpdatePub) with immediate updatable subscription on DB2's TestDB database
The script to create table dbo.T is:
Use TestDB
go
create table dbo.T (a int primary key, b varchar(100))
3. Create a linked server on DB2 that points to DB1 using the following script
/****** Object: LinkedServer [DB1] Script Date: 01/15/2008 23:46:54 ******/
EXEC
master.dbo.sp_addlinkedserver @server = N'DB1', @srvproduct=N'SQL Server'/* For security reasons the linked server remote logins password is changed with ######## */EXEC
master.dbo.sp_addlinkedsrvlogin @rmtsrvname=N'DB1',@useself=N'False',@locallogin=NULL,@rmtuser=N'sa',@rmtpassword='########'GO
EXEC
master.dbo.sp_serveroption @server=N'DB1', @optname=N'collation compatible', @optvalue=N'false'EXEC
master.dbo.sp_serveroption @server=N'DB1', @optname=N'data access', @optvalue=N'true'EXEC
master.dbo.sp_serveroption @server=N'DB1', @optname=N'dist', @optvalue=N'false'EXEC
master.dbo.sp_serveroption @server=N'DB1', @optname=N'pub', @optvalue=N'false'EXEC
master.dbo.sp_serveroption @server=N'DB1', @optname=N'rpc', @optvalue=N'false'EXEC
master.dbo.sp_serveroption @server=N'DB1', @optname=N'rpc out', @optvalue=N'false'EXEC
master.dbo.sp_serveroption @server=N'DB1', @optname=N'sub', @optvalue=N'false'EXEC
master.dbo.sp_serveroption @server=N'DB1', @optname=N'connect timeout', @optvalue=N'0'EXEC
master.dbo.sp_serveroption @server=N'DB1', @optname=N'collation name', @optvalue=nullEXEC
master.dbo.sp_serveroption @server=N'DB1', @optname=N'lazy schema validation', @optvalue=N'false'EXEC
master.dbo.sp_serveroption @server=N'DB1', @optname=N'query timeout', @optvalue=N'0'EXEC
master.dbo.sp_serveroption @server=N'DB1', @optname=N'use remote collation', @optvalue=N'true'GO
4. Now from SSMS, open a query window pointing to DB2 server, and then run the following script (call it script_A)
Use TestDB
go
set xact_abort on
begin distributed traninsert
into DB1.TestDB.dbo.T (a, b) values (1, 'from publisher side x1')commit
tranAfter running the script, and then we can do: select * from testdb.dbo.T on BOTH DB1 and DB2, we should see there is a record in dbo.T on both DB1 and DB2, this means the record inserted into the DB1.TestDB.T is replicated to DB2.TestDB.T, just as expected, although the insert transaction is initiated from the DB2 side.
Now we add one more sql statement in the above script and the new script, let's call it Script_B
USE TestDB
go
set
xact_abort on begin distributed traninsert
into T (a, s) values (2, 'from subscriber side x1') -- this is to insert into the local TestDB.T on DB2, and is expected to replicated back to DB1.TestDB.dbo.T because this is a publication with updatable subscriptioninsert
into DB1.TestDB.dbo.T (a, s) values (3, 'from publisher side x2') -- expect this to be replicated back to DB2.TestDB.dbo.Tcommit
tranAfter running this script on DB2, we expect to see three records in DB2.TestDB.dbo.T as follows
a | b |
1 | from publisher side x1 |
2 | from subscriber side x1 |
3 | from publisher side x2 |
the record with a=1 is there because of the script_A, the record with a=2 is a result of script_B, the direct insert sql statement, and the record with a=3 is what we expect because in script_B, we insert the record to the publisher table, and the newly inserted record should be replicate back to the subscriber just as it is done in script_A.
However, the result is: we DO NOT see the record with a=3 in DB2.TestDB.dbo.T, i.e. when we do the query on DB2
select * from TestDB.dbo.T
we can only see two records instead of three records, the real records are as follows:
a | b | msrepl_tran_version |
1 | from publisher side x1 | 10019F52-4C3E-4828-B150-F6CE7842648F |
2 | from subscriber side x1 | 84275778-460F-44C5-9A2E-AC260CA44CD0 |
From replication side, I do not see any error/warning reported, even if I turned on -HistoryVerboseLeve to 3 for distribution agent.
With all these said / tested, I believe SQL 2K5 has a "bug" which will repeat itself if the following minimum conditions are met:
In a SS2k5 publication with updatable subscription, if the following conditions are met,the replication will not work properly even though there is no error reported.
1. There is a distributed transaction started from the subscriber side
2. The distributed transaction will contain two sql statement, one is to insert into a local replicated table (call it sub_table) which is the subscription table of a publication table (called it pub_table), another sql statement is to insert into the remote pub_table on the publisher side.
When these two conditions are met, the insert sql for remote pub_table will succeed, but the record will not be replicated back from the pub_table to the sub_table as expected.
I welcome your comments / feedback if you have any different thoughts.
Note (2008-01-21) Adams Qu from MS has confirmed this is an issue with SQL 2K5 and it is good to know why, please see his comment here: