Transactions in Transactional Replication

,

Article by Paul Ibison (www.replicationanswers.com)

Introduction

Occasionally in the replication newsgroups questions come

up regarding transactions in transactional replication. People want to know what

happens if a transaction on the publisher fails - does it still get applied at

the subscriber? Or what if it runs on the publisher and fails on the subscriber

- does it always/ever roll back? Then there are questions regarding the

transaction size. If an update statement affects say 100000 rows on the

publisher, this will be logged as 100000 updates in the transaction log. In some

such cases the log reader agent may not be able to cope, and even if the log

reader succeeds, there may be similar problems for the distribution agent. Users

are interested in the possible workarounds for these situations. There is only a

little information in BOL to help clarify and solve these types of query, so I

have tied my various findings into an article to hopefully help out.

Replication of Simple Transactions

We start with a simple case. Suppose we add 4 records in a

transaction at the publisher.

set xact_abort on

begin tran

    insert 1

    insert 2

    insert 3

    insert 4

commit tran

The log reader agent reads the transaction log and adds these records to the

distribution database in the MSrepl_commands and MSrepl_transactions tables. If

we use sp_browsereplcmds to check the actual text of these inserts we see that

the transactional nature of this batch is maintained and we'll see something

like this:

xact_seqno Command Command_Id
seq_aaa Command1 1
seq_aaa Command2 2
seq_aaa Command3 3
seq_aaa Command4 4

The constant xact_seqno identifies the transaction, and

the incrementing command_id indicates the commands in the batch. So,

transactions on the publisher are applied as

transactions on the subscriber.

Failing Transactions on the Publisher

What happens to the replication setup when a transaction

fails at the publisher - does it still get replicated? Suppose we perform the

same transaction as above, but "Command4" fails. On the publisher there

will be an error message, the transaction is rolled back because of the

xact_abort setting and no rows are added. Using sp_browsereplcmds, none of these

commands are visible and they are not to be found on the subscriber, showing

that only committed transactions are replicated to the

subscriber.

Failing Transactions on the Subscriber

What if the transaction completes at the publisher and

subsequently fails at the subscriber? This could occur because we haven't been

vigilant, and someone has been allowed to edit the subscriber data (yes - it

happens :)). In this case the distribution agent fails and raises an error

message. The failure itself is expected, but the inbuilt error handler has a

hard coded exception command: "if @@trancount > 0 rollback tran"

(which is actually independant of the xact_abort setting on the publisher) meaning that

transactions failing at the subscriber are automatically

rolled back. So, until the problem is fixed, the error will prevent

the distribution agent from processing any separate part of the transaction -

the ACID properties are maintained. Note that this is the default behaviour

which may be overridden (see later).

Large Transactions which cause Log Reader Agent Timeouts

The next thing to look at is the issue of large

transactions. Remember that all commands are implicitly run as transactions so

although a TSQL statement might look innocuous at first glance - eg a

simple update statement - if it updates 100000 rows, this will be logged as:

begin tran

    update row 1

    update row 2

    update row 3

    ...

    update row 100000

commit tran

The log reader agent will try to read each of these commands and transfer them

to the distribution database. During this process, what problems can occur?

Basically we might have some sort of timeout of the log reader agent. The

resulting messages are various and include:

"The process could not execute 'sp_replcmds' on 'xxxxxx'."

"Status: 2, code: 0, text: 'Timeout expired'"

"A time out occurred while waiting for memory resources to execute the query."

"Agent 'xxx' is retrying after an error. 0 retries attempted. See agent job

history in the Jobs folder for more details."

"The step was cancelled (stopped) as the result of a stop job request"

These errors are each related to the number of records marked for replication in

the log which the log reader has to parse and process. There are a few profile

parameters which can be used to modify the process and avoid the error:

(a) increase the QueryTimeout. This is

the usual solution. It won't decrease the latency of a big transaction, but it

is often sufficient on its own to get things working again.

(b) change the MaxCmdsInTran. If the problem is that the

transaction contains a hugh amount of commands, then it could be an advantage to

chop it up into several smaller transactions. This will massively decrease

latency and prevent associated timeouts. However it will cause a problem if

there is a command within the transaction which fails. Remember that a

transaction gets implicitly rolled back on the subscriber when there is an

error. If we use this MaxCmdsInTran parameter, the child batch which has the

problem row will error causing the distribution agent to fail. However if there

are previous

child batches they will have already been committed and hence broken the ACID

properties of our original transaction. Therefore there is a definite risk

associated with using this parameter.

(c) change the ReadBatchSize.  This is the maximum number of

transactions read out of the transaction log of the publishing database "per

processing cycle". As there is no supported way to configure the "processing

cycle" (changing -LogScanThreshold is going to be unsupported) we can equate

this to saying "it is the number of transactions read when running the log

reader agent". The lowest this figure can be is 1, so if there is one huge

transaction causing the problem, this setting will not help at all to remove a

timeout. However if the timeout is as a result of a lot of small transactions

being processed, setting this to a smaller figure than the default of 500 to

temporarily clear the backlog, along with increasing the QueryTimeout, is known

to remove the problem.

(d) change the ReadBatchThreshold. This is very similar to the -ReadBatchSize

parameter above but rather than relating to transactions, it relates to commands ie it

is the number of replication commands to be read from the transaction log in a

processing batch. Likewise, setting this to a small value along with increasing

the QueryTimeout can help remove a backlog, especially if the problem is caused

by large transactions.

Large Transactions which cause Distribution Agent

Timeouts or Slowdowns

Just like with the log reader agent we can use the QueryTimeout parameter,

and increasing it's value will be the first option to try. In the case of the

distribution agent though we have 2 extra parameters: CommitBatchSize which is

"the number of transactions to be issued to the Subscriber before a COMMIT

statement is issued. The default is 100." and CommitBatchThreshold

which is "the number of replication commands to be issued to the Subscriber

before a COMMIT statement is issued. The default is 1000." - both definitions

coming from BOL.

These definitions are really quite confusing. Does this mean that the ACID

properties are necessarily broken if we have a transaction that has > 1000

commands in it? What is the "COMMIT" being referred to? Empirically I find that

is that this is not an option to break up a transaction into a series of smaller

transactions, so the word "COMMIT" in the BOL explanations is misleading and

really doesn't refer to the original transaction (or any transaction!).

Consider an example where 4 rows are added and the 4th row causes an error:

begin tran

    insert row 1

    insert row 2

    insert row 3

    insert row 4 -- causes an error

commit tran

If we set the CommitBatchSize and

CommitBatchThreshold to both be 2, and the final insert fails then what

happens? If the ACID properties were broken and replication had produced 2

transactions at the subscriber, there would be 2 rows from the initial

successful transaction at the subscriber. In practice no rows at all are

inserted at the subscriber.  So, these 2

parameters are purely ways of configuring throughput for the distribution agent,

and the logic of the original transaction remains. In other words, the "COMMIT"

in the 2 BOL definitions above refers to the application of a batch of statements

and is nothing at all to do with transactions.

It seems to me that that the usefulness of these 2

commands relates to those cases involving blocking issues at the subscriber.

Smaller batches incur reduced locking to be applied in each separate batch at

the subscriber. If there was no subscriber access other than that of the

distribution agent, the transaction will overall take longer to apply if we use

small batches. However in the presence of other processes needing to perform

reads or updates of committed data at the subscriber (ie I'm not referring to

dirty reads) then blocking issues might mean that smaller batches will overall

work more quickly. Clearly you'd need to really understand the usage profile of

the subscriber to determine if these parameters are going to improve throughput

and concurrency. In practice I've never used these parameters but I suppose I

know they're there if required.

Conclusions

Hopefully this article helps clarify how transactions are

applied in transactional replication, and how this knowledge can be used to

optimise the system

Rate

4.44 (18)

Share

Share

Rate

4.44 (18)