Latency issues in Transactional Replication

Question

Latency issues in Transactional Replication

rogelio.vidaurri

SSCrazy

Points: 2117
More actions
January 17, 2017 at 3:02 pm

#333982

Hi,
We have two SQL server 2012 bidirectional transactional replications working correctly.
Instance A (active) <-> Instance B (passive - not in use)
Instance A (active) <-> Instance C (passive - not in use)
This means we have the same database in 3 instances. 2 of them (A and C) within the same LAN and 1 of them (B) on Internet.
We are adding a new table/article to the database/replication that gets lots of inserts (this new table will replace one of the existing replicated articles), but as soon as we start inserting into it, the latency alert is triggered.
If we insert 10, or 80, it works okay, if we insert 2,000 or more, the latency issue is triggered.
A <-> B starts having issues
However:
A <-> C works correctly
On reviewing the processes on Instacce A via sp_whoisactive, we noticed that everytime we get the issue this process shows up:
PREEMPTIVE_OS_WAITFORSINGLEOBJEC
for that specific subscription, and as soon as it finishes, the undistributed commands go down to 0 as usual.
What can be happening? We did the exact same thing last year and it worked perfectly.
Thanks in advance

Viewing 15 posts - 1 through 15 (of 44 total)

You must be logged in to reply to this topic. Login to reply

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 1

External sync wait.

What is the latency and bandwidth capability between A and B? The last time it worked was it set up EXACTLY like it is now? Did the network pipe change capabilities? Did the B machine change in capabilities?

Test doing a linked server insert from a local table of 2-5K similar rows to the table that is having issues from A to B and see how long that takes.

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

rogelio.vidaurri SSCrazy Points: 2117 More actions · Answer 2

"What is the latency and bandwidth capability between A and B?"
I'll ask our server hosting providers.

"The last time it worked was it set up EXACTLY like it is now? Did the network pipe change capabilities? Did the B machine change in capabilities?"
Yes. No that we are aware of. No that we are aware of.

"Test doing a linked server insert from a local table of 2-5K similar rows to the table that is having issues from A to B and see how long that takes."
Do you mean to execute an insert query on A that inserts 2-5K similar rows to B in the table having issues? Will do.
Big question here is:
Why does the table to be replaced (table2016) work perfectly from A to B even when it has millions of rows and is active? and the new table (table2017) doesn't

Thanks!

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 3

The number of rows in a replicated table isn't of importance from a replication performance perspective. It is how much DML activity occurs.

Check the 2017 table for foreign keys (especially unindexed ones), triggers or anything else that could be different compared to the 2016 table.

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

rogelio.vidaurri SSCrazy Points: 2117 More actions · Answer 4

both tables are exactly the same in constraints, keys, indexes, no triggers, etc.
the only difference is that one has been replicated for almost a year, and the new one was added recently to the replication

I'll get back here as soon as I have answers

regards<b

sjimmo SSChampion Points: 11139 More actions · Answer 5

rogelio.vidaurri - Tuesday, January 17, 2017 3:02 PM
Hi,
We have two SQL server 2012 bidirectional transactional replications working correctly.
Instance A (active) <-> Instance B (passive - not in use)
Instance A (active) <-> Instance C (passive - not in use)
This means we have the same database in 3 instances. 2 of them (A and C) within the same LAN and 1 of them (B) on Internet.
We are adding a new table/article to the database/replication that gets lots of inserts (this new table will replace one of the existing replicated articles), but as soon as we start inserting into it, the latency alert is triggered.
If we insert 10, or 80, it works okay, if we insert 2,000 or more, the latency issue is triggered.
A <-> B starts having issues
However:
A <-> C works correctly
On reviewing the processes on Instacce A via sp_whoisactive, we noticed that everytime we get the issue this process shows up:
PREEMPTIVE_OS_WAITFORSINGLEOBJEC
for that specific subscription, and as soon as it finishes, the undistributed commands go down to 0 as usual.
What can be happening? We did the exact same thing last year and it worked perfectly.
Thanks in advance

Many things impact the latency of replication, including every component involved in the overall system.

It sounds first like the Latency Warning parameters are set as the default, which will definitely give the warnings such as you are describing over the internet connection.

Internet connections are not as efficient as local networks. Computers do not have a dedicated port connection to a router. Multiple computers share a port. Depending on the number of active computers down line from you, and what they are doing can have an impact on the speed you are seeing. Additionally, in-house, what is your overall configuration? Do you use a dedicated distributor or is the distributor also on the publisher server. This can impact the performance on your server as well when there are latency issues. Are you using one publication with subscriptions to each location? This too can have a performance impact causing your faster lines at times to wait while the distributor is working on the slower line.

In order to make the latency alert more realistic, preform pings to your subscriber on the internet during various times of the day. Do this for a few days and determine the average time as your baseline. Then adjust your Latency warning to this. You may have to play with the settings in order to come up with a true figure. Now you should receive more reliable warnings.

If this is causing issues on the publisher, and the distributor is also on the publisher, separate them. In doing so, the publisher sends the data to the distributor which is going to work at it's own speed. But you have removed a burden on the OLTP server. If there is still an issue then create 2 packages, one to be used to the subscriber(s) locally, and the other to be used for those via internet. This will make the load on the distributor more efficient as well.

Check with your ISP and see if they can/will tell you about the load on the line you are coming in on as well as the load going out. They can, if they wish to, get that information from a variety of means at the ISP.

Good luck. Replication works well when it works. But to identify a problem, especially latency, can cause one to take a bottle of Tylenol. I know.

Steve Jimmo
Sr DBA
“If we ever forget that we are One Nation Under God, then we will be a Nation gone under." - Ronald Reagan

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 6

rogelio.vidaurri - Wednesday, January 18, 2017 9:10 AM
both tables are exactly the same in constraints, keys, indexes, no triggers, etc.
the only difference is that one has been replicated for almost a year, and the new one was added recently to the replication
I'll get back here as soon as I have answers
regards

One more thing: did you check for blocking on the target machine? Maybe the 2017 table is seeing more read or other activity there and replication replay is being clocked. I would use sp_whoisactive for this.

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

rogelio.vidaurri SSCrazy Points: 2117 More actions · Answer 7

TheSQLGuru - Wednesday, January 18, 2017 11:46 AM
rogelio.vidaurri - Wednesday, January 18, 2017 9:10 AM
both tables are exactly the same in constraints, keys, indexes, no triggers, etc.
the only difference is that one has been replicated for almost a year, and the new one was added recently to the replication
I'll get back here as soon as I have answers
regards
One more thing: did you check for blocking on the target machine? Maybe the 2017 table is seeing more read or other activity there and replication replay is being clocked. I would use sp_whoisactive for this.

sp_whoisactive shows the PREEMPTIVE_OS_WAITFORSINGLEOBJEC process for that subscription as soon as the 2017 table starts to have lots of inserts
nothing else unusual can be seen

we did the same yesterday but creating an exact same table structure with a different name, same results and behavior

not sure if creating a new database on A and B, and new susbcriptions just for it would work (in case there's something wrong on our current subscription right now
I guess we'll give it a try

rogelio.vidaurri SSCrazy Points: 2117 More actions · Answer 8

"It sounds first like the Latency Warning parameters are set as the default, which will definitely give the warnings such as you are describing over the internet connection."
That's right, but it has been working correctly for years with default values. It all began when trying to substitute a table with million of rows with a new one (same structure, indexes, etc.).

"Do you use a dedicated distributor or is the distributor also on the publisher server."
Yes, we do.

"Are you using one publication with subscriptions to each location?"
Complete configuration:
A has a push subscription to B (Internet)
B has a push subscription to A (Internet)
A has a push subscription to C (LAN)
C has a push subscription to A (LAN)

ICMP requests are blocked on both ends, we'd need to request them be opened.
BTW, our server hosting providers said everything looks normal re the load on the lines.

"If this is causing issues on the publisher, and the distributor is also on the publisher, separate them."
A SQL instance for the distributor and another SQL instance for the rest? Wouldn't it require more SQL licensing?

"If there is still an issue then create 2 packages, one to be used to the subscriber(s) locally, and the other to be used for those via internet. This will make the load on the distributor more efficient as well."
I didn't get this idea. Could you please explain?

Thanks for sharing!

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 9

rogelio.vidaurri - Wednesday, January 18, 2017 1:20 PM
TheSQLGuru - Wednesday, January 18, 2017 11:46 AM
rogelio.vidaurri - Wednesday, January 18, 2017 9:10 AM
both tables are exactly the same in constraints, keys, indexes, no triggers, etc.
the only difference is that one has been replicated for almost a year, and the new one was added recently to the replication
I'll get back here as soon as I have answers
regards
One more thing: did you check for blocking on the target machine? Maybe the 2017 table is seeing more read or other activity there and replication replay is being clocked. I would use sp_whoisactive for this.
sp_whoisactive shows the PREEMPTIVE_OS_WAITFORSINGLEOBJEC process for that subscription as soon as the 2017 table starts to have lots of inserts
nothing else unusual can be seen
we did the same yesterday but creating an exact same table structure with a different name, same results and behavior
not sure if creating a new database on A and B, and new susbcriptions just for it would work (in case there's something wrong on our current subscription right now
I guess we'll give it a try

I was hoping there would be some blocking in addition to the preemptive waits. REALLY screwy that it is just one table and not the other.

Sadly I have exhausted what I can do for you on this one, certainly via a forum. I suggest you reach out to Hillary Cotter for help. He's the best consultant I know of for dealing with replication on SQL Server. Tell him I sent you.

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

sjimmo SSChampion Points: 11139 More actions · Answer 10

rogelio.vidaurri - Wednesday, January 18, 2017 1:30 PM
"It sounds first like the Latency Warning parameters are set as the default, which will definitely give the warnings such as you are describing over the internet connection."
That's right, but it has been working correctly for years with default values. It all began when trying to substitute a table with million of rows with a new one (same structure, indexes, etc.).
"Do you use a dedicated distributor or is the distributor also on the publisher server."
Yes, we do.
"Are you using one publication with subscriptions to each location?"
Complete configuration:
A has a push subscription to B (Internet)
B has a push subscription to A (Internet)
A has a push subscription to C (LAN)
C has a push subscription to A (LAN)
ICMP requests are blocked on both ends, we'd need to request them be opened.
BTW, our server hosting providers said everything looks normal re the load on the lines.
"If this is causing issues on the publisher, and the distributor is also on the publisher, separate them."
A SQL instance for the distributor and another SQL instance for the rest? Wouldn't it require more SQL licensing?
"If there is still an issue then create 2 packages, one to be used to the subscriber(s) locally, and the other to be used for those via internet. This will make the load on the distributor more efficient as well."
I didn't get this idea. Could you please explain?
Thanks for sharing!

Separating the load across multiple packages will lift some of the load on the distributor. The latency over the internet line may impact the local subscriber and publisher stacking up commands waiting to be distributed.
Yes, I would expect the ISP to say everything is OK. Are you using a dedicated line from your company to the ISP and from the ISP to the subscriber? If so, then it sounds like there may be noise on the line. The only way to isolate it is end to end out of service testing by the ISP. Otherwise take the bandwidth and split it up between all of the locations using that line.
Good luck.

BTW, I have an extremely large replication system using internet/broadband and local lan without a latency problem with 9 publishers, 2 distributors and 85 subscribers pushing several million transactions per day. It is all OLTP. Yes, I have had discussions with our ISP's for similar issues, and they resolved it. My latency at worse is still under 10 sec over the internet connections.

Steve Jimmo
Sr DBA
“If we ever forget that we are One Nation Under God, then we will be a Nation gone under." - Ronald Reagan

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 11

sjimmo - Wednesday, January 18, 2017 3:18 PM
rogelio.vidaurri - Wednesday, January 18, 2017 1:30 PM
"It sounds first like the Latency Warning parameters are set as the default, which will definitely give the warnings such as you are describing over the internet connection."
That's right, but it has been working correctly for years with default values. It all began when trying to substitute a table with million of rows with a new one (same structure, indexes, etc.).
"Do you use a dedicated distributor or is the distributor also on the publisher server."
Yes, we do.
"Are you using one publication with subscriptions to each location?"
Complete configuration:
A has a push subscription to B (Internet)
B has a push subscription to A (Internet)
A has a push subscription to C (LAN)
C has a push subscription to A (LAN)
ICMP requests are blocked on both ends, we'd need to request them be opened.
BTW, our server hosting providers said everything looks normal re the load on the lines.
"If this is causing issues on the publisher, and the distributor is also on the publisher, separate them."
A SQL instance for the distributor and another SQL instance for the rest? Wouldn't it require more SQL licensing?
"If there is still an issue then create 2 packages, one to be used to the subscriber(s) locally, and the other to be used for those via internet. This will make the load on the distributor more efficient as well."
I didn't get this idea. Could you please explain?
Thanks for sharing!
Separating the load across multiple packages will lift some of the load on the distributor. The latency over the internet line may impact the local subscriber and publisher stacking up commands waiting to be distributed.
Yes, I would expect the ISP to say everything is OK. Are you using a dedicated line from your company to the ISP and from the ISP to the subscriber? If so, then it sounds like there may be noise on the line. The only way to isolate it is end to end out of service testing by the ISP. Otherwise take the bandwidth and split it up between all of the locations using that line.
Good luck.
BTW, I have an extremely large replication system using internet/broadband and local lan without a latency problem with 9 publishers, 2 distributors and 85 subscribers pushing several million transactions per day. It is all OLTP. Yes, I have had discussions with our ISP's for similar issues, and they resolved it. My latency at worse is still under 10 sec over the internet connections.

From my understanding Steve this can't be a problem with connectivity because one table in same database replicates just fine and the other does not. And it is repeatably so apparently.

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

rogelio.vidaurri SSCrazy Points: 2117 More actions · Answer 12

TheSQLGuru - Wednesday, January 18, 2017 4:07 PM
sjimmo - Wednesday, January 18, 2017 3:18 PM
rogelio.vidaurri - Wednesday, January 18, 2017 1:30 PM
"It sounds first like the Latency Warning parameters are set as the default, which will definitely give the warnings such as you are describing over the internet connection."
That's right, but it has been working correctly for years with default values. It all began when trying to substitute a table with million of rows with a new one (same structure, indexes, etc.).
"Do you use a dedicated distributor or is the distributor also on the publisher server."
Yes, we do.
"Are you using one publication with subscriptions to each location?"
Complete configuration:
A has a push subscription to B (Internet)
B has a push subscription to A (Internet)
A has a push subscription to C (LAN)
C has a push subscription to A (LAN)
ICMP requests are blocked on both ends, we'd need to request them be opened.
BTW, our server hosting providers said everything looks normal re the load on the lines.
"If this is causing issues on the publisher, and the distributor is also on the publisher, separate them."
A SQL instance for the distributor and another SQL instance for the rest? Wouldn't it require more SQL licensing?
"If there is still an issue then create 2 packages, one to be used to the subscriber(s) locally, and the other to be used for those via internet. This will make the load on the distributor more efficient as well."
I didn't get this idea. Could you please explain?
Thanks for sharing!
Separating the load across multiple packages will lift some of the load on the distributor. The latency over the internet line may impact the local subscriber and publisher stacking up commands waiting to be distributed.
Yes, I would expect the ISP to say everything is OK. Are you using a dedicated line from your company to the ISP and from the ISP to the subscriber? If so, then it sounds like there may be noise on the line. The only way to isolate it is end to end out of service testing by the ISP. Otherwise take the bandwidth and split it up between all of the locations using that line.
Good luck.
BTW, I have an extremely large replication system using internet/broadband and local lan without a latency problem with 9 publishers, 2 distributors and 85 subscribers pushing several million transactions per day. It is all OLTP. Yes, I have had discussions with our ISP's for similar issues, and they resolved it. My latency at worse is still under 10 sec over the internet connections.
From my understanding Steve this can't be a problem with connectivity because one table in same database replicates just fine and the other does not. And it is repeatably so apparently.

our last resort would be to remove everything (distributor, subscriptions, etc.) and create all from scratch, hopefully it would work
we had to do this last year when replication starting failing because of "violation of primary key" errors (off-topic)

still waiting for some answers from our ISPs

before that, we'll try what you've both kindly suggested

thanks

sjimmo SSChampion Points: 11139 More actions · Answer 13

TheSQLGuru - Wednesday, January 18, 2017 4:07 PM
sjimmo - Wednesday, January 18, 2017 3:18 PM
rogelio.vidaurri - Wednesday, January 18, 2017 1:30 PM
"It sounds first like the Latency Warning parameters are set as the default, which will definitely give the warnings such as you are describing over the internet connection."
That's right, but it has been working correctly for years with default values. It all began when trying to substitute a table with million of rows with a new one (same structure, indexes, etc.).
"Do you use a dedicated distributor or is the distributor also on the publisher server."
Yes, we do.
"Are you using one publication with subscriptions to each location?"
Complete configuration:
A has a push subscription to B (Internet)
B has a push subscription to A (Internet)
A has a push subscription to C (LAN)
C has a push subscription to A (LAN)
ICMP requests are blocked on both ends, we'd need to request them be opened.
BTW, our server hosting providers said everything looks normal re the load on the lines.
"If this is causing issues on the publisher, and the distributor is also on the publisher, separate them."
A SQL instance for the distributor and another SQL instance for the rest? Wouldn't it require more SQL licensing?
"If there is still an issue then create 2 packages, one to be used to the subscriber(s) locally, and the other to be used for those via internet. This will make the load on the distributor more efficient as well."
I didn't get this idea. Could you please explain?
Thanks for sharing!
Separating the load across multiple packages will lift some of the load on the distributor. The latency over the internet line may impact the local subscriber and publisher stacking up commands waiting to be distributed.
Yes, I would expect the ISP to say everything is OK. Are you using a dedicated line from your company to the ISP and from the ISP to the subscriber? If so, then it sounds like there may be noise on the line. The only way to isolate it is end to end out of service testing by the ISP. Otherwise take the bandwidth and split it up between all of the locations using that line.
Good luck.
BTW, I have an extremely large replication system using internet/broadband and local lan without a latency problem with 9 publishers, 2 distributors and 85 subscribers pushing several million transactions per day. It is all OLTP. Yes, I have had discussions with our ISP's for similar issues, and they resolved it. My latency at worse is still under 10 sec over the internet connections.
From my understanding Steve this can't be a problem with connectivity because one table in same database replicates just fine and the other does not. And it is repeatably so apparently.

I was under the impression that all coming across the internet connection was an issue.

If not, I'm sorry.

Steve Jimmo
Sr DBA
“If we ever forget that we are One Nation Under God, then we will be a Nation gone under." - Ronald Reagan

rogelio.vidaurri SSCrazy Points: 2117 More actions · Answer 14

sjimmo - Wednesday, January 18, 2017 4:57 PM
TheSQLGuru - Wednesday, January 18, 2017 4:07 PM
sjimmo - Wednesday, January 18, 2017 3:18 PM
rogelio.vidaurri - Wednesday, January 18, 2017 1:30 PM
"It sounds first like the Latency Warning parameters are set as the default, which will definitely give the warnings such as you are describing over the internet connection."
That's right, but it has been working correctly for years with default values. It all began when trying to substitute a table with million of rows with a new one (same structure, indexes, etc.).
"Do you use a dedicated distributor or is the distributor also on the publisher server."
Yes, we do.
"Are you using one publication with subscriptions to each location?"
Complete configuration:
A has a push subscription to B (Internet)
B has a push subscription to A (Internet)
A has a push subscription to C (LAN)
C has a push subscription to A (LAN)
ICMP requests are blocked on both ends, we'd need to request them be opened.
BTW, our server hosting providers said everything looks normal re the load on the lines.
"If this is causing issues on the publisher, and the distributor is also on the publisher, separate them."
A SQL instance for the distributor and another SQL instance for the rest? Wouldn't it require more SQL licensing?
"If there is still an issue then create 2 packages, one to be used to the subscriber(s) locally, and the other to be used for those via internet. This will make the load on the distributor more efficient as well."
I didn't get this idea. Could you please explain?
Thanks for sharing!
Separating the load across multiple packages will lift some of the load on the distributor. The latency over the internet line may impact the local subscriber and publisher stacking up commands waiting to be distributed.
Yes, I would expect the ISP to say everything is OK. Are you using a dedicated line from your company to the ISP and from the ISP to the subscriber? If so, then it sounds like there may be noise on the line. The only way to isolate it is end to end out of service testing by the ISP. Otherwise take the bandwidth and split it up between all of the locations using that line.
Good luck.
BTW, I have an extremely large replication system using internet/broadband and local lan without a latency problem with 9 publishers, 2 distributors and 85 subscribers pushing several million transactions per day. It is all OLTP. Yes, I have had discussions with our ISP's for similar issues, and they resolved it. My latency at worse is still under 10 sec over the internet connections.
From my understanding Steve this can't be a problem with connectivity because one table in same database replicates just fine and the other does not. And it is repeatably so apparently.
I was under the impression that all coming across the internet connection was an issue.
If not, I'm sorry.

not discarded yet
however, we can ping now

A -> B 72ms
B -> A 72ms
A -> C 1ms
C -> A 1ms