July 23, 2018 at 4:17 am
It's a really, really, REALLY good thing to have a staging environment. Had we not had one my company would have suffered at least a couple of days of downtime in our worldwide operations this week. Here's the story...
Just another update from our Development department to be released into our rather complex environment, one like many we released before. Nothing special in it we haven't done before. And yet, once we ran this one in our staging environment, the log reader agent -responsible for feeding cdc data into our interface platform, data warehouse and through transactional replication replication environment- stops and refuses to go even one step further. Only some not very informative messages in the replication monitor, even with highest verbose level set. Replication Monitor refers us to the Agent log, and the Agent log refers us back to the Replication Monitor. What the **** is going on?
I remember we've had the log reader crash and get stuck before after we altered a column, so I dug up the information I could find from that crash: it happened when we altered a column in a table that cdc was enabled for. When we alter a column type in such a way that a value that would fit into the old column type can no longer fit into the new data type, the log reader may crash applying the column change to the cdc change tracking table. For example: alter column <column name> varchar(512) into varchar(256) or alter column <some column> int into tinyint will cause this. However, an additional condition exists: in the table's change tracking table (cdc.<capture instance name>_CT) at least one value must exist that can’t be implicitly converted into the new type. If you do not currently have such value, the problem will still exist but it will not show itself until the moment you do try to store such a value. In this case however any value that can be put into the old column type, can easily go into the new type: we're altering nvarchar(255) to nvarchar(512). So today’s problem is nothing like that previous crash.
All the Replication Monitor gives us to work with is this information:Error messages:
The process could not execute 'sp_replcmds' on '<SERVERNAME>'. (Source: MSSQL_REPL, Error number: MSSQL_REPL20011)
Get help: http://help/MSSQL_REPL20011
Log scan process failed in processing a ddl log record. Refer to previous errors in the current session to identify the cause and correct any associated problems. (Source: MSSQLServer, Error number: 22860)
Get help: http://help/22860
Log Scan process failed in processing log records. Refer to previous errors in the current session to identify the cause and correct any associated problems. (Source: MSSQLServer, Error number: 22859)
Get help: http://help/22859
The statement has been terminated. (Source: MSSQLServer, Error number: 3621)
Get help: http://help/3621
The process could not execute 'sp_replcmds' on '<SERVERNAME>'. (Source: MSSQL_REPL, Error number: MSSQL_REPL22037)
To start my debugging, I stop the log reader agent (only one log reader can be active at any time, even if it crashes), and from SSMS I run sp_replcmds. The result: same error message. Good, I can reproduce it. Next step is -and this is another good reason why you need to catch these problems in staging already- to fire up Profiler and set it to catch all statements started from my session's spid, plus Exceptions, and I execute sp_replcmds again. A whole load of statements starts rolling into my profiler window and finally: there's an exception: 536: "Invalid length parameter passed to the RIGHT function.". Well, that's something. What code can possibly be calling right() with a negative number? Let's see what statements were just before this exception...
Great, the error is in encrypted code 🙁 . Thank you MS... With no way to see what exactly triggers this bug my only option is to try to reproduce the problem in a less complicated situation, so I can then experiment with finding a way around it. I'll spare you the lengthy details, but I've first written a set of scripts to quickly create and destroy a simplified environment in one test server that has all roles our server has in our acceptance and production environments: A merge replication publisher sending data to and receiving from a subscriber that is at the same time cdc enabled and transactional re-publisher, plus finally a transactional subscriber. Once this was done I could verify that I could reproduce the error in this environment too. Thank God I could! Then, by systematically rerunning the scripts and leaving one role out every time, I determined that I could eliminate transactional replication from the equation: without transactional replication the problem still occurred. However when I ran the same alter statement without merge replication (against the cdc enabled database itself, because merge replication is no longer involved), all processes continued as intended: the problem was gone. Then my next step was to find an alternative series of steps that has the same effect as the original alter table statement. I replaced it by adding a temporary new column, updating all rows to copy the old value into the temporary column, then drop the original column and add the new column followed by another update to copy the values into the new column and finally drop the temporary column. Using my test scripts I could easily verify that using this series of steps it did work as intended. I also have a work-around now.
From the names of the functions and procedures logged in Profiler I can guess that the encrypted code containing the bug is meant to determine the ddl code causing the ddl change and store this into cdc.ddl_history. Indeed when the bug is triggered, that table does not contain the ddl, whereas when the alter completes successfully, the ddl can be found in cdc.ddl_history.
I will attach my scripts to create the simplified environment and run the test for others, and possibly MS, to try and reproduce the bug in their environment. Instructions: On a test server run the scripts numbered 1, 2, and 4 in that order (the missing numbers were for transactional replication) to create the test environment. Then from the test\ folder run the scripts 2 and 3 to trigger the bug. Now see what happens by running the script test\10 see the result.sql: sys.dm_cdc_log_scan_sessions will show the errors, and cdc.ddl_history will show that no ddl is stored when the error occurs. Also in the attached zip is a .trc file containing all commands executed when I ran sp_replcmds in the faulty situation.
My conclusion: to trigger this issue the database must be both a merge replication subscriber and CDC enabled. I will log a case with MS, so hopefully they can have a look at that encrypted code and fix the issue. When an astrophysicist discovers a new star he gets to name it. I think I can name this bug now 🙂 . I will call this one the “replicated alter column ddl_history bug”
Viewing post 1 (of 1 total)
You must be logged in to reply to this topic. Login to reply