RE: Suggestion/solution for using Hashbytes across entire table

SSCommitted

Points: 1763

October 6, 2015 at 1:55 pm

To make the situation a little more concrete...

I have developed a lot of data conversion T-SQL to convert data from a third-party database, where in their tables there are NO reliable fields to detect changes. I cannot alter those tables.

The data conversion requires about 3 days to run a full conversion (on over 30 years of historical data including transactional histories, all of which must be converted), but I have designed it so that as long as I can detect "changes" to the source data tables, an incremental update a month later can run in 10-15 minutes.

With this approach, the customer can take one snapshot of their current live system, run a full conversion into their new system (which is also live for a lot of other functional components), validate all the converted data (over a few weeks or so), then - when they are confident they are ready - we shut down their old system, copy the tables in their current states, and run the incremental conversion. That means they are only down for an hour or two, instead of shutting down a 24/7 operation for 3+ days. (Plus - should anything be found in the testing after the initial conversion, we can correct that and re-run if necessary.)

Also - if needed, we can do this more than once. We can do the full convert, test, an incremental convert, test, another incremental convert, etc.

So - I am using parallel tables that contain the primary keys of the source tables and a computed checksum, used to detect changes to the source tables since the last conversion run.

The CHECKSUM function is *probably* sufficient to detect virtually all these changes, and has worked so far (that we can detect), but I would like to eliminate the "*probably*" part. 🙂 Plus, in future projects I would like to have another arrow in my quiver of useful techniques...