Moving Data with possible changes from Different Locations

  • I have a new project that I need some input on what the best approach would be to handle. We have production servers at multiple sites that are running SQL 2000. What I have been tasked with is to move (make a copy) of that data from multiple tables nightly to a SQL 2008 server at our central office. Once there we will run our analysis and reports from the data. Some of these tables have millions of rows. Here is the catch, on occasion, some of the older data can change and I need to account for that. So I will have updates and inserts for those tables when I bring them accross each night. They have something working now that makes a copy of the data at every site and does some transformations but it drops the tables every night and start over. Not efficient at all. The purpose for now moving it off the site servers is to lessen the work the production servers have to do for analysis.

    One other issue is that the IT staff is concerned about the performance of bringing so much data over each night accross a WAN.

    Thoughts on how to address the issues and concerns?

  • Are you in a position to make sure tables in the source databases have RowVersion columns? (Called Timestamp in most SQL 2000 databases, but it's the same thing.)

    That makes it really easy to know what to include in your delta load. (Delta = Only what's been changed, in case that term isn't familiar.)

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Another method of doing delta checks, which I've been investigating recently, is a computed checksum() by primary key check to determine delta, and then doing a comparison between the target and destination. Once I have the list of primary keys that need to be updated/inserted, I can grab only the necessary records.

    It's basically a way to workaround the lack of a rowversion column that Gus mentioned in his previous post.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • We have been told by the vendor of the software that runs on the databse that they don't support rowversion. Evidently they have a number of procedures in their code that run SELECT *. Nice, huh?

    As far as the computed CHECKSUM, I've investigated that but enen in BOL it says that it isn't 100% reliable. I've found with some checks the past couple of days that I have anywhere from 100 to 400 records between a backup of the table and the current table where the CHECKSUM values are different but when I look at the individual records they are identical. I can't figure out why the CHECKSUM values are different.

  • ehlinger (2/17/2011)


    We have been told by the vendor of the software that runs on the databse that they don't support rowversion. Evidently they have a number of procedures in their code that run SELECT *. Nice, huh?

    As far as the computed CHECKSUM, I've investigated that but enen in BOL it says that it isn't 100% reliable. I've found with some checks the past couple of days that I have anywhere from 100 to 400 records between a backup of the table and the current table where the CHECKSUM values are different but when I look at the individual records they are identical. I can't figure out why the CHECKSUM values are different.

    :blink: Usually the problem goes the other way, you can get some false 'unchanged' results, not false 'changed'. That's really odd. I'd have to see the values themselves to try to discern that.

    The problem with checksum, usually, is that certain items could create the same checksum. There's an alternate function binary_checksum you could look into, as well. The primary difference there is case sensitivity:

    From BOL: http://msdn.microsoft.com/en-us/library/ms173784.aspx

    CHECKSUM and BINARY_CHECKSUM return different values for the string data types, where locale can cause strings with different representation to compare equal. The string data types are char, varchar, nchar, nvarchar, or sql_variant (if the base type of sql_variant is a string data type). For example, the BINARY_CHECKSUM values for the strings "McCavity" and "Mccavity" are different. In contrast, in a case-insensitive server, CHECKSUM returns the same checksum values for those strings. CHECKSUM values should not be compared with BINARY_CHECKSUM values.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • Got the CHECKSUM issue taken care of and it appears to be working correctly now. I will probably try and use that to indicate if data needs to be transferred over by using a staging table with just the primary key of each table and the CHECKSUM value on the site servers and then just move records where the CHECKSUM has changed from the preceeding day.

    Has anyone ever experienced the occasional issues when the CHECKSUM value doesn't change upon a change? Should I be OK going this route or is there a way in SSIS that I can compare the CHECKSUM values between two tables and avoid using the staging table?

  • ehlinger (2/17/2011)


    Got the CHECKSUM issue taken care of and it appears to be working correctly now. I will probably try and use that to indicate if data needs to be transferred over by using a staging table with just the primary key of each table and the CHECKSUM value on the site servers and then just move records where the CHECKSUM has changed from the preceeding day.

    Has anyone ever experienced the occasional issues when the CHECKSUM value doesn't change upon a change?

    Yes, but it's very rare. When I implement this method I usually make sure there's at least a once a month refresh to catch the one or two/million missing updates that might happen.

    Should I be OK going this route or is there a way in SSIS that I can compare the CHECKSUM values between two tables and avoid using the staging table?

    Well, I personally will keep the checksum() as a persisted computed at the target (where I can adjust schema to my heart's delight), and left join the source to the target with the PK+Checksum as the join. Anyone who's missing in the target list gets themselves put into the queue for an update/insert.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • If you end up with a false positive on checksum in this case, saying a row has changed when it hasn't, that can be handled by a second step of comparing the actual rows, or it can be ignored and update the row anyway, whichever is "cheaper" to implement. After all, updating a row to make it the same as it already was doesn't hurt the data, it just takes a few unnecessary CPU cycles and some I/O overhead.

    The idea is that it will narrow down the list of rows to check for updates, instead of having to check all columns of all rows ever generated.

    Too bad on the rowversion thing. That's what it's made for, so it's really efficient at it. But I do understand that a lot of software vendors produce really junky database code, and then want you to upgrade your server when their code runs like a glacier.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Thanks for all your input.

    Craig, in your opinion could I use the Lookup or Merge Join transforms to compare between the source and the target? If so, which one would you recommend?

    As I stated before I think I could also do the check on the source server by doing a comparison between a staging table and the actual table and have a status field that would basically indicate if there was a update, addition or no change and then based on that move the addition and change records over to the target server. However, if it can all be done in SSIS using the transforms I would think that would be better but then again if I identify the changes on the source and then just move those records maybe not.

  • ehlinger (2/17/2011)


    Thanks for all your input.

    Craig, in your opinion could I use the Lookup or Merge Join transforms to compare between the source and the target? If so, which one would you recommend?

    Lookup, always lookup. You can redirect on error (failed lookup), and you'll only want the failure rows. Merge Join, in my limited experience, is a twice in a lifetime kind of object as far as any kind of optimization is concerned, and then only when you want to end up multiplying rows in the datastream for some reason.

    However, if it can all be done in SSIS using the transforms I would think that would be better but then again if I identify the changes on the source and then just move those records maybe not.

    This may require some testing on your part. Obviously, if your PK is 2 columns, a 3 column vs. 3 column check would run the fastest. However, you're going to have to read the table twice. It may or may not be faster to just grab everything from the source into the datastream, use a tight lookup against the target, find anything that doesn't match (errors on lookup), and discard the matches.

    What I would recommend you do for the SSIS only test is do the lookup based on PK alone, but get back the checksum field. Any errors here are direct inserts, and you can feed that result to your insert statement. After that, do a conditional row split, testing that sourceChecksum = TargetCheckSum. If they match, put them into the 'conditional' area, let the unmatched come out as default.

    Feed your default into an update statement of your choosing, and abandon the conditional split's other results to the AEther.

    Which one of these options is faster is going to depend pretty heavily on table width. If you're looking at (rough guess) ~200 bytes per row or less, try the SSIS version first. If it's larger, try the staging table with list of appropriate records first, then feed that joined to your main table into the SSIS. You're still going to have to find an appropriate MERGE method, however.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • Thanks Craig. I'll be looking at this later tonight and most of the day tomorrow. I appreciate your insight.

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply