Checksum function???

  • Hi

    Can anyone guide me on using checksum to compare million records in source and target...

    I have to update a dimension if sumthing changes, I cant use lookup as it takes lot of time...

    If anyone has ever used checksum() function, can you guide me on how to use it...

    Any help is appreciated.

    thanks

    [font="Comic Sans MS"]
    ---------------------------------------------------

    Thanks [/font]

  • any help?

    [font="Comic Sans MS"]
    ---------------------------------------------------

    Thanks [/font]

  • Here is my two cents. Don't use it. It is not a guarantee that it will identify all changes to your data. It is possible for a combination of fields to result in the same checksum value. Your best way to identify changes is, in my opinion, a column by column comparision.

  • You can use a checksum to narrow down the search. It can make a good leading-edge for what would otherwise be a multi-column index.

    Checksum isn't exact, but it can reduce the number of rows that need to be compared, which can reduce I/O on large data sets.

    What you do is work out what columns you are most likely to need to compare (might be all of them, might be a subset), and generate checksums for them. Persist and index those. Then, join the source table and target table on the checksum columns, as well as other criteria.

    I've done speed tests with this in very large tables (one was half a petabyte for one table with hundreds of millions of rows), and it can result in much faster returns and much less I/O, when used on multi-column seeks/compares.

    That's the general idea. Are you looking for specifics? If so, please clarify what specifics you need help on.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Gus, are (did) you use CHECKSUM or BINARY_CHECKSUM? If I was going this direction, I would look at the later. Of course, I don't have the volume of data you are talking about either.

  • I'm pretty sure it was checksum, not binary. Since I didn't care about the difference in case, which is the primary difference between the two.

    select checksum('A'),checksum('a'), binary_checksum('A'), binary_checksum('a');

    Results:

    142 142 65 97

    With what I was doing, case differences didn't matter, only actual content differences. That's the decision that determines which you should use. Of course, it's also critical that you use the same one on both sides of the equation.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • How wide is the dimension, and are we talking type 1 or type 2 SCD columns?

  • Thanks a lot GSquared....

    From what I got is

    I will have to add checksum column for each column that may change in my source in both the places....

    Source as well as Target...

    But my source is in different server (DB2) so I will have to stage it first and then add checksum for like 10 changing columns and similarly in my target i will add 10 checksum columns add will compare on them.????

    Is this how checksum works...?

    Actually my sroucre (DB2) has 7 million records...so i can't do row by row comparision(takes more than an hour)

    Please let me know whr to add columns ? in SRC STG or in TARGET?

    Thanks

    [font="Comic Sans MS"]
    ---------------------------------------------------

    Thanks [/font]

  • Hi there,

    Not sure this might help you out but in a project I'm working on, we have decided to go with an MD5 hash that is computed via a script task in our data flow (there are a bunch of ways to do this, that's the one we went with because we could easily switch for another scheme). The hash is saved in the dimension. When we get rows from the operational system, we compare the hash and based on that decide whether an update/insert/no action is needed.

    The thing is that with checksums, hashes etc. you can have collisions, you just have to check how high the risk and if you can ever afford one. One could argue that it's the same with GUIDs, while it's highly improbable, it's not impossible. If you can't afford that then you're better off with either a column-by-column comparison or an extra column in your source data that can tell you the latest time something was changed or something like that.

    Hope this helps,

    Greg

  • SQL Learner (8/28/2009)


    Thanks a lot GSquared....

    From what I got is

    I will have to add checksum column for each column that may change ...

    Source as well as Target...

    --edit

    You need only a single checksum per row, based on all the columns whose values you are tracking for changes, eg

    select checksum('A','B','Z')

    34391

    That's how you get the speed benefit - you need only compare two columns (PK and checksum) when looking for changes, rather than all columns.

    Note also Gus' comments about Checksum and Binary_Checksum

    If you haven't even tried to resolve your issue, please don't expect the hard-working volunteers here to waste their time providing links to answers which you could easily have found yourself.

  • I did this recently to determine if a row needed to be updated. I put a checksum column in the table using the values that could possibly be updated. The stored procedure had a checksum variable based on the input variables. The input variables for the proc are also the DB columns that could possibly be updated.

    By comparing the DB checksum calculation with the stored procedure checksum variable I knew whether the row needed updating or not.

  • Mark Pratt (9/1/2009)


    I did this recently to determine if a row needed to be updated. I put a checksum column in the table using the values that could possibly be updated. The stored procedure had a checksum variable based on the input variables. The input variables for the proc are also the DB columns that could possibly be updated.

    By comparing the DB checksum calculation with the stored procedure checksum variable I knew whether the row needed updating or not.

    Makes sense.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Mark Pratt (9/1/2009)


    I did this recently to determine if a row needed to be updated. I put a checksum column in the table using the values that could possibly be updated. The stored procedure had a checksum variable based on the input variables. The input variables for the proc are also the DB columns that could possibly be updated.

    By comparing the DB checksum calculation with the stored procedure checksum variable I knew whether the row needed updating or not.

    Hi,

    Yep, this is exactly the point I was trying to make. Note that is very easily doable using a script component in a data flow and from what we've tried to far, goes pretty fast.

    Regards,

    Greg

  • If you are using sql 2005, and the columns you are interested in are varchar,nvarchar, or varbinary, you can try making a computed column out of the /b HashBytes /b function. There are a few limits, one being the total input size cannot be greater than 8k ( I think this was fixed in either sp3 or 2008). I also believe it is deterministic so you can put an index on it. Use the SHA1 algorithm as the chance of collision is 1 in 2^80.

    Josef Richberg
    2009 Exceptional DBA
    http://www.josef-richberg.squarespace.com
    http://twitter.com/sqlrunner

  • Is there any sort of date column in the source table that could show you which rows have been updated? If you could do something to avoid comparing the million rows, that may help.

    Also, have you tried the SCD transformation? Between that and proper indexing on your dimension, you should still get decent performance out of the compare.

    How much free reign do you have on the source system? Would it be possible to make changes there to aid in making this easier?

    John Rowan

    ======================================================
    ======================================================
    Forum Etiquette: How to post data/code on a forum to get the best help[/url] - by Jeff Moden

Viewing 15 posts - 1 through 15 (of 16 total)

You must be logged in to reply to this topic. Login to reply