Click here to monitor SSC
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


Checksum function???


Checksum function???

Author
Message
NoraG
NoraG
Old Hand
Old Hand (354 reputation)Old Hand (354 reputation)Old Hand (354 reputation)Old Hand (354 reputation)Old Hand (354 reputation)Old Hand (354 reputation)Old Hand (354 reputation)Old Hand (354 reputation)

Group: General Forum Members
Points: 354 Visits: 916
Hi

Can anyone guide me on using checksum to compare million records in source and target...

I have to update a dimension if sumthing changes, I cant use lookup as it takes lot of time...

If anyone has ever used checksum() function, can you guide me on how to use it...

Any help is appreciated.

thanks


---------------------------------------------------

Thanks

NoraG
NoraG
Old Hand
Old Hand (354 reputation)Old Hand (354 reputation)Old Hand (354 reputation)Old Hand (354 reputation)Old Hand (354 reputation)Old Hand (354 reputation)Old Hand (354 reputation)Old Hand (354 reputation)

Group: General Forum Members
Points: 354 Visits: 916
any help?


---------------------------------------------------

Thanks

Lynn Pettis
Lynn Pettis
SSC-Insane
SSC-Insane (24K reputation)SSC-Insane (24K reputation)SSC-Insane (24K reputation)SSC-Insane (24K reputation)SSC-Insane (24K reputation)SSC-Insane (24K reputation)SSC-Insane (24K reputation)SSC-Insane (24K reputation)

Group: General Forum Members
Points: 24150 Visits: 37920
Here is my two cents. Don't use it. It is not a guarantee that it will identify all changes to your data. It is possible for a combination of fields to result in the same checksum value. Your best way to identify changes is, in my opinion, a column by column comparision.

Cool
Lynn Pettis

For better assistance in answering your questions, click here
For tips to get better help with Performance Problems, click here
For Running Totals and its variations, click here or when working with partitioned tables
For more about Tally Tables, click here
For more about Cross Tabs and Pivots, click here and here
Managing Transaction Logs

SQL Musings from the Desert Fountain Valley SQL (My Mirror Blog)
GSquared
GSquared
SSChampion
SSChampion (14K reputation)SSChampion (14K reputation)SSChampion (14K reputation)SSChampion (14K reputation)SSChampion (14K reputation)SSChampion (14K reputation)SSChampion (14K reputation)SSChampion (14K reputation)

Group: General Forum Members
Points: 14363 Visits: 9729
You can use a checksum to narrow down the search. It can make a good leading-edge for what would otherwise be a multi-column index.

Checksum isn't exact, but it can reduce the number of rows that need to be compared, which can reduce I/O on large data sets.

What you do is work out what columns you are most likely to need to compare (might be all of them, might be a subset), and generate checksums for them. Persist and index those. Then, join the source table and target table on the checksum columns, as well as other criteria.

I've done speed tests with this in very large tables (one was half a petabyte for one table with hundreds of millions of rows), and it can result in much faster returns and much less I/O, when used on multi-column seeks/compares.

That's the general idea. Are you looking for specifics? If so, please clarify what specifics you need help on.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
Lynn Pettis
Lynn Pettis
SSC-Insane
SSC-Insane (24K reputation)SSC-Insane (24K reputation)SSC-Insane (24K reputation)SSC-Insane (24K reputation)SSC-Insane (24K reputation)SSC-Insane (24K reputation)SSC-Insane (24K reputation)SSC-Insane (24K reputation)

Group: General Forum Members
Points: 24150 Visits: 37920
Gus, are (did) you use CHECKSUM or BINARY_CHECKSUM? If I was going this direction, I would look at the later. Of course, I don't have the volume of data you are talking about either.

Cool
Lynn Pettis

For better assistance in answering your questions, click here
For tips to get better help with Performance Problems, click here
For Running Totals and its variations, click here or when working with partitioned tables
For more about Tally Tables, click here
For more about Cross Tabs and Pivots, click here and here
Managing Transaction Logs

SQL Musings from the Desert Fountain Valley SQL (My Mirror Blog)
GSquared
GSquared
SSChampion
SSChampion (14K reputation)SSChampion (14K reputation)SSChampion (14K reputation)SSChampion (14K reputation)SSChampion (14K reputation)SSChampion (14K reputation)SSChampion (14K reputation)SSChampion (14K reputation)

Group: General Forum Members
Points: 14363 Visits: 9729
I'm pretty sure it was checksum, not binary. Since I didn't care about the difference in case, which is the primary difference between the two.
select checksum('A'),checksum('a'), binary_checksum('A'), binary_checksum('a');


Results:
142 142 65 97

With what I was doing, case differences didn't matter, only actual content differences. That's the decision that determines which you should use. Of course, it's also critical that you use the same one on both sides of the equation.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
Michael Ross-468051
Michael Ross-468051
SSC-Enthusiastic
SSC-Enthusiastic (167 reputation)SSC-Enthusiastic (167 reputation)SSC-Enthusiastic (167 reputation)SSC-Enthusiastic (167 reputation)SSC-Enthusiastic (167 reputation)SSC-Enthusiastic (167 reputation)SSC-Enthusiastic (167 reputation)SSC-Enthusiastic (167 reputation)

Group: General Forum Members
Points: 167 Visits: 154
How wide is the dimension, and are we talking type 1 or type 2 SCD columns?
NoraG
NoraG
Old Hand
Old Hand (354 reputation)Old Hand (354 reputation)Old Hand (354 reputation)Old Hand (354 reputation)Old Hand (354 reputation)Old Hand (354 reputation)Old Hand (354 reputation)Old Hand (354 reputation)

Group: General Forum Members
Points: 354 Visits: 916
Thanks a lot GSquared....

From what I got is

I will have to add checksum column for each column that may change in my source in both the places....

Source as well as Target...

But my source is in different server (DB2) so I will have to stage it first and then add checksum for like 10 changing columns and similarly in my target i will add 10 checksum columns add will compare on them.????

Is this how checksum works...?

Actually my sroucre (DB2) has 7 million records...so i can't do row by row comparision(takes more than an hour)

Please let me know whr to add columns ? in SRC STG or in TARGET?

Thanks


---------------------------------------------------

Thanks

ducon
ducon
SSC-Enthusiastic
SSC-Enthusiastic (135 reputation)SSC-Enthusiastic (135 reputation)SSC-Enthusiastic (135 reputation)SSC-Enthusiastic (135 reputation)SSC-Enthusiastic (135 reputation)SSC-Enthusiastic (135 reputation)SSC-Enthusiastic (135 reputation)SSC-Enthusiastic (135 reputation)

Group: General Forum Members
Points: 135 Visits: 286
Hi there,

Not sure this might help you out but in a project I'm working on, we have decided to go with an MD5 hash that is computed via a script task in our data flow (there are a bunch of ways to do this, that's the one we went with because we could easily switch for another scheme). The hash is saved in the dimension. When we get rows from the operational system, we compare the hash and based on that decide whether an update/insert/no action is needed.

The thing is that with checksums, hashes etc. you can have collisions, you just have to check how high the risk and if you can ever afford one. One could argue that it's the same with GUIDs, while it's highly improbable, it's not impossible. If you can't afford that then you're better off with either a column-by-column comparison or an extra column in your source data that can tell you the latest time something was changed or something like that.

Hope this helps,

Greg
Phil Parkin
Phil Parkin
SSCrazy Eights
SSCrazy Eights (8.3K reputation)SSCrazy Eights (8.3K reputation)SSCrazy Eights (8.3K reputation)SSCrazy Eights (8.3K reputation)SSCrazy Eights (8.3K reputation)SSCrazy Eights (8.3K reputation)SSCrazy Eights (8.3K reputation)SSCrazy Eights (8.3K reputation)

Group: General Forum Members
Points: 8293 Visits: 19443
SQL Learner (8/28/2009)
Thanks a lot GSquared....

From what I got is

I will have to add checksum column for each column that may change ...

Source as well as Target...

--edit


You need only a single checksum per row, based on all the columns whose values you are tracking for changes, eg

select checksum('A','B','Z')

34391



That's how you get the speed benefit - you need only compare two columns (PK and checksum) when looking for changes, rather than all columns.

Note also Gus' comments about Checksum and Binary_Checksum


Help us to help you. For better, quicker and more-focused answers to your questions, consider following the advice in this link.

If the answer to your question can be found with a brief Google search, please perform the search yourself, rather than expecting one of the SSC members to do it for you.

Please surround any code or links you post with the appropriate IFCode formatting tags. It helps readability a lot.
Go


Permissions

You can't post new topics.
You can't post topic replies.
You can't post new polls.
You can't post replies to polls.
You can't edit your own topics.
You can't delete your own topics.
You can't edit other topics.
You can't delete other topics.
You can't edit your own posts.
You can't edit other posts.
You can't delete your own posts.
You can't delete other posts.
You can't post events.
You can't edit your own events.
You can't edit other events.
You can't delete your own events.
You can't delete other events.
You can't send private messages.
You can't send emails.
You can read topics.
You can't vote in polls.
You can't upload attachments.
You can download attachments.
You can't post HTML code.
You can't edit HTML code.
You can't post IFCode.
You can't post JavaScript.
You can post emoticons.
You can't post or upload images.

Select a forum

































































































































































SQLServerCentral


Search