Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase 12»»

Checksum function??? Expand / Collapse
Author
Message
Posted Monday, August 24, 2009 1:12 PM
Old Hand

Old HandOld HandOld HandOld HandOld HandOld HandOld HandOld Hand

Group: General Forum Members
Last Login: Friday, September 12, 2014 11:45 AM
Points: 336, Visits: 867

Hi

Can anyone guide me on using checksum to compare million records in source and target...

I have to update a dimension if sumthing changes, I cant use lookup as it takes lot of time...

If anyone has ever used checksum() function, can you guide me on how to use it...

Any help is appreciated.

thanks



---------------------------------------------------

Thanks
Post #776261
Posted Tuesday, August 25, 2009 11:47 AM
Old Hand

Old HandOld HandOld HandOld HandOld HandOld HandOld HandOld Hand

Group: General Forum Members
Last Login: Friday, September 12, 2014 11:45 AM
Points: 336, Visits: 867
any help?


---------------------------------------------------

Thanks
Post #776928
Posted Tuesday, August 25, 2009 11:54 AM


SSC-Insane

SSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-Insane

Group: General Forum Members
Last Login: Today @ 8:11 PM
Points: 20,864, Visits: 32,902
Here is my two cents. Don't use it. It is not a guarantee that it will identify all changes to your data. It is possible for a combination of fields to result in the same checksum value. Your best way to identify changes is, in my opinion, a column by column comparision.





Lynn Pettis

For better assistance in answering your questions, click here
For tips to get better help with Performance Problems, click here
For Running Totals and its variations, click here or when working with partitioned tables
For more about Tally Tables, click here
For more about Cross Tabs and Pivots, click here and here
Managing Transaction Logs

SQL Musings from the Desert Fountain Valley SQL (My Mirror Blog)
Post #776933
Posted Tuesday, August 25, 2009 12:18 PM


SSChampion

SSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampion

Group: General Forum Members
Last Login: Thursday, December 18, 2014 9:58 AM
Points: 13,872, Visits: 9,600
You can use a checksum to narrow down the search. It can make a good leading-edge for what would otherwise be a multi-column index.

Checksum isn't exact, but it can reduce the number of rows that need to be compared, which can reduce I/O on large data sets.

What you do is work out what columns you are most likely to need to compare (might be all of them, might be a subset), and generate checksums for them. Persist and index those. Then, join the source table and target table on the checksum columns, as well as other criteria.

I've done speed tests with this in very large tables (one was half a petabyte for one table with hundreds of millions of rows), and it can result in much faster returns and much less I/O, when used on multi-column seeks/compares.

That's the general idea. Are you looking for specifics? If so, please clarify what specifics you need help on.


- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
Post #776958
Posted Tuesday, August 25, 2009 12:30 PM


SSC-Insane

SSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-Insane

Group: General Forum Members
Last Login: Today @ 8:11 PM
Points: 20,864, Visits: 32,902
Gus, are (did) you use CHECKSUM or BINARY_CHECKSUM? If I was going this direction, I would look at the later. Of course, I don't have the volume of data you are talking about either.



Lynn Pettis

For better assistance in answering your questions, click here
For tips to get better help with Performance Problems, click here
For Running Totals and its variations, click here or when working with partitioned tables
For more about Tally Tables, click here
For more about Cross Tabs and Pivots, click here and here
Managing Transaction Logs

SQL Musings from the Desert Fountain Valley SQL (My Mirror Blog)
Post #776970
Posted Tuesday, August 25, 2009 12:42 PM


SSChampion

SSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampion

Group: General Forum Members
Last Login: Thursday, December 18, 2014 9:58 AM
Points: 13,872, Visits: 9,600
I'm pretty sure it was checksum, not binary. Since I didn't care about the difference in case, which is the primary difference between the two.
select checksum('A'),checksum('a'), binary_checksum('A'), binary_checksum('a');

Results:
142 142 65 97

With what I was doing, case differences didn't matter, only actual content differences. That's the decision that determines which you should use. Of course, it's also critical that you use the same one on both sides of the equation.


- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
Post #776993
Posted Wednesday, August 26, 2009 1:10 PM
SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: Monday, December 9, 2013 10:10 AM
Points: 167, Visits: 154
How wide is the dimension, and are we talking type 1 or type 2 SCD columns?
Post #777784
Posted Friday, August 28, 2009 9:26 AM
Old Hand

Old HandOld HandOld HandOld HandOld HandOld HandOld HandOld Hand

Group: General Forum Members
Last Login: Friday, September 12, 2014 11:45 AM
Points: 336, Visits: 867
Thanks a lot GSquared....

From what I got is

I will have to add checksum column for each column that may change in my source in both the places....

Source as well as Target...

But my source is in different server (DB2) so I will have to stage it first and then add checksum for like 10 changing columns and similarly in my target i will add 10 checksum columns add will compare on them.????

Is this how checksum works...?

Actually my sroucre (DB2) has 7 million records...so i can't do row by row comparision(takes more than an hour)

Please let me know whr to add columns ? in SRC STG or in TARGET?

Thanks




---------------------------------------------------

Thanks
Post #779198
Posted Sunday, August 30, 2009 7:04 PM
SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: Friday, September 19, 2014 1:52 PM
Points: 122, Visits: 274
Hi there,

Not sure this might help you out but in a project I'm working on, we have decided to go with an MD5 hash that is computed via a script task in our data flow (there are a bunch of ways to do this, that's the one we went with because we could easily switch for another scheme). The hash is saved in the dimension. When we get rows from the operational system, we compare the hash and based on that decide whether an update/insert/no action is needed.

The thing is that with checksums, hashes etc. you can have collisions, you just have to check how high the risk and if you can ever afford one. One could argue that it's the same with GUIDs, while it's highly improbable, it's not impossible. If you can't afford that then you're better off with either a column-by-column comparison or an extra column in your source data that can tell you the latest time something was changed or something like that.

Hope this helps,

Greg
Post #779665
Posted Monday, August 31, 2009 3:48 AM


SSCertifiable

SSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiable

Group: General Forum Members
Last Login: Today @ 12:24 AM
Points: 5,317, Visits: 12,355
SQL Learner (8/28/2009)
Thanks a lot GSquared....

From what I got is

I will have to add checksum column for each column that may change ...

Source as well as Target...

--edit


You need only a single checksum per row, based on all the columns whose values you are tracking for changes, eg

select checksum('A','B','Z')

34391

That's how you get the speed benefit - you need only compare two columns (PK and checksum) when looking for changes, rather than all columns.

Note also Gus' comments about Checksum and Binary_Checksum




Help us to help you. For better, quicker and more-focused answers to your questions, consider following the advice in this link.

When you ask a question (and please do ask a question: "My T-SQL does not work" just doesn't cut it), please provide enough information for us to understand its context.

It is better to keep your mouth shut and appear stupid than to open it and remove all doubt. (Mark Twain)
Post #779757
« Prev Topic | Next Topic »

Add to briefcase 12»»

Permissions Expand / Collapse