RE: Speeding Up Incremental ETL Proceses in SSIS by Using MD5 Hashes

SSC Journeyman

Points: 84

June 8, 2010 at 9:48 am

belldb (6/8/2010)
I see what you're trying to point out, but keep in mind that the basis for checking individual records is 1:1 because the lookup is joined on the primary key, therefore, you're not checking against the entire dataset, which could create the behavior you describe.
I am interested in seeing which specific changes to a record could result in generating the identical MD5 checksum.

Cool, I don't have the sample database loaded, so didn't know for certain you're joining on the primary key column, since ContactID doesn't have a primary key defined in the code sample. In light of this new information, the 1:1 should make MD5 just fine.

About the only other thing I have to add is that most hashes output a fixed length latin value (CHAR 40 for SHA1), so I'd probably use CHAR instead of either VARCHAR, NCHAR, or NVARCHAR for storing the hash value - this post is about maximizing speed after all.

Even with 1:1 comparisons, the wider your input for the hash becomes, the greater the probability for the hash collision. Although I don't have a reference for a speed difference in SQL Server between MD5 versus SHA1 hashing, SHA1 is plenty fast enough especially since it's safer. We should all try to be as close to ACID as possible, so I'll simply never use MD5 anymore.