Jeff Moden wrote:
With that, I have to ask, since MD5 returns a 16 Byte result and you convert that to BIGINT, which is an 8 Byte result (which is a whopping 8 orders of magnitude less), have you had any problems with collisions between non-duplicate values?
You're right, Jeff, it is a truncated MD5 hash. Still BIGINT is 64 bits or -2^63 (-9,223,372,036,854,775,808) to 2^63-1 (9,223,372,036,854,775,807). In practice, I'm only using the MD5 surrogate keys on meta-data tables, so while collisions are possible, it's not likely to occur across a recordset numbering in a few thousand.
What I'm actually doing is developing a CCPA Update engine that's responsible for executing updates across 20+ servers and 1000+ columns we're identified as containing PII. The CPColumnID is hashed across ServerName, DatabaseName, SchemaName, TableName, ColumnName. I needed some type of computed surrogate key, because CPColumnID is used as a foreign key in several different tables used for mapping, tasks, and auditing. I liked hashing because the ID values are portable and consistent across multiple development and QA environments.
I actually did encounter collisions initially when casting MD5 as INT and generating IDs for a table containing a million randomly generated records. However, it doesn't happen with BIGINT, even when repeating the test many times. But, yeah, I do sometimes think about whether I should have just gone with the ubiquitous INT IDENTITY(1,1).
"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho