Generating Surrogate-Keys for Type 1 and Type 2 dimensions using SSIS

  • Comments posted here are about the content posted at http://www.sqlservercentral.com/columnists/hbhaiya/2783.asp

  • Why don't you use GUIDs for the surrogate-keys?

  • I think this was a good and thorough article on SCD's.

    One thing of note if you get the highest key value and manually set surrogates versus using identity columns then you cannot load a dimension in parallel.

    Also be wary of using the checksum to determine record changes, it is possible, however unlikely, that two different records could produce the same checksum.  Here is a quote from BOL:

    "CHECKSUM satisfies the properties of a hash function: CHECKSUM applied over any two lists of expressions returns the same value if the corresponding elements of the two lists have the same type and are equal when compared using the equals (=) operator. For the purpose of this definition, NULL values of a given type are considered to compare as equal. If one of the values in the expression list changes, the checksum of the list also usually changes. However, there is a small chance that the checksum will not change."

    GUID's may be great in that you don't need to worry about uniqueness, but there are a couple of reasons not to use GUID's as surrogates.  One is that they are 4 times larger then an integer.  May not sound like a big deal but when you are dealing with potentially hundreds of millions of records with an expectation of sub-second response time on reports, every little bit helps.  Also you may want to create clustered indexes on the surrogate in the fact tables.  Since GUID's are not sequential, at least in not in SQL2K, then you will quickly fragment your fact tables.

    David

  • Thank you for writing this article, Harsh. It's a good article. I learned about the version number from your article; I've used effective and expiry date and current version column but I have come across the version number, so thanks for the valuable input.

    I tend to agree with David that using identity is better than GUID or max+1. As David correctly say if we use max+1 we could have problem when loading in parallel. I've seen an application (not DW though, it was an OLTP) used max+1 to generate ID and we experienced contention and locking problems. Last year when building a DW in a project somebody suggested to use GUID and we decided not too because of the same reason as David mentioned, i.e. performance. Not on the dimension tables, but performance of the fact table as fact tables have many SKs columns. I used big int sometimes for customer and product dimensions but all other dimensions normal int usually sufficient.

    Looking forward to your next article.

    Kind regards,

    Vincent

  • can I download the code in your article .

    both the database and the ssis package.

     

     

    thanks in advance

    Ran

     

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply