My fifth data warehousing tip is to always use surrogate keys for your dimension and fact tables.
Surrogate keys are keys that are generated internally by the data warehouse. They have nothing to do with the business keys. The business keys are the unique row identifiers from your source systems. You definitely want these too, as otherwise you won’t be able to map the data in your data warehouse back to your source system (see data warehousing tip #1). However, you shouldn’t be relying on the business keys as your unique identifier in the data warehouse.
One of the underlying reasons that surrogate keys are a must for a data warehouse is that data warehouses can consume from multiple source systems. If you were to rely on the business keys as your unique identifier you could end up with overlapping key values and you certainly could never assume that you wouldn’t.
I’ve had complaints before that a particular data warehouse doesn’t need surrogate keys because it only has one source system. Whilst in principle this sounds reasonable here are some reasons why you should always use surrogate keys regardless of how many data sources you have.
You can make your keys smaller and more efficient
You could have composite keys in your source system. Where the uniqueness of the record is formed from more than one column. If this is the case and you use the business key as your unique identifier in the data warehouse you’d have to propagate all of the columns in the composite key to all of your child tables. Then when users are querying they’ll also have to include all columns in the join. This in itself makes the data warehouse slower to query and more cumbersome to use.
You may have some business keys that are BIGINT whereas you know that you can use INT in the data warehouse. You may have some business keys that are strings. By creating a surrogate key as a suitable data type such as a 1, 2, 4, or 8 byte integer you can have a key that is as small as possible, which means that storage is reduced, and query times are improved. Just make sure that you are catering for future growth when deciding on the size of your key.
Slowly changing dimensions
So, lets say you’ve built your shiny new data warehouse using business keys because there’s only one source system. Awesome. Now, someone comes up to you and asks you to retain a user’s address history in the user table. Fine, no problem. We’ll just change DimUser to a type-2 slowly changing dimension. Erm, not that easy if you only have the business key. Why? Well your business key (let’s say UserId) won’t be unique any more if you add another user record to capture their change of address. You’d then have to create some sort of composite key on the UserId and some unique part of the address, and then propagate this composite key into your fact table. Hmm…
Ideally we want our new data warehouse to have some sort of shelf life. We know that systems get replaced all of the time, but ideally not ours, at least not for a good while. If the business loves your data warehouse (and why wouldn’t they?) they might get excited about the possibilities. They might want to bring in different data sources. If your dimension and fact tables only have business keys as primary keys then you’ve got an issue. You can’t just pull in data from a different source if you are planning on adding it to some of the existing dimensions and facts. You’ll need a complete overhaul of your data warehouse if you want to do this. Your ETL/ELT will change. So will your staging environment, your dimensions and facts, your stored procedures, and your reports and dashboards.
It is best practice to use surrogate keys in a data warehouse. Hopefully the arguments above are compelling enough reasons as to why they are so important. It really is no extra work to implement them. So please get it done right first time as you’ll never know what the future may hold.