Data Warehouse Design Issue

Question

Data Warehouse Design Issue

mar.ko

Default port

Points: 1451
More actions
July 7, 2015 at 12:24 pm

#299871

Guys - I got no response in the database design subforum, so I thought I would cross post this here:
We are stuck with the task of building a warehouse where the major source of data will be SalesForce.
SalesForce uses these large GUID strings (15 to 32 chars).
Would performance be improved significantly if every time we imported into the warehouse, we would first convert the GUID's to unique long integers ?
In that manner, the linkages across the dimensions and fact tables should be much faster.
On top of that, a considerable amount of space would be saved as well.
Is this worth looking into ? Has anyone faced this as an issue before ?

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply

stevefromOZ SSC-Forever Points: 43646 More actions · Answer 1

The SF GUIDs are a max of 18, so CHAR(18) although I've seen different systems (DBAmp OLE DB driver for one) declare them as NCHAR(18). You will almost certainly want to keep/store these so you can have lineage/reference to the source record, so you're going to be hit with the 18bytes (or 36 if you use NCHAR) per row anyway. Adding a bigint will use 8bytes.

For me, I would assume that the bigint PK/FK and indexing performance *would* be better than using the natural key but you will have some significant overhead in your ETL in adding the surrogate, also every lookup will come down to you with the PK GUID of the lookup object.

It almost comes down to how much data, which would impact the performance of an indexed string versus the performance of bigint.

Steve.

mar.ko Default port Points: 1451 More actions · Answer 2

Thanks Steve.

re: "lineage back to the source record".

Good point, but to do the conversion to BigInt, we would have a reference table with the GUID's and new BigInt references.

There would be a small hit every time the fact table is loaded: We would have to fetch the BigInt if the GUID existed in the reference table, if not, insert a new GUID/BigInt pair.

stevefromOZ SSC-Forever Points: 43646 More actions · Answer 3

sortof (ok, I mean, I don't think I agree 😀 )

That "reference" table you refer to, is the dimension table. And the process of checking for existance (with a system that has tight referential integrity like SF) is not too taxing as long as you load your dimensional data first (but then again, you will *always* want to do that, regardless of whether you are using surrogate keys or not).

e.g. Load Account to DimAccount, then when loading transactions related to Account, perform a lookup using the ID (GUID) on the DimAccount and output a surrogatye key value that's a bigint.

Just so you feel better, tis is not a new issue to solve :hehe:

From 2002 - article

From a few days ago - blog post

Steve.

mar.ko Default port Points: 1451 More actions · Answer 4

I see what you mean, but what about the FACT table ?

The idea was to place the converted BigInt keys to the dimension tables in there.

Thus, when loading either FACT or DIMENSION table, a lookup of the BigInt key via GUID must be performed.

Technically with 50 dimensions, I am saving 500 bytes per row or roughly 1/2 gigabyte per million rows.

More importantly, index resolution/lookup is much faster on numbers vs. strings.

stevefromOZ SSC-Forever Points: 43646 More actions · Answer 5

Your last reply seems to suggest support for both approaches (the pain of a lookup would make using natural (GUID) keys more appealing), but then you seem concerned about the data storage.

Taking on the storage question first, your anticipated row counts are what's important to know - say if you anticipate 500 transactions a day then let's call it ~200k/yr, you're half GB storage saving doesn't mean much as it would take 5 yrs to get to the 1MM row mark (for a single fact table). If you're expecting more data than that, then the sums change obviously.

Thinking of the lookups, unless you've got a modified SFDC environment, and one that you're assured won't be changed back, the GUID is typically the unique identifier as different records are allowed to have the same name (e.g. Account; custom objects can be different though). So you're going to need to store that GUID in the dimension. And yes, the dimension load will require a lookup to determine if it's an insert or update, but this can work pretty well (ie fast) if you've got the GUID indexed. And yes, the fact load will require a lookup (I'd use a join personally) to convert the GUID to a int based key value.

Lastly, you mention that the index performance is much better int vs string - generally true but i'd suggest testing it, as while these are strings, they're char(18) (not varchar, there is no variability). I'd also ask - where/when is this important to you? If you're using this data store to load a cube, then index performance is a one time hit (ie load, say daily) but if you anticipate having users who'll be running reports directly against the DB, then it may come into play again here.

I've got some SFDC data in SQL, using the GUIDs as keys/FK's (ie indexed) with counts in the low millions and performance is generally acceptable, but of course, your milage may vary 🙂

Steve.

mar.ko Default port Points: 1451 More actions · Answer 6

Thanks so much for your feedback Steve.

It would really be a great study in design and performance....I wish someone ELSE had already done it !!!

Eric M Russell SSC Guru Points: 125522 More actions · Answer 7

mar.ko (7/7/2015)
Guys - I got no response in the database design subforum, so I thought I would cross post this here:
We are stuck with the task of building a warehouse where the major source of data will be SalesForce.
SalesForce uses these large GUID strings (15 to 32 chars).
Would performance be improved significantly if every time we imported into the warehouse, we would first convert the GUID's to unique long integers ?
In that manner, the linkages across the dimensions and fact tables should be much faster.
On top of that, a considerable amount of space would be saved as well.
Is this worth looking into ? Has anyone faced this as an issue before ?

Joining fact with dimension tables using sequential integer based surrogate keys is a widely used data warehousing practice. I'd suggest reading this book.

http://www.amazon.com/Data-Warehouse-Toolkit-Definitive-Dimensional/dp/1118530802

"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

stevefromOZ SSC-Forever Points: 43646 More actions · Answer 8

Joining fact with dimension tables using sequential integer based surrogate keys is a widely used data warehousing practice.

It absolutely is, and I would use it 9 out of 10 times. But with this SF data, you already have a unique surrogate key. It comes down to whether you want to i) store it in the fact (because you have to store it in the dim/s); and ii) whether you'd prefer to use INT's (or bigints) because of a perceived or actual performance gain.

Steve.

Eric M Russell SSC Guru Points: 125522 More actions · Answer 9

stevefromOZ (7/9/2015)
Joining fact with dimension tables using sequential integer based surrogate keys is a widely used data warehousing practice.
It absolutely is, and I would use it 9 out of 10 times. But with this SF data, you already have a unique surrogate key. It comes down to whether you want to i) store it in the fact (because you have to store it in the dim/s); and ii) whether you'd prefer to use INT's (or bigints) because of a perceived or actual performance gain.

Those GUIDs may be surrogate keys within the context of the source system, but they're not suitable for the fact table. The point of having integer based surrogate keys in the fact table is join performance and minimizing the width of each fact row and size of the fact table.

"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

RonKyle SSC-Dedicated Points: 31537 More actions · Answer 10

I have a similar issue with one cube. I insert them into a reference table in the final staging area. The essential columns are the unique incrementing id and the GUID. This id is inserted into the fact table. The joins become a little more complex, but the space savings really adds up.

mar.ko Default port Points: 1451 More actions · Answer 11

Thanks for chiming in Eric....that was my feeling all along.

mar.ko Default port Points: 1451 More actions · Answer 12

re: "but they're not suitable for the fact table"

That was my feeling all along.

Thanks for chiming in Eric...and thanks for the reference to the book.