Building ProductDim from normalized tables.

  • touching back into DW so a bit rusty. Im wanting to build a product dimension in my DW, the product information in my source db has been normalised so you have table like Product with forign keys to ProductType, ProductColor, ProductSize etc.. My question is

    When im bulding the product dimension should i be bringing over the keys as well as the description into the dimension so that i can i identify an exact product when building my FactOrders table by key when doing the lookup?

    so my product dim would look like

    ProductKey | ProductID | ProductName | ProdColorKey | ProdColor | ProdSizeID | ProdSize

    1 1233312 Vest 8876423 Green 9976552 Small

    or can i leave the keys out and just do the lookup on Name

    ProductKey | ProductName | ProdColor | ProdSize

    1 Vest Green Small

  • No, all you need are ProductKey, and maybe an alternate key.

    The rest of the columns in a Dimension table are just for grouping when you do your analysis.

  • ok so when building my OrderFact table how do i get the relevant ProductKey from the product dimension to use in the fact table for each flavour of product that was sold? Would my lookup based on the names of the product attributes like

    "Select ProductKey From DimProduct Where ProductName=Vest And ProdColor=Green And ProdSize=small"

  • At the top level, you may want to absorb the primary keys from the source system for validation purposes only and generate your own primary keys based on the source system attributes. This allows you to fully integrate multiple data sources into one system controlled by the data warehouse versus designing around all source system keys that could cause conflicts as your data warehouse grows.

    If you choose the road of generating your own primary keys based on the source attributes, if the attribute for Vest changes one day, you will need the source primary key for Vest to validate the change. Otherwise, when Vest changes, it becomes a new record in that production dimension you created because you cannot validate if this new record is truly new or an update to an existing record.

    I would bring over all the attributes and use the source keys for validation only. Then generate new keys controlled by the data warehouse for each attribute absorbed from any source system. This allows me to integrate all data sources into one unified data model that looks and feels like the one source of truth.

    That being said, if you take the approach of denormalizing three tables into one table, then you also have to go from three source primary keys to one primary key generated by your data warehouse. You have to track and validate that consolidation to truly validate each attribute for Name, Color and Size. That way if any of those attributes change, you can correctly address them in your new denormalized product dimension.

    You can achieve this at the easiest level by bringing over those three primary keys and treating them as the alternate keys in the dimension like you suggested (ProductID | ProductName | ProdColorKey | ProdColor | ProdSizeID | ProdSize). Then generate that new primary key by your data warehouse as ProductKey. Then your FactOrder contains only the ProductKey foreign key that allows you to look at your order by Name, Color or Size.

    -- Showing all FactOrder records where the order had a Green product color.

    SELECT

    o.[OrderID]

    ,p.[ProdName]

    ,p.[ProdColor]

    ,p.[ProdSize]

    FROM [DW].[dbo].[FactOrder] o

    INNER JOIN [DW].[dbo].[DimProduct] p ON p.[ProductKey] = o.[ProductKey]

    WHERE p.[ProdColor] = 'Green'

    If the same product has multiple colors and sizes, then obviously you will have redundant data for ProductName. You may want to address that by splitting out color and sizes into their own dimensions. Size for example could actually be treated as a metric and embedded on the fact where you only have DimProduct and DimProductColor where three keys are assigned. ProductKey, ProductColorKey and ProductSize.

    -- Showing all FactOrder records where the order had a Green product color.

    SELECT

    o.[OrderID]

    ,p.[ProdName]

    ,c.[ProdColor]

    ,o.[ProdSize]

    FROM [DW].[dbo].[FactOrder] o

    INNER JOIN [DW].[dbo].[DimProduct] p ON p.[ProductKey] = o.[ProductKey]

    INNER JOIN [DW].[dbo].[DimProductColor] c ON c.[ProductColorKey] = o.[ProductColorKey]

    WHERE c.[ProdColor] = 'Green'

  • Great reply thanks.

    I was thinking exactly the two options you mentioned. I was even thinking my table structure would be the cases when you go to a snowflak structure., But youve cleared this up.

    Cheers Pal.

  • ps_vbdev (8/22/2016)


    Great reply thanks.

    I was thinking exactly the two options you mentioned. I was even thinking my table structure would be the cases when you go to a snowflak structure., But youve cleared this up.

    Cheers Pal.

    No problems.

    Snowflake is still fine to go here too. For example, color would make sense to still separate and branch off products. Either way, I think it's good to not consolidate the product dimension with every possible color and size per product. If there is a lot of products with many colors and sizes, then you're looking at a very large product dimension table. You should try to minimize that and look into separation into that star or snowflake.

    The size is really up to you. I deal with advertising myself and have creative sizes in dimensions. While it's not a true metric to the database, it's still a measurement and I treat 100x100 as a metric that only lives on the fact.

    This should all work fine if you are going to make a product fact table on sales or something similar. Obviously if the product does not have color and size, you can leave those NULL. If you have other requirements to track something like services with products, then it makes since to split services out into it's own fact where you have FactProductSales and FactServiceSales for example.

  • If the same product has multiple colors and sizes, then obviously you will have redundant data for ProductName. You may want to address that by splitting out color and sizes into their own dimensions.

    Tread carefully here. While to some extent the path is going to be dictated by business needs, my initial reaction would be to have this information in the same dimension available as independent attributes. If you split it out, it seems off hand that several questions won't be easily answered, such as which product/size/color is not selling. Color and size are attributes of the product, not independent items related to the order in the same manner as product.

    I was thinking exactly the two options you mentioned. I was even thinking my table structure would be the cases when you go to a snowflak [sic] structure., quote[/]

    This would not be a case for the snowflake structure. It's a straightforward hierarchy. Snowflakes are not good ideas, but occasionally necessary when joining additional information to an attribute. For example, if you had the hierarchy Country-State-Country-City-Zip, and another table with additional information about counties, a snowflake might be a better option than to integrate these attributes for something that is midway in the hierarchy. Too many OLAP designers use snowflake because they can't get past the paradigm shift from having designed OLTP databases.

  • RonKyle (8/22/2016)


    If the same product has multiple colors and sizes, then obviously you will have redundant data for ProductName. You may want to address that by splitting out color and sizes into their own dimensions.

    Tread carefully here. While to some extent the path is going to be dictated by business needs, my initial reaction would be to have this information in the same dimension available as independent attributes. If you split it out, it seems off hand that several questions won't be easily answered, such as which product/size/color is not selling. Color and size are attributes of the product, not independent items related to the order in the same manner as product.

    Indeed, that is a consideration too for sure. You can still measure what is not selling if they are split, but you don't know for sure if the color and or size listed is offered in the specific product. It may show false negatives where the listing of products, sizes and colors are not real combinations and are showing a negative impact of not selling. If you stuck with the consolidation and the redundant data of products, sizes and colors, then you would get true negatives where the listing of product, sizes and colors are REAL combinations and are showing a negative impact of not selling. True is better than false and only if it makes sense for the business to as RonKyle pointed out, is part of the business requirements on answering with reporting.

    Good stuff. Thanks Ron.

  • I would have to say resist the temptation to think about space first and the simple layout of a denormalized design second. You are thinking like a designer of a operational system here and not a warehouse. Keeping the color and size in the product dimension makes sense unless proven otherwise. Such when you are dealing with frequently changing attributes.(<-- Here the use of a minidimension is employed. Ralph Kimball addresses this scenario in his works.) The dimension is self learning /growing based on valid values encountered in the operational data. If you instead create two tables like CarMake and CarModel ... you can theoretically end up with combination like toyota accord and ford prius, how do you know which cross tab is valid? You have to create the extra links to the fact table to know. And it would not cover the cases where there are zero sales for a valid make and model combination (where all of this could be prepopulated to a flat dimension).

    ----------------------------------------------------

  • Well i included color and size in the product dimension which worked fine until i discovered that we can have draft orders that have unsized product that we may want to report on. so as the size key are in the product dim and we dont have sizes for some i can only point my nomatch row at the -1 record in ProductDim"No Valid Product" to indicate an no matching product even tho its only the size missing.

    if i take the size out and make it its own dimension i can then match on product/color/type in the productDim and then do another lookup on the SizeDim so for the unsized i can set the nomatch row accordingly.

    this gives me more info for unsized orders so instead of just haveing a key in the fact table pointing to the nomatch row in the productdim - "No valid product" I will be able to match the product/color/type and then return the nomatch row in the sizedim for the unsized products. I think this is better?

    thoughts guys?

  • You make unsized a size for each product that needs it.

  • Ditto for the other attributes as well. You may very well have limitation on what is possible. You may have it so that you cannot have a certain size in a certain color. In which case the size dimension will not be independent, as a FK to the fact table. These members should not form their own dimension.

    ----------------------------------------------------

  • These members should not form their own dimension.

    Seconded. Do not go down that route. The size, and lack of size, is an attribute of the product, not the order.

  • Ok thank for the advice guys very helpful. Is this approach off having a sized and unsized record (and others if needed) preferably even tho it will double the rows to load into the dim?

  • " even tho it will double the rows to load into the dim"... so? If you have a normal data warehouse, the number of rows in a dimension table is usually orders of magnitude smaller than a fact table.

Viewing 15 posts - 1 through 15 (of 21 total)

You must be logged in to reply to this topic. Login to reply