Star Schema Design - Multiple Sources

  • Hi,

    I am planning the design of a star schema for a data warehouse. The data will be loaded from 2 separate staging tables (ClientTbl and SystemTbl).

    The end user will need to be able to compare any differences between the 2 sets of data. They will also need to be able to match items from either source. e.g. ItemA from SourceA matches ItemA from SourceB.

    For the design, I am considering having both sources on one row in the Fact Table. So for SourceA (ClientTbl source), there would be NULL values for SourceB columns (SystemTbl source) and vice versa. The 'Source' column is then used to flag whether the data is Client source or System source.

    I am thinking it will be easier to do this design especially when we may need to match items. So on a match, the NULLs would be replaced with the matched item's values.

    Am I correct with this design or is there a better solution? Thanks!

    Screenshots of table data and schema design attached.

  • If you're just doing a comparison between two data sources that have the same attributes and the only difference is the source, then why not combine both of these attributes into one and add the SourceCode in the dimension not the fact?

    For example:

    DimProducts

    ProductID

    SourceCode

    ProductName

    The SourceCode would be:

    N = No Source

    C = Client Source

    S = System Source

    Then you can remove some of those other Key fields from your Fact and just focus on assigning your new dim like DimProducts back onto the table with the unique identifier like ProductID? This should allow you to query the Fact and filter between Client and System by filtering the dimension like SourceCode = C.

    The only problems in my example is what happens if you have 100,000 products? This gets duplicated per source. If you have 2 sources, then that dimension now has 200,000 records. For each source you add, it will add 100,000 records to that table.

  • xsevensinzx (9/30/2016)


    If you're just doing a comparison between two data sources that have the same attributes and the only difference is the source, then why not combine both of these attributes into one and add the SourceCode in the dimension not the fact?

    For example:

    DimProducts

    ProductID

    SourceCode

    ProductName

    The SourceCode would be:

    N = No Source

    C = Client Source

    S = System Source

    Then you can remove some of those other Key fields from your Fact and just focus on assigning your new dim like DimProducts back onto the table with the unique identifier like ProductID? This should allow you to query the Fact and filter between Client and System by filtering the dimension like SourceCode = C.

    I agree with this design. The only thing I'd disagree with is the following:

    The only problems in my example is what happens if you have 100,000 products? This gets duplicated per source. If you have 2 sources, then that dimension now has 200,000 records. For each source you add, it will add 100,000 records to that table.

    The design proposed by the OP also has a row per source per product just with a bunch of NULL columns.

    I'd go with:

      SourceAccountKey

      Source

      SourceIdentifierKey

      PeriodKey

      CurrencyKey

      Quantity

      Price

    And all the columns should be not null.

    For reporting if you need to see the Client and System data side by side in one row instead of a 2 rows you can just pivot the data.

  • Hi,

    How do I allow for an item in SourceA to be matched to an item in SourceB?

    So in my report output, if the user has indicated these 2 items are to be matched, then they should appear one one row.

    Thanks!

  • kevin.obrien 66193 (10/9/2016)


    Hi,

    How do I allow for an item in SourceA to be matched to an item in SourceB?

    So in my report output, if the user has indicated these 2 items are to be matched, then they should appear one one row.

    Thanks!

    I am not able to see the images you attached. Perhaps you provide a sample of what your data looks like for the product dimension. My immediate hunch is that source is not really an attribute of a product. If you only get productA from source 1 and productB from source 2 , etc.. then I can go with keeping the source attribute in the product dimension. As you describe however, a product can come from both sources thus are to be 'matched.' I assume this means to see if they are the same product. So I would separate out the source especially if the number of sources can range in the hundreds.

    ----------------------------------------------------

  • kevin.obrien 66193 (10/9/2016)


    Hi,

    How do I allow for an item in SourceA to be matched to an item in SourceB?

    So in my report output, if the user has indicated these 2 items are to be matched, then they should appear one one row.

    Thanks!

    I think all reporting platforms have the ability to PIVOT/Cross Tab the data in report, which is where I would to the pivoting because it is simpler there. But, you can PIVOT in SQL something like this:

    DECLARE @Table TABLE

    (

    ProductKey INT,

    SOURCE VARCHAR(20),

    SourceIdentifierKey INT,

    PeriodKey INT,

    CurrencyKey INT,

    Quantity INT,

    Price DECIMAL(18, 4)

    );

    INSERT INTO @Table

    (ProductKey, SOURCE, SourceIdentifierKey, PeriodKey, CurrencyKey, Quantity, Price)

    VALUES

    (1, -- SourceAccountKey - int

    'A', -- SOURCE - varchar(20)

    1, -- SourceIdentifierKey - int

    1, -- PeriodKey - int

    1, -- CurrencyKey - int

    10, -- Quantity - int

    15.20 -- Price - decimal(18, 4)

    ),

    (1, -- SourceAccountKey - int

    'B', -- SOURCE - varchar(20)

    1, -- SourceIdentifierKey - int

    1, -- PeriodKey - int

    1, -- CurrencyKey - int

    17, -- Quantity - int

    15.80 -- Price - decimal(18, 4)

    ),

    (2, -- SourceAccountKey - int

    'A', -- SOURCE - varchar(20)

    1, -- SourceIdentifierKey - int

    1, -- PeriodKey - int

    1, -- CurrencyKey - int

    7, -- Quantity - int

    11.20 -- Price - decimal(18, 4)

    ),

    (2, -- SourceAccountKey - int

    'B', -- SOURCE - varchar(20)

    1, -- SourceIdentifierKey - int

    1, -- PeriodKey - int

    1, -- CurrencyKey - int

    27, -- Quantity - int

    10.80 -- Price - decimal(18, 4)

    );

    ;

    SELECT

    *

    FROM

    (

    SELECT

    P.ProductKey,

    P.PeriodKey,

    P.CurrencyKey,

    P.A,

    P.B,

    'Price for ' + T2.SOURCE AS Source,

    T2.Price

    FROM

    (

    SELECT

    T.ProductKey,

    T.SOURCE,

    T.PeriodKey,

    T.CurrencyKey,

    T.Quantity

    FROM

    @Table AS T

    ) AS T1 PIVOT ( SUM(Quantity) FOR SOURCE IN ([A], ) ) AS P

    JOIN @Table AS T2

    ON P.ProductKey = T2.ProductKey

    ) AS P1 PIVOT ( AVG(Price) FOR SOURCE IN ([Price For A], [Price For B]) ) AS P1;

    If you need to be more dynamic and can't pivot in the reporting layer, you should read the 2 Crosstab & Pivot articles linked in my signature

  • kevin.obrien 66193 (10/9/2016)


    Hi,

    How do I allow for an item in SourceA to be matched to an item in SourceB?

    So in my report output, if the user has indicated these 2 items are to be matched, then they should appear one one row.

    Thanks!

    You might consider a table with surrogate key, source code, and sourceID.

    When they are matched, you assign the same surrogate key.

    The surrogate key is best a key only in the Data Warehouse. You may add other sources, and some Items will not exist in all systems.

    This can get rather interesting, as an Item may have several suppliers, each with different suppliers and pricing.

    And likely an Item number unique to each supplier.

    Be sure to talk through and point out with the users some of the design considerations.

  • The issue I see is that you can have one product out of many that only uses source D for example, while the rest are varied among the other sources. This case will give the table a swiss cheese appearance with all other products in the table containing null (or the equivalent value) there. If there is no significant relationship between the product and its source, then there is nothing wrong with having a source dimension. This is done in a normal star schema with the concept of a supplier. In short, the attributes of a table should relate to the key field.

    Of course... It really comes down to mirroring this business procurement process. If indeed there is a relationship between what provides what product, then that is a valid attribute of a productID.

    ----------------------------------------------------

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply