Loading 150 million flat structure data into star schema

  • Hi folks,

    I have a scenario where I am getting 150 million data in flat structure table (156 columns). We are migrating from flat structure to Star Schema (26 dimensions and 1 Fact table).

    Our ETL expert has designed the architecture to migrate data as below with incremental load option.

    1. Get unique values from flat structure table to load 26 dimensions / master table. using T-SQL.

    2. Check unique values are present respective dimension. If not, then load into dimension table. If yes, then skip the record.

    3. Back update dimension keys (auto generated into dimension table) into Flat structure in additional Key columns. Using batch update.

    4. Load fact table, straight forward insert (bulk load).

    Problem areas:

    1. Taking unique values from flat structure and loading dimension is done by T-SQL Query and Lookup transformation with cache option. This process is dying on bulk load (30+ million of data), where as taking

    long execution time with 5 million of data.

    2. Back update - Is there any efficient way to improve back updating dimension keys.

    3. The complete process it time consuming took more than a day.

    Server Info:

    The package is deployed on Server 1 - 10 GB RAM for SQL with 8 processors

    The Source and Target database is on Server 2 - 100 GB RAM for SQL with 16 processors

    Quick help is much appreciated! Let me know in case you need additional info.

    Abhijit - http://abhijitmore.wordpress.com

  • First some quick questions

    1) Data sizes?

    2) Available high performance storage space?

    3) Type of storage available?

    Suggest using a staging area on the target Sql Server and replace the Lookup transformation with a merge statement.

    Forgive me if I'm tad slow here but back update would be late arriving?

    150m isn't a large number, full day for loading it is.

    😎

  • Thanks Eirikur,

    1) Data sizes? - 30 GB

    2) Available high performance storage space? 1TB disk, 97 GB sql memory, 16 processors

    3) Type of storage available? - RAID 1

    > Already using staging area.

    > Do you mean merge statement in TSQL or merge join or merge transformation?

    > You are correct the major problem area is back update, but its required to update dimension keys

    data could vary from 5 million to 150 million.

    Abhijit - http://abhijitmore.wordpress.com

  • Abhijit More (10/9/2014)


    Thanks Eirikur,

    1) Data sizes? - 30 GB

    2) Available high performance storage space? 1TB disk, 97 GB sql memory, 16 processors

    3) Type of storage available? - RAID 1

    > Already using staging area.

    > Do you mean merge statement in TSQL or merge join or merge transformation?

    > You are correct the major problem area is back update, but its required to update dimension keys

    data could vary from 5 million to 150 million.

    So you have 30GB of data, you use a lookup with full cache and your ETL server has only 10GB of RAM.

    Hmmmmm.... what could be the issue here?

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • Eirikur Eiriksson (10/8/2014)


    Suggest using a staging area on the target Sql Server and replace the Lookup transformation with a merge statement.

    +1

    Go for a staging area. Back updating the flat file seems unnecessarily complicated to me.

    If you have Enterprise edition, use page compression on the staging table if you are worried about storage space.

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • we do not have flat files as I mention source and target both are SQL

    Abhijit - http://abhijitmore.wordpress.com

  • Abhijit More (10/9/2014)


    we do not have flat files as I mention source and target both are SQL

    Right, I got confused with the "flat structure table".

    You mentioned "incremental load" in your original question. How is this architecture incremental? Or is the 150 million row table the result of an incremental load check?

    I would drop the update of the source table. Just load everything and use lookups with full caches against the 26 dimensions in an SSIS data flow. Maybe raise the number of EngineThreads that are available for the data flow.

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • Using lookup transformation with FULL cache will add an overhead on memory whereas we have other application accessing the same server. another concern using lookup will be using 26 lookup with full cache.

    from incremental load perspective we could receive apporx 15 million of data on each run.

    Abhijit - http://abhijitmore.wordpress.com

  • Abhijit More (10/10/2014)


    Using lookup transformation with FULL cache will add an overhead on memory whereas we have other application accessing the same server. another concern using lookup will be using 26 lookup with full cache.

    from incremental load perspective we could receive apporx 15 million of data on each run.

    Depends on how big your dimensions are (we have no information on that whatsoever) and how wide their business keys are.

    Maybe you should put more RAM on your ETL server. 10GB is way too few. My laptop has more memory than that.

    Since the source table is in SQL Server and your destination tables as well, you could also do everything in TSQL.

    Load the dimensions first. Then write a gigantic SQL statement that joins the source table against the 26 dimensions and dumpt the result in the fact table. If you properly index your dimensions, it should not be that hard. 15 millions rows is not that much.

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply