Lookups with huge datasets

  • Hi everybody-

    I have a package trying to load a fact table in my DW, and the dataset from one of the dimensions is too large for my computer/the lookup. I'm already using the SQL command instead of the table selection, so I'm only pulling in the 2 columns I need, but I'm having trouble caching all of the field, since it's about 30 million records. I've read some hints about people using merges instead of lookups, but couldn't find anything in depth enough to really help me understand. Can anybody else that works with a dataset this large point me in the right direction?

    Thanks-

    Matt

  • why would you want to lookup all the columns as you need only the dimension KEY in the fact table...you can just lookup the business key and the primary key from the dimension table...

    that will be enough

    [font="Comic Sans MS"]
    ---------------------------------------------------

    Thanks [/font]

  • The lookup transform adds all of the records to the lookup cache before it begins processing your package. Execute your package and then go to "Progress" tab. You will notice as you look down the log the number of rows that the Lookup Transform is caching. If you have a lot of rows of data, it's going to bring your computer to a crawl because it has to add all that data to memory.

  • For larger dimensions it is usually more efficient in my experience to load a staging table that has your key columns nullable and do an update with a stored procedure.

  • If SSIS 2005:

    you can use a left outer join in your data source. Join your source (staging) table with the destination (dimension/fact table) with the no lock hint. If your indexes are placed correctly, it should have good performance. Select the surrogate key from the destination table.

    After the source, use a conditional split to see if a record is an update or an insert.

    Insert --> SK_Destination IS NULL

    Update --> SK_Destination IS NOT NULL

    Write the inserts immediately to the destination (hence the no lock hint, or you could get locking issues).

    Write the updates to a temp table.

    After the data flow, perform a set-based update by matching the temp table to the destination table using the surrogate key you found. (this is much more performant than using an OLE DB Command in the data flow!)

    If SSIS 2008:

    use the MERGE statement

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • da-zero (10/6/2010)


    If SSIS 2005:

    you can use a left outer join in your data source. Join your source (staging) table with the destination (dimension/fact table) with the no lock hint. If your indexes are placed correctly, it should have good performance. Select the surrogate key from the destination table.

    After the source, use a conditional split to see if a record is an update or an insert.

    Insert --> SK_Destination IS NULL

    Update --> SK_Destination IS NOT NULL

    Write the inserts immediately to the destination (hence the no lock hint, or you could get locking issues).

    Write the updates to a temp table.

    After the data flow, perform a set-based update by matching the temp table to the destination table using the surrogate key you found. (this is much more performant than using an OLE DB Command in the data flow!)

    If SSIS 2008:

    use the MERGE statement

    If your data source is on the same server as the dimension table, then I agree with you. However, if you are pulling from another server or loading from a flat file, loading a staging table has been quicker in my experience.

  • cmbrooks08 (9/30/2010)


    Hi everybody-

    I have a package trying to load a fact table in my DW, and the dataset from one of the dimensions is too large for my computer/the lookup. I'm already using the SQL command instead of the table selection, so I'm only pulling in the 2 columns I need, but I'm having trouble caching all of the field, since it's about 30 million records. I've read some hints about people using merges instead of lookups, but couldn't find anything in depth enough to really help me understand. Can anybody else that works with a dataset this large point me in the right direction?

    Thanks-

    Matt

    First of all. tell me

    what happens when u execute the SSIS Package ?

    how much time does it take to complete or it doesn't complete at all ??

    Also for lookup you can try partial cache option where you can set the amount of cache to be used by the lookup to cache the rows.

    Regards,

    Amar Sale

    BI Developer
    SSRS, SSIS, SSAS, IBM Cognos, IBM Infosphere Cubing services, Crystal reports, IBM DB2, SQL Server, T-SQL
    Please visit... ApplyBI

  • cliffb (10/6/2010)


    da-zero (10/6/2010)


    ...

    Join your source (staging) table with the destination (dimension/fact table)

    ...

    If your data source is on the same server as the dimension table, then I agree with you. However, if you are pulling from another server or loading from a flat file, loading a staging table has been quicker in my experience.

    I always use staging tables, no matter what set-up. It is amongst other reasons ideal for debugging (even in production environments).

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • Sorry it took so long to respond to this everybody.

    We ended up pulling the max date out of the dimension, then only loading records newer than that. In the fact table for the lookup, we pulled only those records into a staging lookup table so that it was much less data to cache. 2 hours just went down to 10 minutes - awesome!

  • cmbrooks08 (10/8/2010)


    Sorry it took so long to respond to this everybody.

    We ended up pulling the max date out of the dimension, then only loading records newer than that. In the fact table for the lookup, we pulled only those records into a staging lookup table so that it was much less data to cache. 2 hours just went down to 10 minutes - awesome!

    Welcome to the wonderful world of incremental loads! 🙂

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • If you can use third-party solutions, check the commercial CozyRoc Lookup Plus component. It can use Distributed Cache, based on the Windows AppFabric caching service. The solution can potentially handle millions of records.

    ---
    SSIS Tasks Components Scripts Services | http://www.cozyroc.com/

  • How well do you know your data? Does your system allow for history to change? if your pulling out a max date and only selecting >= to that date is there any chance that your missing updates on records < your max date? if you are sure that cant happen then great.

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply