Lookups with huge datasets

  • cmbrooks08

    SSC Eights!

    Points: 861

    Hi everybody-

    I have a package trying to load a fact table in my DW, and the dataset from one of the dimensions is too large for my computer/the lookup. I'm already using the SQL command instead of the table selection, so I'm only pulling in the 2 columns I need, but I'm having trouble caching all of the field, since it's about 30 million records. I've read some hints about people using merges instead of lookups, but couldn't find anything in depth enough to really help me understand. Can anybody else that works with a dataset this large point me in the right direction?

    Thanks-

    Matt

  • sql_learner29

    SSCrazy Eights

    Points: 9107

    why would you want to lookup all the columns as you need only the dimension KEY in the fact table...you can just lookup the business key and the primary key from the dimension table...

    that will be enough

    [font="Comic Sans MS"]
    ---------------------------------------------------

    Thanks [/font]

  • jjbear11

    Default port

    Points: 1497

    The lookup transform adds all of the records to the lookup cache before it begins processing your package. Execute your package and then go to "Progress" tab. You will notice as you look down the log the number of rows that the Lookup Transform is caching. If you have a lot of rows of data, it's going to bring your computer to a crawl because it has to add all that data to memory.

  • cliffb

    SSCarpal Tunnel

    Points: 4547

    For larger dimensions it is usually more efficient in my experience to load a staging table that has your key columns nullable and do an update with a stored procedure.

  • Koen Verbeeck

    SSC Guru

    Points: 258965

    If SSIS 2005:

    you can use a left outer join in your data source. Join your source (staging) table with the destination (dimension/fact table) with the no lock hint. If your indexes are placed correctly, it should have good performance. Select the surrogate key from the destination table.

    After the source, use a conditional split to see if a record is an update or an insert.

    Insert --> SK_Destination IS NULL

    Update --> SK_Destination IS NOT NULL

    Write the inserts immediately to the destination (hence the no lock hint, or you could get locking issues).

    Write the updates to a temp table.

    After the data flow, perform a set-based update by matching the temp table to the destination table using the surrogate key you found. (this is much more performant than using an OLE DB Command in the data flow!)

    If SSIS 2008:

    use the MERGE statement

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • cliffb

    SSCarpal Tunnel

    Points: 4547

    da-zero (10/6/2010)


    If SSIS 2005:

    you can use a left outer join in your data source. Join your source (staging) table with the destination (dimension/fact table) with the no lock hint. If your indexes are placed correctly, it should have good performance. Select the surrogate key from the destination table.

    After the source, use a conditional split to see if a record is an update or an insert.

    Insert --> SK_Destination IS NULL

    Update --> SK_Destination IS NOT NULL

    Write the inserts immediately to the destination (hence the no lock hint, or you could get locking issues).

    Write the updates to a temp table.

    After the data flow, perform a set-based update by matching the temp table to the destination table using the surrogate key you found. (this is much more performant than using an OLE DB Command in the data flow!)

    If SSIS 2008:

    use the MERGE statement

    If your data source is on the same server as the dimension table, then I agree with you. However, if you are pulling from another server or loading from a flat file, loading a staging table has been quicker in my experience.

  • H4K

    Hall of Fame

    Points: 3271

    cmbrooks08 (9/30/2010)


    Hi everybody-

    I have a package trying to load a fact table in my DW, and the dataset from one of the dimensions is too large for my computer/the lookup. I'm already using the SQL command instead of the table selection, so I'm only pulling in the 2 columns I need, but I'm having trouble caching all of the field, since it's about 30 million records. I've read some hints about people using merges instead of lookups, but couldn't find anything in depth enough to really help me understand. Can anybody else that works with a dataset this large point me in the right direction?

    Thanks-

    Matt

    First of all. tell me

    what happens when u execute the SSIS Package ?

    how much time does it take to complete or it doesn't complete at all ??

    Also for lookup you can try partial cache option where you can set the amount of cache to be used by the lookup to cache the rows.

    Regards,

    Amar Sale

    BI Developer
    SSRS, SSIS, SSAS, IBM Cognos, IBM Infosphere Cubing services, Crystal reports, IBM DB2, SQL Server, T-SQL
    Please visit... ApplyBI

  • Koen Verbeeck

    SSC Guru

    Points: 258965

    cliffb (10/6/2010)


    da-zero (10/6/2010)


    ...

    Join your source (staging) table with the destination (dimension/fact table)

    ...

    If your data source is on the same server as the dimension table, then I agree with you. However, if you are pulling from another server or loading from a flat file, loading a staging table has been quicker in my experience.

    I always use staging tables, no matter what set-up. It is amongst other reasons ideal for debugging (even in production environments).

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • cmbrooks08

    SSC Eights!

    Points: 861

    Sorry it took so long to respond to this everybody.

    We ended up pulling the max date out of the dimension, then only loading records newer than that. In the fact table for the lookup, we pulled only those records into a staging lookup table so that it was much less data to cache. 2 hours just went down to 10 minutes - awesome!

  • Koen Verbeeck

    SSC Guru

    Points: 258965

    cmbrooks08 (10/8/2010)


    Sorry it took so long to respond to this everybody.

    We ended up pulling the max date out of the dimension, then only loading records newer than that. In the fact table for the lookup, we pulled only those records into a staging lookup table so that it was much less data to cache. 2 hours just went down to 10 minutes - awesome!

    Welcome to the wonderful world of incremental loads! 🙂

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • COZYROC

    One Orange Chip

    Points: 28499

    If you can use third-party solutions, check the commercial CozyRoc Lookup Plus component. It can use Distributed Cache, based on the Windows AppFabric caching service. The solution can potentially handle millions of records.

    ---
    SSIS Tasks Components Scripts Services | http://www.cozyroc.com/

  • wildh

    SSCarpal Tunnel

    Points: 4310

    How well do you know your data? Does your system allow for history to change? if your pulling out a max date and only selecting >= to that date is there any chance that your missing updates on records < your max date? if you are sure that cant happen then great.

Viewing 12 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply