Lookups with huge datasets

Question

Lookups with huge datasets

cmbrooks08

SSC Eights!

Points: 861
More actions
September 30, 2010 at 12:22 pm

#142102

Hi everybody-
I have a package trying to load a fact table in my DW, and the dataset from one of the dimensions is too large for my computer/the lookup. I'm already using the SQL command instead of the table selection, so I'm only pulling in the 2 columns I need, but I'm having trouble caching all of the field, since it's about 30 million records. I've read some hints about people using merges instead of lookups, but couldn't find anything in depth enough to really help me understand. Can anybody else that works with a dataset this large point me in the right direction?
Thanks-
Matt

Viewing 12 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply

sql_learner29 SSCrazy Eights Points: 9107 More actions · Answer 1

why would you want to lookup all the columns as you need only the dimension KEY in the fact table...you can just lookup the business key and the primary key from the dimension table...

that will be enough

[font="Comic Sans MS"]
---------------------------------------------------

Thanks [/font]

jjbear11 Default port Points: 1497 More actions · Answer 2

The lookup transform adds all of the records to the lookup cache before it begins processing your package. Execute your package and then go to "Progress" tab. You will notice as you look down the log the number of rows that the Lookup Transform is caching. If you have a lot of rows of data, it's going to bring your computer to a crawl because it has to add all that data to memory.

cliffb SSCarpal Tunnel Points: 4598 More actions · Answer 3

For larger dimensions it is usually more efficient in my experience to load a staging table that has your key columns nullable and do an update with a stored procedure.

Koen Verbeeck SSC Guru Points: 259197 More actions · Answer 4

If SSIS 2005:

you can use a left outer join in your data source. Join your source (staging) table with the destination (dimension/fact table) with the no lock hint. If your indexes are placed correctly, it should have good performance. Select the surrogate key from the destination table.

After the source, use a conditional split to see if a record is an update or an insert.

Insert --> SK_Destination IS NULL

Update --> SK_Destination IS NOT NULL

Write the inserts immediately to the destination (hence the no lock hint, or you could get locking issues).

Write the updates to a temp table.

After the data flow, perform a set-based update by matching the temp table to the destination table using the surrogate key you found. (this is much more performant than using an OLE DB Command in the data flow!)

If SSIS 2008:

use the MERGE statement

Need an answer? No, you need a question
My blog at https://sqlkover.com.
MCSE Business Intelligence - Microsoft Data Platform MVP

cliffb SSCarpal Tunnel Points: 4598 More actions · Answer 5

da-zero (10/6/2010)
If SSIS 2005:
you can use a left outer join in your data source. Join your source (staging) table with the destination (dimension/fact table) with the no lock hint. If your indexes are placed correctly, it should have good performance. Select the surrogate key from the destination table.
After the source, use a conditional split to see if a record is an update or an insert.
Insert --> SK_Destination IS NULL
Update --> SK_Destination IS NOT NULL
Write the inserts immediately to the destination (hence the no lock hint, or you could get locking issues).
Write the updates to a temp table.
After the data flow, perform a set-based update by matching the temp table to the destination table using the surrogate key you found. (this is much more performant than using an OLE DB Command in the data flow!)
If SSIS 2008:
use the MERGE statement

If your data source is on the same server as the dimension table, then I agree with you. However, if you are pulling from another server or loading from a flat file, loading a staging table has been quicker in my experience.

H4K Hall of Fame Points: 3374 More actions · Answer 6

cmbrooks08 (9/30/2010)
Hi everybody-
I have a package trying to load a fact table in my DW, and the dataset from one of the dimensions is too large for my computer/the lookup. I'm already using the SQL command instead of the table selection, so I'm only pulling in the 2 columns I need, but I'm having trouble caching all of the field, since it's about 30 million records. I've read some hints about people using merges instead of lookups, but couldn't find anything in depth enough to really help me understand. Can anybody else that works with a dataset this large point me in the right direction?
Thanks-
Matt

First of all. tell me

what happens when u execute the SSIS Package ?

how much time does it take to complete or it doesn't complete at all ??

Also for lookup you can try partial cache option where you can set the amount of cache to be used by the lookup to cache the rows.

Regards,

Amar Sale

BI Developer
SSRS, SSIS, SSAS, IBM Cognos, IBM Infosphere Cubing services, Crystal reports, IBM DB2, SQL Server, T-SQL
Please visit... ApplyBI

Koen Verbeeck SSC Guru Points: 259197 More actions · Answer 7

cliffb (10/6/2010)
da-zero (10/6/2010)
...
Join your source (staging) table with the destination (dimension/fact table)
...
If your data source is on the same server as the dimension table, then I agree with you. However, if you are pulling from another server or loading from a flat file, loading a staging table has been quicker in my experience.

I always use staging tables, no matter what set-up. It is amongst other reasons ideal for debugging (even in production environments).

Need an answer? No, you need a question
My blog at https://sqlkover.com.
MCSE Business Intelligence - Microsoft Data Platform MVP

cmbrooks08 SSC Eights! Points: 861 More actions · Answer 8

Sorry it took so long to respond to this everybody.

We ended up pulling the max date out of the dimension, then only loading records newer than that. In the fact table for the lookup, we pulled only those records into a staging lookup table so that it was much less data to cache. 2 hours just went down to 10 minutes - awesome!

Koen Verbeeck SSC Guru Points: 259197 More actions · Answer 9

cmbrooks08 (10/8/2010)
Sorry it took so long to respond to this everybody.
We ended up pulling the max date out of the dimension, then only loading records newer than that. In the fact table for the lookup, we pulled only those records into a staging lookup table so that it was much less data to cache. 2 hours just went down to 10 minutes - awesome!

Welcome to the wonderful world of incremental loads! 🙂

Need an answer? No, you need a question
My blog at https://sqlkover.com.
MCSE Business Intelligence - Microsoft Data Platform MVP

COZYROC One Orange Chip Points: 28499 More actions · Answer 10

If you can use third-party solutions, check the commercial CozyRoc Lookup Plus component. It can use Distributed Cache, based on the Windows AppFabric caching service. The solution can potentially handle millions of records.

---
SSIS Tasks Components Scripts Services | http://www.cozyroc.com/

wildh SSCarpal Tunnel Points: 4370 More actions · Answer 11

How well do you know your data? Does your system allow for history to change? if your pulling out a max date and only selecting >= to that date is there any chance that your missing updates on records < your max date? if you are sure that cant happen then great.