SQL Clone
SQLServerCentral is supported by Redgate
Log in  ::  Register  ::  Not logged in

Using SSIS lookup transformation editor with millions of rows

If you are using a SSIS lookup transformation editor and the reference dataset is large (millions of rows) and you are using “Full cache” on the lookup transformation (the default), you might run into a problem.  With this setting the rows will all be loaded into memory, and with so many rows you can run out of memory on the machine and the data flow will fail because the lookup transformation will not spool its memory overflow to disk.

There are two solutions to this.  One solution is you can use one of the other two cache modes: “Partial cache” or “No cache”.  These will prevent all or most of the rows from being loaded into memory, but the trade-off is that you will see a lot more hits to the database.  And if your reference dataset is tens of millions of rows, these could make the task take hours or even days. Read more about the different cache modes: SSIS – Lookup Cache Modes – Full, Partial, None and Lookup cache modes.

A better solution is to replace the lookup transformation editor and the corresponding OLE DB Command transformations and instead use the MERGE statement (requires SQL Server 2008) in an Execute SQL task.  Read about it at Using MERGE in Integration Services Packages.  Basically, you will create a Data Flow task that loads, transforms, and saves the source data to a staging table, create an Execute SQL task that contains the MERGE statement, and then connect the Data Flow task to the Execute SQL task, and use the data in the staging table as the input for the MERGE statement.

One SSIS package that I was working on was using “No cache” and the reference dataset was 5 million rows.  It was taking 6 hours to run.  After replacing it with the MERGE statement, it took only 15 minutes to complete.

One side note: “Full cache” mode adds additional startup time for your data flow, as all of the caching takes place before any rows are read from the data flow source(s).  So if you execute a package and it takes a long time before you see the first task start, it could be because many rows are being loaded into memory.

James Serra's Blog

James is a big data and data warehousing technology specialist at Microsoft. He is a thought leader in the use and application of Big Data technologies, including MPP solutions involving hybrid technologies of relational data, Hadoop, and private and public cloud. Previously he was an independent consultant working as a Data Warehouse/Business Intelligence architect and developer. He is a prior SQL Server MVP with over 30 years of IT experience. James is a popular blogger (JamesSerra.com) and speaker, having presented at dozens of PASS events including the PASS Business Analytics conference and the PASS Summit. He is the author of the book “Reporting with Microsoft SQL Server 2012”. He received a Bachelor of Science degree in Computer Engineering from the University of Nevada-Las Vegas.


Posted by mmattson on 16 August 2011

Another option is to use the Cache Transform (Data Flow). This will allow you to load your data to a file cache and proceeding tasks will hit against this, rather than the database.

One consideration is that you still need to be able to load all rows into memory before the cache file will be created. However, once created, the cache file will perform well and save both memory and reduce database hits.

Posted by Lempster on 17 August 2011

This could come in very handy, but how does one insert all the columns from non-matching rows into the target table without having to specify the full column list twice (once for the INSERT column list and once for the VALUES column list)? Is there an 'INSERT INTO SELECT *' option?


Posted by Koen Verbeeck on 17 August 2011

Is the TSQL MERGE operator a viable alternative for the Lookup component in every scenario? In the datawarehousing scenario probably so, but there are other scenario's where you are not merging data but where you still need to do a lookup.

Furthermore, the MERGE statement itself doesn't replace Lookup functionality, you probably use JOINS inside the MERGE statement that do the lookup functionality.

So I would state that using JOINS in TSQL is the best alternative for the Lookup component. (A MERGE JOIN with a Conditional split will also do the trick, but performance wise this won't be as fast as the TSQL JOINS)

Posted by DanielBowlin on 17 August 2011

Another useful technique.  If the look up records will primarily access a small subset of the source data, for example customers active in the last year vs. all customers in the db.  Use a lookup with a partial cache setting to load all the commonly accessed records (customers active in last year) and then route all records failing the lookup to a no cache lookup for all the other records.

One thing I have always wondered about is how to use the Merge technique with multiple lookups.  Is it possible to avoid having to throw a sort transform, and slowing everything down, into the mix before the second merge?

Posted by dbowlin on 17 August 2011

Oops, not a partial cache, a full cache with a query that constrains the result set to the active customers.

Posted by Lempster on 17 August 2011

I agree with verbeeckkoen; in a scenario where one simply wants to pass matching or non-matching rows to another task in the data flow, e.g. another Lookup transformation, using MERGE would involve multiple staging tables and hence multiple data flows. I'll have to do some testing to see whether or not this still works out quicker and less resource-intensive than caching using a Lookup transformation.

Posted by Lempster on 17 August 2011

In fact I don't think MERGE would work at all in the above because at least one of the matching(non-matching) clauses must be specified - there is no option to do nothing with matching or non-matching data.

Posted by Kenneth Wymore on 17 August 2011

I ran into performance issues when trying to bounce roughly 10K rows against 18M rows in a lookup using full cache. Changing it to no cache reduced the load time from hours to minutes. I have yet to try the merge to see if it would perform any better.

Posted by Peter Schott on 17 August 2011

I also ran into Ken's issue, but it definitely depends on what will be looked up and how often you may use that cache. We generated cache files and then looked up against those cache files. That helped quite a bit. We also ran into issues with MERGE against some of the larger fact tables we wanted to load. We were trying to mark an historical row inactive before inserting new rows using Merge and the query optimizer chose to use a table scan. I had to force an index which increased reads, but reduced time from hours to minutes.  I want to try out the Pragmatic Works Upsert transform - just haven't had a chance to really put it to the test yet.

Leave a Comment

Please register or log in to leave a comment.