Hi all - not sure if this is the right forum for SSIS questions, if it is not, please let me know which one I should be using.
I have a task that requires me to take a raw data set, and process it into a set of tables for different information - we'll call them tables A, B, and C.
One particular aspect of this task is that table B has a foreign key tying it to table A, and table C has a foreign key tying it to tables A and B - though the foreign key of the table C record tying it to table A need not be the same as that of the record that it is linked to in table B. EG:
BID AID BRecord
1 1 jkl
2 1 mno
3 3 pqr
CID BID AID ARecord
1 1 1 stu
2 3 2 vwx
3 3 3 yz
The method which I've undertaken to handle this task is to use SSIS to build table A from the raw data set first. Once this is done, table B is built from the raw data set, with table A used as a lookup to supply the AID. Finally, the table C is built, with address tables A and B used as lookups for the AID and BID fields respectively.
Now, this task works perfectly fine when the data sets are small enough. The machine on which this process is running has 16gb of memory, and so it can handle pretty intense queries. However, the way in which the lookups are working is using a ridiculous amount of memory, too much even for the machine to handle.
One raw data set contains some 10 million odd records. When processed into its respective tables A and B, it yields about 8 million table A records and about 10 million table B records. When the task to build table C is executed, the lookups for the tables A and B take up about 7gb of memory each, which causes the task to grind to a halt after processing around 2 million records.
We were initially using a full cache for the lookups; after some research I discovered that full cache requires there to be enough memory space to handle caching of the entire table, and should that space become unavailable, the task would fail. Partial cache was the next attempt we used, however, the problem with partial caching is that the process took far too long to execute, since the lookups against the table on the database level were far too slow.
The attempt which I plan on trying next, is to split the build table C into four sets, by dividing the raw data source in four. This would result in the lookups against tables A and B processing far fewer records each, and should allow the entire process to be completed successfully.
However this approach is a bit cumbersome, because it requires me to duplicate the task four times and change the respective queries. Further, every time I add to the raw data source, I'll need to re-think the splitting such that the tasks become manageable.
Is there a better approach to handling this requirement, or is my approach the optimal to handle the task?