RE: Parallel Processing of Large Volume ETL Jobs

SSCertifiable

Points: 5416

November 8, 2007 at 12:56 pm

This is great timing we were just discussing this in regards to using HaDoop or SQL Server to do some very large, close to terabyte, imports and transformations.

I have a couple questions, and I realize you didn't want to include code but a little start on it would help if possible:

?How do you physically spit up the large files and track them in each thread?

?How is the master process receiving it's messages from the children processes?

?In a SQL SERVER implementation would you most likely use CRL code to break up the files and Bulk Load to import it?

?Any suggestions on foreign keys and indexes?

This is a great topic, it may give us a lead on how to proceed with our new data project.

Very much appreciated!

Skål - jh