SQL Clone
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


Parallel Processing of Large Volume ETL Jobs


Parallel Processing of Large Volume ETL Jobs

Author
Message
RAGHAVENDRA NARAYANA-476029
RAGHAVENDRA NARAYANA-476029
Grasshopper
Grasshopper (12 reputation)Grasshopper (12 reputation)Grasshopper (12 reputation)Grasshopper (12 reputation)Grasshopper (12 reputation)Grasshopper (12 reputation)Grasshopper (12 reputation)Grasshopper (12 reputation)

Group: General Forum Members
Points: 12 Visits: 36
Comments posted to this topic are about the item Parallel Processing of Large Volume ETL Jobs
Jeff Moden
Jeff Moden
SSC Guru
SSC Guru (428K reputation)SSC Guru (428K reputation)SSC Guru (428K reputation)SSC Guru (428K reputation)SSC Guru (428K reputation)SSC Guru (428K reputation)SSC Guru (428K reputation)SSC Guru (428K reputation)

Group: General Forum Members
Points: 428411 Visits: 43436
The title of this article is "Parallel Processing of Large Volume ETL Jobs"... what do you consider to be a "Large Volume ETL Job"?

--Jeff Moden

RBAR is pronounced ree-bar and is a Modenism for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
If you think its expensive to hire a professional to do the job, wait until you hire an amateur. -- Red Adair

Helpful Links:
How to post code problems
How to post performance problems
Forum FAQs
RAGHAVENDRA NARAYANA-476029
RAGHAVENDRA NARAYANA-476029
Grasshopper
Grasshopper (12 reputation)Grasshopper (12 reputation)Grasshopper (12 reputation)Grasshopper (12 reputation)Grasshopper (12 reputation)Grasshopper (12 reputation)Grasshopper (12 reputation)Grasshopper (12 reputation)

Group: General Forum Members
Points: 12 Visits: 36
Large volume means number of rows and data size of source(s) that need to be processed into the target system by applying good amount of complex validation/filtration/transformation logics. The "Number of rows" and "good amount of" in the previous sentence depends on the hardware capacity and availability of ETL time window of individual systems. Before considering the idea discussed in article, study need to be done on the project for facts like current/future hardware capacity, how much data parallelism can be applied with current hardware, current/future available time window for etl processing, future complexities in validation logics, etc.
Usman Ghani
Usman Ghani
Forum Newbie
Forum Newbie (5 reputation)Forum Newbie (5 reputation)Forum Newbie (5 reputation)Forum Newbie (5 reputation)Forum Newbie (5 reputation)Forum Newbie (5 reputation)Forum Newbie (5 reputation)Forum Newbie (5 reputation)

Group: General Forum Members
Points: 5 Visits: 4
I think the information you have attempted to communicate is valuable. However, the repeated use of slashes fragmented or run-on sentences made the article almost impossible to read.
Jim Jaggers
Jim Jaggers
SSC Rookie
SSC Rookie (33 reputation)SSC Rookie (33 reputation)SSC Rookie (33 reputation)SSC Rookie (33 reputation)SSC Rookie (33 reputation)SSC Rookie (33 reputation)SSC Rookie (33 reputation)SSC Rookie (33 reputation)

Group: General Forum Members
Points: 33 Visits: 36
Ohh, and you teased me. You mentioned SQL Set Theory twice in you introduction and then never addressed it in the body. Now I have to go google SQL Set Theory.
omhoge
omhoge
SSCrazy
SSCrazy (2.8K reputation)SSCrazy (2.8K reputation)SSCrazy (2.8K reputation)SSCrazy (2.8K reputation)SSCrazy (2.8K reputation)SSCrazy (2.8K reputation)SSCrazy (2.8K reputation)SSCrazy (2.8K reputation)

Group: General Forum Members
Points: 2784 Visits: 183
This is great timing we were just discussing this in regards to using HaDoop or SQL Server to do some very large, close to terabyte, imports and transformations.

I have a couple questions, and I realize you didn't want to include code but a little start on it would help if possible:

?How do you physically spit up the large files and track them in each thread?
?How is the master process receiving it's messages from the children processes?
?In a SQL SERVER implementation would you most likely use CRL code to break up the files and Bulk Load to import it?
?Any suggestions on foreign keys and indexes?

This is a great topic, it may give us a lead on how to proceed with our new data project.
Very much appreciated!

Skål - jh
Jeff Moden
Jeff Moden
SSC Guru
SSC Guru (428K reputation)SSC Guru (428K reputation)SSC Guru (428K reputation)SSC Guru (428K reputation)SSC Guru (428K reputation)SSC Guru (428K reputation)SSC Guru (428K reputation)SSC Guru (428K reputation)

Group: General Forum Members
Points: 428411 Visits: 43436
RAGHAVENDRA NARAYANA (11/8/2007)
Large volume means number of rows and data size of source(s) that need to be processed into the target system by applying good amount of complex validation/filtration/transformation logics. The "Number of rows" and "good amount of" in the previous sentence depends on the hardware capacity and availability of ETL time window of individual systems. Before considering the idea discussed in article, study need to be done on the project for facts like current/future hardware capacity, how much data parallelism can be applied with current hardware, current/future available time window for etl processing, future complexities in validation logics, etc.


C'mon... How many rows and columns per row are you talking about?

--Jeff Moden

RBAR is pronounced ree-bar and is a Modenism for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
If you think its expensive to hire a professional to do the job, wait until you hire an amateur. -- Red Adair

Helpful Links:
How to post code problems
How to post performance problems
Forum FAQs
RAGHAVENDRA NARAYANA-476029
RAGHAVENDRA NARAYANA-476029
Grasshopper
Grasshopper (12 reputation)Grasshopper (12 reputation)Grasshopper (12 reputation)Grasshopper (12 reputation)Grasshopper (12 reputation)Grasshopper (12 reputation)Grasshopper (12 reputation)Grasshopper (12 reputation)

Group: General Forum Members
Points: 12 Visits: 36
We can think of an ETL system where we expect around 5 million transaction coming in with around 25 columns.

I would request if any of the Technical Architects who worked on database sizing and hardware configuration to share their thoughts on the article including parallelism, processors plus this question please.
Jeff Moden
Jeff Moden
SSC Guru
SSC Guru (428K reputation)SSC Guru (428K reputation)SSC Guru (428K reputation)SSC Guru (428K reputation)SSC Guru (428K reputation)SSC Guru (428K reputation)SSC Guru (428K reputation)SSC Guru (428K reputation)

Group: General Forum Members
Points: 428411 Visits: 43436
Thanks for the rowcount and line size... I've got an experiment or two to do... I'll be back...

--Jeff Moden

RBAR is pronounced ree-bar and is a Modenism for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
If you think its expensive to hire a professional to do the job, wait until you hire an amateur. -- Red Adair

Helpful Links:
How to post code problems
How to post performance problems
Forum FAQs
RAGHAVENDRA NARAYANA-476029
RAGHAVENDRA NARAYANA-476029
Grasshopper
Grasshopper (12 reputation)Grasshopper (12 reputation)Grasshopper (12 reputation)Grasshopper (12 reputation)Grasshopper (12 reputation)Grasshopper (12 reputation)Grasshopper (12 reputation)Grasshopper (12 reputation)

Group: General Forum Members
Points: 12 Visits: 36
?How do you physically spit up the large files and track them in each thread?
?How is the master process receiving it's messages from the children processes?
FOR ABOVE TWO QUESTIONS, IN ARTICLE I HAVE GIVEN DETAILED EXPLANATION.

?In a SQL SERVER implementation would you most likely use CRL code to break up the files and Bulk Load to import it?
NOT SURE ABOUT THIS. I WOULD REQUEST EXPERTS IN THIS PARTICULAR AREA TO ADDRESS THIS.

?Any suggestions on foreign keys and indexes?
NOTHING MUCH THAT I CAN THINK OF NOW, FEW THINGS COULD BE, KEEP MAIN TABLE AND RELATED TABLE INSERTS/UPDATES IN ONE COMMIT/ROLLBACK BLOCK. TAKE CARE OF IDENTITY COLUMNS IN CASE OF ROLLBACK/FAILURE, WE'LL LOSE THE IDENTITY NUMBER IF ROLLBACK/FAILURE HAPPENS. INSERT/UPDATE THE RECORDS IN BULK LIKE THOUSANDS IN CHUNKS IN PREFERABLY TARGET TABLE PRIMARY INDEX ORDER. IN TARGET TABLES, BE STRICT IN INCLUDING INDEXES, TRY COMPOSITE INDEXES INSTEAD OF MULTIPLE INDIVIDUAL INDEXES.
Go


Permissions

You can't post new topics.
You can't post topic replies.
You can't post new polls.
You can't post replies to polls.
You can't edit your own topics.
You can't delete your own topics.
You can't edit other topics.
You can't delete other topics.
You can't edit your own posts.
You can't edit other posts.
You can't delete your own posts.
You can't delete other posts.
You can't post events.
You can't edit your own events.
You can't edit other events.
You can't delete your own events.
You can't delete other events.
You can't send private messages.
You can't send emails.
You can read topics.
You can't vote in polls.
You can't upload attachments.
You can download attachments.
You can't post HTML code.
You can't edit HTML code.
You can't post IFCode.
You can't post JavaScript.
You can post emoticons.
You can't post or upload images.

Select a forum







































































































































































SQLServerCentral


Search