Click here to monitor SSC
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


Conditional Set-Based Processing: Moving Towards a Best Practice for ETL


Conditional Set-Based Processing: Moving Towards a Best Practice for ETL

Author
Message
MelS-512196
MelS-512196
SSC-Addicted
SSC-Addicted (436 reputation)SSC-Addicted (436 reputation)SSC-Addicted (436 reputation)SSC-Addicted (436 reputation)SSC-Addicted (436 reputation)SSC-Addicted (436 reputation)SSC-Addicted (436 reputation)SSC-Addicted (436 reputation)

Group: General Forum Members
Points: 436 Visits: 448
Comments posted to this topic are about the item Conditional Set-Based Processing: Moving Towards a Best Practice for ETL
keith.fearnley
keith.fearnley
SSC Veteran
SSC Veteran (287 reputation)SSC Veteran (287 reputation)SSC Veteran (287 reputation)SSC Veteran (287 reputation)SSC Veteran (287 reputation)SSC Veteran (287 reputation)SSC Veteran (287 reputation)SSC Veteran (287 reputation)

Group: General Forum Members
Points: 287 Visits: 224
Thanks for the approach, which makes a lot of sense.
Has anyone done any testing to see whether it would be worth taking an adaptive approach to this? What I mean is, after a failure in a block (of 25k say) rather than reverting to RBAR for the whole block, try something like a binary chop approach to locate the bad row(s). A variation on this would be to adapt the block size dynamically - start large, if an error, try half that size and repeat until a preset minimum size is reached which would be processed RBAR. If a block gets by successfully, increase the size for the next set.
We've done this kind of thing in serial comms before and I wonder if it would work here.
The key factors, I think, are the ratio of time to process a block versus the time to do it RBAR (for different sizes) and (the more variable factor that depends on your data) how often do you expect to get errors and what is their pathology - are they likely to be solitary, clumped in groups, etc. Some of this will only be learnt over time with your data. These factors will determine whether, having failed 20k rows, it is worth taking the time hit to try two lots of 10k or to drop straight into RBAR for the 20k.
danl-971512
danl-971512
Grasshopper
Grasshopper (17 reputation)Grasshopper (17 reputation)Grasshopper (17 reputation)Grasshopper (17 reputation)Grasshopper (17 reputation)Grasshopper (17 reputation)Grasshopper (17 reputation)Grasshopper (17 reputation)

Group: General Forum Members
Points: 17 Visits: 26
Hello,

While I agree with your approach - the basic concepts are sound - I disagree with one thing: processing the data on the way in to the data warehouse. I've been teaching, talking, and discussing the nature of Data Warehouses for over 15 years now, and today's data warehouse has become a system of record. Due in part for the need of compliance.

This means that the good, the bad, and the ugly data need to make it in to the data warehouse, regardless of what it looks like. It also means that "processing" the data according to business rules and functions are now moved down stream (on the way out to the data marts, the cubes, the star schemas etc...).

This does a few things:
1) Set based processing is in use for all data across the entire warehouse
2) all load routines are parallel, and can be partitioned if necessary
3) load performance should be upwards of 100,000 to 250,000 rows per second - making it easy to load 1 Billion + rows in 45 minutes or less (of course this depends on the hardware)
4) restartability is inherited, as long as set based logic is in place

and so on... The bottom line is moving raw data into the warehouse, the other bottom line is the architecture of the receiving tables in the warehouse is vitally important. This is where the Data Vault Modeling and methodology come in to play.

I'm frequently engaged to correct performance and tuning of large scale systems, and since Microsoft is now there with SQLServer 2008 R2, (And the fact that Microsoft is interested in the Data Vault Model), I would suggest you reexamine the way you load information (ie: putting processing of the data upstream of the data warehouse).

You can see more about this high-speed parallel approach at: http://www.DataVaultInstitute.com (free to register), or my profile on http://www.LinkedIn.com/in/dlinstedt

Also, keep in mind this is not some fly-by-night idea. We've got a huge number of corporations following these principles with great success, including: JP Morgan Chase, SNS Bank, World Bank, ABN-AMRO, Diamler Auto, Edmonton Police Services, Dept of Defense, US Navy, US Army, FAA, FDA, City of Charlotte NC, Tyson Foods, Nutreco, and many many more....

Cheers,
Dan Linstedt
MelS-512196
MelS-512196
SSC-Addicted
SSC-Addicted (436 reputation)SSC-Addicted (436 reputation)SSC-Addicted (436 reputation)SSC-Addicted (436 reputation)SSC-Addicted (436 reputation)SSC-Addicted (436 reputation)SSC-Addicted (436 reputation)SSC-Addicted (436 reputation)

Group: General Forum Members
Points: 436 Visits: 448
Thanks for thoughts on dynamic logic to change the size of the sets processed to handle errors. Interesting idea.

While I appreciate the comments on other ways to handle large data loads, not every project is the same scale or has the budget and/or time to implement specialized tools. What I think is important is the ability to be flexible and come up with the right approach for each project...
sknox
sknox
SSCrazy
SSCrazy (2.2K reputation)SSCrazy (2.2K reputation)SSCrazy (2.2K reputation)SSCrazy (2.2K reputation)SSCrazy (2.2K reputation)SSCrazy (2.2K reputation)SSCrazy (2.2K reputation)SSCrazy (2.2K reputation)

Group: General Forum Members
Points: 2163 Visits: 2755
I actually think Dan Linstedt had a good point, even if it was lost in the marketing-speke of the rest of his post. To wit: why do you have constraints on the DB you're importing into?

Note that I'm not saying you shouldn't have these constraints -- it depends on what you're using this DB for. But your article doesn't explain why there are constraints that could cause the import to fail.

The entire system might be more efficient if you allowed the bad records to import and re-cast these constraints as reports which showed records which need correction in the OLTP system -- or you may have more reports that can't include bad records for any reason, so you need to catch the bad records before they get into the DB. We don't know.

In any case, this try-catch method is a good idea when you have a set-based process that might have to fall back to RBAR, and I too am interested in the whether the binary "divide-and-conquer" approach might be worth the time it'd take to develop.
PeeEs
PeeEs
Forum Newbie
Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)

Group: General Forum Members
Points: 1 Visits: 7
I had to work on ETL system that read from a bunch of sources. After the data was imported in the staging table, I ran a bunch of set based validations and marked the records invalid and also kept track of of reasons why they were bad. After the validations were done, i imported only the good records and send an email with list of bad records and the reasons.
Some of my set based validations involved calls to CLR function which made it slower but the granularity of the error report made it well worth it and it was 10-15 faster than the ETL process it replaced.
Let me also add that this is not a run every 15 minutes kind of application but process data whenever the clients upload data.
joeharris76
joeharris76
Forum Newbie
Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)

Group: General Forum Members
Points: 1 Visits: 21
What he said, except for the Data Vault part… Wink

Seriously, in a data warehouse we always 'land' loads like this in a staging table with minimal constraints. You can then pick and choose what to do with it.

I generally want things to be so loose that it won't fail unless the load file is unusable.
michael.welcome
michael.welcome
SSC Rookie
SSC Rookie (35 reputation)SSC Rookie (35 reputation)SSC Rookie (35 reputation)SSC Rookie (35 reputation)SSC Rookie (35 reputation)SSC Rookie (35 reputation)SSC Rookie (35 reputation)SSC Rookie (35 reputation)

Group: General Forum Members
Points: 35 Visits: 180
Two things to add.

One is that if you built this import routine in a recursive fashision you could have something like:

exec ImportRecords @StartRecord, @EndRecord, @BlockSize.

The first level of recursion would just break the data into segments based on the block size and call back to the SP recursively in a loop. After the first level of recursion @EndRecord - @StartRecord <= @BlockSize. If that was the case the code would insert the data within a try catch block.

Then if you made the Block size a power of 10 (1000 or 10000) each level of error handling within the recursion would have a loop that would break the errored block into 10 peices, set @BlockSize = @BlockSize / 10 and retry. If there was only one error in a block of 10000 records, this would recurse down in to a loop that processed 1000 records at a time. The 9 good blocks would have no errors and the 9000 good records would process in the third layer of recursion.

Then block of 1000 of the remaining records in errored block would process in a fourth layer of recursion 100 records at time. This would continue until errors were caught and @BlockSize = 1. In that case you would just log the error rather than recurse down again.

Of course you would also need some kind of check in the loop to make sure that the recursive value of @EndRecord never exceeded the passed in value of @EndRecord.

The second point is that many error conditions could be found using set based operations to set and error status prior to attempting to insert the records.
MelS-512196
MelS-512196
SSC-Addicted
SSC-Addicted (436 reputation)SSC-Addicted (436 reputation)SSC-Addicted (436 reputation)SSC-Addicted (436 reputation)SSC-Addicted (436 reputation)SSC-Addicted (436 reputation)SSC-Addicted (436 reputation)SSC-Addicted (436 reputation)

Group: General Forum Members
Points: 436 Visits: 448
These posts all have good things to say about considerations when loading data. The example in this article was meant to show a particular way of using sets with try-catch and can be thought of as a basis for building more elaborate systems, not as a complete solution. I agree with trying to handle exceptions as early in the process as possible, in a staging or landing area. I also agree with keeping constraints to a minimum in a data mart. But I also generally go by the rule that everything can break at some time and everything should have exception handling.
admin-499013
admin-499013
SSC Rookie
SSC Rookie (25 reputation)SSC Rookie (25 reputation)SSC Rookie (25 reputation)SSC Rookie (25 reputation)SSC Rookie (25 reputation)SSC Rookie (25 reputation)SSC Rookie (25 reputation)SSC Rookie (25 reputation)

Group: General Forum Members
Points: 25 Visits: 114
A really good point, well made. As Joe says, using a staging area is best practice. However staging will never fully insulate you from the real world.

In fact Obstacle Warehouse Builder offers this as a standard feature (this does not make up for all the other issues with WHB/Oracle) but the approach outlined is clearly best practice - if a lot more work.

Thanks for putting in the effort to write this up.
Go


Permissions

You can't post new topics.
You can't post topic replies.
You can't post new polls.
You can't post replies to polls.
You can't edit your own topics.
You can't delete your own topics.
You can't edit other topics.
You can't delete other topics.
You can't edit your own posts.
You can't edit other posts.
You can't delete your own posts.
You can't delete other posts.
You can't post events.
You can't edit your own events.
You can't edit other events.
You can't delete your own events.
You can't delete other events.
You can't send private messages.
You can't send emails.
You can read topics.
You can't vote in polls.
You can't upload attachments.
You can download attachments.
You can't post HTML code.
You can't edit HTML code.
You can't post IFCode.
You can't post JavaScript.
You can post emoticons.
You can't post or upload images.

Select a forum

































































































































































SQLServerCentral


Search