Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase 123»»»

Conditional Set-Based Processing: Moving Towards a Best Practice for ETL Expand / Collapse
Author
Message
Posted Sunday, November 22, 2009 5:51 PM
SSC-Addicted

SSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-Addicted

Group: General Forum Members
Last Login: Friday, August 9, 2013 3:52 PM
Points: 430, Visits: 448
Comments posted to this topic are about the item Conditional Set-Based Processing: Moving Towards a Best Practice for ETL
Post #823043
Posted Monday, November 23, 2009 1:58 AM
SSC Veteran

SSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC Veteran

Group: General Forum Members
Last Login: Monday, July 21, 2014 6:01 AM
Points: 238, Visits: 123
Thanks for the approach, which makes a lot of sense.
Has anyone done any testing to see whether it would be worth taking an adaptive approach to this? What I mean is, after a failure in a block (of 25k say) rather than reverting to RBAR for the whole block, try something like a binary chop approach to locate the bad row(s). A variation on this would be to adapt the block size dynamically - start large, if an error, try half that size and repeat until a preset minimum size is reached which would be processed RBAR. If a block gets by successfully, increase the size for the next set.
We've done this kind of thing in serial comms before and I wonder if it would work here.
The key factors, I think, are the ratio of time to process a block versus the time to do it RBAR (for different sizes) and (the more variable factor that depends on your data) how often do you expect to get errors and what is their pathology - are they likely to be solitary, clumped in groups, etc. Some of this will only be learnt over time with your data. These factors will determine whether, having failed 20k rows, it is worth taking the time hit to try two lots of 10k or to drop straight into RBAR for the 20k.
Post #823133
Posted Monday, November 23, 2009 3:35 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Tuesday, October 22, 2013 6:36 AM
Points: 7, Visits: 26
Hello,

While I agree with your approach - the basic concepts are sound - I disagree with one thing: processing the data on the way in to the data warehouse. I've been teaching, talking, and discussing the nature of Data Warehouses for over 15 years now, and today's data warehouse has become a system of record. Due in part for the need of compliance.

This means that the good, the bad, and the ugly data need to make it in to the data warehouse, regardless of what it looks like. It also means that "processing" the data according to business rules and functions are now moved down stream (on the way out to the data marts, the cubes, the star schemas etc...).

This does a few things:
1) Set based processing is in use for all data across the entire warehouse
2) all load routines are parallel, and can be partitioned if necessary
3) load performance should be upwards of 100,000 to 250,000 rows per second - making it easy to load 1 Billion + rows in 45 minutes or less (of course this depends on the hardware)
4) restartability is inherited, as long as set based logic is in place

and so on... The bottom line is moving raw data into the warehouse, the other bottom line is the architecture of the receiving tables in the warehouse is vitally important. This is where the Data Vault Modeling and methodology come in to play.

I'm frequently engaged to correct performance and tuning of large scale systems, and since Microsoft is now there with SQLServer 2008 R2, (And the fact that Microsoft is interested in the Data Vault Model), I would suggest you reexamine the way you load information (ie: putting processing of the data upstream of the data warehouse).

You can see more about this high-speed parallel approach at: http://www.DataVaultInstitute.com (free to register), or my profile on http://www.LinkedIn.com/in/dlinstedt

Also, keep in mind this is not some fly-by-night idea. We've got a huge number of corporations following these principles with great success, including: JP Morgan Chase, SNS Bank, World Bank, ABN-AMRO, Diamler Auto, Edmonton Police Services, Dept of Defense, US Navy, US Army, FAA, FDA, City of Charlotte NC, Tyson Foods, Nutreco, and many many more....

Cheers,
Dan Linstedt
Post #823162
Posted Monday, November 23, 2009 6:19 AM
SSC-Addicted

SSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-Addicted

Group: General Forum Members
Last Login: Friday, August 9, 2013 3:52 PM
Points: 430, Visits: 448
Thanks for thoughts on dynamic logic to change the size of the sets processed to handle errors. Interesting idea.

While I appreciate the comments on other ways to handle large data loads, not every project is the same scale or has the budget and/or time to implement specialized tools. What I think is important is the ability to be flexible and come up with the right approach for each project...
Post #823215
Posted Monday, November 23, 2009 8:36 AM
Ten Centuries

Ten CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen Centuries

Group: General Forum Members
Last Login: Today @ 8:04 AM
Points: 1,297, Visits: 1,659
I actually think Dan Linstedt had a good point, even if it was lost in the marketing-speke of the rest of his post. To wit: why do you have constraints on the DB you're importing into?

Note that I'm not saying you shouldn't have these constraints -- it depends on what you're using this DB for. But your article doesn't explain why there are constraints that could cause the import to fail.

The entire system might be more efficient if you allowed the bad records to import and re-cast these constraints as reports which showed records which need correction in the OLTP system -- or you may have more reports that can't include bad records for any reason, so you need to catch the bad records before they get into the DB. We don't know.

In any case, this try-catch method is a good idea when you have a set-based process that might have to fall back to RBAR, and I too am interested in the whether the binary "divide-and-conquer" approach might be worth the time it'd take to develop.
Post #823296
Posted Monday, November 23, 2009 8:52 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Tuesday, November 24, 2009 8:23 AM
Points: 1, Visits: 7
I had to work on ETL system that read from a bunch of sources. After the data was imported in the staging table, I ran a bunch of set based validations and marked the records invalid and also kept track of of reasons why they were bad. After the validations were done, i imported only the good records and send an email with list of bad records and the reasons.
Some of my set based validations involved calls to CLR function which made it slower but the granularity of the error report made it well worth it and it was 10-15 faster than the ETL process it replaced.
Let me also add that this is not a run every 15 minutes kind of application but process data whenever the clients upload data.
Post #823313
Posted Monday, November 23, 2009 12:41 PM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Thursday, May 16, 2013 4:42 AM
Points: 1, Visits: 21
What he said, except for the Data Vault part… ;)

Seriously, in a data warehouse we always 'land' loads like this in a staging table with minimal constraints. You can then pick and choose what to do with it.

I generally want things to be so loose that it won't fail unless the load file is unusable.
Post #823460
Posted Monday, November 23, 2009 1:02 PM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Friday, June 13, 2014 10:27 AM
Points: 12, Visits: 98
Two things to add.

One is that if you built this import routine in a recursive fashision you could have something like:

exec ImportRecords @StartRecord, @EndRecord, @BlockSize.

The first level of recursion would just break the data into segments based on the block size and call back to the SP recursively in a loop. After the first level of recursion @EndRecord - @StartRecord <= @BlockSize. If that was the case the code would insert the data within a try catch block.

Then if you made the Block size a power of 10 (1000 or 10000) each level of error handling within the recursion would have a loop that would break the errored block into 10 peices, set @BlockSize = @BlockSize / 10 and retry. If there was only one error in a block of 10000 records, this would recurse down in to a loop that processed 1000 records at a time. The 9 good blocks would have no errors and the 9000 good records would process in the third layer of recursion.

Then block of 1000 of the remaining records in errored block would process in a fourth layer of recursion 100 records at time. This would continue until errors were caught and @BlockSize = 1. In that case you would just log the error rather than recurse down again.

Of course you would also need some kind of check in the loop to make sure that the recursive value of @EndRecord never exceeded the passed in value of @EndRecord.

The second point is that many error conditions could be found using set based operations to set and error status prior to attempting to insert the records.
Post #823472
Posted Monday, November 23, 2009 1:19 PM
SSC-Addicted

SSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-Addicted

Group: General Forum Members
Last Login: Friday, August 9, 2013 3:52 PM
Points: 430, Visits: 448
These posts all have good things to say about considerations when loading data. The example in this article was meant to show a particular way of using sets with try-catch and can be thought of as a basis for building more elaborate systems, not as a complete solution. I agree with trying to handle exceptions as early in the process as possible, in a staging or landing area. I also agree with keeping constraints to a minimum in a data mart. But I also generally go by the rule that everything can break at some time and everything should have exception handling.
Post #823483
Posted Monday, November 23, 2009 1:25 PM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Tuesday, September 3, 2013 3:02 PM
Points: 21, Visits: 108
A really good point, well made. As Joe says, using a staging area is best practice. However staging will never fully insulate you from the real world.

In fact Obstacle Warehouse Builder offers this as a standard feature (this does not make up for all the other issues with WHB/Oracle) but the approach outlined is clearly best practice - if a lot more work.

Thanks for putting in the effort to write this up.


Post #823488
« Prev Topic | Next Topic »

Add to briefcase 123»»»

Permissions Expand / Collapse