Conditional Set-Based Processing: Moving Towards a Best Practice for ETL

  • Comments posted to this topic are about the item Conditional Set-Based Processing: Moving Towards a Best Practice for ETL

  • Thanks for the approach, which makes a lot of sense.

    Has anyone done any testing to see whether it would be worth taking an adaptive approach to this? What I mean is, after a failure in a block (of 25k say) rather than reverting to RBAR for the whole block, try something like a binary chop approach to locate the bad row(s). A variation on this would be to adapt the block size dynamically - start large, if an error, try half that size and repeat until a preset minimum size is reached which would be processed RBAR. If a block gets by successfully, increase the size for the next set.

    We've done this kind of thing in serial comms before and I wonder if it would work here.

    The key factors, I think, are the ratio of time to process a block versus the time to do it RBAR (for different sizes) and (the more variable factor that depends on your data) how often do you expect to get errors and what is their pathology - are they likely to be solitary, clumped in groups, etc. Some of this will only be learnt over time with your data. These factors will determine whether, having failed 20k rows, it is worth taking the time hit to try two lots of 10k or to drop straight into RBAR for the 20k.

  • Hello,

    While I agree with your approach - the basic concepts are sound - I disagree with one thing: processing the data on the way in to the data warehouse. I've been teaching, talking, and discussing the nature of Data Warehouses for over 15 years now, and today's data warehouse has become a system of record. Due in part for the need of compliance.

    This means that the good, the bad, and the ugly data need to make it in to the data warehouse, regardless of what it looks like. It also means that "processing" the data according to business rules and functions are now moved down stream (on the way out to the data marts, the cubes, the star schemas etc...).

    This does a few things:

    1) Set based processing is in use for all data across the entire warehouse

    2) all load routines are parallel, and can be partitioned if necessary

    3) load performance should be upwards of 100,000 to 250,000 rows per second - making it easy to load 1 Billion + rows in 45 minutes or less (of course this depends on the hardware)

    4) restartability is inherited, as long as set based logic is in place

    and so on... The bottom line is moving raw data into the warehouse, the other bottom line is the architecture of the receiving tables in the warehouse is vitally important. This is where the Data Vault Modeling and methodology come in to play.

    I'm frequently engaged to correct performance and tuning of large scale systems, and since Microsoft is now there with SQLServer 2008 R2, (And the fact that Microsoft is interested in the Data Vault Model), I would suggest you reexamine the way you load information (ie: putting processing of the data upstream of the data warehouse).

    You can see more about this high-speed parallel approach at: http://www.DataVaultInstitute.com (free to register), or my profile on http://www.LinkedIn.com/in/dlinstedt

    Also, keep in mind this is not some fly-by-night idea. We've got a huge number of corporations following these principles with great success, including: JP Morgan Chase, SNS Bank, World Bank, ABN-AMRO, Diamler Auto, Edmonton Police Services, Dept of Defense, US Navy, US Army, FAA, FDA, City of Charlotte NC, Tyson Foods, Nutreco, and many many more....

    Cheers,

    Dan Linstedt

  • Thanks for thoughts on dynamic logic to change the size of the sets processed to handle errors. Interesting idea.

    While I appreciate the comments on other ways to handle large data loads, not every project is the same scale or has the budget and/or time to implement specialized tools. What I think is important is the ability to be flexible and come up with the right approach for each project...

  • I actually think Dan Linstedt had a good point, even if it was lost in the marketing-speke of the rest of his post. To wit: why do you have constraints on the DB you're importing into?

    Note that I'm not saying you shouldn't have these constraints -- it depends on what you're using this DB for. But your article doesn't explain why there are constraints that could cause the import to fail.

    The entire system might be more efficient if you allowed the bad records to import and re-cast these constraints as reports which showed records which need correction in the OLTP system -- or you may have more reports that can't include bad records for any reason, so you need to catch the bad records before they get into the DB. We don't know.

    In any case, this try-catch method is a good idea when you have a set-based process that might have to fall back to RBAR, and I too am interested in the whether the binary "divide-and-conquer" approach might be worth the time it'd take to develop.

  • I had to work on ETL system that read from a bunch of sources. After the data was imported in the staging table, I ran a bunch of set based validations and marked the records invalid and also kept track of of reasons why they were bad. After the validations were done, i imported only the good records and send an email with list of bad records and the reasons.

    Some of my set based validations involved calls to CLR function which made it slower but the granularity of the error report made it well worth it and it was 10-15 faster than the ETL process it replaced.

    Let me also add that this is not a run every 15 minutes kind of application but process data whenever the clients upload data.

  • What he said, except for the Data Vault part… πŸ˜‰

    Seriously, in a data warehouse we always 'land' loads like this in a staging table with minimal constraints. You can then pick and choose what to do with it.

    I generally want things to be so loose that it won't fail unless the load file is unusable.

  • Two things to add.

    One is that if you built this import routine in a recursive fashision you could have something like:

    exec ImportRecords @StartRecord, @EndRecord, @BlockSize.

    The first level of recursion would just break the data into segments based on the block size and call back to the SP recursively in a loop. After the first level of recursion @EndRecord - @StartRecord <= @BlockSize. If that was the case the code would insert the data within a try catch block.

    Then if you made the Block size a power of 10 (1000 or 10000) each level of error handling within the recursion would have a loop that would break the errored block into 10 peices, set @BlockSize = @BlockSize / 10 and retry. If there was only one error in a block of 10000 records, this would recurse down in to a loop that processed 1000 records at a time. The 9 good blocks would have no errors and the 9000 good records would process in the third layer of recursion.

    Then block of 1000 of the remaining records in errored block would process in a fourth layer of recursion 100 records at time. This would continue until errors were caught and @BlockSize = 1. In that case you would just log the error rather than recurse down again.

    Of course you would also need some kind of check in the loop to make sure that the recursive value of @EndRecord never exceeded the passed in value of @EndRecord.

    The second point is that many error conditions could be found using set based operations to set and error status prior to attempting to insert the records.

  • These posts all have good things to say about considerations when loading data. The example in this article was meant to show a particular way of using sets with try-catch and can be thought of as a basis for building more elaborate systems, not as a complete solution. I agree with trying to handle exceptions as early in the process as possible, in a staging or landing area. I also agree with keeping constraints to a minimum in a data mart. But I also generally go by the rule that everything can break at some time and everything should have exception handling.

  • A really good point, well made. As Joe says, using a staging area is best practice. However staging will never fully insulate you from the real world.

    In fact Obstacle Warehouse Builder offers this as a standard feature (this does not make up for all the other issues with WHB/Oracle) but the approach outlined is clearly best practice - if a lot more work.

    Thanks for putting in the effort to write this up.

    Pete

  • this is great and very useful info.

    Angela[/url]

  • Hi Folks,

    I understand what is desired. Please don't get me wrong: set based processing has been around for a long long time. What is important here is to note the following:

    1) Size of the source and target don't matter - whether big or small, set based processing is important.

    2) Using cursors and iterating over rows runs much slower than using "array" or block style commands, regardless of whether it's written in ETL (SSIS) or SQL stored procedures, or other ETL engine technologies.

    3) Speed is important when moving data to the warehouse from the staging area.

    4) Road-blocks to performance are often setup because of the data modeling architecture (ie: requiring business rules to be applied to the data when loading from stage to warehouse).

    5) Business rule processing can "break" the set based architectural approach

    6) moving business rules downstream (hence putting RAW data in the data warehouse) allows the EDW to be compliant and auditable, it also leads to many opportunities to apply set based operations over block style commands.

    7) The set based processing commands for inserts (when dealing with RAW data) allow these pieces to be specific against index matches, which in turn allow for maximum parallelism and maximum partitioning.

    I've used many of these techniques on data warehouses ranging in size from 3MB to 3 Petabytes with great success, again - you don't have to have large data warehouses to see the benefits of set based processing, however you DO need to move the "processing business rules" out of the way, and allow the good, the bad, and the ugly data IN to the data warehouse - otherwise, you miss out on compliance and auditability.

    Hope this helps,

    Dan Linstedt

    DanL@DanLinstedt.com

  • Before I get started... the article is well written, clear, and easy to understand... I just disagree with the premise of having to resort to any form of RBAR for the simple task of importing and validating data in a high performance fashion.

    I'm happy to see that "RBAR" has become a household word. πŸ˜€ It's a shame that "BCP" has not. BCP does have some limitations but it's quite capable in most areas. For example, when you import into a staging table using BCP, if a particular "cell" of information doesn't adhere to the correct datatype or constraints (or whatever), it will cause the row to fail as expected. What most people don't know is that doesn't necessarily make the whole job fail. Nope... you can tell the job how many errors to allow.

    Now, get this... I tell it to allow ALL rows to fail. Why? Because with BCP (and, I believe, Bulk Insert as of 2k5) you can also tell it to log all error rows to a file for troubleshooting and possible repair. No CLR's are required, no Try Catch code is required, no shifting from bulk logic to RBAR, no fancy footwork is required. It's already built into BCP, will isolate only bad rows while continuing to import the good rows, and it's nasty fast. On a modest old server, I've seen it import, clean, and glean 5.1 million 20 column rows in 60 seconds flat. On a "modern" server, I'm sure it will be much faster. The only thing it won't do for you is the final merge from staging to production.

    That brings us to the next subject and several folks have already mentioned it. I never import directly to production tables. I always import to staging tables and, even if I've done as much error checking as possible with BCP, I always finish validating the data in the staging tables. Then I mark all new inserts, mark all rows that contain updates, and then do the inserts and updates as expected which allows me to avoid all RBAR (and rollbacks due to data errors) even in the presence of the occasional data error.

    Ah... yes... BCP doesn't have a GUI and, if you want it to run from T-SQL, you'll need to allow xp_CmdShell to run or to call it through some other vehicle. Heh... so what? Are you going to tell me that your ETL system is also public facing? If so (I'll reserve my personal opinion on that), then you're correct... use SSIS, CLR's, and Try Catch code that shifts to RBAR on failure... that is unless you happen to know how to use BULK INSERT to do the very same thing as BCP (and it will) in 2k5. πŸ˜›

    ETL just doesn't have to be complicated nor slow.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • i took the testing and it was easy.

    Jenny[/url]

  • Disagreements are good -- better ideas happen that way. One thing I think that Jeff should take more seriously, though, is that some projects may have constraints that disallow certain techniques (xp_cmdshell for one), whether we like it or not. Suppose the data load has to run at customer sites -- no DBA attending, no elevated privileges on customer machines... I have used bcp and bulk insert in a number of situations and do think they are valuable, just may not fit every job.

Viewing 15 posts - 1 through 15 (of 29 total)

You must be logged in to reply to this topic. Login to reply