﻿<?xml version='1.0' encoding='UTF-8'?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/"><channel><title>SQLServerCentral / Article Discussions / Article Discussions by Author / Discuss content posted by Mel Sansone  / Conditional Set-Based Processing: Moving Towards a Best Practice for ETL / Latest Posts</title><generator>InstantForum.NET v2.9.0</generator><description>SQLServerCentral</description><link>http://www.sqlservercentral.com/Forums/</link><webMaster>notifications@sqlservercentral.com</webMaster><lastBuildDate>Sat, 25 May 2013 01:24:08 GMT</lastBuildDate><ttl>20</ttl><item><title>RE: Conditional Set-Based Processing: Moving Towards a Best Practice for ETL</title><link>http://www.sqlservercentral.com/Forums/Topic823043-1693-1.aspx</link><description>Just caught the question in the recent post.  If I can remember my own code correctly:-)  The reason the delete of the table variable must happen after the transaction commits is that the table variable data is used in the Catch clause if the transaction fails and rolls back.  Therefore it is deleted only after the transaction commits.</description><pubDate>Thu, 26 Aug 2010 08:13:06 GMT</pubDate><dc:creator>MelS-512196</dc:creator></item><item><title>RE: Conditional Set-Based Processing: Moving Towards a Best Practice for ETL</title><link>http://www.sqlservercentral.com/Forums/Topic823043-1693-1.aspx</link><description>[quote][b]wardster (8/21/2010)[/b][hr]Hi JeffThanks for the quick reply.The solution I'm in the middle of testing uses a permanent staging table rather than temp table or table variable.  Does this mean I should be deleting from it outside the insert transaction?  The staging table is brand new so I'm certain no other objects reference it apart from the stored proc with the delete in.[/quote]First, unless you're trying to preserve the value of an IDENTITY column, I wouldn't use DELETE... I'd use TRUNCATE.  It's a whole lot faster and easier on the LOG.Second, same thing applies.  You'd have to do the DELETE or TRUNCATE outside the transaction or it may undo the DELETE or TRUNCATE if you rollback.  Of course, that may be just what you want so the real answer is "It Depends".  :-D</description><pubDate>Sun, 22 Aug 2010 13:36:30 GMT</pubDate><dc:creator>Jeff Moden</dc:creator></item><item><title>RE: Conditional Set-Based Processing: Moving Towards a Best Practice for ETL</title><link>http://www.sqlservercentral.com/Forums/Topic823043-1693-1.aspx</link><description>Hi JeffThanks for the quick reply.The solution I'm in the middle of testing uses a permanent staging table rather than temp table or table variable.  Does this mean I should be deleting from it outside the insert transaction?  The staging table is brand new so I'm certain no other objects reference it apart from the stored proc with the delete in.</description><pubDate>Sat, 21 Aug 2010 14:08:09 GMT</pubDate><dc:creator>wardster</dc:creator></item><item><title>RE: Conditional Set-Based Processing: Moving Towards a Best Practice for ETL</title><link>http://www.sqlservercentral.com/Forums/Topic823043-1693-1.aspx</link><description>[quote][b]wardster (8/21/2010)[/b][hr]In Mel's example in the original article, he deletes from his staging table after committing the transaction which inserts into the target table and updates the source table, and even explicitly comments the fact to stress it's importance:--Don't delete from temp table until transaction is committedDELETE @TempSourceDoes this matter?[/quote]With a table variable, no.  It's not affected by transactions.  But, if you look at the comment, it looks like the author may have used a Temp Table originally and changed it to a table variable later on.  It could matter if it's a Temp Table (although I've not studied the code well enough to know for sure in this case).</description><pubDate>Sat, 21 Aug 2010 11:32:17 GMT</pubDate><dc:creator>Jeff Moden</dc:creator></item><item><title>RE: Conditional Set-Based Processing: Moving Towards a Best Practice for ETL</title><link>http://www.sqlservercentral.com/Forums/Topic823043-1693-1.aspx</link><description>Hi folksSorry for dragging up an old thread, but I stumbled upon the related article whilst looking for a bit of info on best practices for the use of try/catch within a transaction for a small ETL process.  A very informative discussion, so thanks to all; I like the binary/hyperexponential chop ideas but will try to explore using bcp where poss, have found it very useful for non-production stuff in the past.Anyway, here's my question and scenario:In Mel's example in the original article, he deletes from his staging table [u]after[/u] committing the transaction which inserts into the target table and updates the source table, and even explicitly comments the fact to stress it's importance:[i]--Don't delete from temp table until transaction is committed DELETE @TempSource[/i]Does this matter?  In similar code I wrote the other day before reading this, I assumed I wanted the delete from staging to be part of the transaction too, so the entire process gets committed or rolled back together?  Otherwise if you get a random error during the delete aren't you left with data that you don't want in the staging table?  Not a massive issue as my process truncates the staging table at the start, but I'm wondering if I'm missing something, or worrying over something inconsequential?  ThanksMy scenario is thus:Existing process uses SSIS to import data directly into production table.  The data actually needs to be transformed to be in the form our stored procs want it in, and at the moment this is done using a view, which performs pretty badly - columns we need to index on exist courtesy of expressions in the view.  So options are index the view, or change the import process.I opted for the latter, and am adding a staging table, and a stored proc which transforms the data from staging into a new target table, plus copies the raw data into the current production table, which in effect becomes our audit store.  Existing stored procs will be changed to point at the new target table rather than existing production raw data store.There's no real constraints on the target table so I shouldn't have to worry about errors too much, but still want to use an explicit transaction, something like:[code="sql"]BEGIN TRAN	-- Insert from staging into target table, transforming columns where desired	INSERT tbl_Target		SELECT 	col1		,	MySchema.fn_transform(col2)		,	ISNULL(col3)		FROM tbl_Staging;		-- Copy raw data from staging into store	INSERT tbl_Store		SELECT col1, col2, col3		FROM tbl_Staging;	-- Delete data from staging	DELETE FROM tbl_Staging;IF (@@Error = 0)	COMMIT TRAN;ELSE BEGIN	ROLLBACK TRAN;	-- RAISERROR to tell SSIS package there's been a problemEND[/code]This process runs every hour to import a few thousand rows, thankfully mostly during office hours not middle of the night!:-PNow, i've never used T-SQL try/catch before, as I worked mainly with SQL 2000 until this year, but feel I should use it, especially as it can be used to detect a deadlock and retry (in this case there shouldn't be other stuff querying the staging and store tables but you never know)But does it matter whether my delete from staging is inside the transaction?Thanks</description><pubDate>Sat, 21 Aug 2010 09:40:04 GMT</pubDate><dc:creator>wardster</dc:creator></item><item><title>RE: Conditional Set-Based Processing: Moving Towards a Best Practice for ETL</title><link>http://www.sqlservercentral.com/Forums/Topic823043-1693-1.aspx</link><description>Sorry for the late reply.  Thanks for the explanation, Dan.  I really appreciate it.</description><pubDate>Fri, 05 Feb 2010 17:00:26 GMT</pubDate><dc:creator>Jeff Moden</dc:creator></item><item><title>RE: Conditional Set-Based Processing: Moving Towards a Best Practice for ETL</title><link>http://www.sqlservercentral.com/Forums/Topic823043-1693-1.aspx</link><description>Hi Jeff,I'm sorry if I wasn't clear enough.  400k rps is where SQLServer 2008 R2 should be going - it's not where it is today.  400k rps during load is only achieved in Teradata with a Teradata hardware setup using FastLoad.  I've also achieved this inside of Teradata when transferring data from one table to another: select from insert into...  statements. I would love to see SQLServer 2008 R2 get there, but the fastest I've seen SQLServer 2008 R2 run is 120k rows per second using Informatica 8.6.1 on 32 bit Windows Server 2008, with SQLServer 32 bit 2008 R2 using internal Ultra-SCSI disk, raid 0+1 array config.  Informatica on the same machine as SQLServer, reading from internal disk, writing back to internal disk.  Dell PowerEdge 2650 with 4GB RAM.Hope this helps clear things up...  Again, the nature of competition and the drive for performance is really what this is all about - and from a business perspective, moving data into a data warehouse requires adaptability of IT (agility to respond), and performance of the hardware.  So at the end of the day, removing the business rules from the raw EDW becomes a necessity in volume situations.Hope this helps,Dan LinstedtPS: Sales Pitch: I'm available for consulting, I've been consulting on systems architecture &amp; design, performance analysis and tuning of large scale systems for years.</description><pubDate>Mon, 01 Feb 2010 03:33:36 GMT</pubDate><dc:creator>danl-971512</dc:creator></item><item><title>RE: Conditional Set-Based Processing: Moving Towards a Best Practice for ETL</title><link>http://www.sqlservercentral.com/Forums/Topic823043-1693-1.aspx</link><description>Dan, what are you using to load files into SQL Server at 400k RPS with and how many parallel threads are you using to do that?</description><pubDate>Sun, 31 Jan 2010 17:31:21 GMT</pubDate><dc:creator>Jeff Moden</dc:creator></item><item><title>RE: Conditional Set-Based Processing: Moving Towards a Best Practice for ETL</title><link>http://www.sqlservercentral.com/Forums/Topic823043-1693-1.aspx</link><description>Hi Folks,  Lots of good comments here,  but I want to make a couple things clear:  I work with systems ranging in size from 1 Terabyte to 3 Petabytes.  Things change at different sizes of systems.  Rules change, engines change, processing changes, and architecture changes.What "works" at 1 TB most likely won't work at 50 TB, what works at 50TB most likely needs to be altered to work at 150TB, and what works at 150TB needs to be altered again to work at 500TB. and so on.  The reason I bring this up is because of all this talk of constraints.Constraints in the database put a HUGE load on the processing power of the DB engine and CPU during load time.  Processors have been getting better, bigger, stronger, faster and cheaper - this is great!  But the engines like SQLServer need to take advantage of the CPU power by being re-engineered at the core level to be truly multi-threaded (and not single strung when it gets to something like the darn pagefile.sys).  Microsoft realizes this, and made some huge changes to the core OS in Windows 64 bit.  Then they purchased DatAllegro, gutted the product and placed the true parallelism and partitioning algorithms into the core of SQLServer (Well Done !).Now, can anyone tell me if they've changed the algorithms inside SQLServer to make the "constraint checks" lazy?  Queued on multiple background threads?  If not, then the impact to bulk loading (ESPECIALLY ON LARGE FILES) is severely negative.  Loading speeds for 1k row sizes will be severely hampered, meaning that most folks will jump for joy if they get DTS or BCP to perform at or above 20,000 rows per second on pure inserts WITH constraints on.Do the  math, at 20k rps for a 1k row width at 8k block size, + constraint block modifications + index modifications + header modifications, how long would it take to load 45 million rows?  The answer is not fast enough.The only way to continue to load rows in an ever growing capacity is to increase the load performance.  It's funny that no-one in this thread even addressed the performance numbers I put forward the first time, although I can understand why...  The need to have it run at 80k rows per second at 1k row width is imperative as a MINIMUM.  (Updates and deletes are even worse, especially if they are single row driven, and can't be processed in parallel).  The absolute need for performance with 45 million rows, or even 150 Million row loads is 120k rows per second, upwards to 300k rows per second.If Microsoft wants to compete with Teradata and DB2 UDB EEE, or Paraccel, or Netezza, or most other engines (which I'm sure they do), then they must have this kind of performance available.   At the end of the day what this means is: shutting down the constraints as a FIRST STEP.  The second step is shutting down all other indexes, the third step is increasing the block sizes, the fourth step is increasing parallelism of PageFile.sys, the fifth step is dramatically increasing the parallel insert capabilitieis, and the last step is adding bigger faster, better, cheaper hardware (RAM &amp; CPU).Now, if they want really good performance in single row batch updates, or single row batch deletes, they MUST follow similar steps - except: they have to leave the constraints and indexes in place... so this means putting constraints and indexes in a "lazy queued process" to execute in parallel but in the background.  It means more synchronization for any "error" that may occur during the load &amp; check process, but it means faster throughput.Finally, let's talk about business needs now.  the business NEEDS compliance, and they need a single enterprise store (consolidated) that contains associated business data.  The business also needs high-speed massive loads.  Let's take the example of CMS (center for medicaid and medicare services in the US).  They maintain many of their Medicare/Medicaid records in a DB2 AS400 system on an operational basis.  They have over 1.5 Billion subscriber records.  They have no idea which records are updated, which are inserted, and which are deleted (due to multiple systems, multiple applications, multiple source feeds from doctors and other government offices). The ONLY way they can assemble any kind of decent picture is to pure-copy the information from all the systems to a staging area first, then move it to a warehouse - a raw data warehouse (yes with constraints on, but only on the primary keys).The staging area assures high speed loading of raw data sources at 400k rows per second.  Part of their Data Warehouse is based on the Data Vault principles of auditable and raw data, yes it has constraints but not across the BUSINESS data.  Now add in the final rules here:  BUSINESS RULES DRIVE CONSTRAINTS IN OLTP - BUSINESS RULES DRIVE CHANGES TO THE ARCHITECTURE AND DESIGN OF OLTP, when dealing with this much data in this short time period it is VERY difficult to maintain performance of the system with business rules driving the EDW (embedded in the database).  IF the business rules are implemented as constraints, not only do they physically slow down the EDW, they also slow down IT's response time to business changes - it becomse impossible to maintain the "speed of business".  Which ultimately leads to a shut-down or restructuring of the EDW project.  It's these factors that are discussed in my Data Vault Business Book - so while this may seem like a marketing statement, I assure you that customers all over the world see it differently (as a necessity for forward motion and change to the EDW environment).Cheers,Dan Linstedt</description><pubDate>Sat, 30 Jan 2010 05:08:01 GMT</pubDate><dc:creator>danl-971512</dc:creator></item><item><title>RE: Conditional Set-Based Processing: Moving Towards a Best Practice for ETL</title><link>http://www.sqlservercentral.com/Forums/Topic823043-1693-1.aspx</link><description>I'll start off by declaring total agreement with Geoff.But some other posts were very intelligent so are worth discussing.I very much liked the first comment (from Keith) because it makes sense if you have to live with silly constraints that preclude BCP (anyone with a datacomms background would like it, it fits the real world so well) - but I would recommend using a hyperexponential backoff alogrithm (at least for the first few steps), rather than a simple power series.  This means that your batch size is 1/2 the original size at the first failure level, 1/8 at the second, 1/64 at the third,  1/1024 at the fourth, and so on [in practise, I have dropped back to exponential if I hit the "and so on" - after 1/1024 I have 1/16384, and so on]. This is often [sometimes?] better than compromises like using 10 as the division factor instead of 2 (often, but not always - we are into statistics and probability theory here, not simple arithmetic with definite answers).  I didn't much like Dan Linstedt's first comment, perhaps because I just naturally thought that the constraints on the original import would be minimal and not include most business rules (maybe if I read the article again I will see something I didn't notice on first reading that implies all the business rules are used in the initial import - I can't imagine anyone without pointy hair envisaging such a system, so I would have had to suspend my belief in the sanity of DBAs in order to notice it) - and anyway, as MelS-512196 said, anything/everything breaks and needs exception handling.  But I didn't think Dan's comment was mostly marketing speak as suggested by sknox.Geoff's comments really answer it all - those constraints that forbid use of BCP (or BULK INSERT if we are looking at a server later that SQL 2000) are just plain crazy, and BCP is the basis for a sensible and cost-effective solution.  [I remember when I first used DTS - that was back in 2000 (take that as the year or as the SQLS version, you'll be right either way) when, after 15 years of functional, object, and constraint[logic] oriented databases I came back to the real world of relational databases (and had to throw IBM's DB2 out of the shop and bring in SQLS so that rapid development and prototyping with incremental release was feasible) - but I have never used BULK INSERT, so I have to take Geoff's word for it that it will do the job. Anyone back then who suggested "requirements" that would preclude use of DTS where needed would have been looking for a new job pretty quickly!]</description><pubDate>Fri, 29 Jan 2010 20:32:09 GMT</pubDate><dc:creator>L' Eomot Inversé</dc:creator></item><item><title>RE: Conditional Set-Based Processing: Moving Towards a Best Practice for ETL</title><link>http://www.sqlservercentral.com/Forums/Topic823043-1693-1.aspx</link><description>Here's a cheerful post back to Jeff -- I totally agree with you that people are overcomplicating many projects by not taking advantage of what the data layer and T-SQL can do.  :cool:</description><pubDate>Mon, 30 Nov 2009 07:34:45 GMT</pubDate><dc:creator>MelS-512196</dc:creator></item><item><title>RE: Conditional Set-Based Processing: Moving Towards a Best Practice for ETL</title><link>http://www.sqlservercentral.com/Forums/Topic823043-1693-1.aspx</link><description>I'm glad to see a kindred spirit out there, Noel. Thanks for the cheer-me-up.  Your post is a big breath of fresh air.  Thank you, Sir. :-)</description><pubDate>Sun, 29 Nov 2009 18:49:39 GMT</pubDate><dc:creator>Jeff Moden</dc:creator></item><item><title>RE: Conditional Set-Based Processing: Moving Towards a Best Practice for ETL</title><link>http://www.sqlservercentral.com/Forums/Topic823043-1693-1.aspx</link><description>@Jeff Moden - Can't do anything about you getting "knocked" but I can say Thanks for your always thoughtful and insightful posts that add a great deal to SSC.Regarding your "requirements" list, I just ran into the "no temp tables" rule recently with someone who thought their tempdb usage would go to zero if all temp tables were removed from stored procedures and other data access code (they didn't know SQL Server has many uses for tempdb beyond user temp tables). Have also encountered a DBA who heavily used sp_MsForEachDB yet threatened bodily harm if you used cursors. Someone once hired me to do "optimization" by going through code and replacing cursors with temp tables and WHILE loops. Another wouldn't allow dynamic SQL in stored procedures because that "would prevent plan reuse" yet they somehow thought their 2000 line stored procedures with 20 parameters were being effectively reused.</description><pubDate>Sun, 29 Nov 2009 15:48:30 GMT</pubDate><dc:creator>Noel McKinney</dc:creator></item><item><title>RE: Conditional Set-Based Processing: Moving Towards a Best Practice for ETL</title><link>http://www.sqlservercentral.com/Forums/Topic823043-1693-1.aspx</link><description>[quote][b]MelS-512196 (11/24/2009)[/b][hr]Disagreements are good -- better ideas happen that way.  One thing I think that Jeff should take more seriously, though, is that some projects may have constraints that disallow certain techniques (xp_cmdshell for one), whether we like it or not. Suppose the data load has to run at customer sites -- no DBA attending, no elevated privileges on customer machines... I have used bcp and bulk insert in a number of situations and do think they are valuable, just may not fit every job.[/quote]Heh... I do take other methods/requirements seriously .  That's why I'll recommend Bulk Insert over BCP most of the time or even recommend the use of SSIS.Let's flip this around a bit... I get harped on for advocating doing most everything having to do with data in the data layer and in T-SQL.  Forgive me if I have to harp on the people who want to do data layer stuff everywhere except in the data layer or T-SQL.  ;-)For example, I frequently see posts where people ask about how to loop through files to import, how to keep track of what they imported, how to clean and glean the data once it's imported, and how to do that all with some good bit of performance because they claim something crazy (not so crazy in a lot of cases) like having the requirement to import something like 5 billion rows across hundreds of files per day.  Can it be done using SSIS?  Absolutely... but then you see folks saying to write scripts in Perl, ActiveX, VBS, DLL's, etc, etc to actually be able to pull if off in SSIS (or DTS) and you'll still see them having problems with flow control, variables, transformations, performance, data integrity, etc, etc.  That's when I step in (sometimes well in advance) and suggest alternatives because none of that stuff is actually necessary.  GUI code like SSIS and DTS were supposed to make life simpler and faster.. they frequently don't.The other thing I don't see people doing is taking advantage of some of the power of SQL because of requirements like not using XP_CmdShell.  For security reasons, I can certainly see that.  BUT, there are also some pretty extreme privs required for ETL systems that use SSIS... you have to be able to read the files names, import the data, and maybe rename the file to identify it's complete or move the file to a "competed" directory.  Tell me that doesn't take elevated privs on the operating system.  Just like in SSIS, sometimes those privs have to be done through a proxy... same goes with xp_CmdShell... you don't have to give anyone privs to run xp_CmdShell... you can setup proxies and jobs and it's normally worth that small effort because... T-SQL was meant to handle data.  Use it.So far as some projects having requirements go, yep.. I agree.  Right up to the point when I ask "why" such a requirement is in place and the answer is "because they want it that way".  What that means is that that person hasn't taken the time to suggest and prove that there might be a better way with "better" meaning higher performance, ease of maintenance/modifiation, hands-free/self monitoring qualities, smaller footprint, etc, etc.  Even if they do, the reason for the requirement is based on some blanket myth that xp_CmdShell is bad without thinking about how to do it with a proxy or whatever.Here's some really "great" requirements I've seen due to arbitrary mythological beliefs of some DBA or manager that doesn't have a clue as to what they're talking about...Cannot use Temp Tables.Cannot use SELECT/INTO because it locks up system tables.Cannot use any form of dynamic SQL.Cannot use Bulk Insert because of the dynamic SQL requirement.Cannot use Cursors or While Loops to step through databases.  Use sp_MsForEachDB instead (which has the world's 2nd ugliest cursor built in).The code must be "portable".Cannot use UDF's.The list of such mad requirements goes on.  So, heh... pardon me if I try to offer alternatives even in the face of so called requirements.  People say to "think outside the box" and it usually comes from people that don't realize... they're in box. ;-)I frequently state that "On the fringe of chaos lives innovation."   Let me disagree without getting chastized for it and let me disagree when someone says they won't even try to change some of the more ridiculous requirements they've been given by people who don't actually know better. :hehe:Like you said, "disagreement is good"... let me disagree without getting knocked just because I disagree, eh? ;-)</description><pubDate>Wed, 25 Nov 2009 09:40:31 GMT</pubDate><dc:creator>Jeff Moden</dc:creator></item><item><title>RE: Conditional Set-Based Processing: Moving Towards a Best Practice for ETL</title><link>http://www.sqlservercentral.com/Forums/Topic823043-1693-1.aspx</link><description>Business reqs are always going to differ.  I think the core algorithm you're throwing out there is very good.  Well done.</description><pubDate>Wed, 25 Nov 2009 07:47:30 GMT</pubDate><dc:creator>jaldridge</dc:creator></item><item><title>RE: Conditional Set-Based Processing: Moving Towards a Best Practice for ETL</title><link>http://www.sqlservercentral.com/Forums/Topic823043-1693-1.aspx</link><description>Disagreements are good -- better ideas happen that way.  One thing I think that Jeff should take more seriously, though, is that some projects may have constraints that disallow certain techniques (xp_cmdshell for one), whether we like it or not. Suppose the data load has to run at customer sites -- no DBA attending, no elevated privileges on customer machines... I have used bcp and bulk insert in a number of situations and do think they are valuable, just may not fit every job.</description><pubDate>Tue, 24 Nov 2009 06:01:43 GMT</pubDate><dc:creator>MelS-512196</dc:creator></item><item><title>RE: Conditional Set-Based Processing: Moving Towards a Best Practice for ETL</title><link>http://www.sqlservercentral.com/Forums/Topic823043-1693-1.aspx</link><description>i took the testing and it was easy.</description><pubDate>Mon, 23 Nov 2009 20:34:25 GMT</pubDate><dc:creator>jennytorns88</dc:creator></item><item><title>RE: Conditional Set-Based Processing: Moving Towards a Best Practice for ETL</title><link>http://www.sqlservercentral.com/Forums/Topic823043-1693-1.aspx</link><description>Before I get started... the article is well written, clear, and easy to understand... I just disagree with the premise of having to resort to any form of RBAR for the simple task of importing and validating data in a high performance fashion.I'm happy to see that "RBAR" has become a household word.  :-D  It's a shame that "BCP" has not.  BCP does have some limitations but it's quite capable in most areas.  For example, when you import into a staging table using BCP, if a particular "cell" of information doesn't adhere to the correct datatype or constraints (or whatever), it will cause the row to fail as expected.  What most people don't know is that doesn't necessarily make the whole job fail.  Nope... you can tell the job how many errors to allow.  Now, get this... I tell it to allow ALL rows to fail.  Why?  Because with BCP (and, I believe, Bulk Insert as of 2k5) you can also tell it to log all error rows to a file for troubleshooting and possible repair.  No CLR's are required, no Try Catch code is required, no shifting from bulk logic to RBAR, no fancy footwork is required.  It's already built into BCP, will isolate only bad rows while continuing to import the good rows, and it's nasty fast.  On a modest old server, I've seen it import, clean, and glean 5.1 million 20 column rows in 60 seconds flat.  On a "modern" server, I'm sure it will be much faster.  The only thing it won't do for you is the final merge from staging to production.That brings us to the next subject and several folks have already mentioned it.  I never import directly to production tables.  I always import to staging tables and, even if I've done as much error checking as possible with BCP, I always finish validating the data in the staging tables.  Then I mark all new inserts, mark all rows that contain updates, and then do the inserts and updates as expected which allows me to avoid all RBAR (and rollbacks due to data errors) even in the presence of the occasional data error.Ah... yes... BCP doesn't have a GUI and, if you want it to run from T-SQL, you'll need to allow xp_CmdShell to run or to call it through some other vehicle.  Heh... so what?  Are you going to tell me that your ETL system is also public facing?  If so (I'll reserve my personal opinion on that), then you're correct... use SSIS, CLR's, and Try Catch code that shifts to RBAR on failure... that is unless you happen to know how to use BULK INSERT to do the very same thing as BCP (and it will) in 2k5. :-PETL just doesn't have to be complicated nor slow.</description><pubDate>Mon, 23 Nov 2009 19:24:36 GMT</pubDate><dc:creator>Jeff Moden</dc:creator></item><item><title>RE: Conditional Set-Based Processing: Moving Towards a Best Practice for ETL</title><link>http://www.sqlservercentral.com/Forums/Topic823043-1693-1.aspx</link><description>Hi Folks,I understand what is desired.  Please don't get me wrong: set based processing has been around for a long long time.  What is important here is to note the following:1) Size of the source and target don't matter - whether big or small, set based processing is important.2) Using cursors and iterating over rows runs much slower than using "array" or block style commands, regardless of whether it's written in ETL (SSIS) or SQL stored procedures, or other ETL engine technologies.3) Speed is important when moving data to the warehouse from the staging area.4) Road-blocks to performance are often setup because of the data modeling architecture (ie: requiring business rules to be applied to the data when loading from stage to warehouse).5) Business rule processing can "break" the set based architectural approach6) moving business rules downstream (hence putting RAW data in the data warehouse) allows the EDW to be compliant and auditable, it also leads to many opportunities to apply set based operations over block style commands.7) The set based processing commands for inserts (when dealing with RAW data) allow these pieces to be specific against index matches, which in turn allow for maximum parallelism and maximum partitioning.I've used many of these techniques on data warehouses ranging in size from 3MB to 3 Petabytes with great success, again - you don't have to have large data warehouses to see the benefits of set based processing, however you DO need to move the "processing business rules" out of the way, and allow the good, the bad, and the ugly data IN to the data warehouse - otherwise, you miss out on compliance and auditability.Hope this helps,Dan LinstedtDanL@DanLinstedt.com</description><pubDate>Mon, 23 Nov 2009 16:34:59 GMT</pubDate><dc:creator>danl-971512</dc:creator></item><item><title>RE: Conditional Set-Based Processing: Moving Towards a Best Practice for ETL</title><link>http://www.sqlservercentral.com/Forums/Topic823043-1693-1.aspx</link><description>this is great and very useful info.</description><pubDate>Mon, 23 Nov 2009 14:29:20 GMT</pubDate><dc:creator>angelasutton21</dc:creator></item><item><title>RE: Conditional Set-Based Processing: Moving Towards a Best Practice for ETL</title><link>http://www.sqlservercentral.com/Forums/Topic823043-1693-1.aspx</link><description>A really good point, well made. As Joe says, using a staging area is best practice. However staging will never fully insulate you from the real world. In fact Obstacle Warehouse Builder offers this as a standard feature (this does not make up for all the other issues with WHB/Oracle) but the approach outlined is clearly best practice - if a lot more work. Thanks for putting in the effort to write this up.</description><pubDate>Mon, 23 Nov 2009 13:25:27 GMT</pubDate><dc:creator>admin-499013</dc:creator></item><item><title>RE: Conditional Set-Based Processing: Moving Towards a Best Practice for ETL</title><link>http://www.sqlservercentral.com/Forums/Topic823043-1693-1.aspx</link><description>These posts all have good things to say about considerations when loading data.  The example in this article was meant to show a particular way of using sets with try-catch and can be thought of as a basis for building more elaborate systems, not as a complete solution.  I agree with trying to handle exceptions as early in the process as possible, in a staging or landing area.  I also agree with keeping constraints to a minimum in a data mart.  But I also generally go by the rule that everything can break at some time and everything should have exception handling.</description><pubDate>Mon, 23 Nov 2009 13:19:25 GMT</pubDate><dc:creator>MelS-512196</dc:creator></item><item><title>RE: Conditional Set-Based Processing: Moving Towards a Best Practice for ETL</title><link>http://www.sqlservercentral.com/Forums/Topic823043-1693-1.aspx</link><description>Two things to add.One is that if you built this import routine in a recursive fashision you could have something like:exec ImportRecords @StartRecord, @EndRecord, @BlockSize.The first level of recursion would just break the data into segments based on the block size and call back to the SP recursively in a loop.  After the first level of recursion @EndRecord - @StartRecord &amp;lt;= @BlockSize.  If that was the case the code would insert the data within a try catch block.Then if you made the Block size a power of 10 (1000 or 10000) each level of error handling within the recursion would have a loop that would break the errored block into 10 peices, set @BlockSize = @BlockSize / 10 and retry.  If there was only one error in a block of 10000 records, this would recurse down in to a loop that processed 1000 records at a time.  The 9 good blocks would have no errors and the 9000 good records would process in the third layer of recursion.Then block of 1000 of the remaining records in errored block would process in a fourth layer of recursion 100 records at time.  This would continue until errors were caught and @BlockSize = 1.  In that case you would just log the error rather than recurse down again.Of course you would also need some kind of check in the loop to make sure that the recursive value of @EndRecord never exceeded the passed in value of @EndRecord.The second point is that many error conditions could be found using set based operations to set and error status prior to attempting to insert the records.</description><pubDate>Mon, 23 Nov 2009 13:02:20 GMT</pubDate><dc:creator>michael.welcome</dc:creator></item><item><title>RE: Conditional Set-Based Processing: Moving Towards a Best Practice for ETL</title><link>http://www.sqlservercentral.com/Forums/Topic823043-1693-1.aspx</link><description>What he said, except for the Data Vault part… ;)Seriously, in a data warehouse we always 'land' loads like this in a staging table with minimal constraints. You can then pick and choose what to do with it. I generally want things to be so loose that it won't fail unless the load file is unusable.</description><pubDate>Mon, 23 Nov 2009 12:41:45 GMT</pubDate><dc:creator>joeharris76</dc:creator></item><item><title>RE: Conditional Set-Based Processing: Moving Towards a Best Practice for ETL</title><link>http://www.sqlservercentral.com/Forums/Topic823043-1693-1.aspx</link><description>I had to work on ETL system that read from a bunch of sources. After the data was imported in the staging table, I ran a  bunch of set based validations and marked the records invalid and also kept track of  of reasons why they were bad. After the validations were done, i imported only the good records and send an email with list of bad records and the reasons.Some of my set based validations involved calls to CLR function which made it slower but the granularity of the error report made it well worth it and it was 10-15 faster than the ETL process it replaced. Let me also add that this is not a run every 15 minutes kind of application but process data whenever the clients upload data.</description><pubDate>Mon, 23 Nov 2009 08:52:28 GMT</pubDate><dc:creator>PeeEs</dc:creator></item><item><title>RE: Conditional Set-Based Processing: Moving Towards a Best Practice for ETL</title><link>http://www.sqlservercentral.com/Forums/Topic823043-1693-1.aspx</link><description>I actually think Dan Linstedt had a good point, even if it was lost in the marketing-speke of the rest of his post. To wit: why do you have constraints on the DB you're importing into?Note that I'm not saying you shouldn't have these constraints -- it depends on what you're using this DB for. But your article doesn't explain why there are constraints that could cause the import to fail.The entire system might be more efficient if you allowed the bad records to import and re-cast these constraints as reports which showed records which need correction in the OLTP system -- or you may have more reports that can't include bad records for any reason, so you need to catch the bad records before they get into the DB. We don't know.In any case, this try-catch method is a good idea when you have a set-based process that might have to fall back to RBAR, and I too am interested in the whether the binary "divide-and-conquer" approach might be worth the time it'd take to develop.</description><pubDate>Mon, 23 Nov 2009 08:36:12 GMT</pubDate><dc:creator>sknox</dc:creator></item><item><title>RE: Conditional Set-Based Processing: Moving Towards a Best Practice for ETL</title><link>http://www.sqlservercentral.com/Forums/Topic823043-1693-1.aspx</link><description>Thanks for thoughts on dynamic logic to change the size of the sets processed to handle errors.  Interesting idea.While I appreciate the comments on other ways to handle large data loads, not every project is the same scale or has the budget and/or time to implement specialized tools.  What I think is important is the ability to be flexible and come up with the right approach for each project...</description><pubDate>Mon, 23 Nov 2009 06:19:55 GMT</pubDate><dc:creator>MelS-512196</dc:creator></item><item><title>RE: Conditional Set-Based Processing: Moving Towards a Best Practice for ETL</title><link>http://www.sqlservercentral.com/Forums/Topic823043-1693-1.aspx</link><description>Hello,While I agree with your approach - the basic concepts are sound - I disagree with one thing: processing the data on the way in to the data warehouse.  I've been teaching, talking, and discussing the nature of Data Warehouses for over 15 years now, and today's data warehouse has become a system of record.  Due in part for the need of compliance.This means that the good, the bad, and the ugly data need to make it in to the data warehouse, regardless of what it looks like.  It also means that "processing" the data according to business rules and functions are now moved down stream (on the way out to the data marts, the cubes, the star schemas etc...).This does a few things:1) Set based processing is in use for all data across the entire warehouse2) all load  routines are parallel, and can be partitioned if necessary3) load performance should be upwards of 100,000 to 250,000 rows per second - making it easy to load 1 Billion + rows in 45 minutes or less (of course this depends on the hardware)4) restartability is inherited, as long as set based logic is in placeand so on...  The bottom line is moving raw data into the warehouse, the other bottom line is the architecture of the receiving tables in the warehouse is vitally important.  This is where the Data Vault Modeling and methodology come in to play.I'm frequently engaged to correct performance and tuning of large scale systems, and since Microsoft is now there with SQLServer 2008 R2, (And the fact that Microsoft is interested in the Data Vault Model), I would suggest you reexamine the way you load information (ie: putting processing of the data upstream of the data warehouse).You can see more about  this high-speed parallel approach at: http://www.DataVaultInstitute.com (free to register), or my profile on http://www.LinkedIn.com/in/dlinstedtAlso, keep in mind this is not some fly-by-night idea.  We've got a huge number of corporations following these principles with great success, including: JP Morgan Chase, SNS Bank, World Bank, ABN-AMRO, Diamler Auto, Edmonton Police Services, Dept of Defense, US Navy, US Army, FAA, FDA, City of Charlotte NC, Tyson Foods, Nutreco, and many many more....Cheers,Dan Linstedt</description><pubDate>Mon, 23 Nov 2009 03:35:25 GMT</pubDate><dc:creator>danl-971512</dc:creator></item><item><title>RE: Conditional Set-Based Processing: Moving Towards a Best Practice for ETL</title><link>http://www.sqlservercentral.com/Forums/Topic823043-1693-1.aspx</link><description>Thanks for the approach, which makes a lot of sense.Has anyone done any testing to see whether it would be worth taking an adaptive approach to this? What I mean is, after a failure in a block (of 25k say) rather than reverting to RBAR for the whole block, try something like a binary chop approach to locate the bad row(s).  A variation on this would be to adapt the block size dynamically - start large, if an error, try half that size and repeat until a preset minimum size is reached which would be processed RBAR. If a block gets by successfully, increase the size for the next set.We've done this kind of thing in serial comms before and I wonder if it would work here.The key factors, I think, are the ratio of time to process a block versus the time to do it RBAR (for different sizes) and (the more variable factor that depends on your data) how often do you expect to get errors and what is their pathology - are they likely to be solitary, clumped in groups, etc.  Some of this will only be learnt over time with your data.  These factors will determine whether, having failed 20k rows, it is worth taking the time hit to try two lots of 10k or to drop straight into RBAR for the 20k.</description><pubDate>Mon, 23 Nov 2009 01:58:49 GMT</pubDate><dc:creator>keith.fearnley</dc:creator></item><item><title>Conditional Set-Based Processing: Moving Towards a Best Practice for ETL</title><link>http://www.sqlservercentral.com/Forums/Topic823043-1693-1.aspx</link><description>Comments posted to this topic are about the item [B]&lt;A HREF="/articles/Data+Mart/67606/"&gt;Conditional Set-Based Processing: Moving Towards a Best Practice for ETL&lt;/A&gt;[/B]</description><pubDate>Sun, 22 Nov 2009 17:51:47 GMT</pubDate><dc:creator>MelS-512196</dc:creator></item></channel></rss>