Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase ««123»»

Conditional Set-Based Processing: Moving Towards a Best Practice for ETL Expand / Collapse
Author
Message
Posted Monday, November 23, 2009 2:29 PM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Monday, November 23, 2009 2:24 PM
Points: 1, Visits: 0
this is great and very useful info.

Angela
Post #823524
Posted Monday, November 23, 2009 4:34 PM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Tuesday, October 22, 2013 6:36 AM
Points: 7, Visits: 26
Hi Folks,

I understand what is desired. Please don't get me wrong: set based processing has been around for a long long time. What is important here is to note the following:

1) Size of the source and target don't matter - whether big or small, set based processing is important.
2) Using cursors and iterating over rows runs much slower than using "array" or block style commands, regardless of whether it's written in ETL (SSIS) or SQL stored procedures, or other ETL engine technologies.
3) Speed is important when moving data to the warehouse from the staging area.
4) Road-blocks to performance are often setup because of the data modeling architecture (ie: requiring business rules to be applied to the data when loading from stage to warehouse).
5) Business rule processing can "break" the set based architectural approach
6) moving business rules downstream (hence putting RAW data in the data warehouse) allows the EDW to be compliant and auditable, it also leads to many opportunities to apply set based operations over block style commands.
7) The set based processing commands for inserts (when dealing with RAW data) allow these pieces to be specific against index matches, which in turn allow for maximum parallelism and maximum partitioning.

I've used many of these techniques on data warehouses ranging in size from 3MB to 3 Petabytes with great success, again - you don't have to have large data warehouses to see the benefits of set based processing, however you DO need to move the "processing business rules" out of the way, and allow the good, the bad, and the ugly data IN to the data warehouse - otherwise, you miss out on compliance and auditability.

Hope this helps,
Dan Linstedt
DanL@DanLinstedt.com
Post #823569
Posted Monday, November 23, 2009 7:24 PM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: Yesterday @ 7:27 AM
Points: 35,769, Visits: 32,437
Before I get started... the article is well written, clear, and easy to understand... I just disagree with the premise of having to resort to any form of RBAR for the simple task of importing and validating data in a high performance fashion.

I'm happy to see that "RBAR" has become a household word. It's a shame that "BCP" has not. BCP does have some limitations but it's quite capable in most areas. For example, when you import into a staging table using BCP, if a particular "cell" of information doesn't adhere to the correct datatype or constraints (or whatever), it will cause the row to fail as expected. What most people don't know is that doesn't necessarily make the whole job fail. Nope... you can tell the job how many errors to allow.

Now, get this... I tell it to allow ALL rows to fail. Why? Because with BCP (and, I believe, Bulk Insert as of 2k5) you can also tell it to log all error rows to a file for troubleshooting and possible repair. No CLR's are required, no Try Catch code is required, no shifting from bulk logic to RBAR, no fancy footwork is required. It's already built into BCP, will isolate only bad rows while continuing to import the good rows, and it's nasty fast. On a modest old server, I've seen it import, clean, and glean 5.1 million 20 column rows in 60 seconds flat. On a "modern" server, I'm sure it will be much faster. The only thing it won't do for you is the final merge from staging to production.

That brings us to the next subject and several folks have already mentioned it. I never import directly to production tables. I always import to staging tables and, even if I've done as much error checking as possible with BCP, I always finish validating the data in the staging tables. Then I mark all new inserts, mark all rows that contain updates, and then do the inserts and updates as expected which allows me to avoid all RBAR (and rollbacks due to data errors) even in the presence of the occasional data error.

Ah... yes... BCP doesn't have a GUI and, if you want it to run from T-SQL, you'll need to allow xp_CmdShell to run or to call it through some other vehicle. Heh... so what? Are you going to tell me that your ETL system is also public facing? If so (I'll reserve my personal opinion on that), then you're correct... use SSIS, CLR's, and Try Catch code that shifts to RBAR on failure... that is unless you happen to know how to use BULK INSERT to do the very same thing as BCP (and it will) in 2k5.

ETL just doesn't have to be complicated nor slow.


--Jeff Moden
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

Helpful Links:
How to post code problems
How to post performance problems
Post #823595
Posted Monday, November 23, 2009 8:34 PM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Monday, November 23, 2009 8:30 PM
Points: 1, Visits: 0
i took the testing and it was easy.

Jenny
Post #823602
Posted Tuesday, November 24, 2009 6:01 AM
SSC-Addicted

SSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-Addicted

Group: General Forum Members
Last Login: Friday, August 9, 2013 3:52 PM
Points: 430, Visits: 448
Disagreements are good -- better ideas happen that way. One thing I think that Jeff should take more seriously, though, is that some projects may have constraints that disallow certain techniques (xp_cmdshell for one), whether we like it or not. Suppose the data load has to run at customer sites -- no DBA attending, no elevated privileges on customer machines... I have used bcp and bulk insert in a number of situations and do think they are valuable, just may not fit every job.
Post #823808
Posted Wednesday, November 25, 2009 7:47 AM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Tuesday, December 9, 2014 11:02 AM
Points: 19, Visits: 115
Business reqs are always going to differ. I think the core algorithm you're throwing out there is very good. Well done.
Post #824576
Posted Wednesday, November 25, 2009 9:40 AM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: Yesterday @ 7:27 AM
Points: 35,769, Visits: 32,437
MelS-512196 (11/24/2009)
Disagreements are good -- better ideas happen that way. One thing I think that Jeff should take more seriously, though, is that some projects may have constraints that disallow certain techniques (xp_cmdshell for one), whether we like it or not. Suppose the data load has to run at customer sites -- no DBA attending, no elevated privileges on customer machines... I have used bcp and bulk insert in a number of situations and do think they are valuable, just may not fit every job.


Heh... I do take other methods/requirements seriously . That's why I'll recommend Bulk Insert over BCP most of the time or even recommend the use of SSIS.

Let's flip this around a bit... I get harped on for advocating doing most everything having to do with data in the data layer and in T-SQL. Forgive me if I have to harp on the people who want to do data layer stuff everywhere except in the data layer or T-SQL.

For example, I frequently see posts where people ask about how to loop through files to import, how to keep track of what they imported, how to clean and glean the data once it's imported, and how to do that all with some good bit of performance because they claim something crazy (not so crazy in a lot of cases) like having the requirement to import something like 5 billion rows across hundreds of files per day. Can it be done using SSIS? Absolutely... but then you see folks saying to write scripts in Perl, ActiveX, VBS, DLL's, etc, etc to actually be able to pull if off in SSIS (or DTS) and you'll still see them having problems with flow control, variables, transformations, performance, data integrity, etc, etc. That's when I step in (sometimes well in advance) and suggest alternatives because none of that stuff is actually necessary. GUI code like SSIS and DTS were supposed to make life simpler and faster.. they frequently don't.

The other thing I don't see people doing is taking advantage of some of the power of SQL because of requirements like not using XP_CmdShell. For security reasons, I can certainly see that. BUT, there are also some pretty extreme privs required for ETL systems that use SSIS... you have to be able to read the files names, import the data, and maybe rename the file to identify it's complete or move the file to a "competed" directory. Tell me that doesn't take elevated privs on the operating system. Just like in SSIS, sometimes those privs have to be done through a proxy... same goes with xp_CmdShell... you don't have to give anyone privs to run xp_CmdShell... you can setup proxies and jobs and it's normally worth that small effort because... T-SQL was meant to handle data. Use it.

So far as some projects having requirements go, yep.. I agree. Right up to the point when I ask "why" such a requirement is in place and the answer is "because they want it that way". What that means is that that person hasn't taken the time to suggest and prove that there might be a better way with "better" meaning higher performance, ease of maintenance/modifiation, hands-free/self monitoring qualities, smaller footprint, etc, etc. Even if they do, the reason for the requirement is based on some blanket myth that xp_CmdShell is bad without thinking about how to do it with a proxy or whatever.

Here's some really "great" requirements I've seen due to arbitrary mythological beliefs of some DBA or manager that doesn't have a clue as to what they're talking about...

Cannot use Temp Tables.
Cannot use SELECT/INTO because it locks up system tables.
Cannot use any form of dynamic SQL.
Cannot use Bulk Insert because of the dynamic SQL requirement.
Cannot use Cursors or While Loops to step through databases. Use sp_MsForEachDB instead (which has the world's 2nd ugliest cursor built in).
The code must be "portable".
Cannot use UDF's.

The list of such mad requirements goes on. So, heh... pardon me if I try to offer alternatives even in the face of so called requirements. People say to "think outside the box" and it usually comes from people that don't realize... they're in box.

I frequently state that "On the fringe of chaos lives innovation." Let me disagree without getting chastized for it and let me disagree when someone says they won't even try to change some of the more ridiculous requirements they've been given by people who don't actually know better.

Like you said, "disagreement is good"... let me disagree without getting knocked just because I disagree, eh?


--Jeff Moden
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

Helpful Links:
How to post code problems
How to post performance problems
Post #824740
Posted Sunday, November 29, 2009 3:48 PM


SSCrazy

SSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazy

Group: General Forum Members
Last Login: Tuesday, February 11, 2014 4:12 PM
Points: 2,007, Visits: 768
@Jeff Moden - Can't do anything about you getting "knocked" but I can say Thanks for your always thoughtful and insightful posts that add a great deal to SSC.

Regarding your "requirements" list, I just ran into the "no temp tables" rule recently with someone who thought their tempdb usage would go to zero if all temp tables were removed from stored procedures and other data access code (they didn't know SQL Server has many uses for tempdb beyond user temp tables). Have also encountered a DBA who heavily used sp_MsForEachDB yet threatened bodily harm if you used cursors. Someone once hired me to do "optimization" by going through code and replacing cursors with temp tables and WHILE loops. Another wouldn't allow dynamic SQL in stored procedures because that "would prevent plan reuse" yet they somehow thought their 2000 line stored procedures with 20 parameters were being effectively reused.
Post #825939
Posted Sunday, November 29, 2009 6:49 PM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: Yesterday @ 7:27 AM
Points: 35,769, Visits: 32,437
I'm glad to see a kindred spirit out there, Noel. Thanks for the cheer-me-up. Your post is a big breath of fresh air. Thank you, Sir.

--Jeff Moden
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

Helpful Links:
How to post code problems
How to post performance problems
Post #825966
Posted Monday, November 30, 2009 7:34 AM
SSC-Addicted

SSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-Addicted

Group: General Forum Members
Last Login: Friday, August 9, 2013 3:52 PM
Points: 430, Visits: 448
Here's a cheerful post back to Jeff -- I totally agree with you that people are overcomplicating many projects by not taking advantage of what the data layer and T-SQL can do.
Post #826201
« Prev Topic | Next Topic »

Add to briefcase ««123»»

Permissions Expand / Collapse