Conditional Set-Based Processing: Moving Towards a Best Practice for ETL

Question

Conditional Set-Based Processing: Moving Towards a Best Practice for ETL

Viewing 15 posts - 16 through 30 (of 30 total)

You must be logged in to reply to this topic. Login to reply

jaldridge SSC Enthusiast Points: 109 More actions · Answer 1

Business reqs are always going to differ. I think the core algorithm you're throwing out there is very good. Well done.

Jeff Moden SSC Guru Points: 1004679 More actions · Answer 2

MelS-512196 (11/24/2009)
Disagreements are good -- better ideas happen that way. One thing I think that Jeff should take more seriously, though, is that some projects may have constraints that disallow certain techniques (xp_cmdshell for one), whether we like it or not. Suppose the data load has to run at customer sites -- no DBA attending, no elevated privileges on customer machines... I have used bcp and bulk insert in a number of situations and do think they are valuable, just may not fit every job.

Heh... I do take other methods/requirements seriously . That's why I'll recommend Bulk Insert over BCP most of the time or even recommend the use of SSIS.

Let's flip this around a bit... I get harped on for advocating doing most everything having to do with data in the data layer and in T-SQL. Forgive me if I have to harp on the people who want to do data layer stuff everywhere except in the data layer or T-SQL. 😉

For example, I frequently see posts where people ask about how to loop through files to import, how to keep track of what they imported, how to clean and glean the data once it's imported, and how to do that all with some good bit of performance because they claim something crazy (not so crazy in a lot of cases) like having the requirement to import something like 5 billion rows across hundreds of files per day. Can it be done using SSIS? Absolutely... but then you see folks saying to write scripts in Perl, ActiveX, VBS, DLL's, etc, etc to actually be able to pull if off in SSIS (or DTS) and you'll still see them having problems with flow control, variables, transformations, performance, data integrity, etc, etc. That's when I step in (sometimes well in advance) and suggest alternatives because none of that stuff is actually necessary. GUI code like SSIS and DTS were supposed to make life simpler and faster.. they frequently don't.

The other thing I don't see people doing is taking advantage of some of the power of SQL because of requirements like not using XP_CmdShell. For security reasons, I can certainly see that. BUT, there are also some pretty extreme privs required for ETL systems that use SSIS... you have to be able to read the files names, import the data, and maybe rename the file to identify it's complete or move the file to a "competed" directory. Tell me that doesn't take elevated privs on the operating system. Just like in SSIS, sometimes those privs have to be done through a proxy... same goes with xp_CmdShell... you don't have to give anyone privs to run xp_CmdShell... you can setup proxies and jobs and it's normally worth that small effort because... T-SQL was meant to handle data. Use it.

So far as some projects having requirements go, yep.. I agree. Right up to the point when I ask "why" such a requirement is in place and the answer is "because they want it that way". What that means is that that person hasn't taken the time to suggest and prove that there might be a better way with "better" meaning higher performance, ease of maintenance/modifiation, hands-free/self monitoring qualities, smaller footprint, etc, etc. Even if they do, the reason for the requirement is based on some blanket myth that xp_CmdShell is bad without thinking about how to do it with a proxy or whatever.

Here's some really "great" requirements I've seen due to arbitrary mythological beliefs of some DBA or manager that doesn't have a clue as to what they're talking about...

Cannot use Temp Tables.

Cannot use SELECT/INTO because it locks up system tables.

Cannot use any form of dynamic SQL.

Cannot use Bulk Insert because of the dynamic SQL requirement.

Cannot use Cursors or While Loops to step through databases. Use sp_MsForEachDB instead (which has the world's 2nd ugliest cursor built in).

The code must be "portable".

Cannot use UDF's.

The list of such mad requirements goes on. So, heh... pardon me if I try to offer alternatives even in the face of so called requirements. People say to "think outside the box" and it usually comes from people that don't realize... they're in box. 😉

I frequently state that "On the fringe of chaos lives innovation." Let me disagree without getting chastized for it and let me disagree when someone says they won't even try to change some of the more ridiculous requirements they've been given by people who don't actually know better. :hehe:

Like you said, "disagreement is good"... let me disagree without getting knocked just because I disagree, eh? 😉

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Noel McKinney Hall of Fame Points: 3452 More actions · Answer 3

@jeff Moden - Can't do anything about you getting "knocked" but I can say Thanks for your always thoughtful and insightful posts that add a great deal to SSC.

Regarding your "requirements" list, I just ran into the "no temp tables" rule recently with someone who thought their tempdb usage would go to zero if all temp tables were removed from stored procedures and other data access code (they didn't know SQL Server has many uses for tempdb beyond user temp tables). Have also encountered a DBA who heavily used sp_MsForEachDB yet threatened bodily harm if you used cursors. Someone once hired me to do "optimization" by going through code and replacing cursors with temp tables and WHILE loops. Another wouldn't allow dynamic SQL in stored procedures because that "would prevent plan reuse" yet they somehow thought their 2000 line stored procedures with 20 parameters were being effectively reused.

Jeff Moden SSC Guru Points: 1004679 More actions · Answer 4

I'm glad to see a kindred spirit out there, Noel. Thanks for the cheer-me-up. Your post is a big breath of fresh air. Thank you, Sir. 🙂

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

MelS-512196 SSC Eights! Points: 842 More actions · Answer 5

Here's a cheerful post back to Jeff -- I totally agree with you that people are overcomplicating many projects by not taking advantage of what the data layer and T-SQL can do. 😎

TomThomson SSC Guru Points: 104768 More actions · Answer 6

I'll start off by declaring total agreement with Geoff.

But some other posts were very intelligent so are worth discussing.

I very much liked the first comment (from Keith) because it makes sense if you have to live with silly constraints that preclude BCP (anyone with a datacomms background would like it, it fits the real world so well) - but I would recommend using a hyperexponential backoff alogrithm (at least for the first few steps), rather than a simple power series. This means that your batch size is 1/2 the original size at the first failure level, 1/8 at the second, 1/64 at the third, 1/1024 at the fourth, and so on [in practise, I have dropped back to exponential if I hit the "and so on" - after 1/1024 I have 1/16384, and so on]. This is often [sometimes?] better than compromises like using 10 as the division factor instead of 2 (often, but not always - we are into statistics and probability theory here, not simple arithmetic with definite answers). I didn't much like Dan Linstedt's first comment, perhaps because I just naturally thought that the constraints on the original import would be minimal and not include most business rules (maybe if I read the article again I will see something I didn't notice on first reading that implies all the business rules are used in the initial import - I can't imagine anyone without pointy hair envisaging such a system, so I would have had to suspend my belief in the sanity of DBAs in order to notice it) - and anyway, as MelS-512196 said, anything/everything breaks and needs exception handling. But I didn't think Dan's comment was mostly marketing speak as suggested by sknox.

Geoff's comments really answer it all - those constraints that forbid use of BCP (or BULK INSERT if we are looking at a server later that SQL 2000) are just plain crazy, and BCP is the basis for a sensible and cost-effective solution. oriented databases I came back to the real world of relational databases (and had to throw IBM's DB2 out of the shop and bring in SQLS so that rapid development and prototyping with incremental release was feasible) - but I have never used BULK INSERT, so I have to take Geoff's word for it that it will do the job. Anyone back then who suggested "requirements" that would preclude use of DTS where needed would have been looking for a new job pretty quickly!]

Tom

danl-971512 SSC Veteran Points: 211 More actions · Answer 7

Hi Folks,

Lots of good comments here, but I want to make a couple things clear: I work with systems ranging in size from 1 Terabyte to 3 Petabytes. Things change at different sizes of systems. Rules change, engines change, processing changes, and architecture changes.

What "works" at 1 TB most likely won't work at 50 TB, what works at 50TB most likely needs to be altered to work at 150TB, and what works at 150TB needs to be altered again to work at 500TB. and so on. The reason I bring this up is because of all this talk of constraints.

Constraints in the database put a HUGE load on the processing power of the DB engine and CPU during load time. Processors have been getting better, bigger, stronger, faster and cheaper - this is great! But the engines like SQLServer need to take advantage of the CPU power by being re-engineered at the core level to be truly multi-threaded (and not single strung when it gets to something like the darn pagefile.sys). Microsoft realizes this, and made some huge changes to the core OS in Windows 64 bit. Then they purchased DatAllegro, gutted the product and placed the true parallelism and partitioning algorithms into the core of SQLServer (Well Done !).

Now, can anyone tell me if they've changed the algorithms inside SQLServer to make the "constraint checks" lazy? Queued on multiple background threads? If not, then the impact to bulk loading (ESPECIALLY ON LARGE FILES) is severely negative. Loading speeds for 1k row sizes will be severely hampered, meaning that most folks will jump for joy if they get DTS or BCP to perform at or above 20,000 rows per second on pure inserts WITH constraints on.

Do the math, at 20k rps for a 1k row width at 8k block size, + constraint block modifications + index modifications + header modifications, how long would it take to load 45 million rows? The answer is not fast enough.

The only way to continue to load rows in an ever growing capacity is to increase the load performance. It's funny that no-one in this thread even addressed the performance numbers I put forward the first time, although I can understand why... The need to have it run at 80k rows per second at 1k row width is imperative as a MINIMUM. (Updates and deletes are even worse, especially if they are single row driven, and can't be processed in parallel). The absolute need for performance with 45 million rows, or even 150 Million row loads is 120k rows per second, upwards to 300k rows per second.

If Microsoft wants to compete with Teradata and DB2 UDB EEE, or Paraccel, or Netezza, or most other engines (which I'm sure they do), then they must have this kind of performance available. At the end of the day what this means is: shutting down the constraints as a FIRST STEP. The second step is shutting down all other indexes, the third step is increasing the block sizes, the fourth step is increasing parallelism of PageFile.sys, the fifth step is dramatically increasing the parallel insert capabilitieis, and the last step is adding bigger faster, better, cheaper hardware (RAM & CPU).

Now, if they want really good performance in single row batch updates, or single row batch deletes, they MUST follow similar steps - except: they have to leave the constraints and indexes in place... so this means putting constraints and indexes in a "lazy queued process" to execute in parallel but in the background. It means more synchronization for any "error" that may occur during the load & check process, but it means faster throughput.

Finally, let's talk about business needs now. the business NEEDS compliance, and they need a single enterprise store (consolidated) that contains associated business data. The business also needs high-speed massive loads. Let's take the example of CMS (center for medicaid and medicare services in the US). They maintain many of their Medicare/Medicaid records in a DB2 AS400 system on an operational basis. They have over 1.5 Billion subscriber records. They have no idea which records are updated, which are inserted, and which are deleted (due to multiple systems, multiple applications, multiple source feeds from doctors and other government offices). The ONLY way they can assemble any kind of decent picture is to pure-copy the information from all the systems to a staging area first, then move it to a warehouse - a raw data warehouse (yes with constraints on, but only on the primary keys).

The staging area assures high speed loading of raw data sources at 400k rows per second. Part of their Data Warehouse is based on the Data Vault principles of auditable and raw data, yes it has constraints but not across the BUSINESS data. Now add in the final rules here: BUSINESS RULES DRIVE CONSTRAINTS IN OLTP - BUSINESS RULES DRIVE CHANGES TO THE ARCHITECTURE AND DESIGN OF OLTP, when dealing with this much data in this short time period it is VERY difficult to maintain performance of the system with business rules driving the EDW (embedded in the database). IF the business rules are implemented as constraints, not only do they physically slow down the EDW, they also slow down IT's response time to business changes - it becomse impossible to maintain the "speed of business". Which ultimately leads to a shut-down or restructuring of the EDW project. It's these factors that are discussed in my Data Vault Business Book - so while this may seem like a marketing statement, I assure you that customers all over the world see it differently (as a necessity for forward motion and change to the EDW environment).

Cheers,

Dan Linstedt

Jeff Moden SSC Guru Points: 1004679 More actions · Answer 8

Dan, what are you using to load files into SQL Server at 400k RPS with and how many parallel threads are you using to do that?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

danl-971512 SSC Veteran Points: 211 More actions · Answer 9

Hi Jeff,

I'm sorry if I wasn't clear enough. 400k rps is where SQLServer 2008 R2 should be going - it's not where it is today. 400k rps during load is only achieved in Teradata with a Teradata hardware setup using FastLoad. I've also achieved this inside of Teradata when transferring data from one table to another: select from insert into... statements.

I would love to see SQLServer 2008 R2 get there, but the fastest I've seen SQLServer 2008 R2 run is 120k rows per second using Informatica 8.6.1 on 32 bit Windows Server 2008, with SQLServer 32 bit 2008 R2 using internal Ultra-SCSI disk, raid 0+1 array config. Informatica on the same machine as SQLServer, reading from internal disk, writing back to internal disk. Dell PowerEdge 2650 with 4GB RAM.

Hope this helps clear things up... Again, the nature of competition and the drive for performance is really what this is all about - and from a business perspective, moving data into a data warehouse requires adaptability of IT (agility to respond), and performance of the hardware. So at the end of the day, removing the business rules from the raw EDW becomes a necessity in volume situations.

Hope this helps,

Dan Linstedt

PS: Sales Pitch: I'm available for consulting, I've been consulting on systems architecture & design, performance analysis and tuning of large scale systems for years.

Jeff Moden SSC Guru Points: 1004679 More actions · Answer 10

Sorry for the late reply. Thanks for the explanation, Dan. I really appreciate it.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

wardster SSC Rookie Points: 26 More actions · Answer 11

Hi folks

Sorry for dragging up an old thread, but I stumbled upon the related article whilst looking for a bit of info on best practices for the use of try/catch within a transaction for a small ETL process. A very informative discussion, so thanks to all; I like the binary/hyperexponential chop ideas but will try to explore using bcp where poss, have found it very useful for non-production stuff in the past.

Anyway, here's my question and scenario:

In Mel's example in the original article, he deletes from his staging table after committing the transaction which inserts into the target table and updates the source table, and even explicitly comments the fact to stress it's importance:

--Don't delete from temp table until transaction is committed

DELETE @TempSource

Does this matter? In similar code I wrote the other day before reading this, I assumed I wanted the delete from staging to be part of the transaction too, so the entire process gets committed or rolled back together? Otherwise if you get a random error during the delete aren't you left with data that you don't want in the staging table? Not a massive issue as my process truncates the staging table at the start, but I'm wondering if I'm missing something, or worrying over something inconsequential? Thanks

My scenario is thus:

Existing process uses SSIS to import data directly into production table. The data actually needs to be transformed to be in the form our stored procs want it in, and at the moment this is done using a view, which performs pretty badly - columns we need to index on exist courtesy of expressions in the view. So options are index the view, or change the import process.

I opted for the latter, and am adding a staging table, and a stored proc which transforms the data from staging into a new target table, plus copies the raw data into the current production table, which in effect becomes our audit store. Existing stored procs will be changed to point at the new target table rather than existing production raw data store.

There's no real constraints on the target table so I shouldn't have to worry about errors too much, but still want to use an explicit transaction, something like:

BEGIN TRAN

-- Insert from staging into target table, transforming columns where desired

INSERT tbl_Target

SELECT col1

,MySchema.fn_transform(col2)

,ISNULL(col3)

FROM tbl_Staging;

-- Copy raw data from staging into store

INSERT tbl_Store

SELECT col1, col2, col3

FROM tbl_Staging;

-- Delete data from staging

DELETE FROM tbl_Staging;

IF (@@Error = 0)

COMMIT TRAN;

ELSE

BEGIN

ROLLBACK TRAN;

-- RAISERROR to tell SSIS package there's been a problem

END

This process runs every hour to import a few thousand rows, thankfully mostly during office hours not middle of the night!:-P

Now, i've never used T-SQL try/catch before, as I worked mainly with SQL 2000 until this year, but feel I should use it, especially as it can be used to detect a deadlock and retry (in this case there shouldn't be other stuff querying the staging and store tables but you never know)

But does it matter whether my delete from staging is inside the transaction?

Thanks

Jeff Moden SSC Guru Points: 1004679 More actions · Answer 12

wardster (8/21/2010)
In Mel's example in the original article, he deletes from his staging table after committing the transaction which inserts into the target table and updates the source table, and even explicitly comments the fact to stress it's importance:
--Don't delete from temp table until transaction is committed
DELETE @TempSource
Does this matter?

With a table variable, no. It's not affected by transactions. But, if you look at the comment, it looks like the author may have used a Temp Table originally and changed it to a table variable later on. It could matter if it's a Temp Table (although I've not studied the code well enough to know for sure in this case).

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

wardster SSC Rookie Points: 26 More actions · Answer 13

Hi Jeff

Thanks for the quick reply.

The solution I'm in the middle of testing uses a permanent staging table rather than temp table or table variable. Does this mean I should be deleting from it outside the insert transaction? The staging table is brand new so I'm certain no other objects reference it apart from the stored proc with the delete in.

Jeff Moden SSC Guru Points: 1004679 More actions · Answer 14

wardster (8/21/2010)
Hi Jeff
Thanks for the quick reply.
The solution I'm in the middle of testing uses a permanent staging table rather than temp table or table variable. Does this mean I should be deleting from it outside the insert transaction? The staging table is brand new so I'm certain no other objects reference it apart from the stored proc with the delete in.

First, unless you're trying to preserve the value of an IDENTITY column, I wouldn't use DELETE... I'd use TRUNCATE. It's a whole lot faster and easier on the LOG.

Second, same thing applies. You'd have to do the DELETE or TRUNCATE outside the transaction or it may undo the DELETE or TRUNCATE if you rollback. Of course, that may be just what you want so the real answer is "It Depends". 😀

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

MelS-512196 SSC Eights! Points: 842 More actions · Answer 15

Just caught the question in the recent post. If I can remember my own code correctly:-) The reason the delete of the table variable must happen after the transaction commits is that the table variable data is used in the Catch clause if the transaction fails and rolls back. Therefore it is deleted only after the transaction commits.