Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase «««123

Conditional Set-Based Processing: Moving Towards a Best Practice for ETL Expand / Collapse
Author
Message
Posted Friday, January 29, 2010 8:32 PM


SSCrazy Eights

SSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy Eights

Group: General Forum Members
Last Login: Today @ 10:00 AM
Points: 8,551, Visits: 9,043
I'll start off by declaring total agreement with Geoff.

But some other posts were very intelligent so are worth discussing.

I very much liked the first comment (from Keith) because it makes sense if you have to live with silly constraints that preclude BCP (anyone with a datacomms background would like it, it fits the real world so well) - but I would recommend using a hyperexponential backoff alogrithm (at least for the first few steps), rather than a simple power series. This means that your batch size is 1/2 the original size at the first failure level, 1/8 at the second, 1/64 at the third, 1/1024 at the fourth, and so on [in practise, I have dropped back to exponential if I hit the "and so on" - after 1/1024 I have 1/16384, and so on]. This is often [sometimes?] better than compromises like using 10 as the division factor instead of 2 (often, but not always - we are into statistics and probability theory here, not simple arithmetic with definite answers). I didn't much like Dan Linstedt's first comment, perhaps because I just naturally thought that the constraints on the original import would be minimal and not include most business rules (maybe if I read the article again I will see something I didn't notice on first reading that implies all the business rules are used in the initial import - I can't imagine anyone without pointy hair envisaging such a system, so I would have had to suspend my belief in the sanity of DBAs in order to notice it) - and anyway, as MelS-512196 said, anything/everything breaks and needs exception handling. But I didn't think Dan's comment was mostly marketing speak as suggested by sknox.

Geoff's comments really answer it all - those constraints that forbid use of BCP (or BULK INSERT if we are looking at a server later that SQL 2000) are just plain crazy, and BCP is the basis for a sensible and cost-effective solution. [I remember when I first used DTS - that was back in 2000 (take that as the year or as the SQLS version, you'll be right either way) when, after 15 years of functional, object, and constraint[logic] oriented databases I came back to the real world of relational databases (and had to throw IBM's DB2 out of the shop and bring in SQLS so that rapid development and prototyping with incremental release was feasible) - but I have never used BULK INSERT, so I have to take Geoff's word for it that it will do the job. Anyone back then who suggested "requirements" that would preclude use of DTS where needed would have been looking for a new job pretty quickly!]


Tom
Post #856545
Posted Saturday, January 30, 2010 5:08 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Tuesday, October 22, 2013 6:36 AM
Points: 7, Visits: 26
Hi Folks,

Lots of good comments here, but I want to make a couple things clear: I work with systems ranging in size from 1 Terabyte to 3 Petabytes. Things change at different sizes of systems. Rules change, engines change, processing changes, and architecture changes.

What "works" at 1 TB most likely won't work at 50 TB, what works at 50TB most likely needs to be altered to work at 150TB, and what works at 150TB needs to be altered again to work at 500TB. and so on. The reason I bring this up is because of all this talk of constraints.

Constraints in the database put a HUGE load on the processing power of the DB engine and CPU during load time. Processors have been getting better, bigger, stronger, faster and cheaper - this is great! But the engines like SQLServer need to take advantage of the CPU power by being re-engineered at the core level to be truly multi-threaded (and not single strung when it gets to something like the darn pagefile.sys). Microsoft realizes this, and made some huge changes to the core OS in Windows 64 bit. Then they purchased DatAllegro, gutted the product and placed the true parallelism and partitioning algorithms into the core of SQLServer (Well Done !).

Now, can anyone tell me if they've changed the algorithms inside SQLServer to make the "constraint checks" lazy? Queued on multiple background threads? If not, then the impact to bulk loading (ESPECIALLY ON LARGE FILES) is severely negative. Loading speeds for 1k row sizes will be severely hampered, meaning that most folks will jump for joy if they get DTS or BCP to perform at or above 20,000 rows per second on pure inserts WITH constraints on.

Do the math, at 20k rps for a 1k row width at 8k block size, + constraint block modifications + index modifications + header modifications, how long would it take to load 45 million rows? The answer is not fast enough.

The only way to continue to load rows in an ever growing capacity is to increase the load performance. It's funny that no-one in this thread even addressed the performance numbers I put forward the first time, although I can understand why... The need to have it run at 80k rows per second at 1k row width is imperative as a MINIMUM. (Updates and deletes are even worse, especially if they are single row driven, and can't be processed in parallel). The absolute need for performance with 45 million rows, or even 150 Million row loads is 120k rows per second, upwards to 300k rows per second.

If Microsoft wants to compete with Teradata and DB2 UDB EEE, or Paraccel, or Netezza, or most other engines (which I'm sure they do), then they must have this kind of performance available. At the end of the day what this means is: shutting down the constraints as a FIRST STEP. The second step is shutting down all other indexes, the third step is increasing the block sizes, the fourth step is increasing parallelism of PageFile.sys, the fifth step is dramatically increasing the parallel insert capabilitieis, and the last step is adding bigger faster, better, cheaper hardware (RAM & CPU).

Now, if they want really good performance in single row batch updates, or single row batch deletes, they MUST follow similar steps - except: they have to leave the constraints and indexes in place... so this means putting constraints and indexes in a "lazy queued process" to execute in parallel but in the background. It means more synchronization for any "error" that may occur during the load & check process, but it means faster throughput.

Finally, let's talk about business needs now. the business NEEDS compliance, and they need a single enterprise store (consolidated) that contains associated business data. The business also needs high-speed massive loads. Let's take the example of CMS (center for medicaid and medicare services in the US). They maintain many of their Medicare/Medicaid records in a DB2 AS400 system on an operational basis. They have over 1.5 Billion subscriber records. They have no idea which records are updated, which are inserted, and which are deleted (due to multiple systems, multiple applications, multiple source feeds from doctors and other government offices). The ONLY way they can assemble any kind of decent picture is to pure-copy the information from all the systems to a staging area first, then move it to a warehouse - a raw data warehouse (yes with constraints on, but only on the primary keys).

The staging area assures high speed loading of raw data sources at 400k rows per second. Part of their Data Warehouse is based on the Data Vault principles of auditable and raw data, yes it has constraints but not across the BUSINESS data. Now add in the final rules here: BUSINESS RULES DRIVE CONSTRAINTS IN OLTP - BUSINESS RULES DRIVE CHANGES TO THE ARCHITECTURE AND DESIGN OF OLTP, when dealing with this much data in this short time period it is VERY difficult to maintain performance of the system with business rules driving the EDW (embedded in the database). IF the business rules are implemented as constraints, not only do they physically slow down the EDW, they also slow down IT's response time to business changes - it becomse impossible to maintain the "speed of business". Which ultimately leads to a shut-down or restructuring of the EDW project. It's these factors that are discussed in my Data Vault Business Book - so while this may seem like a marketing statement, I assure you that customers all over the world see it differently (as a necessity for forward motion and change to the EDW environment).

Cheers,
Dan Linstedt
Post #856629
Posted Sunday, January 31, 2010 5:31 PM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: Today @ 8:46 PM
Points: 36,719, Visits: 31,168
Dan, what are you using to load files into SQL Server at 400k RPS with and how many parallel threads are you using to do that?

--Jeff Moden
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

Helpful Links:
How to post code problems
How to post performance problems
Post #856889
Posted Monday, February 1, 2010 3:33 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Tuesday, October 22, 2013 6:36 AM
Points: 7, Visits: 26
Hi Jeff,

I'm sorry if I wasn't clear enough. 400k rps is where SQLServer 2008 R2 should be going - it's not where it is today. 400k rps during load is only achieved in Teradata with a Teradata hardware setup using FastLoad. I've also achieved this inside of Teradata when transferring data from one table to another: select from insert into... statements.

I would love to see SQLServer 2008 R2 get there, but the fastest I've seen SQLServer 2008 R2 run is 120k rows per second using Informatica 8.6.1 on 32 bit Windows Server 2008, with SQLServer 32 bit 2008 R2 using internal Ultra-SCSI disk, raid 0+1 array config. Informatica on the same machine as SQLServer, reading from internal disk, writing back to internal disk. Dell PowerEdge 2650 with 4GB RAM.

Hope this helps clear things up... Again, the nature of competition and the drive for performance is really what this is all about - and from a business perspective, moving data into a data warehouse requires adaptability of IT (agility to respond), and performance of the hardware. So at the end of the day, removing the business rules from the raw EDW becomes a necessity in volume situations.

Hope this helps,
Dan Linstedt
PS: Sales Pitch: I'm available for consulting, I've been consulting on systems architecture & design, performance analysis and tuning of large scale systems for years.
Post #857022
Posted Friday, February 5, 2010 5:00 PM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: Today @ 8:46 PM
Points: 36,719, Visits: 31,168
Sorry for the late reply. Thanks for the explanation, Dan. I really appreciate it.

--Jeff Moden
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

Helpful Links:
How to post code problems
How to post performance problems
Post #860933
Posted Saturday, August 21, 2010 9:40 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Thursday, August 23, 2012 4:10 AM
Points: 2, Visits: 69
Hi folks

Sorry for dragging up an old thread, but I stumbled upon the related article whilst looking for a bit of info on best practices for the use of try/catch within a transaction for a small ETL process. A very informative discussion, so thanks to all; I like the binary/hyperexponential chop ideas but will try to explore using bcp where poss, have found it very useful for non-production stuff in the past.

Anyway, here's my question and scenario:

In Mel's example in the original article, he deletes from his staging table after committing the transaction which inserts into the target table and updates the source table, and even explicitly comments the fact to stress it's importance:

--Don't delete from temp table until transaction is committed
DELETE @TempSource


Does this matter? In similar code I wrote the other day before reading this, I assumed I wanted the delete from staging to be part of the transaction too, so the entire process gets committed or rolled back together? Otherwise if you get a random error during the delete aren't you left with data that you don't want in the staging table? Not a massive issue as my process truncates the staging table at the start, but I'm wondering if I'm missing something, or worrying over something inconsequential? Thanks

My scenario is thus:

Existing process uses SSIS to import data directly into production table. The data actually needs to be transformed to be in the form our stored procs want it in, and at the moment this is done using a view, which performs pretty badly - columns we need to index on exist courtesy of expressions in the view. So options are index the view, or change the import process.

I opted for the latter, and am adding a staging table, and a stored proc which transforms the data from staging into a new target table, plus copies the raw data into the current production table, which in effect becomes our audit store. Existing stored procs will be changed to point at the new target table rather than existing production raw data store.

There's no real constraints on the target table so I shouldn't have to worry about errors too much, but still want to use an explicit transaction, something like:

BEGIN TRAN

-- Insert from staging into target table, transforming columns where desired
INSERT tbl_Target
SELECT col1
, MySchema.fn_transform(col2)
, ISNULL(col3)
FROM tbl_Staging;


-- Copy raw data from staging into store
INSERT tbl_Store
SELECT col1, col2, col3
FROM tbl_Staging;

-- Delete data from staging
DELETE FROM tbl_Staging;

IF (@@Error = 0)
COMMIT TRAN;
ELSE
BEGIN
ROLLBACK TRAN;
-- RAISERROR to tell SSIS package there's been a problem
END

This process runs every hour to import a few thousand rows, thankfully mostly during office hours not middle of the night!

Now, i've never used T-SQL try/catch before, as I worked mainly with SQL 2000 until this year, but feel I should use it, especially as it can be used to detect a deadlock and retry (in this case there shouldn't be other stuff querying the staging and store tables but you never know)

But does it matter whether my delete from staging is inside the transaction?

Thanks
Post #972958
Posted Saturday, August 21, 2010 11:32 AM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: Today @ 8:46 PM
Points: 36,719, Visits: 31,168
wardster (8/21/2010)
In Mel's example in the original article, he deletes from his staging table after committing the transaction which inserts into the target table and updates the source table, and even explicitly comments the fact to stress it's importance:

--Don't delete from temp table until transaction is committed
DELETE @TempSource

Does this matter?


With a table variable, no. It's not affected by transactions. But, if you look at the comment, it looks like the author may have used a Temp Table originally and changed it to a table variable later on. It could matter if it's a Temp Table (although I've not studied the code well enough to know for sure in this case).


--Jeff Moden
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

Helpful Links:
How to post code problems
How to post performance problems
Post #972974
Posted Saturday, August 21, 2010 2:08 PM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Thursday, August 23, 2012 4:10 AM
Points: 2, Visits: 69
Hi Jeff

Thanks for the quick reply.

The solution I'm in the middle of testing uses a permanent staging table rather than temp table or table variable. Does this mean I should be deleting from it outside the insert transaction? The staging table is brand new so I'm certain no other objects reference it apart from the stored proc with the delete in.
Post #972997
Posted Sunday, August 22, 2010 1:36 PM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: Today @ 8:46 PM
Points: 36,719, Visits: 31,168
wardster (8/21/2010)
Hi Jeff

Thanks for the quick reply.

The solution I'm in the middle of testing uses a permanent staging table rather than temp table or table variable. Does this mean I should be deleting from it outside the insert transaction? The staging table is brand new so I'm certain no other objects reference it apart from the stored proc with the delete in.


First, unless you're trying to preserve the value of an IDENTITY column, I wouldn't use DELETE... I'd use TRUNCATE. It's a whole lot faster and easier on the LOG.

Second, same thing applies. You'd have to do the DELETE or TRUNCATE outside the transaction or it may undo the DELETE or TRUNCATE if you rollback. Of course, that may be just what you want so the real answer is "It Depends".


--Jeff Moden
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

Helpful Links:
How to post code problems
How to post performance problems
Post #973096
Posted Thursday, August 26, 2010 8:13 AM
SSC-Addicted

SSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-Addicted

Group: General Forum Members
Last Login: Friday, August 9, 2013 3:52 PM
Points: 430, Visits: 448
Just caught the question in the recent post. If I can remember my own code correctly The reason the delete of the table variable must happen after the transaction commits is that the table variable data is used in the Catch clause if the transaction fails and rolls back. Therefore it is deleted only after the transaction commits.
Post #975672
« Prev Topic | Next Topic »

Add to briefcase «««123

Permissions Expand / Collapse