RE: Conditional Set-Based Processing: Moving Towards a Best Practice for ETL

SSC Veteran

Points: 211

January 30, 2010 at 5:08 am

Hi Folks,

Lots of good comments here, but I want to make a couple things clear: I work with systems ranging in size from 1 Terabyte to 3 Petabytes. Things change at different sizes of systems. Rules change, engines change, processing changes, and architecture changes.

What "works" at 1 TB most likely won't work at 50 TB, what works at 50TB most likely needs to be altered to work at 150TB, and what works at 150TB needs to be altered again to work at 500TB. and so on. The reason I bring this up is because of all this talk of constraints.

Constraints in the database put a HUGE load on the processing power of the DB engine and CPU during load time. Processors have been getting better, bigger, stronger, faster and cheaper - this is great! But the engines like SQLServer need to take advantage of the CPU power by being re-engineered at the core level to be truly multi-threaded (and not single strung when it gets to something like the darn pagefile.sys). Microsoft realizes this, and made some huge changes to the core OS in Windows 64 bit. Then they purchased DatAllegro, gutted the product and placed the true parallelism and partitioning algorithms into the core of SQLServer (Well Done !).

Now, can anyone tell me if they've changed the algorithms inside SQLServer to make the "constraint checks" lazy? Queued on multiple background threads? If not, then the impact to bulk loading (ESPECIALLY ON LARGE FILES) is severely negative. Loading speeds for 1k row sizes will be severely hampered, meaning that most folks will jump for joy if they get DTS or BCP to perform at or above 20,000 rows per second on pure inserts WITH constraints on.

Do the math, at 20k rps for a 1k row width at 8k block size, + constraint block modifications + index modifications + header modifications, how long would it take to load 45 million rows? The answer is not fast enough.

The only way to continue to load rows in an ever growing capacity is to increase the load performance. It's funny that no-one in this thread even addressed the performance numbers I put forward the first time, although I can understand why... The need to have it run at 80k rows per second at 1k row width is imperative as a MINIMUM. (Updates and deletes are even worse, especially if they are single row driven, and can't be processed in parallel). The absolute need for performance with 45 million rows, or even 150 Million row loads is 120k rows per second, upwards to 300k rows per second.

If Microsoft wants to compete with Teradata and DB2 UDB EEE, or Paraccel, or Netezza, or most other engines (which I'm sure they do), then they must have this kind of performance available. At the end of the day what this means is: shutting down the constraints as a FIRST STEP. The second step is shutting down all other indexes, the third step is increasing the block sizes, the fourth step is increasing parallelism of PageFile.sys, the fifth step is dramatically increasing the parallel insert capabilitieis, and the last step is adding bigger faster, better, cheaper hardware (RAM & CPU).

Now, if they want really good performance in single row batch updates, or single row batch deletes, they MUST follow similar steps - except: they have to leave the constraints and indexes in place... so this means putting constraints and indexes in a "lazy queued process" to execute in parallel but in the background. It means more synchronization for any "error" that may occur during the load & check process, but it means faster throughput.

Finally, let's talk about business needs now. the business NEEDS compliance, and they need a single enterprise store (consolidated) that contains associated business data. The business also needs high-speed massive loads. Let's take the example of CMS (center for medicaid and medicare services in the US). They maintain many of their Medicare/Medicaid records in a DB2 AS400 system on an operational basis. They have over 1.5 Billion subscriber records. They have no idea which records are updated, which are inserted, and which are deleted (due to multiple systems, multiple applications, multiple source feeds from doctors and other government offices). The ONLY way they can assemble any kind of decent picture is to pure-copy the information from all the systems to a staging area first, then move it to a warehouse - a raw data warehouse (yes with constraints on, but only on the primary keys).

The staging area assures high speed loading of raw data sources at 400k rows per second. Part of their Data Warehouse is based on the Data Vault principles of auditable and raw data, yes it has constraints but not across the BUSINESS data. Now add in the final rules here: BUSINESS RULES DRIVE CONSTRAINTS IN OLTP - BUSINESS RULES DRIVE CHANGES TO THE ARCHITECTURE AND DESIGN OF OLTP, when dealing with this much data in this short time period it is VERY difficult to maintain performance of the system with business rules driving the EDW (embedded in the database). IF the business rules are implemented as constraints, not only do they physically slow down the EDW, they also slow down IT's response time to business changes - it becomse impossible to maintain the "speed of business". Which ultimately leads to a shut-down or restructuring of the EDW project. It's these factors that are discussed in my Data Vault Business Book - so while this may seem like a marketing statement, I assure you that customers all over the world see it differently (as a necessity for forward motion and change to the EDW environment).

Cheers,

Dan Linstedt