RE: A faster way to prepare dimensional databases

Mr or Mrs. 500

Points: 516

December 17, 2010 at 1:10 pm

A good article - that in severals ways reminds my of a process at my company for building the cube - but NOT populating our datamart ("normalized" and dimensional databases). So the cube is totally rebulit on a nightly basis - but more and more data in the underlying datamart makes this process longer and longer, and we are in the process of re-designing this for a delta-load method (changes only).

In my opinion your process has a few drawbacks - that can turn out to make this kind of ETL-solution a no go!

First of all - and mentioned by others in this thread - your ability to create good execution plans.

If you always drop and recreate your tables you stand to loose all your SQL Server statistics for those table - and hence data for creating good executions plans, correct indexing, parallelism and more.

Unless you know exactly how the users and/or applications consume the data, ie. know all queries and their variations and have tested them, you will maybe gain good performance in loading data, but most all reads on data afterwards will - due to lack of statistics and good executions plans - not perform well. Of cource if no users has direct access to the datamart and only the cube, AND you have optimized all your processes using previous executions plans - and data has not changed structure or keys - then this is not a real problem.

In my company we perform quite a bit of "ad hoc" queries on data when tracking problem, analyzing, modelling and more - so having no statistics will hurt us BAD! WE have recognized that we do NOT - and cannot anticipate - how all the users contruct their queries and make "crazy" but repeatable selections. So we need our statistics to much to do "DROP TABLE - SELECT INTO".

A much better process - but in overall theory equally well performing process - would be to use "BULK LOGGED on the database as a logging mechanism - TRUNCATE TABLE - and BULK INSERT as a data maintaining mechanism" (actually you should do a "UPDATE STATS" after the loading). Here you get the good performance in the loading process - and don't loose your statistics - so you can build (REBUILD) the correct indexes, and have up-to-date executions plans making the overall access of data perform faster.

Another drawback of the "SELECT INTO" is that it locks your system tables for the duration of the insert, thus it can escalate to serious locking issues in your loading process. So parallel "SELECT INTO" processes can result in deadlocks or timeouts - you can only use a sequential load of tables to be "safe".

Also - if your have a non-optimal I/O subsystem, then having to write all those data over and over each night will surely at some point give you a performance bottleneck. Only delta-loads (not ful loads) of data can help you here - and using hash-values (MD5 or the like) for each row, checking that data is updated on a row basis, together with MERGE processes to do INSERT, UPDATE, DELETE and building slowly changing dimensions, will work.

The hash-method will of course create some more activity on your cpu's - and you have to read from disk to compare, but this should be not any serious source of performance problems. If so then any read process will be slow a the disk subsystem should be changed or upgraded.

Even though I like the simplicity of your solution - it has - in my opinion too many unknown or un-estimated performance issues connected to it. So I would not ever use this process on a whole database (I do use a similar process on a few distinct tables, due to being "lazy" in the design process).

Regards,

Claus T. Madsen

Cookies on SQLServerCentral