SQL Clone
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


A faster way to prepare dimensional databases


A faster way to prepare dimensional databases

Author
Message
Thomas LeBlanc
Thomas LeBlanc
SSCarpal Tunnel
SSCarpal Tunnel (4.1K reputation)SSCarpal Tunnel (4.1K reputation)SSCarpal Tunnel (4.1K reputation)SSCarpal Tunnel (4.1K reputation)SSCarpal Tunnel (4.1K reputation)SSCarpal Tunnel (4.1K reputation)SSCarpal Tunnel (4.1K reputation)SSCarpal Tunnel (4.1K reputation)

Group: General Forum Members
Points: 4062 Visits: 905
Thanks for the details in the article, I am gald I read through the whole article and comments The comments helped the discussion topic and paths to loading a Data Mart/Warehouse.

I am sure hopeful those with other options will write articles based on their experience.

Steve will publish them if they are as well written as this one.

Thomas LeBlanc
TheSMilingDBA

Thomas LeBlanc, MCITP DBA 2005, 2008 & MCDBA 2000
http://thesmilingdba.blogspot.com/
amenjonathan
amenjonathan
SSC Veteran
SSC Veteran (254 reputation)SSC Veteran (254 reputation)SSC Veteran (254 reputation)SSC Veteran (254 reputation)SSC Veteran (254 reputation)SSC Veteran (254 reputation)SSC Veteran (254 reputation)SSC Veteran (254 reputation)

Group: General Forum Members
Points: 254 Visits: 434
Pretty good article. Few things to remember about building DWs, one of which it seems you've already learned:

1. Avoid updates where possible. Inserts are many times faster. (which leads to point 2)
2. Build Dimensions then Fact tables. (Avoid updates)
3. Building SCDs relieves the need to update records. Also, it makes a better historical view. I try to make all my dimensions SCDs.
4. MERGE works well for dimensions for which history is neither required or desired. You can get some performance gains if you spend time tweaking your MERGE.
5. Avoid deleting info (or truncating). There are some cases where this is unavoidable, but try SCD or MERGE instead where possible. (unless this is causing the dim to grow out of control - like millions of records)

Also, correct me if I'm wrong, if you use SELECT INTO doesn't that make it hard for the optimizer to generate a good plan? Because it doesn't know what the table will look like before hand. That's what I was told a while back by a performance guru. If this is true, probably best to keep all your perm and stage tables around between ETL sessions. *EDIT* I think I remember him saying best to avoid this because the sproc must be recompiled every time it's run.

-------------------------------------------------------------------------------------------------
My SQL Server Blog
clausm73
clausm73
Valued Member
Valued Member (74 reputation)Valued Member (74 reputation)Valued Member (74 reputation)Valued Member (74 reputation)Valued Member (74 reputation)Valued Member (74 reputation)Valued Member (74 reputation)Valued Member (74 reputation)

Group: General Forum Members
Points: 74 Visits: 159
A good article - that in severals ways reminds my of a process at my company for building the cube - but NOT populating our datamart ("normalized" and dimensional databases). So the cube is totally rebulit on a nightly basis - but more and more data in the underlying datamart makes this process longer and longer, and we are in the process of re-designing this for a delta-load method (changes only).

In my opinion your process has a few drawbacks - that can turn out to make this kind of ETL-solution a no go!

First of all - and mentioned by others in this thread - your ability to create good execution plans.

If you always drop and recreate your tables you stand to loose all your SQL Server statistics for those table - and hence data for creating good executions plans, correct indexing, parallelism and more.

Unless you know exactly how the users and/or applications consume the data, ie. know all queries and their variations and have tested them, you will maybe gain good performance in loading data, but most all reads on data afterwards will - due to lack of statistics and good executions plans - not perform well. Of cource if no users has direct access to the datamart and only the cube, AND you have optimized all your processes using previous executions plans - and data has not changed structure or keys - then this is not a real problem.

In my company we perform quite a bit of "ad hoc" queries on data when tracking problem, analyzing, modelling and more - so having no statistics will hurt us BAD! WE have recognized that we do NOT - and cannot anticipate - how all the users contruct their queries and make "crazy" but repeatable selections. So we need our statistics to much to do "DROP TABLE - SELECT INTO".

A much better process - but in overall theory equally well performing process - would be to use "BULK LOGGED on the database as a logging mechanism - TRUNCATE TABLE - and BULK INSERT as a data maintaining mechanism" (actually you should do a "UPDATE STATS" after the loading). Here you get the good performance in the loading process - and don't loose your statistics - so you can build (REBUILD) the correct indexes, and have up-to-date executions plans making the overall access of data perform faster.

Another drawback of the "SELECT INTO" is that it locks your system tables for the duration of the insert, thus it can escalate to serious locking issues in your loading process. So parallel "SELECT INTO" processes can result in deadlocks or timeouts - you can only use a sequential load of tables to be "safe".

Also - if your have a non-optimal I/O subsystem, then having to write all those data over and over each night will surely at some point give you a performance bottleneck. Only delta-loads (not ful loads) of data can help you here - and using hash-values (MD5 or the like) for each row, checking that data is updated on a row basis, together with MERGE processes to do INSERT, UPDATE, DELETE and building slowly changing dimensions, will work.

The hash-method will of course create some more activity on your cpu's - and you have to read from disk to compare, but this should be not any serious source of performance problems. If so then any read process will be slow a the disk subsystem should be changed or upgraded.

Even though I like the simplicity of your solution - it has - in my opinion too many unknown or un-estimated performance issues connected to it. So I would not ever use this process on a whole database (I do use a similar process on a few distinct tables, due to being "lazy" in the design process).

Regards,
Claus T. Madsen
UMG Developer
UMG Developer
SSCrazy
SSCrazy (2.7K reputation)SSCrazy (2.7K reputation)SSCrazy (2.7K reputation)SSCrazy (2.7K reputation)SSCrazy (2.7K reputation)SSCrazy (2.7K reputation)SSCrazy (2.7K reputation)SSCrazy (2.7K reputation)

Group: General Forum Members
Points: 2706 Visits: 2204
Nice article, thanks!

We have found the same thing in our process, that it is faster to completely rebuild with an INSERT INTO a blank table, than it is to try to do incremental updates. (UPDATE just takes too long with the number of changes we have, even if you batch it.)

The one thing that worries me in your example is that you assign the surrogate key based on the description, i.e. ClientName, what happens if you have two clients with the same name? Why don't you build the surrogate keys based on what I assume is the business key, i.e. ClientID? (It is less likely to be an issue with the Geography or Products dimensions, but removing the multi-column joins would probably speed things up as well.)
Revenant
Revenant
SSCertifiable
SSCertifiable (7.4K reputation)SSCertifiable (7.4K reputation)SSCertifiable (7.4K reputation)SSCertifiable (7.4K reputation)SSCertifiable (7.4K reputation)SSCertifiable (7.4K reputation)SSCertifiable (7.4K reputation)SSCertifiable (7.4K reputation)

Group: General Forum Members
Points: 7377 Visits: 4864
Interesting article,althoug I read it only after it was reposted.

In these 'upserts', we usually rely on the MERGE statement. As it is written in hand-tuned Assembler, its speed is IMO far superior to any custom-written T-SQL code.
itdevnews
itdevnews
SSC Rookie
SSC Rookie (33 reputation)SSC Rookie (33 reputation)SSC Rookie (33 reputation)SSC Rookie (33 reputation)SSC Rookie (33 reputation)SSC Rookie (33 reputation)SSC Rookie (33 reputation)SSC Rookie (33 reputation)

Group: General Forum Members
Points: 33 Visits: 59
Thank you for this great article.

I am migrating my skills from relational database development to dimensional modeling. The transition is slightly non straightforward since I have been working with OLTP databases for years now. One book I have found really helpful and would like to share is "Pro SQL Server.2008 Analysis Services".

Does anyone have any recommendations as to:

1) Any IT certifications exams worth sitting for. Is 70.448 worth the time and effort?.
2) Any English language/distance learning thought MSc degrees. SQL Server based rather than Oracle.
3) Seeking helpful resources for beginners/intermediate level.
dave-dj
dave-dj
Ten Centuries
Ten Centuries (1K reputation)Ten Centuries (1K reputation)Ten Centuries (1K reputation)Ten Centuries (1K reputation)Ten Centuries (1K reputation)Ten Centuries (1K reputation)Ten Centuries (1K reputation)Ten Centuries (1K reputation)

Group: General Forum Members
Points: 1004 Visits: 1149
I done the MCITP which was quite good but I did find at the time there was a bit of a gap between what the book/course teaches you and what was in the exam.

It's worth doing, but you'll need some pratical experience as well.

_____________________________________________________________________________
MCITP: Business Intelligence Developer (2005)
JasonRowland
JasonRowland
SSC-Enthusiastic
SSC-Enthusiastic (132 reputation)SSC-Enthusiastic (132 reputation)SSC-Enthusiastic (132 reputation)SSC-Enthusiastic (132 reputation)SSC-Enthusiastic (132 reputation)SSC-Enthusiastic (132 reputation)SSC-Enthusiastic (132 reputation)SSC-Enthusiastic (132 reputation)

Group: General Forum Members
Points: 132 Visits: 1149
Nice article.
Thanks for taking the time to put it together and share with us.
itdevnews
itdevnews
SSC Rookie
SSC Rookie (33 reputation)SSC Rookie (33 reputation)SSC Rookie (33 reputation)SSC Rookie (33 reputation)SSC Rookie (33 reputation)SSC Rookie (33 reputation)SSC Rookie (33 reputation)SSC Rookie (33 reputation)

Group: General Forum Members
Points: 33 Visits: 59
Thanks Dave for your suggestions.
Go


Permissions

You can't post new topics.
You can't post topic replies.
You can't post new polls.
You can't post replies to polls.
You can't edit your own topics.
You can't delete your own topics.
You can't edit other topics.
You can't delete other topics.
You can't edit your own posts.
You can't edit other posts.
You can't delete your own posts.
You can't delete other posts.
You can't post events.
You can't edit your own events.
You can't edit other events.
You can't delete your own events.
You can't delete other events.
You can't send private messages.
You can't send emails.
You can read topics.
You can't vote in polls.
You can't upload attachments.
You can download attachments.
You can't post HTML code.
You can't edit HTML code.
You can't post IFCode.
You can't post JavaScript.
You can post emoticons.
You can't post or upload images.

Select a forum

































































































































































SQLServerCentral


Search