A faster way to prepare dimensional databases

  • Thanks for the details in the article, I am gald I read through the whole article and comments The comments helped the discussion topic and paths to loading a Data Mart/Warehouse.

    I am sure hopeful those with other options will write articles based on their experience.

    Steve will publish them if they are as well written as this one.

    Thomas LeBlanc

    TheSMilingDBA

    Thomas LeBlanc, MVP Data Platform Consultant

  • Pretty good article. Few things to remember about building DWs, one of which it seems you've already learned:

    1. Avoid updates where possible. Inserts are many times faster. (which leads to point 2)

    2. Build Dimensions then Fact tables. (Avoid updates)

    3. Building SCDs relieves the need to update records. Also, it makes a better historical view. I try to make all my dimensions SCDs.

    4. MERGE works well for dimensions for which history is neither required or desired. You can get some performance gains if you spend time tweaking your MERGE.

    5. Avoid deleting info (or truncating). There are some cases where this is unavoidable, but try SCD or MERGE instead where possible. (unless this is causing the dim to grow out of control - like millions of records)

    Also, correct me if I'm wrong, if you use SELECT INTO doesn't that make it hard for the optimizer to generate a good plan? Because it doesn't know what the table will look like before hand. That's what I was told a while back by a performance guru. If this is true, probably best to keep all your perm and stage tables around between ETL sessions. *EDIT* I think I remember him saying best to avoid this because the sproc must be recompiled every time it's run.

    -------------------------------------------------------------------------------------------------
    My SQL Server Blog

  • A good article - that in severals ways reminds my of a process at my company for building the cube - but NOT populating our datamart ("normalized" and dimensional databases). So the cube is totally rebulit on a nightly basis - but more and more data in the underlying datamart makes this process longer and longer, and we are in the process of re-designing this for a delta-load method (changes only).

    In my opinion your process has a few drawbacks - that can turn out to make this kind of ETL-solution a no go!

    First of all - and mentioned by others in this thread - your ability to create good execution plans.

    If you always drop and recreate your tables you stand to loose all your SQL Server statistics for those table - and hence data for creating good executions plans, correct indexing, parallelism and more.

    Unless you know exactly how the users and/or applications consume the data, ie. know all queries and their variations and have tested them, you will maybe gain good performance in loading data, but most all reads on data afterwards will - due to lack of statistics and good executions plans - not perform well. Of cource if no users has direct access to the datamart and only the cube, AND you have optimized all your processes using previous executions plans - and data has not changed structure or keys - then this is not a real problem.

    In my company we perform quite a bit of "ad hoc" queries on data when tracking problem, analyzing, modelling and more - so having no statistics will hurt us BAD! WE have recognized that we do NOT - and cannot anticipate - how all the users contruct their queries and make "crazy" but repeatable selections. So we need our statistics to much to do "DROP TABLE - SELECT INTO".

    A much better process - but in overall theory equally well performing process - would be to use "BULK LOGGED on the database as a logging mechanism - TRUNCATE TABLE - and BULK INSERT as a data maintaining mechanism" (actually you should do a "UPDATE STATS" after the loading). Here you get the good performance in the loading process - and don't loose your statistics - so you can build (REBUILD) the correct indexes, and have up-to-date executions plans making the overall access of data perform faster.

    Another drawback of the "SELECT INTO" is that it locks your system tables for the duration of the insert, thus it can escalate to serious locking issues in your loading process. So parallel "SELECT INTO" processes can result in deadlocks or timeouts - you can only use a sequential load of tables to be "safe".

    Also - if your have a non-optimal I/O subsystem, then having to write all those data over and over each night will surely at some point give you a performance bottleneck. Only delta-loads (not ful loads) of data can help you here - and using hash-values (MD5 or the like) for each row, checking that data is updated on a row basis, together with MERGE processes to do INSERT, UPDATE, DELETE and building slowly changing dimensions, will work.

    The hash-method will of course create some more activity on your cpu's - and you have to read from disk to compare, but this should be not any serious source of performance problems. If so then any read process will be slow a the disk subsystem should be changed or upgraded.

    Even though I like the simplicity of your solution - it has - in my opinion too many unknown or un-estimated performance issues connected to it. So I would not ever use this process on a whole database (I do use a similar process on a few distinct tables, due to being "lazy" in the design process).

    Regards,

    Claus T. Madsen

  • Nice article, thanks!

    We have found the same thing in our process, that it is faster to completely rebuild with an INSERT INTO a blank table, than it is to try to do incremental updates. (UPDATE just takes too long with the number of changes we have, even if you batch it.)

    The one thing that worries me in your example is that you assign the surrogate key based on the description, i.e. ClientName, what happens if you have two clients with the same name? Why don't you build the surrogate keys based on what I assume is the business key, i.e. ClientID? (It is less likely to be an issue with the Geography or Products dimensions, but removing the multi-column joins would probably speed things up as well.)

  • Interesting article,althoug I read it only after it was reposted.

    In these 'upserts', we usually rely on the MERGE statement. As it is written in hand-tuned Assembler, its speed is IMO far superior to any custom-written T-SQL code.

  • Thank you for this great article.

    I am migrating my skills from relational database development to dimensional modeling. The transition is slightly non straightforward since I have been working with OLTP databases for years now. One book I have found really helpful and would like to share is "Pro SQL Server.2008 Analysis Services".

    Does anyone have any recommendations as to:

    1) Any IT certifications exams worth sitting for. Is 70.448 worth the time and effort?.

    2) Any English language/distance learning thought MSc degrees. SQL Server based rather than Oracle.

    3) Seeking helpful resources for beginners/intermediate level.

  • I done the MCITP which was quite good but I did find at the time there was a bit of a gap between what the book/course teaches you and what was in the exam.

    It's worth doing, but you'll need some pratical experience as well.

    _____________________________________________________________________________MCITP: Business Intelligence Developer (2005)

  • Nice article.

    Thanks for taking the time to put it together and share with us.

  • Thanks Dave for your suggestions.

Viewing 9 posts - 31 through 38 (of 38 total)

You must be logged in to reply to this topic. Login to reply