A faster way to prepare dimensional databases

Question

A faster way to prepare dimensional databases

Adam Aspin

SSCarpal Tunnel

Points: 4890
More actions
February 23, 2010 at 10:05 pm

#220881

Comments posted to this topic are about the item A faster way to prepare dimensional databases

Viewing 15 posts - 1 through 15 (of 38 total)

You must be logged in to reply to this topic. Login to reply

Magnus Hagdahl SSC Enthusiast Points: 184 More actions · Answer 1

Although I'm glad to learn new ways of loading an multidimensional database structure, I do feel the solution to be a bit cumbersome.

This could of course be due to that you dont have control over the complete ETL-chain.

Our dw also has normalized database and multidimensional database. The norm is loaded daily from 20 different sources and we perform incremental updates of the two main fact tables, Sales and GL.

From textfile to updated transaction table, given about 3 million updated rows in each table every night, we have a processing time of about 30 mins.

The logic has the following bases

1. Norm dim tables create and hold surrogate keys for items, accounts etc. This means surrogate keys are fixed and not recreated every night

2. Norm fact tables have surrogate keys for; a: every dim member on the row, b: for the fact row itself

3. All norm tables have an "LatestUpdated"-flag which is set when a nightly load updates the row

4. Each load step is logged, saving the "latest succesful run data"-flag

The dw update then do this

1. Truncate and reload of all dimension tables. This works since norm keep all surrogate keys

2. Incremental update of the Fact table using a comparison of "LatestUpdated" and "LastSuccessfulRun". Minimum rows are touched and surrogate keys are already in place

3. Reports tables/cube loads follow...

I think this approach seem simpler and less error prone.

This of course require that you have control of how the norm database is loaded.

chenthilar SSC Enthusiast Points: 119 More actions · Answer 2

actually i was waiting for a loooooooong time to read this type of article to get basic ideas, really very good article, thanks a lot

Martin Vrieze SSCrazy Points: 2760 More actions · Answer 3

This is one of the best nuts and bolts articles I've come across in SSAS.

Question for the Group:

Is there a "test" dataset available for creating an SSAS Dimensional Database? All of the books, articles and commentary tell you to load the Adventure Works Cube and move on...this is a bit like putting the cart before the horse...it is already a completed solution.

The dataset ideally would have semi-dirty data that requires some cleanup before it can be loaded into a dimensional DB...something that is commonplace everywhere except in AdventureWorks.

Please correct me if I am off the mark on AdventureWorks DW.

Daniel Bowlin SSC-Dedicated Points: 34566 More actions · Answer 4

Very good article. I am working in a very similar situation. Eventually I will gain more control over the ETL process. But I took a couple of good ideas from this article that will definitely help me in the short term. Thanks.

casinc815 Old Hand Points: 303 More actions · Answer 5

Adam,

Great article! Very succinct and gets to the point about how important a DW and DM coexist together.

My question is about growth. How is the process working after say 30 days and 180 days? What has the increased volume done to performance?

Once again, great Job.

Jim

Joel Srebnick SSC Enthusiast Points: 116 More actions · Answer 6

"Insert Into" with a large amount of records will run a long time because SQL Server does the whole insert in one transaction. Use a data pump (Data Flow Task) and set the properties to write the transaction every 5000 records (set both "Rows per Batch" and "Maximum insert commit size" to 5000). You will save hours on your loads. You don't really care about backing out the whole transaction because if it fails, you will most likely go back to the truncate and start over again.

magarity kerns SSCrazy Points: 2763 More actions · Answer 7

The final refinement of the SK lookup table method comes when you add an MD5 checksum column and leave them permanently populated. Use these tables to quickly determine update/insert for only changed or new records. You can even make one of these for the fact table itself. Then fast incremental loads for the entire process become possible and your times will be cut dramatically. I've dealt with source systems that put out flat file dumps of 500+ million rows daily and used this method to process the deltas in barely an hour.

SQLBOT SSCrazy Eights Points: 8014 More actions · Answer 8

Very impressive article!

I'm just now being asked to transition to some DW work after eons of doing primarily OLTP support.

I'm bookmarking this article!

Thanks so much for your efforts.

Craig Outcalt

Tips for new DBAs: http://www.sqlservercentral.com/articles/Career/64632
My other articles: http://www.sqlservercentral.com/Authors/Articles/Craig_Outcalt/560258

centexbi SSC Enthusiast Points: 125 More actions · Answer 9

One technique that can be used is to keep all the surrogate keys in a separate fact extension table with the fact id. Hence you are either updating a very narrow fact table or inserting to it.

We always use views for dimensions and fact tables as input to an SSAS cube. That way, you can join the the two fact tables in a view to get the surrogate keys deployed as well.

Daniel Bowlin SSC-Dedicated Points: 34566 More actions · Answer 10

Daniel Bowlin

SSC-Dedicated

Points: 34566

February 24, 2010 at 4:29 pm

#1124670

Where can I find out more about the MD5 checksum?

Adam Aspin SSCarpal Tunnel Points: 4890 More actions · Answer 11

Hi Joel,

Thanks for this idea - but can you confirm that processing large data sets in batches like this is faster overall for the entire dataset? My (admittedly hazy) memory of this approach is that the overall time taken is around the same?

THanks for the input!

Adam

Adam Aspin SSCarpal Tunnel Points: 4890 More actions · Answer 12

Hi Magnus,

Of course, you are absolutely right that your approach is faster, but, as you point out, it reauires complete control of the entire process. Also (which I did not make clear enough in the article, so please excuse me), the appraoch that I am suggesting is very much a development technique. Our data warehouse is evolving constantly, and dimensions and attributes are being altered all the time (this is one of those projects where "Agile" means less design and planning than is healthy...) so the design is not stable enough to persist dimesnion attributes - yet.

Once we reach a more stable setup, then we will certainly adapt the process to persist unchanging dimension and fact data, and adopt the most appropriate techniques which suit the project.

Thanks for the very succinct and clear description of how to go about doing this!

Adam

Adam Aspin SSCarpal Tunnel Points: 4890 More actions · Answer 13

Hi There Casinc835,

Thanks for the input!

After 60 -odd days and a 10% increase in volumes (and admittedly a lot of design changes) we are not seeing any notable increases in processing times, so it seems pretty linear.

Regards,

Adam

Adam Aspin SSCarpal Tunnel Points: 4890 More actions · Answer 14

Hi Centexbi,

This sounds really interesting - I will give it a try as soon as I can.

Thanks for the idea!

Adam