The middle is a mystery

  • Hello-

    I've been reading up on MS Analysis Services and running some basic tests but I think I'm missing a big concept. I'm a competent Relational DB guy and I also see the power and potential of Cubes.

    However, the "Middle" is still a mystery to me. The part in between the Operational database and the final Analysis Cube(s).

    Do cubes replace a data warehouse?

    Do I need a data warehouse AND cubes?

    Do I load Operational data into the warehouse and then pull from the warehouse into the cube ???

    There are so many possible scenarios but I've yet to see a simple concrete example of the "Middle".

    Assuming I have a SMALL business and sell books. I have an SQL Server Operational database to handle day-to-day orders and customers - and I also wish to have a cube(s) available to analyze/aggregate historical data - what is the simplest "Middle" design to use?

    I have books, customers, and orders - nothing more - nothing less.

    Can some provide a short and sweet "Middle" design and events leading to the population of the cube(s) ???

    ANY help or insight would be greatly appreciated

    - B

  • A cube has to pull data from some source. This source can be a relational db or a data mart or data warehouse. In your example of books, customers, and orders, you could actually build a cube with the orders table as a 'fact' table and books and customers as dimensions. For a small data set this would work fine, however for a large data set it would lead to unpractically long cube processing times as some of the dimensions, say your time/date dimension, would be pulled from the orders table thereby forcing the Analysis Services engine to do distinct count queries on an ever increasingly large data set everytime new data was added and the cube processed. That is one reason why a fact table and 'relatively' stable dimension tables are used. This is obviously an over simplification of a very complex issue but I hope it helps to shed some light on your questions. I am sure others on the list will be able to provide more examples and perhaps better explanations. For a good start on Analysis Services, look into the book SQL Services Analysis Services by OLAP Train (http://www.amazon.com/exec/obidos/tg/detail/-/0735609047/qid=1049213931/sr=8-1/ref=sr_8_1/102-9868999-1243300?v=glance&s=books&n=507846 Watch for the wrap on that url...

    hth,

    Michael

    Michael Weiss


    Michael Weiss

  • Thanks -

    So are you saying I do not necessarily need a data warehouse, data marts, or some kind of temporary scrubbing tables? I can "Load My Cube" every day directly from the Operational database?

    Other than simply "Lots of data" in my Operational database - why do I need some other kind of "middle tables" ? For scrubbing?

    I guess in a nutshell - should one create their own "Star Schema" tables, populate them, and then base the cube on these newly created star schema tables? And, would one call the new "Star Schema" tables their "Data warehouse" or "Data mart"?

    I have several books already - will investigate your book suggestion as well.

    My books focus more on details and jump right into "creating a cube" etc.. but do not go from Operational database to marts or star schemas into cubes - it's more the design that I need.

    Regardless - thanks... B

  • Yes, you could create your own star schema and it would in fact be a data mart or data warehouse. As for scrubbing tables, if you are moving data from a relational db to a data mart, you would most likely need a set of scrubbing or 'staging' tables in order to lookup and populate the foreign key values in your fact table. Also, at that stage you could handle any new dimension members that might come through in the data. However, for quick cubes, for example, AS can connect to a relational db source or an excel spreadsheet or several other sources.

    hth,

    Michael

    Michael Weiss


    Michael Weiss

  • In most cases, people want to have a data mart (or warehouse) because of the flexability it offers them when looing to report on and transform their data. If I was to walk into your book store after you asked for some assistance in building Analysis/Reporting cubes, I would suggest Marting the data. I would do this because although they seem reasonably static, your products have changing attributes over time (ISBN prob stays the same, maybe not tho for reprints and revisions; UPC [barcode] can and does change frequently for products, especially when manufactured offshore in many locations; Publishers may change names {be bought out etc]; and the differing editions of a book may be important to track.

    If your stock/inventory system is like most I've seen, it will handle some of these changes, but most likely not overly well. Having a datamart would allow you to have a product dimension (ie table) that could accommodate these changes over time, allowing you full reporting flexability e.g. design your product dim such that it has effective dates for products, allows for a parent product [say 1st Edition] and child products [later editions]). With this style of dimension you could have a cube that reports only on current books in print (and in store) and maybe another cube that lets you look at the sales of 'One Flew over the cuckoos nest' in all of its' editions (assuming it's been re-published over time).

    Scrubbing is also important. If there is *any* front end user entered data, you'll prbably want to scrub it.

    Depending on back end systems, you may have transactions that have a product ID that no longer exists in the transactional (inventory) system (ie the system allows deletes regardless of trxns being made against the item). Having a datamart will cover this off.

    Lastly, a lot of transactional systems allow for archiving of data (with the view to never retreieve it ) and therefore builidng a cube off your transactional system will not necessarilly allow full analysis of data over time (ie if the system purges every 6-12 months, how can you do seasonality trending over years??). Having the data in a specific repository (ie the mart or WH) will cover this off.

    As far as design goes, they'r enearly always all different, but for an easy intro into relational to star schema, even a book like the MSft PRess 'Data warehousing with SQL7...' will give you a good start. OTherwise (depending on your camp ) go for The Datawarehousing Toolkit (Kimball), Datawarehousing Lifecycle Toolkit (Kimball), or something from Bill Inmon. These 2 guys basically provide a good basis for the two schools of thought re:Dimensional design.

    Am happy to offer design suggestions if you shoot thru some example data structures.

    HTH,

    Steve

    Absolutely lastly, (I re-read your initial post) I would go for a simple star schema that updates with sales (and dimensional) data on a nightly or weekly basis (depending on business requirement). Use DTS to upkeep (watch out for slowly changing dimensions though with DTS, a bit of manual labour required).

    And lastly, read the cube design hints (in the submitted docs section of this website), and don't try to do operational reporting out of an analytical tool.

    Steve.

  • Thanks Steve -

    I have both "ToolKit" books - have read the older one and am halfway through the newer.

    Unfortunately, I can be a conceptual blockhead at times and am still missing something.

    Is the set of "Star-Schema" tables essentially the "Data Mart" ?

    If so - I guess I would nightly populate the StarSchema tables from the Operational DB. Then in the next step - I would populate the cube with only the new data in the StarSchema?

    Are the StarSchema tables only a temporary holding tank or do I keep appending data to them day after day or is the cube the only place that holds ALL historical data?

    These are really high-level, major concept questions that I need help on. Before I get on this 50 lane highway - do I need to go East or West?

    I'll tinker more with cube population - especially the concept of "Incremental" updates.

    Regardless - thanks for the time and input. I feel that the cubes are powerful tools and should be learned and utilized.

    -B

  • Hey B,

    In order of your questions/statements...

    Yes, the set of tables (that can be arranged i a star-schema) are your data-mart ( a philosophical of DM Vs DW isn't warranted here .

    The data mart would be updated daily/nightly or at whatever interval is acceptable to business analysis requirements.

    Personally, I don't use incremental updates (a hang over from my Cognos Transformer days, when this was a risky business, also incremental builds raise issues regarding dimension changes that can be avoided by full processing of dimensions and cubes, just my opinion).

    The tables are indeed the place that will continue to hold new (and historical) data. If yuo *were* to do incremental builds, you *could* use the cube as you DW/DM, but personally this is too risky for me. WHat you are counting on is a cube that doesn't ever corrupt/have issues and to which you add data daily. Personally, I would use the DM tables to store all data long term, and build cube(s) from this data as requried. Keep in mind that you can implement partitioning of cube data (ie put diff years in diff places etc), and also that if you are concerned that your daily reporting/analysis requirements would be hindered by too much historical data, you could always create virtual or physical cube(s) that have only current information.

    So, my bottom-line take-away is that if I store all of my historical data in a DM, and have done my design well, I can let my operational system purge to it's hearts content, but I will still have my data. I can also query this DM using standard SQL (no need for cubing) with much better performance than from my operational system. Lastly, I can create any number of cubes based on this data, and use these for analysis of my data.

    Usual (for us) sys architecture is:

    1. One or more operational systems containing data we want.

    2. Extract from these nightly into a staging area

    3. Clean, scrub, validate, lookup, etc etc data in staging, either in place or on way thru to DM/DW

    4. Load DM/DW dimensions

    5. Load DM/DW fact(s)

    6. Rebuild any AS shared dims, then rebuild cubes

    Have fun and keep us up-to-date with your progress.

    Steve.

    Steve.

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply