Comments posted to this topic are about the content posted at http://www.sqlservercentral.com/columnists/vRainardi/usingodsasamiddlelayer.asp
I'm not sure I entirely understand your argument for an ODS in this context. The only difference between that and the warehouse is that the ODS may not encompass the whole enterprise. Assuming that you would still want to have an enterprise-wide data warehouse, the ODS in this context seems to be redundant. I guess what I'm asking is how is it not a data warehouse
I'm also confused about what operational functions (updates etc) you would want to apply in the ODS. Surely any updates should be applied back at the source system.
Lastly if the data marts are being fed from the ODS (as illustrated in the diagram) then what is the data warehouse used for. Shouldn't the data warehouse be used as the one holy sacred supplier of all data for reporting?
Sorry I don't mean to be so negative, data warehouse and data mart design is always very open to interpretation. Thanks for the article it has given me some new angles to consider.
Hi Robert, thank you for your comments.
The ODS is not a data warehouse because ODS is in the third normal form, not in dimensional model. But yes you are right, ODS contains all the data in the enterprise, and data warehouse database also contains all the data in the enterprise. Because of the amount of data the ODS contains is the same as data warehouse, yes in this sense ODS is redundant. But the benefits I mentioned in the articles outweigh the cost of this redundancy.
About what operational functions (updates) that you would like to apply in the ODS, and why this updates are not applied back in the source system, please allow me to illustrate this case with an example. I hope it will clarify the situation. Referring to the diagram in the article, let's say for example you have 3 ERP systems: JDE, SAP and a propriatary system (custom developed in your company) called CGL. Say you have a business performance metric called strike rate, which is computed in the ODS, based on the information from all 3 ERP systems. This strike rate is by service centre, by service type and daily. So for example, for Birmingham, for Fiscal Year 2006 week 24, the strike rate for ServiceType1 are 93%, 92%, 96%, 95%, 91%, 94% (Monday to Saturday). And so on, we calculate the strike rate for every service centre, for every service type, for every day in the year. To calculate these percentages you need information not only from 1 ERP system, but from all 3.
Now, for this strike rate metric, your company also have strike rate target, which is by service centre, by service type. For example, for Birmingham, for ServiceType1, the strike rate is 95% whilst for Manchester, for ServiceType2, the strike rate is 92%. This information are not anywhere in the source system and we need to get it to the ODS. We could update the ODS directly, for example by building a web/windows .NET application. Or (this one is probably more preferable) by designing the new application to save the strike rate targets on a separate database, and we then ETL / import this information into staging, then into ODS.
About your last point, if the data marts are being fed from the ODS then what is the data warehouse used for. Yes we could feed the data marts from the data warehouse database (as the data warehouse database contains all the information on the ODS), but it is more efficient to feed them straight from ODS, especially if the data marts are fed daily.
Thank you for all your comments Roberts, they have been an eye opener to me too.
Thanks for you reply Vincent.
Hall of Fame
Really good article but I must disagree with your comment that the ODS is not a data warehouse because it is in 3rd normal form, not in dimensional model. Although storing data in a dimensional model is valid and may be the best model for supporting end-user reporting, it is not the end-all model for a data warehouse. The only reason your ODS is not the definative data warehouse is because you said it can be updated. This idea is not generally supported in the data warehouse... but as with anything there are exceptions to every rule i.e. Type-1 slowly changing dimensions.
I believe that different methods work for different environments and we must be flexible to real world environments yet frigid in staying with proven best practices... whether it's Kimball dimensional models, Inmons CIF and DW 2.0, or John Doe in Little Company USA's who made the data warehouse work within his real world limitations. Business practices change, methodologies change... and who knows, in a couple of years we may have a new method for building data warehouse system... the Rainardi Method
Anyway I digress, overall great article with a lot to think about and I'm looking forward to more.
Thanks David. Sometimes I wonder about that, why there are only 3 types of SCD, why don't we make a new type. Why only Kimball and Inmon method (and Barry Devlin), why don't we make a new method. Your question whether a data warehouse must be a dimensional has been in my mind for a long time now. I think dimensional is a better format to get the data out but normalised form is a better format for integration. I think real time ETL is really hot at the moment and it breaks the fundamental principles that a DW is a non-volatile store. Another trend is text analytics which developed very quickly in the last year or so which I think will be a hot potato in this 2007. What do you think?
With regards to the 3 types of SCD, I don't see another alternative to handling updates in a table. Either you 1) Replace the data, 2) Version the data in a new row, or 3) Version the columns that change in the same row. If there is another technique you were thinking of, I would love to hear it. I'm always open to hearing new ideas that may help out in the future.
For the warehouse here, we find that using the Kimball design for the warehouse as well as the 2 ODS's we have works just fine. Our reasoning for using an ODS vs putting the data in the warehouse was time based. We had limited time to get data into a storable format and add to that 90% of the data is junk in one particular column that is used for reference information. Using the ODS buys us time to analyze the junk so we can eventually create some rules in ETL to cleanse the data and land it in the DWH.
Thank you for sharing your experience about ODS with us Cliff.
With regards to the SCD beyond the 3 types Margy Ross and Ralph Kimball describe several alternatives and combination (hybrid) types in this article. The one I was thinking of last year is storing the historical dimensional data in another table(s), using the technique similar to the one used for 'auditing', i.e. the structure of the history table is the same as the dimension table, with some additional columns such as change timestamps, flag columns (such as change type and latest), version columns and user information columns. The alternative of this technique is storing only the necessary columns of the dimension table, rather than all columns. In the case of a 'snow-flaked' or 'outtrigger' dimension, we have the options of combining the historical dimensional data in 1 history table, or mimic-ing the structure of the 'snow-flaked' dimension tables.
I apologise I've just seen your post today.
Thanks for the enriching discussion about
whether To impart the ODS as middle layer.
I am pretty excited to know how we can
Implement an ODS in an environment where
both the source and target databases are
from Relational SQLSERVER 2005.
If you would provide me DO's and DONT'S
of implementation methodology with a
Suitable example especially dealing with the
"INCREMENTAL LOAD" will be greatly helpful to my task.
there has been a lot of things that I have learnt since I wrote that ODS article and last year I wrote a data warehousing book, http://www.amazon.co.uk/dp/1590599314/
My current view is that we don't have to have an ODS to build a data warehouse. So regarding the do and don'ts, the top of the list would probably be "only build the ODS if you need one". The dimensional data warehouse should contain the data at the lowest level/grain. But in practice this is not always the case. 95% of the case it is applicable, but there are certain cases where it's not practical and we had to source operational lower level reporting from ODS.
The second DO is probably the grain, i.e. must be the same as the source business system, i.e. should be the lowest possible level, don't aggregate.
3rd DO: should be in 3rd (or more) normal form. It shouldn't be in 2nd or 1st. And it definitely shouldn't be in dimensional model.
4th DO: have a internal ODS key/ID, rather than relying on the source system key.
5th DO: do have a DQ (Data Quality), i.e. create data firewall between the source system and ODS
DON'T: don't store history, only keep the current version. If we need to do snapshoting and SCD, it is better to do this on the dimensional Data Warehouse.
DON'T: don't copy the structure of the source system as is, but organize into proper 3rd NF.
For incremental load is we can use identity column or timestamp column. I wrote the incremental loading here: http://www.sqlservercentral.com/articles/Design/dataextractionmethodspart1/2356/
Please don't hesitate to come back with any question, I'll be glad to assist.
PS. You don't have to follow the above do and don't as it is, your situation could be different. So they are just for your consideration. Good luck with your project.
Thanks for the response,
Information you have provided would give
A better direction and soild focus in dealing with
Building of an ODS.
would n't you mind I require some more clarity
About those DO's and Dont's.
(1) 4 th DO:
Would you provide me information about how to define
ODS Key/ID with an "Example"?
(2) 5 th DO:
Would you provide me information about how to define
"Data Fire Wall" between Source system and ODS with
If would assist me with relavent information that
would helpful for my project.
Hi Vijay, the ODS key is a surrogate key, i.e. sequential integer beginning with 0. For example: 0, 1, 2, 3, ... etc.
Data firewall is a collection of data quality rules in the ETL system. This is industry specific. For example, in an insurance ODS "if the capital adequacy ratio is less than certain limit, raise a warning". Or in a CRM ODS (CDI, Customer Data Integration), "if the email address is invalid, then flag it so that no campaign is sent"
I read the discussion from the first posts, and i can tell you that your first article about ODS confused me a lot.
I read a lot of articles about ODS and Data warehouse. I also worked on several DW/BI projects and till now i never had to deal with ODS, we put every detail in the data warehouse, sometimes in 3 Normal form and others in Dimensional Form...
Concerning ODS, term which was coined by Inmon and Imhof, had to serve tactical decisions rather then strategic ones. That's good, but when i read kimball's article about ODS, i understood that it's only question of the periodicity of the refreshment of data... So since ETL technology are moving from batch mode to near realtime mode, i don't see any need to still have an ODS.
What do you think about this ?
By the way, i just had your book, i will read it... at first view it seems to be very interesting
You are right Abdel, with micro-batch or near real time ETL, data warehouses now have new functions. It’s now used not only for analysis, but also for operational reporting and supporting daily business operations, which reduces/eliminates the needs for an ODS.
And you are right, we should only create ODS when we need one. ODS has associated additional costs both in development effort and operational overhead: more ETL processes to execute, longer loading times. If until today you have not dealt with ODS, that’s good — chances are you don’t need it. Many companies don’t use ODS; they only have a dimensional data warehouse.
The reasons why some companies use ODS are: (these vary from company to company)
1) The dimensional data warehouse is not updated frequently enough, so they created ODS which is refreshed more frequently. For example, the dimensional DW is updated once a day whilst the ODS is updated 4 times during working hours.
2) The dimensional data warehouse does not store data at the lowest level grain. It stores historical data at aggregate level, for example daily snapshots of account balances for the last 5 years. Therefore they created an ODS which stores data at the lowest level, for example the current version of all transactions.
3) The dimensional data warehouse is in denormalized structure and they need to run some report from a normalized data structure. So they created ODS which is in normalized data structure to support these reports.
4) To support operational and regulatory reporting, they need to maintain group-wide data such as standing data and operational targets. These are not available in the source systems as the source systems are company-wide (the group consists of several companies with different business systems). The ODS is not only used on read-only basis for reporting but also for maintaining this data. An example of such data is reporting structure and data conversion elements.
5) Integration. The ODS tables are very similar to their business systems, but it contains data from multiple business systems, and it’s better designed (it’s newer, more structured, more complete, more normalized, more stringent naming conventions). Hence for some reports they prefer to report from ODS then from individual business system or from the dimensional data warehouse.
6) Performance. Although the same data is available on the dimensional data warehouse, because of its normalized structure some reports execute faster when they are pointed to the ODS. The fact that the ODS stores only the latest/current version (no history) also helps the report performance, compared to the dimensional data warehouse which stores all versions and years of snapshot data. Of course the other way around is also true: some reports execute faster when they are pointed to the dimensional data warehouse instead of ODS.
About the term "dimensional data warehouse" above: I prefer to call it “dimensional data store” or DDS. Where as the term "data warehouse system" encompasses the whole thing: stage, ETL, ODS, DDS, DQ, control, audit, cubes and the front-end applications, as I described on this article: Business Intelligence or Data Warehouse[/url] and in my book[/url].
On the other hand, the term "data warehouse database" refers to either a relational DDS (as I mentioned here in this post) or a multidimensional data store (OLAP cubes). The data warehouse databases are shown in blue in the above article.
I appreciate that you may have different views Abdel and I am interested to learn from you. So I'd be grateful if you would post your views.
Thank you for your answers.
One architecture that i usually use is described as follows :
- Extract the Transactional Data from Multiple Source Systems to a Staging Area. This Staging Area is not permanent. We extract Data with less transformations, and we store Data about the same subject from different systems in the same table (the structure of the table is a standardized to receive data from different systems, we also add some columns to differentiate source system...). The reason why we use a staging area is to disconnect as quickly as possible from the source systems in order to not disturb the operations.
- If it's possible We extract only delta Data from the source systems to the staging area.
- Transform the data and load it in the data warehouse (usually in 3NF), this Data warehouse allow us to track history over time, so we can provide a picture of our activity at any time in the past ( This is very useful in the insurance field). We try to never update the Data Warehouse tables, even if some times we can, in collaboration with business users, decide that some kind of changes in the source system are not very interesting so we can update some transactions...
- For performance reasons, we create Datamarts ( a set of aggregated fact tables and dimensions), to use for Analysis. The best design in this stage is the dimensional form.
- To explain analysis, we have to access to the detail, which is stored in the Data Warehouse database, so we create reports on Data warehouse Database and allow access to them from Analysis developed using Aggregated Datamarts by passing parameters ( This reduces the response time and by using some techniques, we can force the users to drill down to the correct level to retrieve fewer data from the data warehouse database).
- And so on.
Why i never had to deal with ODS :
If there is a need to deal with operational reporting :
- At first we reduce the refreshment periodicity of the Data Warehouse ( We can go as far as the ETL can do)
- At the first load of the Data warehouse database, We also load separately all useful tables for operational reporting. That's what i call the current version of the transactions.
- for the following extractions, we UPSERT these tables from the staging area in parallel with the load of the Data Warehouse Database ( this Database store the current version and all the past versions)
- If for any reason, there is a need to reconstruct our Datamarts or our Current version, We can do it from the Data Warehouse Database.
Hope this gives you an idea on how i do Data warehousing... i don't know if it's good or not, but till now the only problem i encountered is Data explosion :-).
As you can imagine, almost 70% of the effort to build a good DW/BI Solution is consumed by the ETL...
Viewing 15 posts - 1 through 15 (of 16 total)