If you have been involved with any data warehousing projects then the chances are that you have heard of at least one of Dr Ralph Kimball's publications
- The Data Warehousing Toolkit
- The Data Webhouse Toolkit
- The Data Warehouse Lifecycle Toolkit
If Bill Inmon is regarded as the father of the data warehouse then Dr Ralph Kimball is widely regarded as the father of business intelligence.
I have just finished reading "The Data Warehousing Toolkit" for the first time and have placed it firmly in a category of "books I wish I had read years ago".
I say I have finished reading it but in all honestly if you are involved in data warehousing projects it is going to be a long time before
you stop reading this book. I can see myself reading sections of this over and over again.
What is the book about?
The book is not really a SQL Server book at all. In fact, apart form a couple of sentences mentioning ORACLE the book is pretty much database agnostic.
I would say the book addresses the following
- 80% of it is how to design a data warehouse from a dimensional modelling perspective
- 15% focuses on the business aspects of data warehousing
- The remaining 5% is relevant but miscellaneous information including mistakes, pitfalls and common myths about data warehousing
So what do I think is so great about this book?
Firstly, I found it an easy read. One of Dr Kimball's points about data warehousing from an end user perspective is that it should be easy to use and he has applied this principle to his book.
Each chapter builds on the preceding chapters and each one is based on an industry or business process. I found that this kept my interest as well as providing a good illustration of the points being raised.
It does make reference to the 1st edition and mention how some practises have been made obsolete by technology advances. For example, when the 1st edition was written disk space and processor capacity was significantly less than they are today so aggregating data was a necessary evil. Today technology allows us to avoid unnecessary aggregation.
Chapter ten asks you to critique a design on the basis of what you should have learnt so far.
Chapter fifteen pulls absolutely everything together for one big case study based on the insurance industry.
The 15% that focuses on business issues is exceptionally important as it covers some absolute fundamentals that are very easy to overlook. In my experience there are some things that are absolutely obvious but only in hindsight. This book states a number of these up-front.
What didn't I like?
Really the points below are mere nit picking.
Towards the end of the book I felt that some additional schema examples would have been beneficial.
In general I tend to dip into textbooks as I have specific needs at the time of reading. I rarely read them end-to-end in one pass. This is not a book that lets you dip into odd chapters. You could not decide you were interested in the CRM topic without having read the preceding chapters.
That said there was a lot to absorb in this book and I think an attempt to produce stand-a-lone chapters would have made the book unwieldy and repetitive.
Key Business Ideas
There are some ideas in the book that are of monumental importance and are reiterated throughout the book.
Model a business process
Perhaps the main point is that the data warehouse should model a business process and not seen as a means to generate a predefined report or set of reports.
The data warehouse should be seen (and accepted) as a repository of truth. The end users may not like what the truth reveals but they should have no issue with it being the truth. All reports built on top of the data warehouse are reports built on that truth.
Some of the strategies to produce this single repository of truth are described below
- Coming up with a universally accepted definition of business facts and the dimensional measures that will be used within the data warehouse. These are known as conformed facts and dimensions.
- Establishing a published definition of terms and data dictionary. This must be freely accessible.
- Establishing a central authority for ensuring that dimensional measures and facts remain common across the enterprise.
From experience I know that this is very easy to say but in real life this is like nailing jelly to the ceiling.
It has to be done though. If you don't succeed with this then when the data warehouse reveals an unpalatable truth people will argue about the figures themselves rather than what the figures are telling them.
Who is the data warehouse for?
Another key point is that the data warehouse is to drive good decision-making and profitability. If it fails to meet these criteria then it has failed at the most basic level.
The data warehouse has to be easy for business users to understand and use. The book stresses that business intelligence should
not rely on a "priesthood" of technical specialists in order to deliver information to the business users.
If the business users wants information from the data warehouse then they should be able to get it for themselves.
Training is vital and the book goes as far as to recommend a "no education no access" policy with the full backing of the business sponsor. There is no point blowing the data warehousing budget on technology and processes if the business users cannot use the system.
How many of you have been on training courses well ahead of you actually having access to the thing you are being trained on? The book explicitly states that the end users need access to the data warehouse and tools as soon as they have been trained.
I have run into this problem many times in my career. Trying to introduce fact based decision making into organisations that run off gut feel and assumptions. I found myself nodding when the book raised this point.
I once performed a highly detailed analysis for a company whom the cheapness of acquiring customers was deemed the single most important factor. My analysis showed that
- Cheap customers have a high drop-off rate. They don't tend to come back and if they do they buy cheap products and default on payment.
- Customers recruited on products slightly below the average product offering (as opposed to the cheapest possible product) tended to become repeat customers spending more on 2nd, 3rd and subsequent purchases.
This was so contrary to the corporate culture and belief system that my analysis went straight into the round file.
A friend had a similar experience profiling customers for a shampoo manufacturer. If the adverts were to be believed a young 20 something blonde beauty running through a meadow was the core market for their gentle shampoo.
His analysis showed that post menopausal women were the single largest segment because the hormonal changes meant their hair was drier and more brittle and the shampoo was ideal for their requirements.
He was lucky to leave the premises in one piece!
Key technology ideas
For me there were three ideas that surprised me.
- Don't aggregate the data
- Don't normalise the data in the presentation layer
- Don't use natural keys
Don't aggregate the data
One of the key decisions in the data warehouse is the granularity of the facts. You may decide that you want to look at sales at the order level. That is fine, but if you decide later that you want to go down to the line-item level and you have aggregated all your source data up to the order level then you are stuck.
This isn't to say that you can't provide aggregates to aid performance but you shouldn't eliminate the source data
Don't normalise the data
From the beginning the book emphasises that the schema for an operation system such as a sales order system is going to be radically different to the schema for the data warehouse.
The schema should be as near to a star schema as it can possibly be. That is a fact table has direct links to the dimension tables. The aim is to make the schema easy to understand to a non-technical user. There are obvious exceptions that are dealt with thoroughly in the book.
The author stresses that any space saving in normalising out the dimensions is relatively trivial compared to sheer size of the fact tables. What you gain in space you lose in ease of understanding and ease of understanding is a key factor in getting the data warehouse accepted.
If you are not going to normalise the data in the dimensions then you might as well make sure that the descriptive text within the dimensions is meaningful and not abbreviated.
Don't use natural keys
Dr Kimball favours a single integer surrogate key for joining fact and dimension tables.
Apart from simplicity the fact of the matter is that natural keys can and do change and having to update a several billion-row fact table as opposed to a few thousand (or fewer) row dimension table is somewhat unpalatable.
If you are an experienced DBA none of the following points are going to be obvious but it is nice to see them stated up front in the book. From the outset the following features need to be designed in.
- Data warehouse security. Who should have access to your data warehouse.
- The index and partitioning strategy
- Purge and archiving strategy, particularly if the data warehouse holds personal information.
- As a data warehouse is going to be very large you don't want to be planning this in retrospectively.
- Perhaps one of the most important design considerations is one that straddles the line between those that I have called business ideas and technology ideas. That is the separation of the data warehouse into 3 core parts.
This is the only part of the data warehouse to which business users have access.
This will comprise of the star schemas of the fact tables and dimension tables.
This is an intermediate area where data in the offline data store is processed so that it may be uploaded into the Presentation Area.
No business user is allowed into this area.
Offline Data store
This is ware the raw operational data is received and collated. Again no business users is allowed anywhere near this area.
One off the cuff comment in the book was that if the schemas in the Presentation Area look boringly similar then a major goal has been achieved. The aim of making the data warehouse easy to use and understand.
What else is covered?
Most of the major data warehouse terms are covered. At a risk of regenerating the table of contents I have listed the main ones below
- The four stage DW design process
- Choose the process
- Specify the granularity of the facts
- Identify the facts (and conform them to the enterprise dictionary)
- Identify the dimensions ( and conform them to the enterprise dictionary)
- Additive facts - those that can be aggregated across all dimensions in the model
- Semi-additive facts - those that can only be aggregated for some dimensions.
- Fact less fact tables
- Those used to indicate the absence of data
- Those where there are no numeric facts to aggregate. They can only provide counts.
- Degenerate dimensions - those that retain specific meaning and remain within a fact table
- Junk dimensions - Those that have no natural group to which they can belong but have a limited number of permutations and combinations such as bit flags in the data
- Accumulating snapshots - facts that cannot be measured on a transaction by transaction basis such as warehouse cycle counts/stock takes.
- Strategies for dealing with slow and fast changing dimensions.
- Outrigger dimensions. An example of when snow-flaking is allowed.
- How to deal with heterogeneous dimensions such as products with a core set of common facts but product line specific facts.
- The recommendation of data warehouse roles. A role refers to a view used to present a common table such as the date dimension as being relevant to a specific fact.
- Fact dimensions - Entity Attribute Designs.
- Techniques for dealing with hierarchal data
- Techniques for dealing with many to many relationships
- Techniques to build a profit and loss account warehouse.
This book was recommended to me by a DBA for whom I have the utmost respect and now I am recommending it to you.
If you have read other data warehousing books you will notice that some draw a distinction between a data warehouse and a data mart.
I tend to think of data marts as the presentation layer of a data warehouse, that is, where you actually get your data from. The data
warehouse precedes the presentation layer just as a supermarket is the sales point for goods but is backed up by a warehouse.
There is an awful lot to absorb in this book particularly in the latter chapters and as I said earlier it is a book that bares
a lot of re-reading.
Beyond the design aspects of the book it was refreshing to see the business side of a data warehouse project was not ignored.
All to often books focus on the technological aspects of an IT project playing only lip service to the business aspects. Not only was it not ignored it was given significant emphasis.
Improving Data Warehouse and Business Information Quality : Methods for Reducing Costs and Increasing Profits by Larry P English
Corporate Information Factory by Bill Inmon and Claudia Imhoff.