SQLServerCentral Article

More Problems with Data Warehousing

,

I was interested in the sheer number of people who had read Janet Wong's recent article, Problems In Building a Data Warehouse. Reading into a couple of the replies, it seems that one thing which people were interested in were the "gotchas" from implementing a data warehousing system, and having the most technophobic user base in the universe I thought I could highlight a few.

The most important thing to bear in mind is clean data. You have to remember that Data Warehousing is like a pyramid, your base data makes up the base of the pyramid, and you need to fine hone this to get to the point.

In getting to the point I found a few problems.

Make Sure You Have a Universal Primary Key Across All Data Sources

Obvious isn't it? But in my case I was a couple of weeks in before I realized one of the systems I was integrating didn't have one (it looked like it did, but actually there were subtle differences). Luckily there was a text field that I could use with the SSIS fuzzy logic task to match for my pilot proof. In the end it took 2 months for the users to go back and change their systems to fix this for good.

Make Sure You Know How Reliable your Base Information is

This can't be stressed enough. No system is 100% accurate, and bear in mind that a 97% accurate report combined with a 97% accurate report will probably not make a 97% accurate report; most likely it will be 94% accurate.

If you find information in two places, work out which is the most reliable and try and get rid of the other one. It's kind of like having two cigarette lighters, front door keys or pairs of glasses. Because you have a spare you don't expend as much energy keeping track of either of them as you do when you only have one; it's just human nature. Also, try and work out if one of the systems has checks and balances which help to keep things up to date.

As an example we have an ink system which records the cost of the ink we buy. We also have a finance system where we raise purchase orders to buy ink. It makes far more sense to get the ink cost from the finance system because if we get the price wrong in the finance system the invoices from the suppliers don't match the purchase order and it soon gets sorted out. If the price is wrong in the ink system nothing bad happens and nobody cares.

Until your data warehousing project fails. Then it's your fault.

Some systems are just not right; the data is bad and they are poorly looked after. Of course, a re-implementation project can help to get these straightened out, but you are likely to be quite a way into your warehousing project before you realise just how not right they are. Also, re-implementation helps going forward, but what about the historical data? Do you just throw it away?

The key here is to look for the person with the spreadsheets who is responsible for getting meaningful data about the part of the business this system controls. Chances are he drags a report from the system into Excel then uses a combination of algorithms to come up with his figures. For example if this value is 0 this record is clearly rubbish; if this is greater than 10 then this is clearly rubbish. If you're a glorified hacker like me you can use a series of very basic statistical techniques and where clauses to match this guys logic and get to a pretty good average. If you're a stats genius then you'll probably do a much better job.

Compare Like For Like

Even in something as simple as a CRM system the report "Show me all customers that bought X in the last 3 months" can be inaccurate when combined with other data are you talking about the invoice date, invoice payment date, delivery date or order date?

Picking one out of thin air is not the way forward. This is a difficult topic and one you have to work out for yourself. Suffice it to say that your sales team will usually want you to go with purchase order date, but your finance team will probably only be interested in invoices raised or even paid invoices.

If you start reconciling a sales report with a finance report you'll see anomalies in the data, sales will consider a product as being sold in January whilst finance will consider it as sold in February.

Errors are Obvious

In my opinion this is the single most frustrating thing you can experience as an IT professional. You are rolling out information that has never been rolled out before, people are going to look at it. There will be errors.

You need to know how many. A guy far more experienced than me always told me you should never sell a data warehouse on 100% accuracy, because you will never get there.

The problem you have is that even if management have had this information in the past through manual reporting, they have probably never had this level of visibility, and they will almost certainly never have been able to cross check it.

The nature of computerised reports is such that generally the details are 100% accurate, but there may be some data which falls outside your logic, even when your logic is perfect you'll usually lose something because of system or entry errors, and therefore there will be whole rows of data missing.

The nature of manual reports is that generally all the rows are there, but the details in the rows are incorrect (typing errors rather than omissions).

As a rule nobody checks manual systems. Apart from me. I've spent weeks doing it, and I can tell you that cumulative errors from complex manual systems seem to manage a maximum of 90% accuracy. Omission errors from computerised reports seem to average about 95% accuracy (once you've ironed out the basic problems).

The problem being that omission errors are easy to spot, and they can see your report is "only" 95% accurate

This is a tricky one, and of course there will be a combination of bad data and compromises you have made to get your data onto a level playing field, which may affect accuracy. But 95% ain't bad (excepting for simple CRM type apps where you can probably do better)

I've found almost without exception that the department who produce these manual systems are inordinately proud of it and boast of their 100% ISO compliance and traceability blah blah blah, they usually have a pretty well established marketing machine and everyone parrots this as gospel.

So you need to check their manual systems yourself, they're not going to do it, and prove your 95% is better than their 90%. Then you have to convince everyone else that you've checked it, 'cos they ain't going to check your checks.

This is a hard sell.

Don't be Nasty About Spreadsheets

At least try. I know it's hard sometimes, but when your DW will be replacing the pinnacle of somebody's technical achievement they can get pretty emotional about it. Try and engage the user, tell them they've got a lot of very useful information in their spreadsheet and ask them to explain what they're doing. This will tell you a lot about what you're trying to achieve (remember they may have algorithms which weed out bad data which you can use, but they almost certainly won't know that they are algorithms).

Remember everyone knows something you don't. Their spreadsheet may well be a complete waste of time, energy and duplicated effort, but don't be negative about it; just deliver a proven (ultra-efficient) replacement for it.

Manage your Stored Procedures

You'll find that some data cannot be compared directly straight away, two bits of data will need to be manipulated to get it compared on a like for like basis. This means you will end up with a lot of stored procedures that transform your data so it is on a level playing field with all your other data. If your warehouse is anything like mine you'll end up taking data from various places and performing several transforms on it to get it into your warehouse.

It is absolutely imperative that you name your stored procedures properly. Use numbers to denote sequence. What you will find is that you have made some mistakes in your transforms along the way, and you have to keep going back over the transforms to get it just right.

Try and name the transform path sensibly, and I would suggest numbering it.

For example, I have several data paths for the data in our printing firm, I have an ink recipe which I need to combine with our paper usage, here are my transforms :

Ink Recipe

  1. DW_Sp_IR_Get Ink Recipe by Job 1
  2. DW_Sp_IR_Get Ink Cost From Finance per Gramme 2
  3. DW_Sp_PJ_Calculate Job Size 3
  4. DW_Sp_IR_Calculate total Ink Cost for job 4

Printing Job

  1. DW_Sp_PJ_Get Paper use per ink job 1
  2. DW_Sp_PJ_Get Paper cost from Finance 2
  3. DW_Sp_PJ_Calculate Job Size 3
  4. DW_Sp_PJ_Calculate total paper cost for job 4

Combine

  1. DW_Sp_Nexus1_Combine Printing Job with Recipe

You might be able to see from the above that there is more than one relationship between those two sequences. Draw a flow diagram and work out which procedures you run where. If you go at it from a trial and error point of view you're going to get stuck when something isn't quite right, you'll go back and change it. There is no real mechanism in SQL Server to document the interdependency of these transforms (Unless you're using SSIS in SQL 2005, which is pretty good for this sort of thing, incidentally).

Don't try and document it straight away in any detail, you won't know half of what you need to do, try and do it when you understand each particular "Module" of your project.

Conclusion

OK, so there are a lot of barriers to the completion of a Data warehousing project, most of them are not technical. I think the most important thing is educating the business (and yourself) about the accuracy of their current figures, as the quickest way to derail a data warehousing project is to set the business expectations too high.

Remember the point of a DW project is to get the best information you can to the top level as quickly and painlessly as possible. You are probably fighting against a series of manual reports which have been produced possibly for many years, probably with completely spurious but completely accepted statements of accuracy. The people who provide this information may well see you as a threat to their job and try and derail you themselves unless you can engage them in the process, never forget they know stuff you don't and they might be quite proud of their nonsense reports, or they might not care, or they might know or suspect they're nonsense and worry that you'll prove it.

Even if your Warehouse is attempting to deliver completely new information it is still going to be based on a backbone of accepted business reporting, so there should be some way to reconcile your information against something you know to be true (but remember that "know" is sometimes subjective, so try not to take anyone's word for it)

Bear in mind that the accuracy of data is related to the number of checks made against it, for example a Sales Order system is pretty accurate, because a customer will almost always question an incorrect invoice and a salesman will usually take care with an order, whereas a production reporting system may reconcile material flow against material purchased, but there will be an acceptable variance for error.

You are going to have to go back and do it again, you'll get some stuff wrong on the first attempt. You may have to go right back to your initial query and change the whole flow from beginning to end. Make sure you document the data flow and name your procedures sensibly or you'll get lost.

Rate

5 (1)

You rated this post out of 5. Change rating

Share

Share

Rate

5 (1)

You rated this post out of 5. Change rating