Data Reconcilliation Reports

  • Here is a simple case study for which we are looking to design a Recon Report or leverage existing tools:

    We have multiple source systems of data (Source system include Access and SQL Server) that are ETL'ed into a reporting data mart that is primarily used for SSRS reporting. More often than not we have issues with data being extracted or loaded causing for either duplicate records in the destination or significantly less number of records in the destination which we start to understand only after a user shouts out.

    We were looking to create a Reconcilliation report on the staging server that could be used to identify any discrepency immediately.

    Any particular suggestions to this regard of how it can be approached or Any tools that already do this that could be considered?

  • I've created a few simple recon reports over the years. I don't know of a tool to do this for you, but there could be on out there somewhere. Usually I just write simple queries to give me a sanity check after the data load completes. For instance, if I know based on history that an element of my etl should always push around 100 new records into a given table I might query that table after the data load and count the number of new records (based on dateloaded) and then test the return against my expectations. You can create a table with a record for each part of your data mart with your expectations and then use it in your tests. You need to give yourself some wiggle room ... generally if you get within 15-20% of your expectations on any given day your ok. You could then set up some sort of reporting that shows if you are close to expected value, within %50, or more than %50 over/under expected values.

    You would build your table of assumptions based on history. Just query to get a sense of actual number of records loaded / changed / deleted ... that sort of thing. I did this for a call center once and one of my tests was to make sure that all agents scheduled on a given day had records loaded into my data mart for that day. I compared a schedule table with the load table and it told me if someone was on the clock but I have no records for them. That was a real life saver more than once.

    You probably get the idea. it's not rocket science. It will take some time to setup, but in the end you'll be glad it's there. Let me know if you want to talk it through in more detail.

  • Tm - this is absolutely useful and this is what we thought about exactly, our complication is only with the fact that there are about 500 Million records in the database and we were not sure how this could be achieved easily, do you have any thoughts on how we could do that without having to give up on a lot of performance.

  • It's hard to say without being there, but I would think that the kinds of validation queries you will need will be ok even with that volume of data. Just like any other query, make sure you write it in a way that makes sense and you have good indexes and you should be ok. You'll be writing mostly aggregate queries within a defined scope of time Maybe a few queries joined to one or two other tables to validate specific things. Unless you're actually loading 500m records in every data load it shouldn't be a problem.

    I would start small. Pick the top 10 pain points that you have to deal with again and again. Work out ways to validate those things after every data load. The lessons learned building the first set of validations will guide you as you work through the rest.

  • My approach was similar to the one described by Tim, here is a summay:

    A control table tracking monthly records count for the most critical tables in the environment; this table has a tablename on its first column and a column per month tracking records count, this table helps to track [a] growth and hi-cups, when for whatever reason the data failed to arrive, it also pinpointed major actions taking place at the source system. It is updated by an stored procedure that could be run on demand or on the second day of each month.

    Another very similar table tracking monthly measures, it has the tablename on its first column, the measure being tracked on the second column and as many columns as required; another or the same procedure updating the 'monthly records count' table takes care of updating this one.

    These two tables assist tracking data fluctuations up to the previous month taking place at the source applications, more often than not, their data highlight issues related to the legacy application, and the business people get busy explaining any fluctuation.

    I do have views on the reporting tables tracking daily summaries on records count and measure daily totals, these views returns daily summaries for the last two months (maximum of 8 weeks), the granularity of these views depend on the data being tracked, they may include geographical summaries, e.g. per state or per line of business, sometimes they track figures fluctuations introduced by transactions being assigned to the wrong category.

    Another object I had implemented tracks 'data availability', sometimes the business run their reports before 100% of the data have arrived.

    Something else to track is data delays, sometimes the legacy application is late completing its end of the day (or end of the month) processes and the ETL takes whatever was ready.

    There are too many factors, you should be building a suite of control tables and views addressing these factors as they appear.

    Hope this helps,
    Rock from VbCity

  • Another thought might be to implement an SSIS framework. I built mine from the instructions in SSIS: Problem Design Solution , or you can buy BI Express from Pragmatic works. BI Express makes it much easier to set up, but it's not free.

    If you use one of these frameworks you get a database you can query for run times for the entire package and each element in your package (plus a whole boatload of other useful metadata). This could be a starting point to determine if you have a problem (ie the run finished much to quickly or took much to long).

    I would consider a framework a 'nice to have' tool in your monitoring toolbox. If you have to pick between this and the solutions Rock and I have described I would go with the later.

    Good luck!

  • The SSIS Problem Design Solution is a great book, I set its framework as a Visual Studio template, and it is a great idea to have one.

    Based on the issues described by MaverickMan, their problems happen on existing ETL processes, it will be a lot of work to refit them with a framework.

    Hope this helps,
    Rock from VbCity

  • good point. applying the template after the fact is nothing but pain and heartache. bi express is easier if your pockets are deep enough.

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply