Validations at end of ETL indicate missing data
At the end of each ETL job, I always run some sort of validation process to give me an indication that the data loading and processing are as expected or things have gone terribly wrong. The latter happened last Thursday when I received the validation result in my inbox.
Last Thursday also coincided with a “disk utilization” issue in the upstream server. So I waited until last Friday when the “disk utilization” issue was fixed in the upstream server to have the data reloaded. To my surprise, the data volume that was loaded on our side was still way below normal.
Steps to investigate missing data
It’s time to investigate. Blog writers at this point will usually “make the long story short”, I’ll do the same here, but with a few bullet points to highlight the steps I took to investigate.
- Pick one day for one employee: this is the most obvious data point to validate for my data. Our data contains detail data down to the granularity of per employee, per Interaction Resource ID, per connection ID, per 15 minutes interval per row. Picking one day for one employee will give me not too little and not too much data to check.
- Validate on the upstream source server: the table in question is a fact table and has a unique interaction resource fact ID. The unique list of the interaction resource fact ID is an obvious candidate as my data points for checking.
- Compare the interaction resource fact ID: between the upstream source server and our own server. Because the data volume is small enough, a simple query revealed that about 20% of the interaction resource fact IDs are not loaded into our own table.
- An ETL process design with an atomic operation in mind helps to pinpoint exactly where the problem might be: our fact tables usually are long and also wide, so it’s not very easy to visually see why these 20% of the data were not loaded. So it’s time to go to the step where the data was processed before the loading. Fortunately my ETL processes are designed with the atomic operation in mind, and I know exactly which step to look for the culprit.
- A WHERE clause was added to the processing stored procedure: a log entry in the procedure clearly says that a WHERE clause was added last Thursday. A quick running of the query inside the procedure shows that this WHERE clause filter out those 20% data.
ecda1.[CUSTOM_DATA_8] <> ‘null’
A quick check of the raw data shows that out of 1.3+ million rows, there is only one row that has the value ‘null’ in ecda1.[CUSTOM_DATA_8]. Then why 20% of the data were also filtered out along with the one ‘bad’ row?
Why a seemingly innocent WHERE clause can cause large amount of data missing
This post is getting longer than I wanted. I’ll stop now. In the next post, I’ll explain:
- Why the above WHERE clause not only filtered out one ‘bad’ row, but also took 20% of other rows along with it.
- Why in ETL, adding WHERE clause during data processing in production is almost never a good idea.
Validation of data loading and processing
One of the most often used validation method at the end of each ETL run is to run a cross reference checking on a couple of metrics, which entails finding two independent sources of the same metric.
Atomic Operation in ETL
Atomic operation, atomic programming, atomic database design, atomic transaction, etc., etc.. There are many explanations to these concepts. I am probably not qualified to give it a formal definition in the ETL design, but it’s a strategy that every ETL designer/developer should learn to practice. As an ETL designer/developer, our professional life depends on how well we understand the strategy and how well we apply it to every single task we design.