More and more companies are becoming aware they suffer losses, problems and costs due to bad data. Poor data quality creates problems in both IT and the business. There is a difference, however, between data quality and data cleansing. Data cleansing is one of the step during the ETL process and is just one area of the data quality. Data quality technology includes many areas such as data profiling, standardization, validation, enrichments and others.
Measure Data Quality
Measuring data quality is the first step to understanding the data strengths and weaknesses. It allows business users and IT to understand the implication of data usage on data driven projects. It helps to implement data cleansing initiatives and improve the data collection process. In order to measure data quality, you need to design data quality rules, and then fine tune the rules and build a data quality scorecard. Data quality rules include analyzing data relationships, gathering experts knowledge, and data profiling
Analyzing the data relationship means to understand how the different data elements relate to each other. It examines the data type, format, valid values, missing values, and determines if it is a required or optional field. It looks for hidden relationship between columns or tables. It also looks for fields and tables that no one ever used.
Gathering expert knowledge means talking to the business users and understanding what the data means to them. Both business and technical teams gather data quality and data validation requirements, and each requirement must be understandable, testable and accomplishable. Each source data, procedure, application and data validation process is documented.
Data Profiling is one of the best tools for measuring data quality. It is the process of analyzing the database in relation to data quality, to identify and prioritize the data quality. It is very useful in managing and planning a data clean-up project. It actually extracts data from the current system and creates a knowledge base of accurate information about the data, usually about 5 10% of the current data depending on how muc data is in your system. It allows the data analysts to examine the data inaccuracies and determine if the data is actually matching the business rules.
Most of the data profiling is done manually. It involves writing queries that document the data's structure. There are also data profiling tools. For example DataFlux, Innovative Systems and Informatica have built their own profiling tools. Most of these tools help to parse and standardize the data, monitor the business rules and define what is sufficient data quality for the organization.
In order to correct data errors, we have to find out how the dirty data gets into the system. The errors can come from data entry, data conversion from legacy system, inconsistent values in the same field, missing values or unexpected values from external data. A data entry person intends to enter 'David' but enters 'Bavid' instead. Also in many cases systems (especially legacy system) do not have detailed documentation; both the technical and business people just do not understand what the data should be. As an example, the legacy system used "H" and "S" as hourly and salary employees. The new system may use exempt and non-exempt to differ ate hourly and salary employees. Or the IT people found in the legacy system sometimes the pay rate field was empty or had a different value. There was no documentation to define the pay rate field in the legacy system, no one knew what it meant when the field was empty or had another value. In this case we needed to establish a business rule to convert the pay rate. If the pay rate was 'H' then converted to 'Exempt', 'S' to 'Non-Exempt' and if it was empty or the value was not "H" or "S" then used the employee's level to determine the pay type.
Data quality problems contribute to inconsistent definitions for common terms in between departments and applications. For example, Year-to-date is always confusing to many users. Different departments interpret year-to-date differently, some departments interpret year-to-date is calendar year to date, the other departments interpret year-to-date as fiscal year to date. If it is fiscal year to date, there should be a business rule to set up which month the fiscal year starts. There should be a business rule to define whether year-to-date should be calendar year or fiscal year and use some other name to define the another meaning. Maybe year-to-date means calendar year to date; fy-to-date means fiscal year to date. Standardization is very important in the company. If the data is standardized, it helps the data to move smoothly in between departments. Everyone will understand what the term means and what the report means. But in order to do that, the business and the IT department has to work very hard together
The process of cleaning up the data does not happen in one day. As a matter of fact, in order to have quality data in the whole company, it starts with one department, then going to the next department, then the next, and so on. The technical analysts needs to work with the business group who understands the data and the business objective. Then both IT and the business group determine how to build a good, quality database to store their data.
The benefit of high data quality increases customer satisfaction and IT spends less time trying to reconcile the data. It reduces costs and increases revenues. These activities need to become the normal flow of the data warehouse process. Monitoring data quality and making corrections to improve it should not be considered a nuisance, but should be consider a regular part of system operation. This means the company needs a dedicated group of people focusing on improving data accuracy. Data quality is not a glamorous topic, but it definitely affects the day to day business process.
I would like to thank Clarold Britton to help me editing this article