Why Getting Data Right Matters

An InfoWorld article from 2017 suggests that 80 percent of a data scientist’s job is cleaning and transforming data, and I believe this is probably only true for organizations that spend at least an average effort in designing and implementing their data storage. These persons who have trained to analyze data using complex math formulas could be spending their time producing important insights into how to make their company more profitable. Yet instead, assuming they don't have other meetings and administrative tasks during a week, they are spending on average 32 hours each week mucking through data trying to figure out what data is useful, what it means, and then reorganizing it for analyzing. They have only eight hours a week left to provide insight into the data, assuming they were right in what they thought the data meant.

We can do better. Organizations can reduce the amount of time deciphering data by paying more attention to getting database structures designed as right as reasonably possible the first time. It is crucial when you start a project that involves data, no matter whether a completely new system or altering an existing system, to understand the reason a company stores data. Clearly, the first reason is the obvious one: to manage operations. Take money in, ship products out.

What follows is generally where the power of data comes in. Why did we get that money? How quickly did we ship the product? For people who received their product quickly, were they more likely to purchase more? Did the offer that was included on the receipt help to bring in more sales? How did we not know that people who bought peanut butter on a Tuesday ordered more milk on Friday?

Structure data properly for whatever data platform you are employing. Name attributes the same from version to version of a system, so you know what the ProductStatus means in v1, and v10, even though structures and even platform may have changed. Perhaps even more importantly, that when you needed to store the ProductStatus, you didn't use the LastName attribute in v3-3.2 because it was "easier" than adding a new column.

Just getting structure right is just the beginning. Too many databases are like basic buckets. They will allow any data in that the customer wants. This leads to situations like having a product order appear to be shipped years before it was even ordered; phone calls that appear to take less than 0 seconds; invoices that were paid in 3020. I could go on listing issues I have seen before, but the problem is that when computer systems allow bad data, bad data creeps in. Analyzing data with more than a smattering of instances with poor quality affects the results of analysis.

Discovering insights by analyzing data is what’s truly important in the long run. Data is useful operationally for minutes, perhaps days, but for analysis for years to come. It all starts with the boring, somewhat time-consuming basics of following proper design patterns.

 

Rate

Share

Share

Rate