This editorial was originally published on Jan 28, 2015. It is being republished as Steve is out of the office.
I thought this was an interesting post: Your data is not normal. It talks about the fact that we often assume some normal distribution when working with data. We seem to think we can extrapolate, often in a linear fashion, based on past experience. We often use functions and write queries that tend to think of normality as having some standard distribution.
However that's not often the case. It's not necessarily even the best way to begin looking at data for patterns. We might need to apply a number of different types of algorithms to a set of data to see what types of analysis might be appropriate. Not sure what I mean? Let me give you a few examples.
As a speaker, I often get feedback from my sessions as a series of forms where people have rated the session on a scale and possibly included comments. Many speakers receive the data in the same format, but I've often seen speakers get one form that rates them poorly, or has a complaint. The speaker will then view that one item as somehow representative of the session. This despite the fact that they might have ten forms with high ratings. They're somehow assigning more weight to the negative rating then the all the others. There's perhaps some value here in doing so, since there might be a legitimate complaint, but often we need to discard this one form as an outlier.
As another example, I worked with an import firm that was trying to estimate future purchases based on past ones for ordering purposes. Initially they looked at an average of the last three months (rolling) to place future orders. However I pointed out that we had some seasonal fluctuations in our business. There were a few times a year that the rolling average would dramatically under or over estimate the product we needed. As we dug deeper into our historical data, we learned that a better estimate required some level of complexity in our formulas, but we also realized we needed to allow for human overrides when other information was available. That was a fun project, but one that was quite complex when we analyzed the data.
There are numerous other examples I could give, but the point is this. Learn about your data. Dig into patterns, and don't assume that normal for one set of data looks anything like normal for other sets.
