What is normal? Finding outliers with R

• Comments posted to this topic are about the item What is normal? Finding outliers with R

• I'm surprised you glossed over the more formal statistical guidelines for a "normal distribution" as your example was quite a good demonstration. Normal distributions are inherent to the accuracy of many statistical functions.
See the 7 point description here:
http://onlinestatbook.com/2/normal_distribution/intro.html

daz

• dazlari - Monday, June 12, 2017 11:41 PM

I'm surprised you glossed over the more formal statistical guidelines for a "normal distribution" as your example was quite a good demonstration. Normal distributions are inherent to the accuracy of many statistical functions.
See the 7 point description here:
http://onlinestatbook.com/2/normal_distribution/intro.html

Hi Dazlari,

Thanks for your comments and thank you for the link. I appreciate your comment and definitely agree that an appreciation of common distributions and their theory are important. This article was strongly geared towards the proof of concept, so I deliberately avoided theory. Important point too, the distribution under consideration here is Poisson, not normal.

For anyone else keen to see some of the theory here, I highly recommend looking up the Poisson distribution, which is the appropriate distribution given count data. Wikipedia actually do a really good job of describing the Poisson (and the Normal) distribution. It's a good starting point.

• Interesting stuff. I won't get into the clearly missed portion of why you chose that data, how you validated it, how you gathered it and whatnot that is really the core of why you chose the model you chose and what questions are truly driving finding out the probability of the occurrence. But, why did you take the approach of only using code to explain the outliers versus explaining why they are happening to begin with?

For example, you can see the outliers as soon as you plot the data, but what methods are common to identify them other than clearly accepting them like you did? You really didn't explain why they are there. Maybe they exist in human error? Maybe they exist in data extract error? Etc. Just accepting them, shading them red is not really teaching us how to really handle them.

I say this because I have data scientist come to me with outliers that could have been my fault with how they asked for the data. There is a methodology / best practice at play here around the science of explaining the facts of what they are observing. Not accepting everything and going to the board room with it.

Good example in retail outlier is a drop shipper. Normal users may be buying items between \$1 and \$20 dollars. Then you have a outlier buying in bulk at \$1000 dollars. Understanding why that happens can lead  to finding out that drop shippers are not correctly being identified and removed from the analysis entirely because they did not qualify as an active user. You can also look at this as discovery. Through your analysis, you discovered a new species of fish that you can now start classifying and tracking, which is a win for the reason on investigating your results.

• Hi, thanks for posting. Please help out a relative newby with R:  what is the meaning of the "Score :=" in your formula? I am not familar with the apparent assignment operator of :=.

• rhendrickson - Tuesday, June 13, 2017 7:54 AM

Hi, thanks for posting. Please help out a relative newby with R:  what is the meaning of the "Score :=" in your formula? I am not familar with the apparent assignment operator of :=.

I just found the answer to my question, it is a syntax used in the data.table package.

• xsevensinzx - Tuesday, June 13, 2017 6:16 AM

Interesting stuff. I won't get into the clearly missed portion of why you chose that data, how you validated it, how you gathered it and whatnot that is really the core of why you chose the model you chose and what questions are truly driving finding out the probability of the occurrence. But, why did you take the approach of only using code to explain the outliers versus explaining why they are happening to begin with?

For example, you can see the outliers as soon as you plot the data, but what methods are common to identify them other than clearly accepting them like you did? You really didn't explain why they are there. Maybe they exist in human error? Maybe they exist in data extract error? Etc. Just accepting them, shading them red is not really teaching us how to really handle them.

I say this because I have data scientist come to me with outliers that could have been my fault with how they asked for the data. There is a methodology / best practice at play here around the science of explaining the facts of what they are observing. Not accepting everything and going to the board room with it.

Good example in retail outlier is a drop shipper. Normal users may be buying items between \$1 and \$20 dollars. Then you have a outlier buying in bulk at \$1000 dollars. Understanding why that happens can lead  to finding out that drop shippers are not correctly being identified and removed from the analysis entirely because they did not qualify as an active user. You can also look at this as discovery. Through your analysis, you discovered a new species of fish that you can now start classifying and tracking, which is a win for the reason on investigating your results.

Hi there,

You've cut right to the main point of this article - which was to identify the points which are most interesting in the data and then from there, dive into understanding them perhaps similar to what you've described.

I'm sure that you have probably seen heaps of dashboards with dozens, or even hundreds, or charts which all show some view of essentially the same data. This frustrates me. Whatever the industry, I'd like to see a simple representation of "normal" and then more detailed charts / analyses / exploration of the things that deviate from normal - i.e. the interesting things. I want to be really clear here, that I never "accepted" the outliers, I simply highlighted these as points of interest.

• Interesting article 🙂

I'm new to R, but this looks pretty usefull, finding datapoints that deviate from those within the bell curve.

Maybe its time to spin up my own R box to see what it can do.

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply