What is normal? Finding outliers with R

Question

What is normal? Finding outliers with R

nick.dale.burns

SSCrazy

Points: 2226
More actions
June 12, 2017 at 11:32 pm

#322047

Comments posted to this topic are about the item What is normal? Finding outliers with R

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply

dazlari SSC Enthusiast Points: 144 More actions · Answer 1

I'm surprised you glossed over the more formal statistical guidelines for a "normal distribution" as your example was quite a good demonstration. Normal distributions are inherent to the accuracy of many statistical functions.
See the 7 point description here:
http://onlinestatbook.com/2/normal_distribution/intro.html

daz

nick.dale.burns SSCrazy Points: 2226 More actions · Answer 2

dazlari - Monday, June 12, 2017 11:41 PM
I'm surprised you glossed over the more formal statistical guidelines for a "normal distribution" as your example was quite a good demonstration. Normal distributions are inherent to the accuracy of many statistical functions.
See the 7 point description here:
http://onlinestatbook.com/2/normal_distribution/intro.html

Hi Dazlari,

Thanks for your comments and thank you for the link. I appreciate your comment and definitely agree that an appreciation of common distributions and their theory are important. This article was strongly geared towards the proof of concept, so I deliberately avoided theory. Important point too, the distribution under consideration here is Poisson, not normal.

For anyone else keen to see some of the theory here, I highly recommend looking up the Poisson distribution, which is the appropriate distribution given count data. Wikipedia actually do a really good job of describing the Poisson (and the Normal) distribution. It's a good starting point.

xsevensinzx One Orange Chip Points: 25560 More actions · Answer 3

Interesting stuff. I won't get into the clearly missed portion of why you chose that data, how you validated it, how you gathered it and whatnot that is really the core of why you chose the model you chose and what questions are truly driving finding out the probability of the occurrence. But, why did you take the approach of only using code to explain the outliers versus explaining why they are happening to begin with?

For example, you can see the outliers as soon as you plot the data, but what methods are common to identify them other than clearly accepting them like you did? You really didn't explain why they are there. Maybe they exist in human error? Maybe they exist in data extract error? Etc. Just accepting them, shading them red is not really teaching us how to really handle them.

I say this because I have data scientist come to me with outliers that could have been my fault with how they asked for the data. There is a methodology / best practice at play here around the science of explaining the facts of what they are observing. Not accepting everything and going to the board room with it.

Good example in retail outlier is a drop shipper. Normal users may be buying items between $1 and $20 dollars. Then you have a outlier buying in bulk at $1000 dollars. Understanding why that happens can lead to finding out that drop shippers are not correctly being identified and removed from the analysis entirely because they did not qualify as an active user. You can also look at this as discovery. Through your analysis, you discovered a new species of fish that you can now start classifying and tracking, which is a win for the reason on investigating your results.

rhendrickson Valued Member Points: 64 More actions · Answer 4

Hi, thanks for posting. Please help out a relative newby with R: what is the meaning of the "Score :=" in your formula? I am not familar with the apparent assignment operator of :=.

rhendrickson Valued Member Points: 64 More actions · Answer 5

rhendrickson - Tuesday, June 13, 2017 7:54 AM
Hi, thanks for posting. Please help out a relative newby with R: what is the meaning of the "Score :=" in your formula? I am not familar with the apparent assignment operator of :=.

I just found the answer to my question, it is a syntax used in the data.table package.

nick.dale.burns SSCrazy Points: 2226 More actions · Answer 6

xsevensinzx - Tuesday, June 13, 2017 6:16 AM
Interesting stuff. I won't get into the clearly missed portion of why you chose that data, how you validated it, how you gathered it and whatnot that is really the core of why you chose the model you chose and what questions are truly driving finding out the probability of the occurrence. But, why did you take the approach of only using code to explain the outliers versus explaining why they are happening to begin with?
For example, you can see the outliers as soon as you plot the data, but what methods are common to identify them other than clearly accepting them like you did? You really didn't explain why they are there. Maybe they exist in human error? Maybe they exist in data extract error? Etc. Just accepting them, shading them red is not really teaching us how to really handle them.
I say this because I have data scientist come to me with outliers that could have been my fault with how they asked for the data. There is a methodology / best practice at play here around the science of explaining the facts of what they are observing. Not accepting everything and going to the board room with it.
Good example in retail outlier is a drop shipper. Normal users may be buying items between $1 and $20 dollars. Then you have a outlier buying in bulk at $1000 dollars. Understanding why that happens can lead to finding out that drop shippers are not correctly being identified and removed from the analysis entirely because they did not qualify as an active user. You can also look at this as discovery. Through your analysis, you discovered a new species of fish that you can now start classifying and tracking, which is a win for the reason on investigating your results.

Hi there,

You've cut right to the main point of this article - which was to identify the points which are most interesting in the data and then from there, dive into understanding them perhaps similar to what you've described.

I'm sure that you have probably seen heaps of dashboards with dozens, or even hundreds, or charts which all show some view of essentially the same data. This frustrates me. Whatever the industry, I'd like to see a simple representation of "normal" and then more detailed charts / analyses / exploration of the things that deviate from normal - i.e. the interesting things. I want to be really clear here, that I never "accepted" the outliers, I simply highlighted these as points of interest.

Theo Ekelmans SSCarpal Tunnel Points: 4592 More actions · Answer 7

Interesting article 🙂

I'm new to R, but this looks pretty usefull, finding datapoints that deviate from those within the bell curve.

Maybe its time to spin up my own R box to see what it can do.