July 4, 2017 at 11:58 pm
Comments posted to this topic are about the item Machine Learning for Outlier Detection in R
July 5, 2017 at 12:08 am
Hi Nick,
I would be careful as to what type of data to be used in PCA, as this algorithm is sensitive to types of data (nominal, ordinal, categorical, interval). Especially in this part, because it is a distance based algorithm.distance_matrix <- as.matrix(dist(scale(mtcars)))
pca <- prcomp(distance_matrix)
What was puzzling me, is a simple content question: What is the outlier in your case? When looking for outliers, one should have a clear goal as to how to define an outlier (in content sense), so that the part of, "what one would be looking for" is made simpler and what are the thresholds values for such outliers.
Best, Tomaž
Tomaž Kaštrun | twitter: @tomaz_tsql | blog: https://tomaztsql.wordpress.com/
July 5, 2017 at 12:47 am
tomaz.kastrun - Wednesday, July 5, 2017 12:08 AMHi Nick,
I would be careful as to what type of data to be used in PCA, as this algorithm is sensitive to types of data (nominal, ordinal, categorical, interval). Especially in this part, because it is a distance based algorithm.distance_matrix <- as.matrix(dist(scale(mtcars)))
pca <- prcomp(distance_matrix)What was puzzling me, is a simple content question: What is the outlier in your case? When looking for outliers, one should have a clear goal as to how to define an outlier (in content sense), so that the part of, "what one would be looking for" is made simpler and what are the thresholds values for such outliers.
Best, TomaÅ¾
Hi Tomas,
Completely agree with you re being aware of your data and whether an Euclidean distance metric is appropriate. In this case, all the features are continuous or ordinal, in which case an Euclidean distance measure is both sensible and appropriate. With regards to PCA itself, this is a technique that partitions variation and exploits correlation amongst features. So this can definitely be applied to categorical features - take for example, genomics which routinely uses PCA to explain genetic variability based on categorical allele counts (0, 1 or 2). Should you apply PCA (or any technique) blindly, without understanding the assumptions behind it and the appropriateness to your data? Heck no.
As for your question, I guess I can't answer it. I completely agree with you, that the way you interpret results is very important. Of course, as you state, the way you interpret the results will very much rely on your own problem and your understanding of that problem. So, a pinch of salt will always go a long way.
July 5, 2017 at 8:29 am
Nick,
Loved the article. Can you point me to information on the installation/setup of R to be able to run your examples? I have installed R, but need some guidance on installing packages/libraries.
Thanks,
CBS
July 5, 2017 at 3:09 pm
cstater - Wednesday, July 5, 2017 8:29 AMNick,Loved the article. Can you point me to information on the installation/setup of R to be able to run your examples? I have installed R, but need some guidance on installing packages/libraries.
Thanks,
CBS
Hi There,
Welcome to R! First of all, google RStudio and install this as your IDE. By far my favourite exploratory IDE for R. From RStudio, you can install most packages using the install.packages() function:
install.packages("DBSCAN")
library(DBSCAN)
Good luck.
July 6, 2017 at 2:54 am
Thank you Nick,
Agree with your points. Thank you for point them out and thumbs up on article. 🙂
Tomaž Kaštrun | twitter: @tomaz_tsql | blog: https://tomaztsql.wordpress.com/
July 6, 2017 at 7:05 am
nick.dale.burns - Wednesday, July 5, 2017 3:09 PMcstater - Wednesday, July 5, 2017 8:29 AMNick,Loved the article. Can you point me to information on the installation/setup of R to be able to run your examples? I have installed R, but need some guidance on installing packages/libraries.
Thanks,
CBSHi There,
Welcome to R! First of all, google RStudio and install this as your IDE. By far my favourite exploratory IDE for R. From RStudio, you can install most packages using the install.packages() function:
install.packages("DBSCAN")
library(DBSCAN)Good luck.
Nick,
Thank you! RStudio makes it much easier...
CBS
July 22, 2017 at 9:32 am
Thanks for the article. So how you define an outlier in an automated process? Those observations that fall in the first cluster?
July 23, 2017 at 1:44 am
Jonathan Mallia - Saturday, July 22, 2017 9:32 AMThanks for the article. So how you define an outlier in an automated process? Those observations that fall in the first cluster?
Hi Jonathan,
How do you set your thresholds for an automated process? With a lot of care! Have a quick read of R's documentation for DBSCAN. Any points not assigned to a cluster are labelled with a zero.
The question then, is how do you pick the hyper-parameters (the neighbourhood radius and min points in the case of DBSAN)? I would be very nervous about leaving this to a truly automated process. Like any science, you will need to collect data, run proof of concepts, evaluate these and develop your solution on historical data. It's the only way to do anything and retain confidence in your future results.
Cheers,
Nick
Viewing 9 posts - 1 through 9 (of 9 total)
You must be logged in to reply to this topic. Login to reply