Machine Learning for Outlier Detection in R

  • nick.dale.burns

    SSCrazy

    Points: 2226

    Comments posted to this topic are about the item Machine Learning for Outlier Detection in R

  • tomaz.kastrun

    SSCrazy

    Points: 2085

    Hi Nick,
    I would be careful as to what type of data to be used in PCA, as this algorithm is sensitive to types of data (nominal, ordinal, categorical, interval). Especially in this part, because it is a distance based algorithm.

    distance_matrix <- as.matrix(dist(scale(mtcars)))
    pca <- prcomp(distance_matrix)

    What was puzzling me, is a simple content question: What is the outlier in your case? When looking for outliers, one should have a clear goal as to how to define an outlier (in content sense), so that the part of, "what one would be looking for" is made simpler and what are the thresholds values for such outliers.

    Best, Tomaž

    Tomaž Kaštrun | twitter: @tomaz_tsql | blog:  https://tomaztsql.wordpress.com/

  • nick.dale.burns

    SSCrazy

    Points: 2226

    tomaz.kastrun - Wednesday, July 5, 2017 12:08 AM

    Hi Nick,
    I would be careful as to what type of data to be used in PCA, as this algorithm is sensitive to types of data (nominal, ordinal, categorical, interval). Especially in this part, because it is a distance based algorithm.

    distance_matrix <- as.matrix(dist(scale(mtcars)))
    pca <- prcomp(distance_matrix)

    What was puzzling me, is a simple content question: What is the outlier in your case? When looking for outliers, one should have a clear goal as to how to define an outlier (in content sense), so that the part of, "what one would be looking for" is made simpler and what are the thresholds values for such outliers.

    Best, Tomaž

    Hi Tomas,

    Completely agree with you re being aware of your data and whether an Euclidean distance metric is appropriate. In this case, all the features are continuous or ordinal, in which case an Euclidean distance measure is both sensible and appropriate. With regards to PCA itself, this is a technique that partitions variation and exploits correlation amongst features. So this can definitely be applied to categorical features - take for example, genomics which routinely uses PCA to explain genetic variability based on categorical allele counts (0, 1 or 2). Should you apply PCA (or any technique) blindly, without understanding the assumptions behind it and the appropriateness to your data? Heck no.

    As for your question, I guess I can't answer it. I completely agree with you, that the way you interpret results is very important. Of course, as you state, the way you interpret the results will very much rely on your own problem and your understanding of that problem. So, a pinch of salt will always go a long way.

  • cstater

    Old Hand

    Points: 370

    Nick,

    Loved the article.  Can you point me to information on the installation/setup of R to be able to run your examples?  I have installed R, but need some guidance on installing packages/libraries.

    Thanks,
    CBS

  • nick.dale.burns

    SSCrazy

    Points: 2226

    cstater - Wednesday, July 5, 2017 8:29 AM

    Nick,

    Loved the article.  Can you point me to information on the installation/setup of R to be able to run your examples?  I have installed R, but need some guidance on installing packages/libraries.

    Thanks,
    CBS

    Hi There,

    Welcome to R! First of all, google RStudio and install this as your IDE. By far my favourite exploratory IDE for R. From RStudio, you can install most packages using the install.packages() function:


    install.packages("DBSCAN")
    library(DBSCAN)

    Good luck.

  • tomaz.kastrun

    SSCrazy

    Points: 2085

    Thank you Nick,

    Agree with your  points.  Thank you for point them out and  thumbs up on article. 🙂

    Tomaž Kaštrun | twitter: @tomaz_tsql | blog:  https://tomaztsql.wordpress.com/

  • cstater

    Old Hand

    Points: 370

    nick.dale.burns - Wednesday, July 5, 2017 3:09 PM

    cstater - Wednesday, July 5, 2017 8:29 AM

    Nick,

    Loved the article.  Can you point me to information on the installation/setup of R to be able to run your examples?  I have installed R, but need some guidance on installing packages/libraries.

    Thanks,
    CBS

    Hi There,

    Welcome to R! First of all, google RStudio and install this as your IDE. By far my favourite exploratory IDE for R. From RStudio, you can install most packages using the install.packages() function:


    install.packages("DBSCAN")
    library(DBSCAN)

    Good luck.

    Nick,
    Thank you!  RStudio makes it much easier...

    CBS

  • Jonathan Mallia

    SSCertifiable

    Points: 5192

    Thanks for the article. So how you define an outlier in an automated process? Those observations that fall in the first cluster?

  • nick.dale.burns

    SSCrazy

    Points: 2226

    Jonathan Mallia - Saturday, July 22, 2017 9:32 AM

    Thanks for the article. So how you define an outlier in an automated process? Those observations that fall in the first cluster?

    Hi Jonathan,

    How do you set your thresholds for an automated process? With a lot of care! Have a quick read of R's documentation for DBSCAN. Any points not assigned to a cluster are labelled with a zero.

    The question then, is how do you pick the hyper-parameters (the neighbourhood radius and min points in the case of DBSAN)? I would be very nervous about leaving this to a truly automated process. Like any science, you will need to collect data, run proof of concepts, evaluate these and develop your solution on historical data. It's the only way to do anything and retain confidence in your future results.

    Cheers,
    Nick

Viewing 9 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply