SQLServerCentral Article

Machine Learning for Outlier Detection in R

,

When we think about outliers, we typically think in one dimension, for example, people who are exceptionally tall. Basic statistical tests for outliers (e.g. those in R's Outliers package) are designed specifically for such univariate settings. But in reality, there is usually a wealth of supporting information which will have influence here, for example, "basketball players" tend to be exceptionally tall when compared to the rest of the population, but there are probably also short basketball players if compared amongst other basketball players.

We have the same situation in business datasets, for example, understanding customer behaviours relies on supporting information such as their age, gender, length of history, etc. In these settings, we need methods which can detect outliers across all attributes of a dataset. Statistically, we need multivariate tests for outliers. In this post, we will look at 3 methods for multivariate outlier detection: the Mahalanobis distance (a multivariate extension to standard univariate tests) and two clustering techniques: DBSCAN and expectation maximisation (EM).

The Data

We're going to use one of R's sample datasets, mtcars. The mtcars dataset records details such as miles per gallon (mpg), number of cylinders (cyl) and horsepower (hp) for 32 different cars. You can read the (brief) documentation on this dataset here. The head of the data is shown below:

mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
...
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1

A quick look at some selected columns doesn't reveal any startingly obvious outliers:

mtcars_pairs_plot.png

If we squint, there might be some outliers in the hp and qsec column, but there isn't anything truly compelling in these plots. These plots are also a little tough to view all the time, so for the purposes of the rest of this article, we will use principal components analysis to create a simple two-dimensional mapping for visualisation. The code and plot is shown below:

library(ggplot2)
library(data.table)
distance_matrix <- as.matrix(dist(scale(mtcars)))
pca <- prcomp(distance_matrix)
embedding <- data.table(pca$x[, 1:2])
embedding[, CarNames := rownames(mtcars)]
ggplot(embedding, aes(x = PC1, y = PC2)) +
    geom_point(size = 10, colour = "steelblue", alpha = 0.3) +
    geom_text(aes(label = CarNames), check_overlap = TRUE) +
    theme_minimal()
Cars2d.png
This two-dimensional visualisation is definitely easier to view and already points towards some possible outliers (Ferrari Dino, Ford Panterra, Maserati Bora). Let's explore some outlier detection methods.

Mahalanobis distance

We're going to begin with the Mahalanobis Distance. This is a statistical measure, not strictly a machine learning method, but it is an extension to simpler univariate tests that we have previously looked at. Recall that typical univariate outlier measures are variance-adjusted distances for example, the calculation for the zscore is:

zscore_eqn.gif

Where x(i) is an observation (row), mu(x) is the mean of all the rows and sd(x) is the standard deviation. To put this in simple terms, the zscore is a measure of how far from "normal" an observation is, adjusted by the normal amount of variation that we observe (i.e. measure with large spread require an observation to be further from the mean to be outliers, than those with small spread). The Mahalanobis is an extension of the univariate zscore, which also accounts for the correlation structure between all the features (columns):

MahalanobisEqn.gif

Where Sigma(inverse) is the inverse of the covariance matrix of the dataset. Enough formulas, let's see the R code and the results:

library(MVN)
results <- mvOutlier(scale(mtcars), qqplot = TRUE, method = "quan")
results <- data.table(CarNames = rownames(results$outlier), MahalanobisOutlier = results$outlier$Outlier)
embedding <- merge(embedding, results, by = "CarNames")
ggplot(embedding, aes(x = PC1, y = PC2)) +
    geom_point(aes(colour = MahalanobisOutlier == TRUE), size = 10, alpha = 0.3) +
    geom_text(aes(label = CarNames), check_overlap = TRUE) +
    scale_colour_manual(values = c("steelblue", "red")) +
    theme_minimal()

Cars_Mahalanobisplot.png

This is really interesting. The Mahalanobis distance confirms out suspicions about the Ferrari Dino, Ford Panetra and Maserati Bora. it also highlights some other cars (some labels missing due to overlap): Merc 280, Merc 280c,  Toyota Corona and the Lotus Eurpoa. These cars are an interesting mix of super cars (e.g. Ferrari, Maserati, Lotus) and older cars (e.g. Toyota Corona, Mercs), that clearly bear little resemblance to the other cars in the dataset. Very interesting.

Right, we promised to look at machine learning methods, specifically clustering methods for outlier detection. So let's do that, but we'll keep the results of the Mahalanobis test as our gold standard (because it is based on robust theory, not heuristic algorithms).

DBSCAN

Let's begin with DBSCAN. We don't have the space to fully explain these algorithms here, I encourage you to google DBSCAN and Expectation Maximisation for more details. But to explain DBSCAN simply, it is a density-based clustering algorithm. We need to define two parameters for DBSCAN:

  1. the neighbourhood (eps): the radius to search around a point
  2. minPts: the minimum number of neighbouring points necessary to say that an observation is a "core point" and not an outlier (slight simplification, but it will do)

The R code and results are below:

library(dbscan)
embedding[, DClusters := dbscan(scale(mtcars), eps = 2, minPts = 3)$cluster]
ggplot(embedding, aes(x = PC1, y = PC2)) +
    geom_point(aes(colour = factor(DClusters)), size = 10, alpha = 0.3) +
    geom_text(aes(label = CarNames), check_overlap = TRUE) +
    theme_minimal()
Cars_DBSCANPlot.png
Let's begin by looking at the outliers, which are cluster 0 (orange). There is a clear "band' of outliers through the center, with good overlap with the results from the Mahalanobis distances. There are clearly some differences however, for example, DBSCAN isolates the Mazdas, the Hornet and the Valiant. Remember that this is a density-based algorithm, so given the neighbourhood and density that we defined, there is not enough similar cars in this dataset to say that these cars are "representative" or "normal". This makes sense if you think of density and a kind of "voting" or "popularity" measure. It's really important to note that differing assumptions between the Mahalnobis distance and the DBSCAN algorithm are what are driving the disagreement between these results.
Perhaps more interesting are the clusters that DBSCAN found. There is clearly a "small car" cluster (1), a Mercedes cluster (3) and two muscle car clusters (2 & 4). 

Expectation Maximisation

Like DBSCAN, we won't explain the Expectation maximisation (EM) algorithm here. In fact, it is quite complicated and we couldn't begin to do it justice. Let's simply say that it is an unsupervised clustering algorithm that tries to find "similar subspaces" based on their orientation and variance. Sigh, even that wasn't simple. Never mind, let's see it in action:

library(mclust)
cars_em <- Mclust(scale(mtcars), G = 4)
embedding[, EMClusters := cars_em$classification]
ggplot(embedding, aes(x = PC1, y = PC2)) +
    geom_point(aes(colour = factor(EMClusters)), size = 10, alpha = 0.3) +
    geom_text(aes(label = CarNames), check_overlap = TRUE) +
    theme_minimal()
Cars_EMPlot.png

Cluster 1 is the "outlier cluster". Again, the Ferrari, Ford Pantera and Maserati have been found to be outliers. And like DBSCAN, the two Mazda RX4s are outliers. The Toyota Corona and Merc 280s aren't outliers by either DBSCAN or EM. If we compare the clusters identified by EM, with those from DBSCAN, we can see there is very good agreement. In fact, I might prefer EM here, as is has grouped together all of the American / muscle cars into cluster 4.

Wrap up

Overall, there is strong agreement between all three methods that the Ferrari, Ford Pantera and Maserati are outliers in this dataset. Depending on which method you choose, there are other cars which might also be outliers. This is a typical problem with machine learning (or perhaps statistics in general), that your results are reliant on your methods - so you need to understand the assumptions that underpin the methods and you need to be able to weight these against the data you have. In this example, we don't actually have much data (32 cars) - so making a strong case for an algorithm or result set is difficult. In fact, I would be skeptical of the Mahalanobis distance and the EM results simply because there is not a lot of data to estimate variance. If I was being conservative, I would lean towards the results from DBSCAN which relies on fewer assumptions. 

Rate

5 (4)

You rated this post out of 5. Change rating

Share

Share

Rate

5 (4)

You rated this post out of 5. Change rating