• Jonathan Mallia - Saturday, July 22, 2017 9:32 AM

    Thanks for the article. So how you define an outlier in an automated process? Those observations that fall in the first cluster?

    Hi Jonathan,

    How do you set your thresholds for an automated process? With a lot of care! Have a quick read of R's documentation for DBSCAN. Any points not assigned to a cluster are labelled with a zero.

    The question then, is how do you pick the hyper-parameters (the neighbourhood radius and min points in the case of DBSAN)? I would be very nervous about leaving this to a truly automated process. Like any science, you will need to collect data, run proof of concepts, evaluate these and develop your solution on historical data. It's the only way to do anything and retain confidence in your future results.

    Cheers,
    Nick