• BuckWoody - Monday, May 29, 2017 6:10 AM

    Thanks for posting this! I love to see applications of R in SQL. A couple of notes: You might want to think about adding in a step for Principal Component Analysis (PCA) in front of the K-Means. The data have different scales, and it's not as useful to do K-Means over data with different scales. Also, you might want to talk a bit about the cluster assignments themselves, and how you chose the numbers you did. Great stuff! (More on K-Means and homogenized datasets: http://www.statmethods.net/advstats/cluster.html More on PCA: http://setosa.io/ev/principal-component-analysis/)

    Hi BuckWoody, thanks for your pointers here - completely agree. 

    I wonder if the issue of scaling might explain why the cluster centroids are so strongly dominated by 1 feature (with the exception of cluster 3), but then I can't quite remember what this plot was - the code has different variable names (behaviours, b2), which I can no longer remember what they are. A bad case of not being consistent and making notes for my future self! This article could use some editing now, after 3 years 🙂