Completely agree with you Grant. Understanding the various tools and their applications is definitely a learning curve!
Definitely second someone else's recommendation for Andrew Ng's course, Machine Learning, on Coursera. Though I think it is heavily focused upon understanding the algorithm, versus understanding the application and interpretation of analysis. I wonder if a course such as this one (https://onlinecourses.science.psu.edu/statprogram/stat505) would provide greater intuition on the practical aspects of analysing our kind of data sets? It looks very similar to one of the more useful papers we had at uni.
Well done on a good introductory article.
From my point of view, what's really important is that R and RStudio now offer a comprehensive suite of statistical tools that can be tied with SQL data without having to fork out the kind of budgets normally associated with something like SAS. It now, as you rightly point out, makes proper statistical analysis feasible in areas that were previously financially prohibitive. That includes analysing server stats, but let's not forget that whole tranches of user applications are now becoming worthwhile.
Obviously, if you want to get the most out of this kind of functionality, employing someone with a good solid statistical understanding is sensible, and that doesn't necessarily mean someone who works in IT. Nonetheless, predefined data mining algorithms are still viable for known standard processes such as this article has outlined. What a quantum leap from Perfmon, a bit of googling and some guesses based on instinct.
Semper in excretia, suus solum profundum variat
Thanks for posting this! I love to see applications of R in SQL. A couple of notes: You might want to think about adding in a step for Principal Component Analysis (PCA) in front of the K-Means. The data have different scales, and it's not as useful to do K-Means over data with different scales. Also, you might want to talk a bit about the cluster assignments themselves, and how you chose the numbers you did. Great stuff! (More on K-Means and homogenized datasets: http://www.statmethods.net/advstats/cluster.html More on PCA: http://setosa.io/ev/principal-component-analysis/)
MCDBA, MCSE, Novell and Sun Certified
If medium size organizations would simply collect the data, they might be both more interested in statistically analyzing it, and have the information to address many of their problems without using advanced statistics.
BuckWoody - Monday, May 29, 2017 6:10 AM
Hi BuckWoody, thanks for your pointers here - completely agree.
I wonder if the issue of scaling might explain why the cluster centroids are so strongly dominated by 1 feature (with the exception of cluster 3), but then I can't quite remember what this plot was - the code has different variable names (behaviours, b2), which I can no longer remember what they are. A bad case of not being consistent and making notes for my future self! This article could use some editing now, after 3 years 🙂
Viewing 5 posts - 31 through 34 (of 34 total)