Log in  ::  Register  ::  Not logged in

## K-Means Clustering with Python

 Author Message xsevensinzx SSC-Insane Group: General Forum Members Points: 23796 Visits: 6176 Hello Again!Figured I would drop some more knowledge, but more specifically on the actual topic of Machine Learning. This bit is going to cover an unsupervised learning technique with K-Means Clustering in Python. Unsupervised, in case you didn't know, just means we are not going to train the data or provide insight on what is good or what is bad to help the algorithm get to the right results. It's just going to take some data, learn from it on it's own, and output a final result.Some people actually dislike this practice because you're not actually adding that human bias to the output. You're essentially letting the data speak for itself. This can lead you to believe the data itself, all that hard work and money you spent on it, may actually not be as valuable than you thought.K-Means is a popular clustering approach that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells. The trick is trying to find right value for K. In the real world, you won't know the "Right" value of K to start with and you'll need to converge on it yourself through analyzing the data.The following function and outline was from a course I took on Python and Machine Learning that is going to fabricate income and age. The function is also going to try to simulate some natural clusters for us to help us understand how this works. But you will see that even with the natural clusters, it's not always going to churn out the results we expect from a human eyeball perspective.`from numpy import random, array#Create fake income/age clusters for N people in k clustersdef createClusteredData(N, k): random.seed(10) pointsPerCluster = float(N)/k X = [] for i in range (k): incomeCentroid = random.uniform(20000.0, 200000.0) ageCentroid = random.uniform(20.0, 70.0) for j in range(int(pointsPerCluster)): X.append([random.normal(incomeCentroid, 10000.0), random.normal(ageCentroid, 2.0)]) X = array(X) return X`Then we will use K-Means to discover if there are any clusters in the data using this unsupervised learning:`%matplotlib inlinefrom sklearn.cluster import KMeansimport matplotlib.pyplot as pltfrom sklearn.preprocessing import scalefrom numpy import random, floatdata = createClusteredData(100, 5)model = KMeans(n_clusters=5)# Note I'm scaling the data to normalize it! Important for good results.model = model.fit(scale(data))# We can look at the clusters each data point was assigned toprint model.labels_ # And we'll visualize it:plt.figure(figsize=(8, 6))plt.scatter(data[:,0], data[:,1], c=model.labels_.astype(float))plt.show()`From the results, we selected the K value of 5. We can see that maybe 5 was not the best result here because from a human eyeball perspective, we can clearly see there is 4 separate clusters of data, not really 5 clusters. But that's likely not what we would see in a perfect world either. Here are the high-level steps of K-Means and what it's doing:Place K points into the space represented by the objects that are being clustered. These points represent initial group centroids.Assign each object to the group that has the closest centroid.When all objects have been assigned, recalculate the positions of the K centroids.Repeat Steps 2 and 3 until the centroids no longer move. This produces a separation of the objects into groups from which the metric to be minimized can be calculated.In ClosingThis is just to show you how easy it is to use Python with machine learning. I'm no master data scientist or whatever, but from a technology or IT perspective, I can use something like this to identify new classifications of the data I have in SQL Server. This may help me create new features of my existing data to help my team understand how certain data is doing what and why. For example, in marketing, maybe can identify various clusters of users who spend more, but maybe click less or click more and spend less to where I can group them up into classifications for further breakouts of those uses in the final reports. It may lead the team to easily identify, call them out and maybe even target them in future reactions to what they see from the results.

## Permissions

 You can't post new topics. You can't post topic replies. You can't post new polls. You can't post replies to polls. You can't edit your own topics. You can't delete your own topics. You can't edit other topics. You can't delete other topics. You can't edit your own posts. You can't edit other posts. You can't delete your own posts. You can't delete other posts. You can't post events. You can't edit your own events. You can't edit other events. You can't delete your own events. You can't delete other events. You can't send private messages. You can't send emails. You can read topics. You can't vote in polls. You can't upload attachments. You can download attachments. You can't post HTML code. You can't edit HTML code. You can't post IFCode. You can't post JavaScript. You can post emoticons. You can't post or upload images.

##### Select a forum
 SQL Server 2017      SQL Server 2017 - Administration      SQL Server 2017 - Development SQL Server 2016      SQL Server 2016 - Administration      SQL Server 2016 - Development and T-SQL SQL Server 2014      Administration - SQL Server 2014      Development - SQL Server 2014 SQL Server 2012      SQL 2012 - General      SQL Server 2012 - T-SQL SQL Server vNext      SQL Server 15 - Administration      SQL Server 15 - Development SQL Server 2008      SQL Server 2008 - General      T-SQL (SS2K8)      June 2007 CTP      Working with Oracle      July CTP      SQL Server Newbies      Security (SS2K8)      SQL Server 2008 High Availability      SQL Server 2008 Administration      Data Corruption (SS2K8 / SS2K8 R2)      SQL Server 2008 Performance Tuning Cloud Computing      SQL Azure - Development      SQL Azure - Administration      Amazon AWS and other cloud vendors      General Cloud Computing Questions      CosmosDB      Azure Data Lake      Azure Machine Learning      Azure Data Factory Reporting Services      Reporting Services      Reporting Services 2005 Administration      Reporting Services 2005 Development      Reporting Services 2008/R2 Administration      Reporting Services 2008 Development      SSRS 2012      SSRS 2014      SSRS 2016 Programming      Connecting      General      SMO/RMO/DMO      XML      Service Broker      Powershell      Testing      TFS/Data Dude/DBPro      SSDT      Continuous Integration, Deployment, and Delivery      R Services and R Language Data Warehousing      Integration Services      Strategies and Ideas      Analysis Services      Data Transformation Services (DTS)      Performance Point      Data Mining      PowerPivot      R language      Machine Learning Database Design      Disaster Recovery      Design Ideas and Questions      Relational Theory      Hardware      Virtualization      Security and Auditing SQLServerCentral.com      Anything that is NOT about SQL!      Contests!      Editorials      SQLServerCentral.com Announcements      SQLServerCentral.com Website Issues      Suggestions      Tag Issues with Content      Podcast Feedback      SQLServerCentral.com Test Forum      Articles Requested SQL Server 2005      Administering      Backups      Business Intelligence      CLR Integration and Programming.      Data Corruption      Development      Working with Oracle      SQL Server 2005 Compact Edition      SQL Server 2005 General Discussion      SQL Server 2005 Security      SQL Server 2005 Strategies      SS2K5 Replication      SQL Server Express      SQL Server 2005 Performance Tuning      SQL Server 2005 Integration Services      T-SQL (SS2K5)      SQL Server Newbies SQL Server 7,2000      Administration      Backups      Data Corruption      General      Globalization      In The Enterprise      Working with Oracle      Security      Strategies      SQL Server Newbies      Service Packs      SQL Server CE      Performance Tuning      Replication      Sarbanes-Oxley      T-SQL      SQL Server Agent SQL Server and other platforms      MySQL      Oracle      PostgreSQL      DB2      SQL Server and Sharepoint Older Versions of SQL (v6.5, v6.0, v4.2)      Older Versions of SQL (v6.5, v6.0, v4.2) Career      Certification      Employers and Employees      Events      Job Postings      Resumes and Job Hunters      Presentations and Speaking      Retired Members Testing Center      SQL Server Security Skills      Question of the Day (QOD) Microsoft Access      Microsoft Access Products and Books      Third Party Products         CA         SQLCentric         Extreme Technologies.         Innovartis         Embarcadero         SQL Sentry         Sonasoft         Golden Gate Software         Lumigent         Red Gate Software         Quest Software         ApexSQL         Idera      Discussions about Books         Discuss Programming Books          Discuss XML Books          Discuss T-SQL Books          Discuss Data Warehousing Books          Discuss DTS Books          Discuss SQL Server 7.0 Books         Discuss SQL Server 2000 Books Notification Services      Administration Article Discussions Future Versions      SQL 12

## Search

 Copyright © 2002-2019 Redgate. All Rights Reserved. Privacy Policy. Terms of Use. Report Abuse.