RE: Using SQL Server and R Services for analyzing Sales data (Part 3)

SSCrazy

Points: 2141

January 21, 2017 at 8:35 am

#1924220

Jonathan Mallia - Saturday, January 21, 2017 6:58 AM
Hi,
I was wondering why you did not use the CustomerKey when you created the cluster in:
dist(Sales[,c(1,3,5)])
Wouldn't it have been more effective to cluster by customer rather than by productgroup only?
Thanks in advance for the explanation!

Hi,

customerkey is just a running ID for each of the customers in the database. In this case, Clustering is done on the attributes of the customers (observation), and CustomerKey is not an attribute that would describe or unveil any information about the customer. If it would be included, it can only create dis-information in relation to other real/natural attributes.
Attribute for customer can be: business information: number of transactions created, value of invoices, basket values, business type; demographic information: area, city, country, age, etc. All these attributes describe customers. CustomerKey on the other hand, does not describe customer, nor is anyhow related to customer. it is just a database identifier.

ProductGroup can be added, because it describes products customer is buying/selling. But if you have all the customers buying all the products, it might also be a good to rethink if you want to include it / how you want to include such attribute.

Hope I made it more understanding.
Best, ToamaÅ¾

Tomaž Kaštrun | twitter: @tomaz_tsql | Github: https://github.com/tomaztk | blog: https://tomaztsql.wordpress.com/