RE: Help with Data-Mining Clustering question

jsminer2011

SSC Enthusiast

Points: 102

March 21, 2012 at 3:25 am

#1462058

Carlos Bossy (3/21/2012)
Yes it seems you want to use regression. Clustering is a good algorithm for understanding your data, but if you want to predict a value like Purchase Amount based on other variables then regression is the way to go. You said you wanted to 'point at a specific column (Purchase Amount) and ask DM to tell me what groups of fields lead to different purchase amounts ranges'. I didn't completely understand it at first because I might have stated it differently - 'I have several independent variables that I know about customers (such as gender, income, past purchase history) and I want to predict a dependent variable called purchase amount'. Based on this I would recommend you start with a decision tree as your algorithm.
To do this in Excel, load the data into a worksheet with column headers. Go to the data mining menu item (assuming you've installed the data mining add-in) and select the dataset. Click the Classify icon. It will walk you through the process, just make sure you select Purchase Amount as the 'Column to Analyze'. Continue through the wizard, it will generate the mining model and show you the decision tree. You can take a visual look at the tree and you should get a good idea as to its structure.
After the model is built you have to test it to see if it is a good predictor of purchase amount. Use the icons under the Accuracy and Validation section for this. Then go through the cycle again and again until you get good results.
By the way, funny you said doing this with Excel seems too simple. Actually this a very powerful way to do this work, since Excel is invoking Analysis Services in the background to build and test the mining models. You get the power of Excel and Analysis Services together in one package.

Hi, again - thanks for the very informative and step-by-step answer.

You mentioned I should use 'Decision Trees'.

But won't decision tree always force me to use some independent variables ?

What I am trying to say is this:

I believe that the results of 'clusters' would look something like this :

=====================================================

All the people who bought in the price range of 100$-200$ are :

* Males under the age of 60

* Females between ages 20 and 25

* People who live 20 KM from the store

* People who purchase more than 10 items and come to the store in the morning

=====================================================

All the people who bought in the price range of 50$-100$ are :

* Males who come at night

* People who buy less than 5 items

=====================================================

All the people who bought in the price range of 0$-50$ are :

* Females who come at morning

* People who buy only 1 item

=====================================================

That is what I would like to get as a result at the end.

The different 'clusters' that form the different price ranges.

But I think that 'Decision Trees' might not fit here,

because, as I understand it (and I may be wrong 🙂 )

the way decision trees work is that it starts from one variable,

and then splits to the next variable, and the splits to the next variable,

and so on ...

and at the end - the clusters always start from the same variable.

Meaning - if the first variable is 'Gender' - than I will always have 'Male' or 'Female'

in the resulted cluster.

But as I have shown above - some clusters might not have to include this variable.

For example : "People who buy only 1 item".

So is 'Decision Trees' what I am looking for ?

I hope my question is clear enough ...