• Carlos Bossy (3/21/2012)


    John,

    I'm not sure you can get exactly what you want so I'm trying to suggest the closest you can come to it. This is where you can use more than one approach in data mining to get results, and often more than one approach is appropriate. To address your points/questions:

    - Clustering does not 'start from the same variable' as you put it, since it uses every input variables to determine the cluster a customer belongs to. The clusters generated would have people with similar characteristics, but they won't be 100% similar.

    - You'll never get a result as clean as the example you provided. With clustering you might have the same Purchase Amount in more than one cluster, and you won't be able to get all people that have a specific purchase amount together in one cluster (unless your data is very simple which it doesn't seem to be). As an example you'll get results like this:

    Cluster 1

    92% of the people in this cluster bought in the price range of 100$-200$

    86% are Males under the age of 60

    74% are Females between ages 20 and 25

    Cluster 2

    100% of the people in this cluster bought in the price range of 100$-200$

    68% are Males between the age of 48 and 54

    95% are Females under age 45

    Cluster 3

    61% of the people in this cluster bought in the price range of 100$-200$ (so 28% bought 0-50$, and 11% bought 50-100$, for example)

    77% are Males under age of 34

    81% are Females under age 39

    - What you can do that is similar to the example you presented is to create the cluster model and then look at the cluster profiles tab when you browse the mining model. This screen is a good way for you to see the clusters generated and the type of people who are contained in that cluster. You'll be able to see what type of people are contained in the clusters with purchase amount 0-50, 50-100, etc.

    Regardless, clustering is very useful and might be what you want, but there is no concept of a starting variable.

    Back to decision trees. The starting variable you refer to could be interpreted as the predictive variable. But generating a decision tree will split the data as you know, although it doesn't use every column each time either. Some columns may not be relevant to the ultimate decision made by the model, or a tree can be pruned by the algorithm so that a column doesn't factor in to the result. A specific purchase amount might be arrived at in the tree via multiple paths, so it won't be as clean as you want, but you will be able to explain how it got there.

    You can see that neither of these algorithms will get you exactly the results you want, but they are both very valuable for what you want to do. I would approach this by creating a mining structure that contains two mining models, one that is clustered and the other a decision tree. As you work with each you'll get a feel for which one works better, and you may find that you'll rely on clustering to satisfy some requirements, and decision trees to hep with other requirements.

    Sorry for the long-winded responses, but I hope this helps. If we were doing this in person, I would have filled 5 whiteboards already 🙂

    Thanks again (for the 3rd time).

    I am happy with long posts, it is a lot of good information, and that is always good.

    You said a couple of time that I won't be happy with "A specific purchase amount might be arrived at in the tree via multiple paths, so it won't be as clean as you want".

    Why won't I be happy with that ?

    That is EXACTLY what I am looking for.

    I know that there are different paths that lead to the purchase amount 100$-200$, and I just need to find them.

    My question is - will these different paths have some variable values in common, or is each path in the decision tree mutually exclusive from other paths ?