Help with Data-Mining \ Clustering question

  • I am totally new to this data-mining business.

    I would like for someone to point me in the right direction, plese. In my business - I will have a database with several tables (let's assume MS SQL Server). For the sake of example, I have a 'person' table and an 'purchases' table (each person can make several different purchases).

    Here is an example of the 'person' table:

    * ID

    * First Name

    * Last Name

    * Address

    *Date of Birth

    Here is an example of the 'purchase' table:

    * Date of Purchase

    * Time of Purchase

    * Store Location

    * Number of Items Purchased

    * Total Purchase Amount

    What I would like to do is:

    I would like to perform 'data-mining' in order to find out the "categories", or "clusters" of characteristics, that define each 'purchase amount'.

    Meaning - my target 'column' is the 'purchase amount' column.

    Now I want to know all the different clusters of fields that produce each 'amount' in this column.

    For example - I would like to know what is common for all the people who bought between 2,000$ and 2,500$ ?

    Maybe they all came from the same neighbourhood ?

    Maybe all of them were young & came from the same neighbourhood?

    Maybe there are 2 distinct different groups that fit the '2000$-2500$' spending group ?

    I am almost sure there are different categories (or distinct clusters) that match each price group...

    one might be old people that come to buy at night,

    another might be middle-aged people who buy more than 10 items,

    and another might be middle-aged people who buy one Sundays.

    From what I understand, this falls under the 'data-mining' region (clustering maybe?)

    Is this true ? and if so, what are the tools I must use in order to find out these set of parameters that define the different groups for each pricing category?

    As you can see, I have no idea which tools \ applications I should use.

    Someone recommended installing an add-in for Excel that could do the trick, but this seems rather too simple.

    Someone else mentioned BI (don't know what exactly to do there)...

    I know there is also the 'Microsoft Analysis Services'. Does that help me in any way ?

    Thank you very much !

    John

  • John, to answer one of your questions, yes, what you are doing is definitely data mining. With SQL Server you can do what you want with Analysis Services either through Excel or by developing directly through Visual Studio, although there are a number of 3rd party data mining tools out there.

    If you are not a developer and don't want to use VS, install the Excel add-in and use that. You can find info on it here with instructions and demos: http://www.microsoft.com/sqlserver/en/us/solutions-technologies/business-intelligence/data-mining.aspx

    I'm sure I've seen a good tutorial somewhere out there on doing clustering with the Excel add-in, but I don't have it at hand, so do a search and see if you can find one.

    If you would rather use VS then start a new Analysis Services project and create a mining structure and mining models. Based on your data it does seem that a clustering algorithm would be a good one to start with. You can discretize the spending amount into whatever number of groups you want (0-2000, 2000-2500, etc.), then build the mining model and you will start to see what customer characterstics fall into each cluster. This is a process you will probably go through more than once as you understand your data better.

    I'm obviously skipping a lot since I can't really go through a step-by-step process, but this is generally the idea. I hope it helps you get started.

    LinkedIn - http://www.linkedin.com/in/carlosbossy
    Blog - http://www.carlosbossy.com
    Follow me - @carlosbossy

  • Carlos Bossy (3/19/2012)


    John, to answer one of your questions, yes, what you are doing is definitely data mining. With SQL Server you can do what you want with Analysis Services either through Excel or by developing directly through Visual Studio, although there are a number of 3rd party data mining tools out there.

    If you are not a developer and don't want to use VS, install the Excel add-in and use that. You can find info on it here with instructions and demos: http://www.microsoft.com/sqlserver/en/us/solutions-technologies/business-intelligence/data-mining.aspx

    I'm sure I've seen a good tutorial somewhere out there on doing clustering with the Excel add-in, but I don't have it at hand, so do a search and see if you can find one.

    If you would rather use VS then start a new Analysis Services project and create a mining structure and mining models. Based on your data it does seem that a clustering algorithm would be a good one to start with. You can discretize the spending amount into whatever number of groups you want (0-2000, 2000-2500, etc.), then build the mining model and you will start to see what customer characterstics fall into each cluster. This is a process you will probably go through more than once as you understand your data better.

    I'm obviously skipping a lot since I can't really go through a step-by-step process, but this is generally the idea. I hope it helps you get started.

    Thanks for the informative reply !

    Actually, I am a developer (using mostly VS 2010 with C#).

    but from what I understand - Excel might be good enough for me.

    It just seems too 'simple' to use Excel as a data-mining tool.

    After reading some more - I am unsure if what I am looking for is 'Clustering' or 'Regression'.

    I have read both definitions, and I think that maybe 'Regression' is a more suitable thing that I need to explore.

    Do you agree ?

    Again - what I need is to point at a specific column (Purchase Amount) and ask DM to tell me what groups of fields lead to different purchase amounts ranges ...

  • Yes it seems you want to use regression. Clustering is a good algorithm for understanding your data, but if you want to predict a value like Purchase Amount based on other variables then regression is the way to go. You said you wanted to 'point at a specific column (Purchase Amount) and ask DM to tell me what groups of fields lead to different purchase amounts ranges'. I didn't completely understand it at first because I might have stated it differently - 'I have several independent variables that I know about customers (such as gender, income, past purchase history) and I want to predict a dependent variable called purchase amount'. Based on this I would recommend you start with a decision tree as your algorithm.

    To do this in Excel, load the data into a worksheet with column headers. Go to the data mining menu item (assuming you've installed the data mining add-in) and select the dataset. Click the Classify icon. It will walk you through the process, just make sure you select Purchase Amount as the 'Column to Analyze'. Continue through the wizard, it will generate the mining model and show you the decision tree. You can take a visual look at the tree and you should get a good idea as to its structure.

    After the model is built you have to test it to see if it is a good predictor of purchase amount. Use the icons under the Accuracy and Validation section for this. Then go through the cycle again and again until you get good results.

    By the way, funny you said doing this with Excel seems too simple. Actually this a very powerful way to do this work, since Excel is invoking Analysis Services in the background to build and test the mining models. You get the power of Excel and Analysis Services together in one package.

    LinkedIn - http://www.linkedin.com/in/carlosbossy
    Blog - http://www.carlosbossy.com
    Follow me - @carlosbossy

  • Carlos Bossy (3/21/2012)


    Yes it seems you want to use regression. Clustering is a good algorithm for understanding your data, but if you want to predict a value like Purchase Amount based on other variables then regression is the way to go. You said you wanted to 'point at a specific column (Purchase Amount) and ask DM to tell me what groups of fields lead to different purchase amounts ranges'. I didn't completely understand it at first because I might have stated it differently - 'I have several independent variables that I know about customers (such as gender, income, past purchase history) and I want to predict a dependent variable called purchase amount'. Based on this I would recommend you start with a decision tree as your algorithm.

    To do this in Excel, load the data into a worksheet with column headers. Go to the data mining menu item (assuming you've installed the data mining add-in) and select the dataset. Click the Classify icon. It will walk you through the process, just make sure you select Purchase Amount as the 'Column to Analyze'. Continue through the wizard, it will generate the mining model and show you the decision tree. You can take a visual look at the tree and you should get a good idea as to its structure.

    After the model is built you have to test it to see if it is a good predictor of purchase amount. Use the icons under the Accuracy and Validation section for this. Then go through the cycle again and again until you get good results.

    By the way, funny you said doing this with Excel seems too simple. Actually this a very powerful way to do this work, since Excel is invoking Analysis Services in the background to build and test the mining models. You get the power of Excel and Analysis Services together in one package.

    Hi, again - thanks for the very informative and step-by-step answer.

    You mentioned I should use 'Decision Trees'.

    But won't decision tree always force me to use some independent variables ?

    What I am trying to say is this:

    I believe that the results of 'clusters' would look something like this :

    =====================================================

    All the people who bought in the price range of 100$-200$ are :

    * Males under the age of 60

    * Females between ages 20 and 25

    * People who live 20 KM from the store

    * People who purchase more than 10 items and come to the store in the morning

    =====================================================

    All the people who bought in the price range of 50$-100$ are :

    * Males who come at night

    * People who buy less than 5 items

    =====================================================

    All the people who bought in the price range of 0$-50$ are :

    * Females who come at morning

    * People who buy only 1 item

    =====================================================

    That is what I would like to get as a result at the end.

    The different 'clusters' that form the different price ranges.

    But I think that 'Decision Trees' might not fit here,

    because, as I understand it (and I may be wrong 🙂 )

    the way decision trees work is that it starts from one variable,

    and then splits to the next variable, and the splits to the next variable,

    and so on ...

    and at the end - the clusters always start from the same variable.

    Meaning - if the first variable is 'Gender' - than I will always have 'Male' or 'Female'

    in the resulted cluster.

    But as I have shown above - some clusters might not have to include this variable.

    For example : "People who buy only 1 item".

    So is 'Decision Trees' what I am looking for ?

    I hope my question is clear enough ...

  • This is a pretty good book with all the basics.

    http://www.marktab.net/datamining/index.php/2010/11/05/data-mining-for-business-intelligence-book-review/

    Also, for SQL specific Data Mining for SQL Server 2005 (although I believe that the 2008 edition is out). It covers SSAS as well as the specific algprithms available in SSAS.

  • sbelche-760308 (3/21/2012)


    This is a pretty good book with all the basics.

    http://www.marktab.net/datamining/index.php/2010/11/05/data-mining-for-business-intelligence-book-review/

    Also, for SQL specific Data Mining for SQL Server 2005 (although I believe that the 2008 edition is out). It covers SSAS as well as the specific algprithms available in SSAS.

    Thanks for the info.

    No answer to my specific question ? :ermm:

  • John,

    I'm not sure you can get exactly what you want so I'm trying to suggest the closest you can come to it. This is where you can use more than one approach in data mining to get results, and often more than one approach is appropriate. To address your points/questions:

    - Clustering does not 'start from the same variable' as you put it, since it uses every input variables to determine the cluster a customer belongs to. The clusters generated would have people with similar characteristics, but they won't be 100% similar.

    - You'll never get a result as clean as the example you provided. With clustering you might have the same Purchase Amount in more than one cluster, and you won't be able to get all people that have a specific purchase amount together in one cluster (unless your data is very simple which it doesn't seem to be). As an example you'll get results like this:

    Cluster 1

    92% of the people in this cluster bought in the price range of 100$-200$

    86% are Males under the age of 60

    74% are Females between ages 20 and 25

    Cluster 2

    100% of the people in this cluster bought in the price range of 100$-200$

    68% are Males between the age of 48 and 54

    95% are Females under age 45

    Cluster 3

    61% of the people in this cluster bought in the price range of 100$-200$ (so 28% bought 0-50$, and 11% bought 50-100$, for example)

    77% are Males under age of 34

    81% are Females under age 39

    - What you can do that is similar to the example you presented is to create the cluster model and then look at the cluster profiles tab when you browse the mining model. This screen is a good way for you to see the clusters generated and the type of people who are contained in that cluster. You'll be able to see what type of people are contained in the clusters with purchase amount 0-50, 50-100, etc.

    Regardless, clustering is very useful and might be what you want, but there is no concept of a starting variable.

    Back to decision trees. The starting variable you refer to could be interpreted as the predictive variable. But generating a decision tree will split the data as you know, although it doesn't use every column each time either. Some columns may not be relevant to the ultimate decision made by the model, or a tree can be pruned by the algorithm so that a column doesn't factor in to the result. A specific purchase amount might be arrived at in the tree via multiple paths, so it won't be as clean as you want, but you will be able to explain how it got there.

    You can see that neither of these algorithms will get you exactly the results you want, but they are both very valuable for what you want to do. I would approach this by creating a mining structure that contains two mining models, one that is clustered and the other a decision tree. As you work with each you'll get a feel for which one works better, and you may find that you'll rely on clustering to satisfy some requirements, and decision trees to hep with other requirements.

    Sorry for the long-winded responses, but I hope this helps. If we were doing this in person, I would have filled 5 whiteboards already 🙂

    LinkedIn - http://www.linkedin.com/in/carlosbossy
    Blog - http://www.carlosbossy.com
    Follow me - @carlosbossy

  • Carlos Bossy (3/21/2012)


    John,

    I'm not sure you can get exactly what you want so I'm trying to suggest the closest you can come to it. This is where you can use more than one approach in data mining to get results, and often more than one approach is appropriate. To address your points/questions:

    - Clustering does not 'start from the same variable' as you put it, since it uses every input variables to determine the cluster a customer belongs to. The clusters generated would have people with similar characteristics, but they won't be 100% similar.

    - You'll never get a result as clean as the example you provided. With clustering you might have the same Purchase Amount in more than one cluster, and you won't be able to get all people that have a specific purchase amount together in one cluster (unless your data is very simple which it doesn't seem to be). As an example you'll get results like this:

    Cluster 1

    92% of the people in this cluster bought in the price range of 100$-200$

    86% are Males under the age of 60

    74% are Females between ages 20 and 25

    Cluster 2

    100% of the people in this cluster bought in the price range of 100$-200$

    68% are Males between the age of 48 and 54

    95% are Females under age 45

    Cluster 3

    61% of the people in this cluster bought in the price range of 100$-200$ (so 28% bought 0-50$, and 11% bought 50-100$, for example)

    77% are Males under age of 34

    81% are Females under age 39

    - What you can do that is similar to the example you presented is to create the cluster model and then look at the cluster profiles tab when you browse the mining model. This screen is a good way for you to see the clusters generated and the type of people who are contained in that cluster. You'll be able to see what type of people are contained in the clusters with purchase amount 0-50, 50-100, etc.

    Regardless, clustering is very useful and might be what you want, but there is no concept of a starting variable.

    Back to decision trees. The starting variable you refer to could be interpreted as the predictive variable. But generating a decision tree will split the data as you know, although it doesn't use every column each time either. Some columns may not be relevant to the ultimate decision made by the model, or a tree can be pruned by the algorithm so that a column doesn't factor in to the result. A specific purchase amount might be arrived at in the tree via multiple paths, so it won't be as clean as you want, but you will be able to explain how it got there.

    You can see that neither of these algorithms will get you exactly the results you want, but they are both very valuable for what you want to do. I would approach this by creating a mining structure that contains two mining models, one that is clustered and the other a decision tree. As you work with each you'll get a feel for which one works better, and you may find that you'll rely on clustering to satisfy some requirements, and decision trees to hep with other requirements.

    Sorry for the long-winded responses, but I hope this helps. If we were doing this in person, I would have filled 5 whiteboards already 🙂

    Thanks again (for the 3rd time).

    I am happy with long posts, it is a lot of good information, and that is always good.

    You said a couple of time that I won't be happy with "A specific purchase amount might be arrived at in the tree via multiple paths, so it won't be as clean as you want".

    Why won't I be happy with that ?

    That is EXACTLY what I am looking for.

    I know that there are different paths that lead to the purchase amount 100$-200$, and I just need to find them.

    My question is - will these different paths have some variable values in common, or is each path in the decision tree mutually exclusive from other paths ?

  • I'm glad I could help so it's good to hear this has been useful.

    With decision trees arriving at the same answer using multiple paths, the paths are not necessarily mutually exclusive. They might end up that way, but that would just be a random occurrence. For a tree of any complexity you will most likely have some overlap of variable values. In one path the node might ask 'Is age between 25-44' and in another it might say 'Is age > 39'.

    I guess that's why I said you might not be happy. I got the impression you wanted a cleaner tree, with variable values occurring only in a single path. What you'll get instead for the multiple paths that lead to 100$-200$, path 1 might use age between 25-44 to make a decision, and path 2 might use age > 39 to make a decision. In fact, later in path 2 after satisfying age > 39, it might then ask is age between 50-65. If you are comfortable with this then you are good to go.

    LinkedIn - http://www.linkedin.com/in/carlosbossy
    Blog - http://www.carlosbossy.com
    Follow me - @carlosbossy

  • Carlos Bossy (3/21/2012)


    I'm glad I could help so it's good to hear this has been useful.

    With decision trees arriving at the same answer using multiple paths, the paths are not necessarily mutually exclusive. They might end up that way, but that would just be a random occurrence. For a tree of any complexity you will most likely have some overlap of variable values. In one path the node might ask 'Is age between 25-44' and in another it might say 'Is age > 39'.

    I guess that's why I said you might not be happy. I got the impression you wanted a cleaner tree, with variable values occurring only in a single path. What you'll get instead for the multiple paths that lead to 100$-200$, path 1 might use age between 25-44 to make a decision, and path 2 might use age > 39 to make a decision. In fact, later in path 2 after satisfying age > 39, it might then ask is age between 50-65. If you are comfortable with this then you are good to go.

    Thank you. It is deeply appreciated.

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply