Home Forums Data Warehousing Data Mining Simple (I think!) Clustering Question (Please help!) RE: Simple (I think!) Clustering Question (Please help!)

  • I am assuming you are just including the TEMT_CLUSTER_SMALL fact table in the analysis, that's probably the right way to proceed, but I am wondering how useful clustering analysis is going to be when you have 20,000 values of ID_NUM and 500 values of TECK_ID. I think based on that raw data it's not really going to give you a sensible result. You'd have either 20,000 clusters or 5000 clusters depending on which you chose.

    You might want to consider binning the data into ranges.

    I suggest doing a frequency histogram to take a look at how many individual records you have for the ID_NUM, and then the same for the TECK_ID.

    For example, you could bin them into about 10 -15 groups

    So if you have ID_NUM from 1-2000 and then 2001-4000 etc, that would be one way of binning them, but ideally you make ranges with equal populations, hence why you should do some data exploration first.

    Without knowing what the teck a TECK is I can't really give good guidance on the route to take (domain knowledge is half the battle in analytics!) , but another type of data mining model that could be a good initial investigation is "decision trees". The model will automatically bin them for you in order to create a relatively small tree. With clustering analysis, once you get beyond about 3 clusters it becomes very difficult to visualize in a useful way.