Simple (I think!) Clustering Question (Please help!)

  • Hello All,

    Long time reader, first time writer!!

    I am having some trouble with SSAS and data mining - specifically the Microsoft Clustering package.

    I intend to ultimately do my work in AMO and MDX, but for now, just happy to understand how it works in the BIDS via Visual Studio. One step at a time!

    The whole problem is around clustering both "vertically" and "horizontally" (separately) from a table that is organized vertically. My main source data table in my OLTP database looks like =>

    ID_NUM

    {numbers 1 - 20,000}

    TECK_ID

    {numbers 1-500, {for each ID_NUM}}

    (though just grabbed a few of these for playing around with the data in the screencaps)

    TECK_VALUE

    {a double, the 'fact' bit}

    So- 10 million rows, of two ints and a double.

    Which looks like this- http://i.imgur.com/KG1LhaJ.jpg

    So I create a new Analysis Services project in Visual Studio, set up a Data Source, and bring in the above table, as well as two "dimension tables" (identity of what id_num is, names of what each teck_id is) into a Data Source View and link it up, matching up the appropriate keys.

    Which looks like this- http://i.imgur.com/Q0vgwIc.jpg

    Next I want to manipulate how my data is represented, so I go to set up a cube from this Data Source View. I create dimensions based on my two "dimension" tables (the above "id_num" primary key one, and the "teck_id" primary key one), and create a single measure (as sum) of the teck_value column from my main table. This all seems to compile successfully.

    Which looks like this- http://i.imgur.com/y5pUSjh.jpg

    The reason I think everything has worked well is I can arrange my data how I want by browsing the cube. I am able to define my "rows" as both the id_num, or as the "teck_id", with the other one filling up the columns. The measure "Teck_value" always makes up the dataset of the table. This is exactly how I want it, the flexibility to arrange my data both ways.

    Which looks like this- http://i.imgur.com/ugLUkgg.jpg

    And this- http://i.imgur.com/RwQgj58.jpg

    Beautiful! Now I wish to do some mining on this basis!

    I wish to, quite simply, using Microsoft Clustering to (separately) -

    - Assign each TECK_ID a cluster number based on how it varies on each ID_NUM

    - Assign each ID_NUM a cluster based on how it varies on each TECK_ID

    Seemingly a simple requirement - just changing what is represented as "rows" and what as "columns" - which I already appear to be able to do through the cube browser. This seems to be one of the main points of OLAP rather than OLTP from my uneducated perspective!

    Yet when I try to set this up I fail utterly!

    The Clustering Wizard leaves me confounded and I come up with nonsense results. I am given the option of selecting a key (for which I can choose either of the above), but no option to parse by the other dimension. Indeed, the only thing I can choose to mine on is TECK_VALUE, which isn't any good as that doesn't separate out the different fields!

    My wizard looks like this- http://i.imgur.com/lHfasv0.jpg

    So, I am left in a pickle. I really don't want to go back and line up my OLTP databases horizontally because 1) this would mean having 20k columns when I try to categorize my TECK_IDs. and 2) I was hoping SSAS and OLAP can give me the flexibility I need to mine the fields that I want - isn't that part of the reason you set up a cube "chop up the data how you like" ?

    SQL Server Central - Please help an OLAP noob make his first steps into this dangerous new world!!

  • I think the issue is that you've set ID_NUM as a key. They aren't unique so the data mining tool is getting confused.

    You don't need to change the data to "horizontal", how you have it is fine.

    You will need a surrogate key on the table, so that each row of ID_NUM, TECK_ID and TECK_VALUE is a unique element. Just insert a column and populate with an IDENTITY or row number.

    You can then get the model to data mine on the TECK VALUE based on either TECK ID or ID_NUM or a combination of both.

  • hey there davos,

    Thanks for the help!

    I added a primary key to my main fact table (called it TRUE_ID) - dsv - but I still can't set up a "measure" which segregates by a dimension. That is to say, I set up my clustering mining model like this (to try and cluster ID_NUM based on TECK_VALUE changing over TECK_ID), but when I process and have a look at the output the clusters are defined only in terms of TECK_VALUE, not in terms of TECK_ID - here.

    I think the solution may be for me to set up a new measure rather than just one solely based on TECK_VALUE, be good if I could set up measures based on how TECK_VALUE changes with each TECK_ID.

    At the moment from a pivot perspective I have =>

    m:TECK_VALUE

    ID_NUM TECK_ID=1, TECK_ID=2, TECK_ID=3

    1

    2

    3

    Which I can't seem to mine properly because the only measure is TECK_VALUE, and I can't get it to include TECK_ID.

    I think what I need is -

    ID_NUM m:TECK_VALUE & TECK_ID=1, m:TECK_VALUE & TECK_ID = 2 etc etc

    1

    2

    3

    Perhaps?

    Still confounded!

  • I am assuming you are just including the TEMT_CLUSTER_SMALL fact table in the analysis, that's probably the right way to proceed, but I am wondering how useful clustering analysis is going to be when you have 20,000 values of ID_NUM and 500 values of TECK_ID. I think based on that raw data it's not really going to give you a sensible result. You'd have either 20,000 clusters or 5000 clusters depending on which you chose.

    You might want to consider binning the data into ranges.

    I suggest doing a frequency histogram to take a look at how many individual records you have for the ID_NUM, and then the same for the TECK_ID.

    For example, you could bin them into about 10 -15 groups

    So if you have ID_NUM from 1-2000 and then 2001-4000 etc, that would be one way of binning them, but ideally you make ranges with equal populations, hence why you should do some data exploration first.

    Without knowing what the teck a TECK is I can't really give good guidance on the route to take (domain knowledge is half the battle in analytics!) , but another type of data mining model that could be a good initial investigation is "decision trees". The model will automatically bin them for you in order to create a relatively small tree. With clustering analysis, once you get beyond about 3 clusters it becomes very difficult to visualize in a useful way.

  • G'day davos,

    I think I am okay from a fundamental clustering perspective - clustering 500 "cases" into 10 groups (clusters) based on 20,000 observations of each case. For smaller sets but the same sort of 'shape' I can perform this calculation in memory with no problems.

    My key problem is that my single "fact" table, becomes 1 measure of TECK_VALUE.

    I really need a measure of (TECK_VALUE when TECK_ID = 1), (TECK_VALUE when TECK_ID = 2)... one of these measures exists for each case. Whenever I run clustering now it defines each cluster in terms of TECK_VALUE --- this is no good, because I need to to consider how TECK_VALUE changes as TECK_ID changes... if that makes sense.

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply