Data mining on a single table

  • Hello everyone,

    I am wondering, is it possible to have data mining on a single table with a large amount of data?

    The given table is about sales and contains all the needed info (SellerType, ShopLocation, ItemCategory, ItemCondition, Sold, Price, ShippingCost).

    I tried to create a data cube but I can only get one dimension.

    What I want to exam is the affect shipping cost has on sales based on category, or the price based on condition and category, etc.

    Any thoughts?

    Thank you in advance.

  • Check out the link.



    For better, quicker answers on T-SQL questions, read Jeff Moden's suggestions.[/url]

    "Million-to-one chances crop up nine times out of ten." ― Terry Pratchett, Mort

  • I'm sorry but I don't understand your post.

    I didn't add code or table data, I'm just asking if data mining can be done successfully on a single table.

    Since English is not my native language I tried to describe the table, by giving the name of the fields, in case someone was wondering why the dataset consists only one table.

    Thank you for your time though.

  • I found the solution and I'm posting it in case someone else have the same problem.

    The answer is to create relational tables as following, instead of having one table with all the data in it.

    Table Sales (SalesID, SellerTypeID, ShopLocationID, ItemCategoryID,ItemConditionID, Sold, Price, ShippingCost)

    Table SellerType (SellerTypeID,SellerTypeDescription)

    Table ShopLocation (ShopLocationID,ShopLocationDescription)

    Table ItemCondition (ItemConditionID, ItemConditionDescription)

    Table ItemCategory (ItemCategoryID, ItemCategoryDescription)

    Regards 🙂

  • Chriz L. (12/1/2012)


    I found the solution and I'm posting it in case someone else have the same problem.

    The answer is to create relational tables as following, instead of having one table with all the data in it.

    Table Sales (SalesID, SellerTypeID, ShopLocationID, ItemCategoryID,ItemConditionID, Sold, Price, ShippingCost)

    Table SellerType (SellerTypeID,SellerTypeDescription)

    Table ShopLocation (ShopLocationID,ShopLocationDescription)

    Table ItemCondition (ItemConditionID, ItemConditionDescription)

    Table ItemCategory (ItemCategoryID, ItemCategoryDescription)

    Regards 🙂

    I have to say that "It Depends". What you have above will work fine until a shop changes location (and that does happen) and then you'll have "lost" historical data. Descriptions can also change at the drop of a hat. The only way you could preserve the historical nature of the information is if you instantiate something like "Type 2 Slowly Changing Dimensions" in the "finger tables" of this "Star Schema".

    You're also missing some key information in your "hub" table (even without the "finger" tables). You're missing the date of the sale which is why the other people asked you to post the CREATE TABLE statement for your table if you had one. They wanted to check on all the stuff you might be missing.

    To wit, this hasn't answered your question of "is it possible to have data mining on a single table with a large amount of data", at all, because you still have a large, single table as a hub and you'll need to join it to several other tables and, to keep the historical information intact, you'll also need to instantiate the Type 2 SCD's I previously spoke of.

    Sales tables, like this one, are historical archives. There's very few reasonable explanations as to why any normalization, like you're done above, should even be attempted. All it does is severely complicate things and, without the Type 2 SCD's (one of the few reasons to try such normalization on such a static reporting table), it won't allow you to mine for things like "did sales go up or down when we changed the name of an item".

    To answer your question, "It Depends". Do you want to be able to mine for things like if description changes have affected sales? If so, this is going to get a whole lot more complicated. If not, then go back to your flat historical table and save yourself the trouble. Don't forget to add a date. 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thank you for the post, it's very helpful.

    I'm just getting started with data mining so I can use all the help I can get.

    🙂

  • To add to what I previously stated. If you're using an analysis tool, such as SSAS, you may have to "normalize" to a star or snowflake to take advantage of the things that "pivots" and the like can do. Just be aware that since it is for sales and sales are, by their very nature, "historical", how you handle changes to locations and product names will play a key role in what you can actually report. Type 1 or Type 2 SCDs or other historical reporting methods can play a very important role in that.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply