Data Mining Introduction Part 7: Microsoft Association

  • Comments posted to this topic are about the item Data Mining Introduction Part 7: Microsoft Association

  • Awesome article. I love the association algorithm. However, I've run into an issue where I've moved out of the textbook nested table is input,key, and predict typical setup. My issue was that it seems that DMX queries predictAssociation is returning everything and then doing the join to filter. Adding a 50 to the predictAssociation() worked to only return the top fifty results and speed it up, but it still was taking 20 seconds (down from 30ish). So, it's almost like it has to return the entire nested bit before it does the input joins. I have models larger in data size that are the textbook style of nested input/predict/key and the queries take milliseconds to run.

    This particular model uses different input than the output. For instance, rather than purchasing a product causes you to buy this product, where you can do a simple market basket analysis of products purchased in the same session, you actually would use categories browsed to determine products purchased. Or, products browsed can predict search terms you might use. After pulling my hair out, I tried other forms of structuring my models, but they would always run out of memory or the queries wouldn't work correctly.

    Example model structure:

    Columns:

    user_session =>key

    product_browsed =>input

    vw_data_for_nested=>PredictOnly

    Columns:

    search_term_used =>key

    Input Data:

    user_session | search_term | product_browsed

    90234 white 23AX0DZ

    90234 white 039POOZ

    34333 light 23AX0DZ

    Sample query (the data is great!):

    SELECT FLATTENED

    (SELECT

    [search term]

    ,$PROBABILITY AS [Probability]

    ,$AdjustedPROBABILITY AS [AdjustedProbability]

    ,$Support AS [Support]

    FROM PREDICT( [vw For Product Predicts Search],50 ,include_node_id,Include_statistics)

    WHERE $nodeid <>'')

    FROM [mdlProductPredictsSearch] prediction join

    (SELECT '23AX0DZ' AS [product browsed]) as t

    on [mdlProductPredictsSearch].[product browsed] = t.[product browsed]

  • Yeah, it is in the Microsoft documentation that this algorithm has some performance problems.

    I am copying the technet documentation here:

    Performance

    The process of creating itemsets and counting correlations can be time-consuming. Although the Microsoft Association Rules algorithm uses optimization techniques to save space and make processing faster, you should know that that performance issues might occur under conditions such as the following:

    Data set is large with many individual items.

    Minimum itemset size is set too low.

    To minimize processing time and reduce the complexity of the itemsets, you might try grouping related items by categories before you analyze the data.

  • Actually, the processing isn't the problem, it's the queries. I haven't seen any other documentation on the latency of queries, but there just doesn't seem to be many people using DMX in the wild to provide feedback.

  • Good Article helped me a lot

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply