How to find products commonly purchased together

  • Hi all,

    I feel certain this should be a common query, just maybe not using the right search.
    Here  goes: I have an ordered items table. How can i determine items that are commonly sold together?
    In the following example, i want to see that items 3 and 4 are commonly sold together, and 2, 3, and 4 are also sometimes ordered together.
    There can be 2-n items in an order.

    Any help will be appreciated!

    Thanks,

    Paul

    CREATE TABLE #tmp_data
        (OrderNo    INTEGER,
        OrderItem    VARCHAR(8));

    INSERT    #tmp_data
    SELECT    1, 'Item1'
    UNION ALL SELECT 1, 'Item2'
    UNION ALL SELECT 1, 'Item3'
    UNION ALL SELECT 1, 'Item4'
    UNION ALL SELECT 2, 'Item2'
    UNION ALL SELECT 2, 'Item3'
    UNION ALL SELECT 2, 'Item4'
    UNION ALL SELECT 3, 'ItemA'
    UNION ALL SELECT 3, 'Item3'
    UNION ALL SELECT 3, 'Item4'
    UNION ALL SELECT 4, 'ItemB'
    UNION ALL SELECT 4, 'Item3'
    UNION ALL SELECT 4, 'Item4';

  • You're looking for something called Market Basket Analysis. 
    This article might be helpful

  • Yup, sounds like what I want to do. Except in SQL Server 2008. I don't have access to DW.

  • A data warehouse is just another database with a star schema instead of a standard 3NF schema.

  • Ah sorry, I thought thew WITH(DISTRIBUTION = xxx) was a required part of the tools for the analysis.

    I'll have a go at this, thanks!

  • This is an fun and interesting problem. If my solution is what you are looking for (or close enough) then I would suggest this index:
    CREATE CLUSTERED INDEX cl_nu__temp_data__OrderItem ON #tmp_data(OrderItem);

    Solution:
    SELECT getTop1.searchItem, getTop1.orderItem, getTop1.total, getTop1.countRank
    FROM
    (
    SELECT
      searchItem = tx.OrderItem,
      orderItem = itvf_itempairs.OrderItem,
      total  = itvf_itempairs.total,
      countRank = DENSE_RANK() OVER
          (PARTITION BY tx.OrderItem ORDER BY -itvf_itempairs.total)
    FROM
    (
      SELECT t.OrderItem
      FROM #tmp_data AS t
      GROUP    BY t.OrderItem
    ) AS tx
    CROSS APPLY
    (
      SELECT t.OrderItem, total = COUNT(*)
      FROM #tmp_data AS t
      WHERE t.OrderNo IN
      (
      SELECT t.OrderNo
      FROM #tmp_data AS t
      WHERE EXISTS
      (
       SELECT *
       FROM #tmp_data AS t2
          WHERE t2.OrderItem = tx.orderItem
       AND  t2.OrderItem = t.orderItem
      )
      )
      AND t.OrderItem <> tx.orderItem
      GROUP BY t.OrderItem
    ) AS itvf_itempairs
    ) AS getTop1
    WHERE getTop1.countRank = 1;

    Returns:

    searchItem orderItem total       countRank
    ---------- --------- ----------- --------------------
    Item1     Item2     1         1
    Item1     Item3     1         1
    Item1     Item4     1         1
    Item2     Item3     2         1
    Item2     Item4     2         1
    Item3     Item4     4         1
    Item4     Item3     4         1
    ItemA     Item3     1         1
    ItemA     Item4     1         1
    ItemB     Item3     1         1
    ItemB     Item4     1          1

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • That's pretty impressive Alan, and applied to my real-world data, the results look something like expected.

    Thank you very much.

Viewing 7 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply