T-SQL Puzzler - identify sets that have the same items (where set ID and items in same table)

  • I am struggling to come up with a set-based solution for this problem (i.e. that doesn't involve loops/cursors) .... any help gratefully appreciated!

    A table contains items (identified by an ItemCode) and the set they belong to (identified by a SetId). Here is some sample data:

    SetIdItemCode

    1A

    1B

    24

    28

    26

    310

    312

    410

    418

    58

    612

    716

    816

    916

    10A

    10B

    114

    116

    118

    You can see that there are some sets that have the same members:

    - 1 and 10

    - 2 and 11

    - 7, 8 & 9

    What I want to do is identify the sets that have the same members, by giving them the same ID in another column called UniqueSetId.

    Any ideas?

  • This is a possible option.

    CREATE TABLE SetsItems(

    SetId int,

    ItemCode varchar(10))

    INSERT INTO SetsItems

    SELECT 1,'A' UNION ALL

    SELECT 1,'B' UNION ALL

    SELECT 2,'4' UNION ALL

    SELECT 2,'8' UNION ALL

    SELECT 2,'6' UNION ALL

    SELECT 3,'10' UNION ALL

    SELECT 3,'12' UNION ALL

    SELECT 4,'10' UNION ALL

    SELECT 4,'18' UNION ALL

    SELECT 5,'8' UNION ALL

    SELECT 6,'12' UNION ALL

    SELECT 7,'16' UNION ALL

    SELECT 8,'16' UNION ALL

    SELECT 9,'16' UNION ALL

    SELECT 10,'A' UNION ALL

    SELECT 10,'B' UNION ALL

    SELECT 11,'4' UNION ALL

    SELECT 11,'6' UNION ALL

    SELECT 11,'8';

    WITH cteSets AS(

    SELECT DISTINCT SetID,

    CAST( (SELECT ',' + ItemCode

    FROM SetsItems i

    WHERE s.SetId = i.SetId

    ORDER BY ItemCode

    FOR XML PATH('')) AS varchar(max)) Items

    FROM SetsItems s

    )

    SELECT *,

    RANK() OVER( ORDER BY Items) Option1,

    DENSE_RANK() OVER( ORDER BY Items) Option2

    FROM cteSets

    ORDER BY Option1

    GO

    DROP TABLE SetsItems

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Here's another (using Luis's setup)

    WITH Results(SetId1,SetId2) AS (

    SELECT a.SetId,b.SetId

    FROM SetsItems a

    INNER JOIN SetsItems b ON b.SetId > a.SetId

    AND b.ItemCode = a.ItemCode

    GROUP BY a.SetId,b.SetId

    HAVING COUNT(*) = (SELECT COUNT(*) FROM SetsItems c WHERE c.SetId = a.SetId)

    AND COUNT(*) = (SELECT COUNT(*) FROM SetsItems d WHERE d.SetId = b.SetId))

    SELECT DISTINCT ca.SetId,

    DENSE_RANK() OVER(ORDER BY a.SetId1) AS UniqueSetId

    FROM Results a

    CROSS APPLY(VALUES(a.SetId1),(a.SetId2)) ca(SetId)

    WHERE NOT EXISTS(SELECT * FROM Results b WHERE b.SetId2 = a.SetId1)

    ORDER BY UniqueSetId,SetId;

    ____________________________________________________

    Deja View - The strange feeling that somewhere, sometime you've optimised this query before

    How to get the best help on a forum

    http://www.sqlservercentral.com/articles/Best+Practices/61537
  • Both solutions worked, thanks very much. For some reason Luis's is much faster on my large data set (millions of rows).

  • I usually do something like this:

    SELECTSetId, CHECKSUM_AGG(CHECKSUM(ItemCode)) ch

    FROMSetsItems

    GROUP BY SetId

    ORDER BY ch

    If you're **really** unlucky you might get a false positive because CHECKSUM isn't perfect, but i doubt it. You can do a HASHBYTES on the ItemCode to pretty much avoid any collisions.

  • This is a simple case of Relational Division, of which I am particular fond of.

    The result isSetIDSetID

    110

    211

    78

    79

    89By using this piece of codeSELECTkc.SetID,

    nc.SetID

    FROM(

    SELECTSetID,

    COUNT(*) AS cnt

    FROMdbo.SetsItems

    GROUP BYSetID

    ) AS kc

    INNER JOIN(

    SELECTSetID,

    COUNT(*) AS cnt

    FROMdbo.SetsItems

    GROUP BYSetID

    ) AS nc ON nc.cnt = kc.cnt

    AND nc.SetID > kc.SetID

    INNER JOINdbo.SetsItems AS t ON t.SetID = kc.SetID

    INNER JOINdbo.SetsItems AS n ON n.SetID = nc.SetID

    WHEREt.ItemCode = n.ItemCode

    GROUP BYkc.SetID,

    nc.SetID

    HAVINGCOUNT(*) = MIN(nc.cnt);


    N 56°04'39.16"
    E 12°55'05.25"

  • -- SwePeso - Preaggregation/filtering step

    SELECTSetID,

    COUNT(*) AS cnt,

    MIN(ItemCode) AS mn,

    MAX(ItemCode) AS mx,

    CHECKSUM_AGG(CHECKSUM(ItemCode)) AS chk

    INTO#Temp

    FROMdbo.SetsItems WITH (NOLOCK)

    GROUP BYSetID;

    -- SwePeso - projection step

    SELECTROW_NUMBER() OVER (ORDER BY w.chk) AS UniqueSetID,

    STUFF(f.Data, 1, 2, N'') AS SetIDs

    FROM(

    SELECTcnt,

    mn,

    mx,

    chk

    FROM#Temp

    GROUP BYcnt,

    mn,

    mx,

    chk

    HAVINGCOUNT(*) >= 2

    ) AS w

    CROSS APPLY(

    SELECT', ' + CAST(x.SetID AS NVARCHAR(12))

    FROM#Temp AS x

    WHEREx.cnt = w.cnt

    AND x.mn = w.mn

    AND x.mx = w.mx

    AND x.chk = w.chk

    FOR XMLPATH('')

    ) AS f(Data);


    N 56°04'39.16"
    E 12°55'05.25"

  • Thanks for the extra solutions. Got me reading about relational division: https://www.simple-talk.com/sql/learn-sql-server/high-performance-relational-division-in-sql-server

  • Laurence Neville (3/1/2015)


    Thanks for the extra solutions. Got me reading about relational division: https://www.simple-talk.com/sql/learn-sql-server/high-performance-relational-division-in-sql-server

    :blush:


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply