finding groups of related data

  • I would please like an answer to a problem I could not figure out a T-SQL query solution, although it seems trivial.

    I have a database of related pairs of items of 2 same datatype columns:

    Item 1 Item2

    A D

    A E

    B F

    C A

    C G

    C Q

    I would need a table of group name and all items in related pairs from left and right columns similar to this

    Group Id Item

    1 A

    1 D

    1 E

    1 C

    1 G

    1 Q

    2 B

    2 F

    Thank you for a hint or a solution to that problem?

    Branimir Cucek

  • I have got so far to this:

    I have got the missing pairs from C->A dependency pair

    using self joins

    SELECT rawlist_1.Item2, rawlist_2.Item1

    FROM rawlist INNER JOIN

    rawlist rawlist_1 ON rawlist.Item2 = rawlist_1.Item2 INNER JOIN

    rawlist rawlist_2 ON rawlist_1.Item1 = rawlist_2.Item2

    WHERE (rawlist_1.Item2 = 'A')

    then UNION with obvious first pairs

    SELECT Item1, Item2

    FROM rawlist

    WHERE (Item1 = 'A')

    plus UNION vice-versa:

    SELECT Item2, Item1

    FROM rawlist

    WHERE (Item2 = 'A')

    and then extract group members.

    But it seems I should iterate in a recursive way for all other subdependencies

    and for every item.

    That is how it got complicated.

    Any clues or comments on that?

  • Hmm, interesting problem, did not want to spend much time on it, so here is a solution using a temporary table:

    GO

    WITH cte

    AS ( SELECT i.Item1

    , i.Item2

    FROM items AS i

    WHERE Item1 <= Item2

    UNION

    SELECT i.Item2 AS Item1

    , i.Item1 AS Item2

    FROM items AS i

    WHERE i.Item2 <= i.Item1

    UNION ALL

    SELECT i1.Item1

    , i2.Item2

    FROM items AS i1

    JOIN items AS i2 ON i1.Item2 = i2.Item1

    WHERE i1.Item1 <= i2.Item2

    )

    SELECT DISTINCT

    c1.Item1

    , dense_rank() OVER ( ORDER BY c1.Item1 ) AS treeNr

    INTO #rootItems

    FROM cte AS c1

    WHERE NOT EXISTS ( SELECT c2.Item1

    FROM cte AS c2

    WHERE c2.Item2 = c1.Item1 )

    GO

    WITH cte

    AS ( SELECT i.Item1

    , i.treeNr

    FROM #rootItems AS i

    UNION ALL

    SELECT i2.Item2

    , i1.treeNr AS treeNr

    FROM cte AS i1

    JOIN items AS i2 ON i1.Item1 = i2.Item1

    AND i2.Item1 <= i2.Item2

    UNION ALL

    SELECT i2.Item1

    , i1.treeNr AS treeNr

    FROM cte AS i1

    JOIN items AS i2 ON i1.Item1 = i2.Item2

    AND i2.Item1 > i2.Item2

    )

    SELECT *

    FROM cte

    ORDER BY treeNr

    , Item1

    To set up the whole thing:

    create table items (Item1 char, Item2 char)

    insert into items values('A','D')

    insert into items values('A','E')

    insert into items values('B','F')

    insert into items values('C','A')

    insert into items values('C','G')

    insert into items values('C','Q')

    Regards,

    Andras


    Andras Belokosztolszki, MCPD, PhD
    GoldenGate Software

  • Ok, and here is a solution without a temptable :), just a CTE expression. Although I'm sure there is a more efficient way to do this 🙂

    WITH cte

    AS ( SELECT i.Item1

    , i.Item2

    FROM items AS i

    WHERE Item1 <= Item2

    UNION

    SELECT i.Item2 AS Item1

    , i.Item1 AS Item2

    FROM items AS i

    WHERE i.Item2 <= i.Item1

    UNION ALL

    SELECT i1.Item1

    , i2.Item2

    FROM items AS i1

    JOIN items AS i2 ON i1.Item2 = i2.Item1

    WHERE i1.Item1 <= i2.Item2

    ),

    rootnodes_cte as

    (SELECT DISTINCT

    c1.Item1

    , dense_rank() OVER ( ORDER BY c1.Item1 ) AS treeNr

    FROM cte AS c1

    WHERE NOT EXISTS ( SELECT c2.Item1

    FROM cte AS c2

    WHERE c2.Item2 = c1.Item1 )

    ),

    cte2

    AS ( SELECT i.Item1

    , i.treeNr

    FROM rootnodes_cte AS i

    UNION ALL

    SELECT i2.Item2

    , i1.treeNr AS treeNr

    FROM cte2 AS i1

    JOIN items AS i2 ON i1.Item1 = i2.Item1

    AND i2.Item1 <= i2.Item2

    UNION ALL

    SELECT i2.Item1

    , i1.treeNr AS treeNr

    FROM cte2 AS i1

    JOIN items AS i2 ON i1.Item1 = i2.Item2

    AND i2.Item1 > i2.Item2

    )

    SELECT *

    FROM cte2

    ORDER BY treeNr, Item1

    The result is:

    Item1 treeNr

    ----- --------------------

    A 1

    C 1

    D 1

    E 1

    G 1

    Q 1

    B 2

    F 2

    Regards,

    Andras


    Andras Belokosztolszki, MCPD, PhD
    GoldenGate Software

  • Mine is simpler, but it ends up reversing the group order on that sample of data:

    create table #T (

    Item1 char(1) not null,

    Item2 char(1) not null,

    constraint PK_T primary key (item1, item2))

    insert into #t (item1, item2)

    select 'A','D' union all

    select 'A','E' union all

    select 'B','F' union all

    select 'C','A' union all

    select 'C','G' union all

    select 'C','Q'

    select distinct dense_rank() over (order by item1), item1

    from #t

    ;with

    CTE (GroupID, Item) as

    (select distinct dense_rank() over (order by item1), item1

    from #t

    where item1 not in

    (select item2

    from #t)

    union all

    select groupid, item2

    from #t

    inner join cte

    on #t.item1 = cte.item)

    select *

    from cte

    order by groupid, item

    GroupIDItem

    1B

    1F

    2A

    2C

    2D

    2E

    2G

    2Q

    If the sequence of the groups doesn't matter, then this is definitely simpler code.

    To get the groups to sequence as desired, I'd have to add two more CTEs to it. One to assign the group IDs to the first entry in each group, one to join the data together and replace the original group ID from the first CTE. Easy enough to do, but is it unnecessary complexity or does it actually matter?

    This query will work with up to 100 levels of grouping. If, for example, D appeared in Item1 with Z in Item2 in another row of the table, it would add that to group 2. Setting maxrecursion would allow that limit to be adjusted up or down.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Is this a correct interpretation of your original sample data?

    A - D

    / E C - Q

    |

    G

    B

    F

    And if I understand you corrent, you want all interconnected subgraphs ordered?


    N 56°04'39.16"
    E 12°55'05.25"

  • What will then happen when you add the sample data

    SELECT'D', 'Q' UNION ALL

    and remove the sample data

    SELECT'A', 'D' UNION ALL

    The graph then looks like

    A D

    / \ E C - Q

    |

    G

    B

    F

    and the result should be similar to

    Item1treeNr

    A1

    C1

    D1

    E1

    G1

    Q1

    B2

    F2

    and NOT look like this?

    Item1treeNr

    A1

    C1

    E1

    G1

    Q1

    B2

    F2

    D3

    Q3


    N 56°04'39.16"
    E 12°55'05.25"

  • Not that goodlooking but it will do the job.

    DECLARE@Items TABLE (RowID INT IDENTITY(1, 1), Item1 CHAR, Item2 CHAR)

    INSERT@Items

    SELECT'D', 'Q' UNION ALL

    SELECT'A', 'E' UNION ALL

    SELECT'B', 'F' UNION ALL

    SELECT'C', 'A' UNION ALL

    SELECT'C', 'G' UNION ALL

    SELECT'C', 'Q'

    /*

    A D

    / \ E C - Q

    |

    G

    B

    F

    */

    DECLARE@Yak TABLE

    (

    Item CHAR PRIMARY KEY CLUSTERED,

    Grp INT

    )

    INSERT@Yak

    (

    Item,

    Grp

    )

    SELECTItem1,

    MIN(RowID)

    FROM(

    SELECTItem1,

    RowID

    FROM@Items

    UNION ALL

    SELECTItem2,

    RowID

    FROM@Items

    ) AS d

    GROUP BYItem1

    DECLARE@RowID INT

    SELECT@RowID = MAX(RowID)

    FROM@Items

    WHILE @RowID IS NOT NULL

    BEGIN

    UPDATEy

    SETy.Grp = x.Seq

    FROM@Yak AS y

    INNER JOIN(

    SELECTy.Item,

    y.Grp,

    MIN(y.Grp) OVER () AS Seq

    FROM@Items AS i

    INNER JOIN@Yak AS y ON y.Item IN (i.Item1, i.Item2)

    WHEREi.RowID = @RowID

    ) AS x ON x.Item = y.Item

    OR x.Grp = y.Grp

    SELECT@RowID = MAX(RowID)

    FROM@Items

    WHERERowID < @RowID

    END

    SELECTItem,

    Grp

    FROM@Yak

    ORDER BYGrp,

    Item


    N 56°04'39.16"
    E 12°55'05.25"

  • I think that this is the "connected subgraphs" problem from my Discrete Algorithms class in college. Of course it was in Pascal, not SQL, and it was 30 years ago...

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • This example seems very efficient:

    DECLARE @Items TABLE (ID INT IDENTITY(1, 1), Item1 CHAR, Item2 CHAR, Grp bigint)

    INSERT @Items (Item1, Item2)

    SELECT 'D', 'Q' UNION ALL

    SELECT 'A', 'E' UNION ALL

    SELECT 'B', 'F' UNION ALL

    SELECT 'C', 'A' UNION ALL

    SELECT 'C', 'G' UNION ALL

    SELECT 'C', 'Q' UNION ALL

    SELECT 'H', 'I' UNION ALL

    SELECT 'I', 'J' UNION ALL

    SELECT 'J', 'K' UNION ALL

    SELECT 'K', 'H' UNION ALL

    SELECT 'L', 'M' UNION ALL

    SELECT 'M', 'N' UNION ALL

    SELECT 'N', 'O' UNION ALL

    SELECT 'O', 'P' UNION ALL

    SELECT 'P', 'R' UNION ALL

    SELECT 'R', 'S' UNION ALL

    SELECT 'S', 'T' UNION ALL

    SELECT 'T', 'U'

    /*

    A D

    / \ E C - Q

    |

    G

    B

    F

    I

    / H J

    \ /

    K

    L-M-N-O-P-R-S-T-U

    */

    --======Make the Connected table

    Create Table #Connected (Node char(1) NOT NULL, ConnectedTo char(1) NOT NULL)

    ALTER TABLE #Connected ADD CONSTRAINT [PK_#Nodes] PRIMARY KEY CLUSTERED (Node ASC, ConnectedTo ASC)

    --start with the first-order connections:

    INSERT into #Connected

    Select Item1, Item2 from @Items

    UNION

    Select Item2, Item1 from @Items

    UNION

    Select Item1, Item1 from @Items

    Declare @rows int

    Set @rows = 1

    --====== Loop until done

    While @rows > 0

    BEGIN

    --======Now Extend it one Order:

    Insert into #Connected

    Select distinct C1.Node, C2.ConnectedTo

    From #Connected C1

    Join #Connected C2

    ON C1.ConnectedTo = C2.Node--transitive property

    Where NOT EXISTS (Select * From #Connected C11

    Where C11.Node = C1.Node

    And C11.ConnectedTo = C2.ConnectedTo

    )

    Set @rows = @@RowCount

    END

    --======Report Results

    Select Node, Ascii(Min(Upper(ConnectedTo))) - 47 as [Group]

    From #Connected

    Group By Node

    drop table #Connected

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • Thank you all for giving me an efficient answer to the real problem.

    However, there are some other questions.

    It seems that the group naming after the ASCII value of the lowest named item is inconvenient.

    The letters in the rows in real database represent different items(car parts) as 6-12 digit varchars. There are just

    represented as letters here for clearer picture of the problem.

    Anyway, taking from what we have, it is easier if you just take a MIN(ConnectedTo) element as a unique group name.

    There is no constraint on the name of the group - it just need to be unique.

    I have adopted this and begun to think about maintenance of the living cross-reference table.

    Indeed, if you want to join two groups into one you would just need to insert a new pair

    with items from different groups and rerun the query and the cross reference table would rearange itself.

    To add or remove an item from group is trivial, but the big problem is when you want to remove the item which is

    actually a group name in the Group column ( MIN(ConnectedTo) element from the original query). You would need

    some dummy element pair which will now qualify as group name ( MIN(ConnectedTo) element ) in the query rerun and then

    remove the pair "dummy element"-"last group column". But after a while i think the cross reference table would be a mess.

    Or Maybe if there is a really some unique name of the group indepedent from item names generated in the first place it would be no problem but how can this be accomplished and maintained?

    Which is a better way to pursue?

    Branimir Cucek

  • Is there a physical relationship between the items? Like, PartA is a component of PartB? (Classic bill of materials.) Or is the connection between items arbitrary? ("Buy one shock absorber, get two spark plugs free!" kind of stuff.)

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • No, it is not classical Bill of Materials. It is more like you sell shock absorbers from many manufacturers for various types of cars , and every manufacturer/brand has it own catalog part number for a particular car. Also, the parts from the same brands can also have more than one number (old/new number with different prices, quality I or II etc.).

    The aim is that using only one number you obtain from one catalog or the brand number customer gives you, you have to know what other shock absorbers of different brands you have on stock for the particular car that fits so that the customer/seller can choose without searching different catalogs for various brands.

    Hope it helps.

    Branimir Cucek

  • So, when this returns multiple groups, what is it actually giving you?

    I'm envisioning a proc that has a part number as an input parameter, and which gives you all the equivalent parts. Is that what you're looking for? Or is this more of a report of everything in the system all at once?

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • That is it,

    ... a procedure that has a part number as an input parameter, and which gives you all the equivalent parts.

    What, in your opinion, is a best strategy for maintenance of such a table

    Branimir Cucek

Viewing 15 posts - 1 through 15 (of 19 total)

You must be logged in to reply to this topic. Login to reply