finding groups of related data

Question

Post reply

finding groups of related data

branko-496001

SSC Journeyman

Points: 76
More actions
June 4, 2008 at 1:33 am

#208007

I would please like an answer to a problem I could not figure out a T-SQL query solution, although it seems trivial.
I have a database of related pairs of items of 2 same datatype columns:
Item 1 Item2
A D
A E
B F
C A
C G
C Q
I would need a table of group name and all items in related pairs from left and right columns similar to this
Group Id Item
1 A
1 D
1 E
1 C
1 G
1 Q
2 B
2 F
Thank you for a hint or a solution to that problem?
Branimir Cucek

Viewing 15 posts - 1 through 15 (of 19 total)

You must be logged in to reply to this topic. Login to reply

branko-496001 SSC Journeyman Points: 76 More actions · Answer 1

I have got so far to this:

I have got the missing pairs from C->A dependency pair

using self joins

SELECT rawlist_1.Item2, rawlist_2.Item1

FROM rawlist INNER JOIN

rawlist rawlist_1 ON rawlist.Item2 = rawlist_1.Item2 INNER JOIN

rawlist rawlist_2 ON rawlist_1.Item1 = rawlist_2.Item2

WHERE (rawlist_1.Item2 = 'A')

then UNION with obvious first pairs

SELECT Item1, Item2

FROM rawlist

WHERE (Item1 = 'A')

plus UNION vice-versa:

SELECT Item2, Item1

FROM rawlist

WHERE (Item2 = 'A')

and then extract group members.

But it seems I should iterate in a recursive way for all other subdependencies

and for every item.

That is how it got complicated.

Any clues or comments on that?

Andras Belokosztolszki SSC-Insane Points: 22031 More actions · Answer 2

Hmm, interesting problem, did not want to spend much time on it, so here is a solution using a temporary table:

GO

WITH cte

AS ( SELECT i.Item1

, i.Item2

FROM items AS i

WHERE Item1 <= Item2

UNION

SELECT i.Item2 AS Item1

, i.Item1 AS Item2

FROM items AS i

WHERE i.Item2 <= i.Item1

UNION ALL

SELECT i1.Item1

, i2.Item2

FROM items AS i1

JOIN items AS i2 ON i1.Item2 = i2.Item1

WHERE i1.Item1 <= i2.Item2

)

SELECT DISTINCT

c1.Item1

, dense_rank() OVER ( ORDER BY c1.Item1 ) AS treeNr

INTO #rootItems

FROM cte AS c1

WHERE NOT EXISTS ( SELECT c2.Item1

FROM cte AS c2

WHERE c2.Item2 = c1.Item1 )

GO

WITH cte

AS ( SELECT i.Item1

, i.treeNr

FROM #rootItems AS i

UNION ALL

SELECT i2.Item2

, i1.treeNr AS treeNr

FROM cte AS i1

JOIN items AS i2 ON i1.Item1 = i2.Item1

AND i2.Item1 <= i2.Item2

UNION ALL

SELECT i2.Item1

, i1.treeNr AS treeNr

FROM cte AS i1

JOIN items AS i2 ON i1.Item1 = i2.Item2

AND i2.Item1 > i2.Item2

)

SELECT *

FROM cte

ORDER BY treeNr

, Item1

To set up the whole thing:

create table items (Item1 char, Item2 char)

insert into items values('A','D')

insert into items values('A','E')

insert into items values('B','F')

insert into items values('C','A')

insert into items values('C','G')

insert into items values('C','Q')

Regards,

Andras

Andras Belokosztolszki, MCPD, PhD
GoldenGate Software

Andras Belokosztolszki SSC-Insane Points: 22031 More actions · Answer 3

Ok, and here is a solution without a temptable :), just a CTE expression. Although I'm sure there is a more efficient way to do this 🙂

WITH cte

AS ( SELECT i.Item1

, i.Item2

FROM items AS i

WHERE Item1 <= Item2

UNION

SELECT i.Item2 AS Item1

, i.Item1 AS Item2

FROM items AS i

WHERE i.Item2 <= i.Item1

UNION ALL

SELECT i1.Item1

, i2.Item2

FROM items AS i1

JOIN items AS i2 ON i1.Item2 = i2.Item1

WHERE i1.Item1 <= i2.Item2

),

rootnodes_cte as

(SELECT DISTINCT

c1.Item1

, dense_rank() OVER ( ORDER BY c1.Item1 ) AS treeNr

FROM cte AS c1

WHERE NOT EXISTS ( SELECT c2.Item1

FROM cte AS c2

WHERE c2.Item2 = c1.Item1 )

),

cte2

AS ( SELECT i.Item1

, i.treeNr

FROM rootnodes_cte AS i

UNION ALL

SELECT i2.Item2

, i1.treeNr AS treeNr

FROM cte2 AS i1

JOIN items AS i2 ON i1.Item1 = i2.Item1

AND i2.Item1 <= i2.Item2

UNION ALL

SELECT i2.Item1

, i1.treeNr AS treeNr

FROM cte2 AS i1

JOIN items AS i2 ON i1.Item1 = i2.Item2

AND i2.Item1 > i2.Item2

)

SELECT *

FROM cte2

ORDER BY treeNr, Item1

The result is:

Item1 treeNr

----- --------------------

A 1

C 1

D 1

E 1

G 1

Q 1

B 2

F 2

Regards,

Andras

Andras Belokosztolszki, MCPD, PhD
GoldenGate Software

GSquared SSC Guru Points: 260824 More actions · Answer 4

Mine is simpler, but it ends up reversing the group order on that sample of data:

create table #T (

Item1 char(1) not null,

Item2 char(1) not null,

constraint PK_T primary key (item1, item2))

insert into #t (item1, item2)

select 'A','D' union all

select 'A','E' union all

select 'B','F' union all

select 'C','A' union all

select 'C','G' union all

select 'C','Q'

select distinct dense_rank() over (order by item1), item1

from #t

;with

CTE (GroupID, Item) as

(select distinct dense_rank() over (order by item1), item1

from #t

where item1 not in

(select item2

from #t)

union all

select groupid, item2

from #t

inner join cte

on #t.item1 = cte.item)

select *

from cte

order by groupid, item

GroupIDItem

1B

1F

2A

2C

2D

2E

2G

2Q

If the sequence of the groups doesn't matter, then this is definitely simpler code.

To get the groups to sequence as desired, I'd have to add two more CTEs to it. One to assign the group IDs to the first entry in each group, one to join the data together and replace the original group ID from the first CTE. Easy enough to do, but is it unnecessary complexity or does it actually matter?

This query will work with up to 100 levels of grouping. If, for example, D appeared in Item1 with Z in Item2 in another row of the table, it would add that to group 2. Setting maxrecursion would allow that limit to be adjusted up or down.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

SwePeso SSC-Dedicated Points: 39758 More actions · Answer 5

Is this a correct interpretation of your original sample data?

A - D

/ E C - Q

|

G

B

F

And if I understand you corrent, you want all interconnected subgraphs ordered?

N 56°04'39.16"
E 12°55'05.25"

SwePeso SSC-Dedicated Points: 39758 More actions · Answer 6

What will then happen when you add the sample data

SELECT'D', 'Q' UNION ALL

and remove the sample data

SELECT'A', 'D' UNION ALL

The graph then looks like

A D

/ \ E C - Q

|

G

B

F

and the result should be similar to

Item1treeNr

A1

C1

D1

E1

G1

Q1

B2

F2

and NOT look like this?

Item1treeNr

A1

C1

E1

G1

Q1

B2

F2

D3

Q3

N 56°04'39.16"
E 12°55'05.25"

SwePeso SSC-Dedicated Points: 39758 More actions · Answer 7

Not that goodlooking but it will do the job.

DECLARE@Items TABLE (RowID INT IDENTITY(1, 1), Item1 CHAR, Item2 CHAR)

INSERT@Items

SELECT'D', 'Q' UNION ALL

SELECT'A', 'E' UNION ALL

SELECT'B', 'F' UNION ALL

SELECT'C', 'A' UNION ALL

SELECT'C', 'G' UNION ALL

SELECT'C', 'Q'

/*

A D

/ \ E C - Q

|

G

B

F

*/

DECLARE@Yak TABLE

(

Item CHAR PRIMARY KEY CLUSTERED,

Grp INT

)

INSERT@Yak

(

Item,

Grp

)

SELECTItem1,

MIN(RowID)

FROM(

SELECTItem1,

RowID

FROM@Items

UNION ALL

SELECTItem2,

RowID

FROM@Items

) AS d

GROUP BYItem1

DECLARE@RowID INT

SELECT@RowID = MAX(RowID)

FROM@Items

WHILE @RowID IS NOT NULL

BEGIN

UPDATEy

SETy.Grp = x.Seq

FROM@Yak AS y

INNER JOIN(

SELECTy.Item,

y.Grp,

MIN(y.Grp) OVER () AS Seq

FROM@Items AS i

INNER JOIN@Yak AS y ON y.Item IN (i.Item1, i.Item2)

WHEREi.RowID = @RowID

) AS x ON x.Item = y.Item

OR x.Grp = y.Grp

SELECT@RowID = MAX(RowID)

FROM@Items

WHERERowID < @RowID

END

SELECTItem,

Grp

FROM@Yak

ORDER BYGrp,

Item

N 56°04'39.16"
E 12°55'05.25"

RBarryYoung SSC Guru Points: 143329 More actions · Answer 8

I think that this is the "connected subgraphs" problem from my Discrete Algorithms class in college. Of course it was in Pascal, not SQL, and it was 30 years ago...

[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung [font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]

RBarryYoung SSC Guru Points: 143329 More actions · Answer 9

This example seems very efficient:

DECLARE @Items TABLE (ID INT IDENTITY(1, 1), Item1 CHAR, Item2 CHAR, Grp bigint)

INSERT @Items (Item1, Item2)

SELECT 'D', 'Q' UNION ALL

SELECT 'A', 'E' UNION ALL

SELECT 'B', 'F' UNION ALL

SELECT 'C', 'A' UNION ALL

SELECT 'C', 'G' UNION ALL

SELECT 'C', 'Q' UNION ALL

SELECT 'H', 'I' UNION ALL

SELECT 'I', 'J' UNION ALL

SELECT 'J', 'K' UNION ALL

SELECT 'K', 'H' UNION ALL

SELECT 'L', 'M' UNION ALL

SELECT 'M', 'N' UNION ALL

SELECT 'N', 'O' UNION ALL

SELECT 'O', 'P' UNION ALL

SELECT 'P', 'R' UNION ALL

SELECT 'R', 'S' UNION ALL

SELECT 'S', 'T' UNION ALL

SELECT 'T', 'U'

/*

A D

/ \ E C - Q

|

G

B

F

I

/ H J

\ /

K

L-M-N-O-P-R-S-T-U

*/

--======Make the Connected table

Create Table #Connected (Node char(1) NOT NULL, ConnectedTo char(1) NOT NULL)

ALTER TABLE #Connected ADD CONSTRAINT [PK_#Nodes] PRIMARY KEY CLUSTERED (Node ASC, ConnectedTo ASC)

--start with the first-order connections:

INSERT into #Connected

Select Item1, Item2 from @Items

UNION

Select Item2, Item1 from @Items

UNION

Select Item1, Item1 from @Items

Declare @rows int

Set @rows = 1

--====== Loop until done

While @rows > 0

BEGIN

--======Now Extend it one Order:

Insert into #Connected

Select distinct C1.Node, C2.ConnectedTo

From #Connected C1

Join #Connected C2

ON C1.ConnectedTo = C2.Node--transitive property

Where NOT EXISTS (Select * From #Connected C11

Where C11.Node = C1.Node

And C11.ConnectedTo = C2.ConnectedTo

)

Set @rows = @@RowCount

END

--======Report Results

Select Node, Ascii(Min(Upper(ConnectedTo))) - 47 as [Group]

From #Connected

Group By Node

drop table #Connected

[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung [font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]

branko-496001 SSC Journeyman Points: 76 More actions · Answer 10

Thank you all for giving me an efficient answer to the real problem.

However, there are some other questions.

It seems that the group naming after the ASCII value of the lowest named item is inconvenient.

The letters in the rows in real database represent different items(car parts) as 6-12 digit varchars. There are just

represented as letters here for clearer picture of the problem.

Anyway, taking from what we have, it is easier if you just take a MIN(ConnectedTo) element as a unique group name.

There is no constraint on the name of the group - it just need to be unique.

I have adopted this and begun to think about maintenance of the living cross-reference table.

Indeed, if you want to join two groups into one you would just need to insert a new pair

with items from different groups and rerun the query and the cross reference table would rearange itself.

To add or remove an item from group is trivial, but the big problem is when you want to remove the item which is

actually a group name in the Group column ( MIN(ConnectedTo) element from the original query). You would need

some dummy element pair which will now qualify as group name ( MIN(ConnectedTo) element ) in the query rerun and then

remove the pair "dummy element"-"last group column". But after a while i think the cross reference table would be a mess.

Or Maybe if there is a really some unique name of the group indepedent from item names generated in the first place it would be no problem but how can this be accomplished and maintained?

Which is a better way to pursue?

Branimir Cucek

GSquared SSC Guru Points: 260824 More actions · Answer 11

Is there a physical relationship between the items? Like, PartA is a component of PartB? (Classic bill of materials.) Or is the connection between items arbitrary? ("Buy one shock absorber, get two spark plugs free!" kind of stuff.)

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

branko-496001 SSC Journeyman Points: 76 More actions · Answer 12

No, it is not classical Bill of Materials. It is more like you sell shock absorbers from many manufacturers for various types of cars , and every manufacturer/brand has it own catalog part number for a particular car. Also, the parts from the same brands can also have more than one number (old/new number with different prices, quality I or II etc.).

The aim is that using only one number you obtain from one catalog or the brand number customer gives you, you have to know what other shock absorbers of different brands you have on stock for the particular car that fits so that the customer/seller can choose without searching different catalogs for various brands.

Hope it helps.

Branimir Cucek

GSquared SSC Guru Points: 260824 More actions · Answer 13

So, when this returns multiple groups, what is it actually giving you?

I'm envisioning a proc that has a part number as an input parameter, and which gives you all the equivalent parts. Is that what you're looking for? Or is this more of a report of everything in the system all at once?

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

branko-496001 SSC Journeyman Points: 76 More actions · Answer 14

That is it,

... a procedure that has a part number as an input parameter, and which gives you all the equivalent parts.

What, in your opinion, is a best strategy for maintenance of such a table

Branimir Cucek