RE: Another grouping problem – SQLServerCentral

SSC Guru

Points: 186127

May 16, 2013 at 1:23 am

Cadavre (5/15/2013)
ChrisM@Work (5/15/2013)
dwain.c (5/15/2013)
Sorry to disappoint you guys but I wasn't able to come up with anything.
My gut was telling me an rCTE wouldn't do it and the set-based loop I tried wouldn't resolve it out properly either.
Of course, I could be wrong as I'm a bit out of practice writing rCTEs! 🙂
I'm not surprised, Dwain. It's quite tricky. The more obvious methods like MAX() OVER() won't work when you expect them to, in the recursive part.
I haven't tried, as I said I'm pretty busy, but I thought a combination of a MAX() OVER() and a recursive CTE was going to do the business. Ah well.
Thanks for taking a look Dwain, be nice to see where your efforts took you.

There are at least two ways of approaching this problem, number chains and sets. I think the methods reviewed so far have employed number chains, so I had a play with sets.

Allocate each number pair a set identity, so the first row (ordered by whatever) has set identity = 1. Wherever either of the two members of set 1 appear elsewhere in the table, change the set identity for the pair to 1. Run again: wherever any of the (now many) members of set 1 appear elsewhere in the table in a different set, change the set id to 1. Rinse and repeat.

Here's the first attempt:

--DROP TABLE #Temp;

-- preprocess the sample data: there isn't a dupe and a master,

-- they're simply members of a set of two.

WITH SequencedData AS (

SELECT *, SetNo = ROW_NUMBER() OVER (ORDER BY Duplicate_ID, Master_ID)

FROM SCC

)

SELECT SetNo, Duplicate_ID

INTO #Temp

FROM SequencedData

UNION ALL

SELECT SetNo, Master_ID

FROM SequencedData;

-- (102 row(s) affected)

---------------------------------------------------------------------------

-- three different ways of performing the subquery for the update.

-- three updates have to be executed with the sample data to complete.

---------------------------------------------------------------------------

UPDATE t SET SetNo = c.newset

FROM #Temp t

INNER JOIN (

SELECT d.SetNo, newset = MIN(newset)

FROM (

SELECT Duplicate_ID, SetNo, newset = MIN(SetNo) OVER (PARTITION BY Duplicate_ID)

FROM #Temp

) d

GROUP BY d.SetNo

HAVING MIN(newset) <> d.SetNo

) c ON c.SetNo = t.SetNo;

-- (96 row(s) affected)

UPDATE t SET SetNo = c.newset

FROM #Temp t

INNER JOIN (

SELECT q.SetNo, q.newset

FROM (

SELECT

d.SetNo, d.newset, rn = ROW_NUMBER() OVER(PARTITION BY d.SetNo ORDER BY d.newset)

FROM (

SELECT SetNo, newset = MIN(SetNo) OVER (PARTITION BY Duplicate_ID)

FROM #Temp

) d

) q

WHERE rn = 1

AND q.SetNo <> q.newset

) c ON c.SetNo = t.SetNo;

-- (66 row(s) affected)

UPDATE t SET SetNo = c.newset

FROM #Temp t

INNER JOIN (

SELECT newset = MIN(t2.SetNo), t1.SetNo

FROM #Temp t1

INNER JOIN #Temp t2

ON t2.Duplicate_ID = t1.Duplicate_ID

AND t2.SetNo < t1.SetNo

GROUP BY t1.SetNo

) c ON c.SetNo = t.SetNo;

-- (2 row(s) affected)

---------------------------------------------------------------------------

-- Final result set

---------------------------------------------------------------------------

SELECT t.Duplicate, x.Duplicate_ID

FROM (

SELECT SetNo, Duplicate = MIN(Duplicate_ID)

FROM #Temp

GROUP BY SetNo

) t

CROSS APPLY (

SELECT DISTINCT *

FROM #Temp ti

WHERE ti.SetNo = t.SetNo

AND ti.Duplicate_ID <> t.Duplicate

) x;

Note that the subquery code for the three updates shown is logically the same and generates the same results with the sample data - I was playing with the subqueries in an attempt to find something which might be rCTE-compatible.

Three updates have to be performed, a fourth does no work. The results are the same as Abu Dina's and it's pretty darned quick.

I haven't yet discovered a rCTE to do the same process because you can't perform aggregations on the recursive part of a rCTE, which this process requires. You can chain ordinary CTE's though, and get the whole caboodle working in a single query like this:

;WITH DataSets AS (

SELECT

SetID = ROW_NUMBER() OVER(ORDER BY Master_ID, Duplicate_ID),

Master_ID, Duplicate_ID

FROM scc

),

FirstPass AS (

SELECT SetID = ISNULL(x.NewSetID, d.SetID), d.Master_ID, d.Duplicate_ID,

RowAffected = 0 + CASE WHEN x.NewSetID IS NOT NULL THEN 1 ELSE 0 END --

FROM DataSets d

OUTER APPLY ( -- find the lowest SetID that either number belongs to

SELECT NewSetID = MIN(di.SetID)

FROM DataSets di

WHERE di.SetID < d.SetID

AND (di.Master_ID IN (d.Master_ID, d.Duplicate_ID) OR di.Duplicate_ID IN (d.Master_ID, d.Duplicate_ID))

) x

),

SecondPass AS (

SELECT SetID = ISNULL(x.NewSetID, d.SetID), d.Master_ID, d.Duplicate_ID,

RowAffected = RowAffected + CASE WHEN x.NewSetID IS NOT NULL THEN 1 ELSE 0 END

FROM FirstPass d

OUTER APPLY (

SELECT NewSetID = MIN(di.SetID)

FROM FirstPass di

WHERE di.SetID < d.SetID

AND (di.Master_ID IN (d.Master_ID, d.Duplicate_ID) OR di.Duplicate_ID IN (d.Master_ID, d.Duplicate_ID))

) x

),

ThirdPass AS (

SELECT SetID = ISNULL(x.NewSetID, d.SetID), d.Master_ID, d.Duplicate_ID,

RowAffected = RowAffected + CASE WHEN x.NewSetID IS NOT NULL THEN 1 ELSE 0 END

FROM SecondPass d

OUTER APPLY (

SELECT NewSetID = MIN(di.SetID)

FROM SecondPass di

WHERE di.SetID < d.SetID

AND (di.Master_ID IN (d.Master_ID, d.Duplicate_ID) OR di.Duplicate_ID IN (d.Master_ID, d.Duplicate_ID))

) x

),

TweakedResults AS (

SELECT

SetID, Duplicate_ID, rn = ROW_NUMBER() OVER (PARTITION BY SetID ORDER BY Duplicate_ID)

FROM (

SELECT SetID, Duplicate_ID

FROM ThirdPass

UNION -- eliminate dupes

SELECT SetID, Master_ID

FROM ThirdPass

) d

)

SELECT

retained_id = t.Duplicate_ID,

dropped_id = x.Duplicate_ID

FROM TweakedResults t

CROSS APPLY (SELECT Duplicate_ID FROM TweakedResults ti WHERE ti.SetID = t.SetID AND ti.rn>1) x

WHERE t.rn = 1

- but the performance doesn't exactly shine. Have a look at the plan and you see why - the source table is read something like 8 times 😉

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden