Cadavre (5/15/2013)
ChrisM@Work (5/15/2013)
dwain.c (5/15/2013)
Sorry to disappoint you guys but I wasn't able to come up with anything.My gut was telling me an rCTE wouldn't do it and the set-based loop I tried wouldn't resolve it out properly either.
Of course, I could be wrong as I'm a bit out of practice writing rCTEs! π
I'm not surprised, Dwain. It's quite tricky. The more obvious methods like MAX() OVER() won't work when you expect them to, in the recursive part.
I haven't tried, as I said I'm pretty busy, but I thought a combination of a MAX() OVER() and a recursive CTE was going to do the business. Ah well.
Thanks for taking a look Dwain, be nice to see where your efforts took you.
There are at least two ways of approaching this problem, number chains and sets. I think the methods reviewed so far have employed number chains, so I had a play with sets.
Allocate each number pair a set identity, so the first row (ordered by whatever) has set identity = 1. Wherever either of the two members of set 1 appear elsewhere in the table, change the set identity for the pair to 1. Run again: wherever any of the (now many) members of set 1 appear elsewhere in the table in a different set, change the set id to 1. Rinse and repeat.
Here's the first attempt:
--DROP TABLE #Temp;
-- preprocess the sample data: there isn't a dupe and a master,
-- they're simply members of a set of two.
WITH SequencedData AS (
SELECT *, SetNo = ROW_NUMBER() OVER (ORDER BY Duplicate_ID, Master_ID)
FROM SCC
)
SELECT SetNo, Duplicate_ID
INTO #Temp
FROM SequencedData
UNION ALL
SELECT SetNo, Master_ID
FROM SequencedData;
-- (102 row(s) affected)
---------------------------------------------------------------------------
-- three different ways of performing the subquery for the update.
-- three updates have to be executed with the sample data to complete.
---------------------------------------------------------------------------
UPDATE t SET SetNo = c.newset
FROM #Temp t
INNER JOIN (
SELECT d.SetNo, newset = MIN(newset)
FROM (
SELECT Duplicate_ID, SetNo, newset = MIN(SetNo) OVER (PARTITION BY Duplicate_ID)
FROM #Temp
) d
GROUP BY d.SetNo
HAVING MIN(newset) <> d.SetNo
) c ON c.SetNo = t.SetNo;
-- (96 row(s) affected)
UPDATE t SET SetNo = c.newset
FROM #Temp t
INNER JOIN (
SELECT q.SetNo, q.newset
FROM (
SELECT
d.SetNo, d.newset, rn = ROW_NUMBER() OVER(PARTITION BY d.SetNo ORDER BY d.newset)
FROM (
SELECT SetNo, newset = MIN(SetNo) OVER (PARTITION BY Duplicate_ID)
FROM #Temp
) d
) q
WHERE rn = 1
AND q.SetNo <> q.newset
) c ON c.SetNo = t.SetNo;
-- (66 row(s) affected)
UPDATE t SET SetNo = c.newset
FROM #Temp t
INNER JOIN (
SELECT newset = MIN(t2.SetNo), t1.SetNo
FROM #Temp t1
INNER JOIN #Temp t2
ON t2.Duplicate_ID = t1.Duplicate_ID
AND t2.SetNo < t1.SetNo
GROUP BY t1.SetNo
) c ON c.SetNo = t.SetNo;
-- (2 row(s) affected)
---------------------------------------------------------------------------
-- Final result set
---------------------------------------------------------------------------
SELECT t.Duplicate, x.Duplicate_ID
FROM (
SELECT SetNo, Duplicate = MIN(Duplicate_ID)
FROM #Temp
GROUP BY SetNo
) t
CROSS APPLY (
SELECT DISTINCT *
FROM #Temp ti
WHERE ti.SetNo = t.SetNo
AND ti.Duplicate_ID <> t.Duplicate
) x;
Note that the subquery code for the three updates shown is logically the same and generates the same results with the sample data - I was playing with the subqueries in an attempt to find something which might be rCTE-compatible.
Three updates have to be performed, a fourth does no work. The results are the same as Abu Dina's and it's pretty darned quick.
I haven't yet discovered a rCTE to do the same process because you can't perform aggregations on the recursive part of a rCTE, which this process requires. You can chain ordinary CTE's though, and get the whole caboodle working in a single query like this:
;WITH DataSets AS (
SELECT
SetID = ROW_NUMBER() OVER(ORDER BY Master_ID, Duplicate_ID),
Master_ID, Duplicate_ID
FROM scc
),
FirstPass AS (
SELECT SetID = ISNULL(x.NewSetID, d.SetID), d.Master_ID, d.Duplicate_ID,
RowAffected = 0 + CASE WHEN x.NewSetID IS NOT NULL THEN 1 ELSE 0 END --
FROM DataSets d
OUTER APPLY ( -- find the lowest SetID that either number belongs to
SELECT NewSetID = MIN(di.SetID)
FROM DataSets di
WHERE di.SetID < d.SetID
AND (di.Master_ID IN (d.Master_ID, d.Duplicate_ID) OR di.Duplicate_ID IN (d.Master_ID, d.Duplicate_ID))
) x
),
SecondPass AS (
SELECT SetID = ISNULL(x.NewSetID, d.SetID), d.Master_ID, d.Duplicate_ID,
RowAffected = RowAffected + CASE WHEN x.NewSetID IS NOT NULL THEN 1 ELSE 0 END
FROM FirstPass d
OUTER APPLY (
SELECT NewSetID = MIN(di.SetID)
FROM FirstPass di
WHERE di.SetID < d.SetID
AND (di.Master_ID IN (d.Master_ID, d.Duplicate_ID) OR di.Duplicate_ID IN (d.Master_ID, d.Duplicate_ID))
) x
),
ThirdPass AS (
SELECT SetID = ISNULL(x.NewSetID, d.SetID), d.Master_ID, d.Duplicate_ID,
RowAffected = RowAffected + CASE WHEN x.NewSetID IS NOT NULL THEN 1 ELSE 0 END
FROM SecondPass d
OUTER APPLY (
SELECT NewSetID = MIN(di.SetID)
FROM SecondPass di
WHERE di.SetID < d.SetID
AND (di.Master_ID IN (d.Master_ID, d.Duplicate_ID) OR di.Duplicate_ID IN (d.Master_ID, d.Duplicate_ID))
) x
),
TweakedResults AS (
SELECT
SetID, Duplicate_ID, rn = ROW_NUMBER() OVER (PARTITION BY SetID ORDER BY Duplicate_ID)
FROM (
SELECT SetID, Duplicate_ID
FROM ThirdPass
UNION -- eliminate dupes
SELECT SetID, Master_ID
FROM ThirdPass
) d
)
SELECT
retained_id = t.Duplicate_ID,
dropped_id = x.Duplicate_ID
FROM TweakedResults t
CROSS APPLY (SELECT Duplicate_ID FROM TweakedResults ti WHERE ti.SetID = t.SetID AND ti.rn>1) x
WHERE t.rn = 1
- but the performance doesn't exactly shine. Have a look at the plan and you see why - the source table is read something like 8 times π
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden