• Cadavre (5/15/2013)


    ChrisM@Work (5/15/2013)


    dwain.c (5/15/2013)


    Sorry to disappoint you guys but I wasn't able to come up with anything.

    My gut was telling me an rCTE wouldn't do it and the set-based loop I tried wouldn't resolve it out properly either.

    Of course, I could be wrong as I'm a bit out of practice writing rCTEs! πŸ™‚

    I'm not surprised, Dwain. It's quite tricky. The more obvious methods like MAX() OVER() won't work when you expect them to, in the recursive part.

    I haven't tried, as I said I'm pretty busy, but I thought a combination of a MAX() OVER() and a recursive CTE was going to do the business. Ah well.

    Thanks for taking a look Dwain, be nice to see where your efforts took you.

    There are at least two ways of approaching this problem, number chains and sets. I think the methods reviewed so far have employed number chains, so I had a play with sets.

    Allocate each number pair a set identity, so the first row (ordered by whatever) has set identity = 1. Wherever either of the two members of set 1 appear elsewhere in the table, change the set identity for the pair to 1. Run again: wherever any of the (now many) members of set 1 appear elsewhere in the table in a different set, change the set id to 1. Rinse and repeat.

    Here's the first attempt:

    --DROP TABLE #Temp;

    -- preprocess the sample data: there isn't a dupe and a master,

    -- they're simply members of a set of two.

    WITH SequencedData AS (

    SELECT *, SetNo = ROW_NUMBER() OVER (ORDER BY Duplicate_ID, Master_ID)

    FROM SCC

    )

    SELECT SetNo, Duplicate_ID

    INTO #Temp

    FROM SequencedData

    UNION ALL

    SELECT SetNo, Master_ID

    FROM SequencedData;

    -- (102 row(s) affected)

    ---------------------------------------------------------------------------

    -- three different ways of performing the subquery for the update.

    -- three updates have to be executed with the sample data to complete.

    ---------------------------------------------------------------------------

    UPDATE t SET SetNo = c.newset

    FROM #Temp t

    INNER JOIN (

    SELECT d.SetNo, newset = MIN(newset)

    FROM (

    SELECT Duplicate_ID, SetNo, newset = MIN(SetNo) OVER (PARTITION BY Duplicate_ID)

    FROM #Temp

    ) d

    GROUP BY d.SetNo

    HAVING MIN(newset) <> d.SetNo

    ) c ON c.SetNo = t.SetNo;

    -- (96 row(s) affected)

    UPDATE t SET SetNo = c.newset

    FROM #Temp t

    INNER JOIN (

    SELECT q.SetNo, q.newset

    FROM (

    SELECT

    d.SetNo, d.newset, rn = ROW_NUMBER() OVER(PARTITION BY d.SetNo ORDER BY d.newset)

    FROM (

    SELECT SetNo, newset = MIN(SetNo) OVER (PARTITION BY Duplicate_ID)

    FROM #Temp

    ) d

    ) q

    WHERE rn = 1

    AND q.SetNo <> q.newset

    ) c ON c.SetNo = t.SetNo;

    -- (66 row(s) affected)

    UPDATE t SET SetNo = c.newset

    FROM #Temp t

    INNER JOIN (

    SELECT newset = MIN(t2.SetNo), t1.SetNo

    FROM #Temp t1

    INNER JOIN #Temp t2

    ON t2.Duplicate_ID = t1.Duplicate_ID

    AND t2.SetNo < t1.SetNo

    GROUP BY t1.SetNo

    ) c ON c.SetNo = t.SetNo;

    -- (2 row(s) affected)

    ---------------------------------------------------------------------------

    -- Final result set

    ---------------------------------------------------------------------------

    SELECT t.Duplicate, x.Duplicate_ID

    FROM (

    SELECT SetNo, Duplicate = MIN(Duplicate_ID)

    FROM #Temp

    GROUP BY SetNo

    ) t

    CROSS APPLY (

    SELECT DISTINCT *

    FROM #Temp ti

    WHERE ti.SetNo = t.SetNo

    AND ti.Duplicate_ID <> t.Duplicate

    ) x;

    Note that the subquery code for the three updates shown is logically the same and generates the same results with the sample data - I was playing with the subqueries in an attempt to find something which might be rCTE-compatible.

    Three updates have to be performed, a fourth does no work. The results are the same as Abu Dina's and it's pretty darned quick.

    I haven't yet discovered a rCTE to do the same process because you can't perform aggregations on the recursive part of a rCTE, which this process requires. You can chain ordinary CTE's though, and get the whole caboodle working in a single query like this:

    ;WITH DataSets AS (

    SELECT

    SetID = ROW_NUMBER() OVER(ORDER BY Master_ID, Duplicate_ID),

    Master_ID, Duplicate_ID

    FROM scc

    ),

    FirstPass AS (

    SELECT SetID = ISNULL(x.NewSetID, d.SetID), d.Master_ID, d.Duplicate_ID,

    RowAffected = 0 + CASE WHEN x.NewSetID IS NOT NULL THEN 1 ELSE 0 END --

    FROM DataSets d

    OUTER APPLY ( -- find the lowest SetID that either number belongs to

    SELECT NewSetID = MIN(di.SetID)

    FROM DataSets di

    WHERE di.SetID < d.SetID

    AND (di.Master_ID IN (d.Master_ID, d.Duplicate_ID) OR di.Duplicate_ID IN (d.Master_ID, d.Duplicate_ID))

    ) x

    ),

    SecondPass AS (

    SELECT SetID = ISNULL(x.NewSetID, d.SetID), d.Master_ID, d.Duplicate_ID,

    RowAffected = RowAffected + CASE WHEN x.NewSetID IS NOT NULL THEN 1 ELSE 0 END

    FROM FirstPass d

    OUTER APPLY (

    SELECT NewSetID = MIN(di.SetID)

    FROM FirstPass di

    WHERE di.SetID < d.SetID

    AND (di.Master_ID IN (d.Master_ID, d.Duplicate_ID) OR di.Duplicate_ID IN (d.Master_ID, d.Duplicate_ID))

    ) x

    ),

    ThirdPass AS (

    SELECT SetID = ISNULL(x.NewSetID, d.SetID), d.Master_ID, d.Duplicate_ID,

    RowAffected = RowAffected + CASE WHEN x.NewSetID IS NOT NULL THEN 1 ELSE 0 END

    FROM SecondPass d

    OUTER APPLY (

    SELECT NewSetID = MIN(di.SetID)

    FROM SecondPass di

    WHERE di.SetID < d.SetID

    AND (di.Master_ID IN (d.Master_ID, d.Duplicate_ID) OR di.Duplicate_ID IN (d.Master_ID, d.Duplicate_ID))

    ) x

    ),

    TweakedResults AS (

    SELECT

    SetID, Duplicate_ID, rn = ROW_NUMBER() OVER (PARTITION BY SetID ORDER BY Duplicate_ID)

    FROM (

    SELECT SetID, Duplicate_ID

    FROM ThirdPass

    UNION -- eliminate dupes

    SELECT SetID, Master_ID

    FROM ThirdPass

    ) d

    )

    SELECT

    retained_id = t.Duplicate_ID,

    dropped_id = x.Duplicate_ID

    FROM TweakedResults t

    CROSS APPLY (SELECT Duplicate_ID FROM TweakedResults ti WHERE ti.SetID = t.SetID AND ti.rn>1) x

    WHERE t.rn = 1

    - but the performance doesn't exactly shine. Have a look at the plan and you see why - the source table is read something like 8 times πŸ˜‰

    β€œWrite the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden