Another grouping problem

  • I have found a bug in my duplicate macthing application and I believe it's with the way I group my records at the end of the matching process.

    Consider the following example (details are not real!)

    52941 matches 8388 and 15799 because mobile number and name is the same

    80833 matches 8388 and 15799 because name and address match

    so this is same as saying 80833 matches 52941, 8388 AND 15799

    So based on the second result set, I would like to perform a grouping of the outcome to look like this:

    NOTE, I picked the highest value ID as my Base or Group ID.

    Any suggestions?

    ---------------------------------------------------------

    It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
    David Edwards - Media lens[/url]

    Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
    Howard Zinn

  • You've been around long enough to know to provide consumable sample data.


    Forever trying to learn
    My blog - http://www.cadavre.co.uk/
    For better, quicker answers on T-SQL questions, click on the following...http://www.sqlservercentral.com/articles/Best+Practices/61537/
    For better, quicker answers on SQL Server performance related questions, click on the following...http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

  • Here you go:

    create table SCC (Master_ID INT, Duplicate_ID INT)

    INSERT INTO SCC

    SELECT 52941, 8388 UNION ALL

    SELECT 52941, 15799 UNION ALL

    SELECT 80833, 8388 UNION ALL

    SELECT 80833, 15799

    ---------------------------------------------------------

    It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
    David Edwards - Media lens[/url]

    Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
    Howard Zinn

  • My solution so far:

    select DISTINCt c.baseID, CASE WHEN Master_ID = c.BaseID THEN a.Duplicate_ID else Master_ID END

    from SCC as a

    cross apply (select MAX(master_ID) AS BaseID

    from SCC as b

    where b.duplicate_id = a.duplicate_id) as c

    ---------------------------------------------------------

    It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
    David Edwards - Media lens[/url]

    Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
    Howard Zinn

  • Unfortunately my solution isn't working as expected. Anyone else with suggestions?

    Here is some sample data:

    with SCC_CTE (Retained_ID, Dropped_ID)

    AS (Select 15799, 8388 union all

    select 52941,8388 union all

    select 52941, 15799 union all

    select 80833, 8388 union all

    select 80833, 15799)

    select * from SCC_CTE

    And here is the expected out output:

    ;with Result_CTE (Retained_ID, Dropped_ID)

    AS (Select 80833, 8388 union all

    select 80833,15799 union all

    select 80833, 52941)

    select * from Result_CTE

    Thanks in advance.

    ---------------------------------------------------------

    It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
    David Edwards - Media lens[/url]

    Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
    Howard Zinn

  • Sorry, I'd have replied sooner but things have been going on at work that take priority.

    What you need is a recursive check through the list. So, I've knocked up a really quick demonstration, but I'd suggest looking into it yourself as I'm convinced that this is not the optimal query. Unfortunately, I'm a bit busy to dig in to it any further.

    IF object_id('tempdb..#SCC') IS NOT NULL

    BEGIN

    DROP TABLE #SCC;

    END;

    --SAMPLE DATA

    SELECT Retained_ID, Dropped_ID

    INTO #SCC

    FROM (VALUES(15799, 8388),(52941, 8388),(52941, 15799),(80833, 8388),(80833, 15799)

    )a(Retained_ID, Dropped_ID);

    -- EXPECTED RESULTS

    SELECT Retained_ID, Dropped_ID

    FROM (VALUES(80833, 8388),(80833, 15799),(80833, 52941))a(Retained_ID, Dropped_ID);

    -- RECURSIVE QUERY

    WITH CTE AS (

    SELECT Retained_ID, Dropped_ID

    FROM #SCC

    UNION ALL

    SELECT a.Retained_ID, CASE WHEN a.Retained_ID > b.Retained_ID THEN b.Retained_ID ELSE a.Dropped_ID END

    FROM #SCC a

    INNER JOIN CTE b ON a.Dropped_ID = b.Dropped_ID AND a.Retained_ID > b.Retained_ID)

    SELECT DISTINCT ca.Retained_ID, ca.Dropped_ID

    FROM CTE q

    CROSS APPLY (SELECT TOP 1 a.Retained_ID, a.Dropped_ID

    FROM CTE a

    WHERE a.Dropped_ID = q.Dropped_ID

    ORDER BY a.Retained_ID DESC

    ) ca;


    --EDIT--


    Almost immediately after posting, I realised that the recursive query could just as easily be written like this: -

    -- RECURSIVE QUERY

    WITH CTE AS (

    SELECT Retained_ID, Dropped_ID

    FROM #SCC

    UNION ALL

    SELECT a.Retained_ID, b.Retained_ID

    FROM #SCC a

    INNER JOIN CTE b ON a.Dropped_ID = b.Dropped_ID AND a.Retained_ID > b.Retained_ID)

    SELECT MAX(Retained_ID) AS Retained_ID, Dropped_ID

    FROM CTE

    GROUP BY Dropped_ID;


    Forever trying to learn
    My blog - http://www.cadavre.co.uk/
    For better, quicker answers on T-SQL questions, click on the following...http://www.sqlservercentral.com/articles/Best+Practices/61537/
    For better, quicker answers on SQL Server performance related questions, click on the following...http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

  • Craig - very promising, but try adding these to the sample data set:

    UNION ALL

    SELECT 2426, 2341 UNION ALL

    SELECT 2341, 1200

    Abu Dina - are these two additional rows valid data, and if so, can you extend the data set please? I think it will help folks if there are more groups and perhaps a few rows where there are no associations at all.

    β€œWrite the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Thanks for all your help so far.

    This is one way I've tried and it works I think but I'm sure you clever folks can come up with a better way :

    declare @pairs table (dropped_id int, retained_id int)

    insert into @pairs

    SELECT 34336, 14553 UNION ALL

    SELECT 39352, 14553 UNION ALL

    SELECT 39352, 34336 UNION ALL

    SELECT 39776, 34123 UNION ALL

    SELECT 47833, 14553 UNION ALL

    SELECT 47833, 34336 UNION ALL

    SELECT 47833, 39352 UNION ALL

    SELECT 47969, 14553 UNION ALL

    SELECT 47969, 34336 UNION ALL

    SELECT 47969, 39352 UNION ALL

    SELECT 47969, 47833 UNION ALL

    SELECT 48460, 14553 UNION ALL

    SELECT 48460, 34336 UNION ALL

    SELECT 48460, 39352 UNION ALL

    SELECT 48460, 47833 UNION ALL

    SELECT 48460, 47969 UNION ALL

    SELECT 48463, 14553 UNION ALL

    SELECT 48463, 34336 UNION ALL

    SELECT 48463, 39352 UNION ALL

    SELECT 48463, 47833 UNION ALL

    SELECT 48463, 47969 UNION ALL

    SELECT 48463, 48460 UNION ALL

    SELECT 52149, 14553 UNION ALL

    SELECT 52149, 34336 UNION ALL

    SELECT 52149, 39352 UNION ALL

    SELECT 52149, 39776 UNION ALL

    SELECT 52149, 47833 UNION ALL

    SELECT 52149, 47969 UNION ALL

    SELECT 52149, 48460 UNION ALL

    SELECT 52149, 48463 UNION ALL

    SELECT 54497, 14553 UNION ALL

    SELECT 54497, 34336 UNION ALL

    SELECT 54497, 39352 UNION ALL

    SELECT 54497, 39776 UNION ALL

    SELECT 54497, 47833 UNION ALL

    SELECT 54497, 47969 UNION ALL

    SELECT 54497, 48460 UNION ALL

    SELECT 54497, 48463 UNION ALL

    SELECT 54497, 52149 UNION ALL

    SELECT 66316, 14553 UNION ALL

    SELECT 66316, 34336 UNION ALL

    SELECT 66316, 39352 UNION ALL

    SELECT 66316, 47833 UNION ALL

    SELECT 66316, 47969 UNION ALL

    SELECT 66316, 48460 UNION ALL

    SELECT 66316, 48463 UNION ALL

    SELECT 66316, 48819 UNION ALL

    SELECT 66316, 52149 UNION ALL

    SELECT 66316, 54497 UNION ALL

    SELECT 74054, 66316 UNION ALL

    SELECT 77722, 53333

    declare @sortedpairs table (retained_id int, dropped_id int)

    insert into @sortedpairs (retained_id,dropped_id )

    select retained_id ,dropped_id from @pairs where retained_id <dropped_id

    insert into @sortedpairs (retained_id,dropped_id )

    select dropped_id ,retained_id from @pairs where retained_id >dropped_id

    insert into @sortedpairs (retained_ID,dropped_ID)

    select sp1.retained_ID,sp2.retained_ID from @sortedpairs sp1

    inner join @sortedpairs sp2

    on sp1.dropped_ID=sp2.dropped_ID

    and sp1.retained_ID<>sp2.retained_ID

    and sp1.retained_ID<sp2.retained_ID

    insert into @sortedpairs (dropped_ID,retained_ID)

    select sp1.retained_ID,sp2.retained_ID from @sortedpairs sp1

    inner join @sortedpairs sp2

    on sp1.dropped_ID=sp2.dropped_ID

    and sp1.retained_ID<>sp2.retained_ID

    and sp1.retained_ID>sp2.retained_ID

    while (@@rowcount<>0)

    begin

    update tab1 set tab1.retained_ID=tab2.retained_ID

    from @sortedpairs tab1

    inner join @sortedpairs tab2

    on tab2.dropped_ID=tab1.retained_ID

    end

    select * from @sortedpairs

    group by retained_ID,dropped_ID

    order by 1

    ---------------------------------------------------------

    It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
    David Edwards - Media lens[/url]

    Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
    Howard Zinn

  • I really don't have time to dig into this, work is a bet hectic. When I've got some spare time, I'll try to remember to revisit it. I'm fairly certain that this is a recursive CTE problem, so Dwain may well be your man (he loves recursive CTEs).

    Just wondering, how many rows do you end up with in your working table? Taking a quick glance at your code, I'd imagine that as the number of rows increases in your source table, the number of rows in your working table must go up exponentially.


    Forever trying to learn
    My blog - http://www.cadavre.co.uk/
    For better, quicker answers on T-SQL questions, click on the following...http://www.sqlservercentral.com/articles/Best+Practices/61537/
    For better, quicker answers on SQL Server performance related questions, click on the following...http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

  • Join the club! I also have a number of deadlines looming so don't have the luxury to spend too much time on this.

    Record linkage is a pain in the backside especially when you have changing spec from week to week!

    Yes, the number of rows does increase considerably so although my multi-step method works, I know it's going to suffer performance issues as I try it with larger record sets.

    For now though, I've tested it on a sample of 80k rows and my entire process from start to finish takes roughly 2 minute (this includes a dozen CLR TVFs) so I'm quite happy with it.

    and yes, I can almost visualise the recursive CTE solution but as I said, I just don't have the time to mess around with it at the moment.

    Thanks for your help so far, I do appreciate it.

    ---------------------------------------------------------

    It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
    David Edwards - Media lens[/url]

    Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
    Howard Zinn

  • Cadavre (5/13/2013)


    I really don't have time to dig into this, work is a bet hectic. When I've got some spare time, I'll try to remember to revisit it. I'm fairly certain that this is a recursive CTE problem, so Dwain may well be your man (he loves recursive CTEs).

    Dwain who?


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • Sorry to disappoint you guys but I wasn't able to come up with anything.

    My gut was telling me an rCTE wouldn't do it and the set-based loop I tried wouldn't resolve it out properly either.

    Of course, I could be wrong as I'm a bit out of practice writing rCTEs! πŸ™‚


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • dwain.c (5/15/2013)


    Sorry to disappoint you guys but I wasn't able to come up with anything.

    My gut was telling me an rCTE wouldn't do it and the set-based loop I tried wouldn't resolve it out properly either.

    Of course, I could be wrong as I'm a bit out of practice writing rCTEs! πŸ™‚

    I'm not surprised, Dwain. It's quite tricky. The more obvious methods like MAX() OVER() won't work when you expect them to, in the recursive part.

    β€œWrite the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • ChrisM@Work (5/15/2013)


    dwain.c (5/15/2013)


    Sorry to disappoint you guys but I wasn't able to come up with anything.

    My gut was telling me an rCTE wouldn't do it and the set-based loop I tried wouldn't resolve it out properly either.

    Of course, I could be wrong as I'm a bit out of practice writing rCTEs! πŸ™‚

    I'm not surprised, Dwain. It's quite tricky. The more obvious methods like MAX() OVER() won't work when you expect them to, in the recursive part.

    I haven't tried, as I said I'm pretty busy, but I thought a combination of a MAX() OVER() and a recursive CTE was going to do the business. Ah well.

    Thanks for taking a look Dwain, be nice to see where your efforts took you.


    Forever trying to learn
    My blog - http://www.cadavre.co.uk/
    For better, quicker answers on T-SQL questions, click on the following...http://www.sqlservercentral.com/articles/Best+Practices/61537/
    For better, quicker answers on SQL Server performance related questions, click on the following...http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

  • Cadavre (5/15/2013)


    Thanks for taking a look Dwain, be nice to see where your efforts took you.

    Very well. For what it's worth:

    DECLARE @pairs TABLE (dropped_id INT, retained_id INT)

    INSERT INTO @pairs

    --select 15799, 8388 union all

    --select 52941,8388 union all

    --select 52941, 15799 union all

    --select 80833, 8388 union all

    --select 80833, 15799

    SELECT 34336, 14553 UNION ALL

    SELECT 39352, 14553 UNION ALL

    SELECT 39352, 34336 UNION ALL

    SELECT 39776, 34123 UNION ALL

    SELECT 47833, 14553 UNION ALL

    SELECT 47833, 34336 UNION ALL

    SELECT 47833, 39352 UNION ALL

    SELECT 47969, 14553 UNION ALL

    SELECT 47969, 34336 UNION ALL

    SELECT 47969, 39352 UNION ALL

    SELECT 47969, 47833 UNION ALL

    SELECT 48460, 14553 UNION ALL

    SELECT 48460, 34336 UNION ALL

    SELECT 48460, 39352 UNION ALL

    SELECT 48460, 47833 UNION ALL

    SELECT 48460, 47969 UNION ALL

    SELECT 48463, 14553 UNION ALL

    SELECT 48463, 34336 UNION ALL

    SELECT 48463, 39352 UNION ALL

    SELECT 48463, 47833 UNION ALL

    SELECT 48463, 47969 UNION ALL

    SELECT 48463, 48460 UNION ALL

    SELECT 52149, 14553 UNION ALL

    SELECT 52149, 34336 UNION ALL

    SELECT 52149, 39352 UNION ALL

    SELECT 52149, 39776 UNION ALL

    SELECT 52149, 47833 UNION ALL

    SELECT 52149, 47969 UNION ALL

    SELECT 52149, 48460 UNION ALL

    SELECT 52149, 48463 UNION ALL

    SELECT 54497, 14553 UNION ALL

    SELECT 54497, 34336 UNION ALL

    SELECT 54497, 39352 UNION ALL

    SELECT 54497, 39776 UNION ALL

    SELECT 54497, 47833 UNION ALL

    SELECT 54497, 47969 UNION ALL

    SELECT 54497, 48460 UNION ALL

    SELECT 54497, 48463 UNION ALL

    SELECT 54497, 52149 UNION ALL

    SELECT 66316, 14553 UNION ALL

    SELECT 66316, 34336 UNION ALL

    SELECT 66316, 39352 UNION ALL

    SELECT 66316, 47833 UNION ALL

    SELECT 66316, 47969 UNION ALL

    SELECT 66316, 48460 UNION ALL

    SELECT 66316, 48463 UNION ALL

    SELECT 66316, 48819 UNION ALL

    SELECT 66316, 52149 UNION ALL

    SELECT 66316, 54497 UNION ALL

    SELECT 74054, 66316 UNION ALL

    SELECT 77722, 53333

    DECLARE @pairs2 TABLE (dropped_id INT, retained_id INT)

    DECLARE @rows INT

    INSERT INTO @pairs2

    SELECT dropped_id=CASE WHEN dropped_id < retained_id THEN dropped_id ELSE retained_id END

    ,retained_id=CASE WHEN dropped_id < retained_id THEN retained_id ELSE dropped_id END

    FROM @pairs

    SELECT @rows = @@ROWCOUNT

    DELETE FROM @pairs

    INSERT INTO @pairs

    SELECT dropped_id, retained_id

    FROM @pairs2

    WHERE dropped_id NOT IN (SELECT retained_id FROM @pairs2)

    SELECT @rows = @@ROWCOUNT

    WHILE @rows <> 0

    BEGIN

    INSERT INTO @pairs

    --OUTPUT INSERTED.dropped_id, INSERTED.retained_id

    SELECT a.dropped_id, b.retained_id

    FROM @pairs a

    INNER JOIN @pairs2 b ON a.retained_id = b.dropped_id

    EXCEPT

    SELECT dropped_id, retained_id

    FROM @pairs

    SELECT @rows = @@ROWCOUNT

    END

    INSERT INTO @pairs

    SELECT a=MIN(dropped_id), b=MAX(dropped_id)

    FROM @pairs

    GROUP BY retained_id

    HAVING COUNT(*) > 1

    ;WITH Pairs AS (

    SELECT dropped_id, a.retained_id

    ,n=ROW_NUMBER() OVER (PARTITION BY a.retained_id ORDER BY dropped_id)

    ,b

    FROM @pairs a

    INNER JOIN (

    SELECT retained_id, a=MAX(dropped_id), b=MIN(dropped_id)

    FROM @pairs

    GROUP BY retained_id

    HAVING COUNT(*) > 1) b ON a.dropped_id = b.a OR a.retained_id = b.retained_id)

    UPDATE a

    SET dropped_id=b

    FROM Pairs a

    WHERE n > 1

    SELECT DISTINCT *

    FROM @pairs

    ORDER BY dropped_id, retained_id

    I tweaked it this morning and I think it may now be producing the correct rows for this case but no guarantee it always will.


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

Viewing 15 posts - 1 through 15 (of 21 total)

You must be logged in to reply to this topic. Login to reply