Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase 123»»»

Another grouping problem Expand / Collapse
Author
Message
Posted Friday, May 10, 2013 5:34 AM


Say Hey Kid

Say Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey Kid

Group: General Forum Members
Last Login: Friday, October 31, 2014 10:27 AM
Points: 708, Visits: 3,293
I have found a bug in my duplicate macthing application and I believe it's with the way I group my records at the end of the matching process.

Consider the following example (details are not real!)



52941 matches 8388 and 15799 because mobile number and name is the same

80833 matches 8388 and 15799 because name and address match

so this is same as saying 80833 matches 52941, 8388 AND 15799

So based on the second result set, I would like to perform a grouping of the outcome to look like this:



NOTE, I picked the highest value ID as my Base or Group ID.

Any suggestions?


---------------------------------------------------------


It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
David Edwards - Media lens

Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
Howard Zinn
Post #1451530
Posted Friday, May 10, 2013 5:39 AM


SSCrazy

SSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazy

Group: General Forum Members
Last Login: 2 days ago @ 1:30 AM
Points: 2,386, Visits: 7,622
You've been around long enough to know to provide consumable sample data.


Not a DBA, just trying to learn

For better, quicker answers on T-SQL questions, click on the following...
http://www.sqlservercentral.com/articles/Best+Practices/61537/

For better, quicker answers on SQL Server performance related questions, click on the following...
http://www.sqlservercentral.com/articles/SQLServerCentral/66909/



If you litter your database queries with nolock query hints, are you aware of the side effects?
Try reading a few of these links...

(*) Missing rows with nolock
(*) Allocation order scans with nolock
(*) Consistency issues with nolock
(*) Transient Corruption Errors in SQL Server error log caused by nolock
(*) Dirty reads, read errors, reading rows twice and missing rows with nolock


LinkedIn | Blog coming soon (for sufficiently large values of "soon" )!
Post #1451534
Posted Friday, May 10, 2013 5:45 AM


Say Hey Kid

Say Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey Kid

Group: General Forum Members
Last Login: Friday, October 31, 2014 10:27 AM
Points: 708, Visits: 3,293
Here you go:

create table SCC (Master_ID INT, Duplicate_ID INT)
INSERT INTO SCC
SELECT 52941, 8388 UNION ALL
SELECT 52941, 15799 UNION ALL
SELECT 80833, 8388 UNION ALL
SELECT 80833, 15799



---------------------------------------------------------


It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
David Edwards - Media lens

Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
Howard Zinn
Post #1451535
Posted Friday, May 10, 2013 6:26 AM


Say Hey Kid

Say Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey Kid

Group: General Forum Members
Last Login: Friday, October 31, 2014 10:27 AM
Points: 708, Visits: 3,293
My solution so far:

select DISTINCt c.baseID, CASE WHEN Master_ID = c.BaseID THEN a.Duplicate_ID else Master_ID END
from SCC as a
cross apply (select MAX(master_ID) AS BaseID
from SCC as b
where b.duplicate_id = a.duplicate_id) as c



---------------------------------------------------------


It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
David Edwards - Media lens

Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
Howard Zinn
Post #1451549
Posted Friday, May 10, 2013 9:56 AM


Say Hey Kid

Say Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey Kid

Group: General Forum Members
Last Login: Friday, October 31, 2014 10:27 AM
Points: 708, Visits: 3,293
Unfortunately my solution isn't working as expected. Anyone else with suggestions?

Here is some sample data:

with SCC_CTE  (Retained_ID, Dropped_ID)
AS (Select 15799 , 8388 union all
select 52941, 8388 union all
select 52941, 15799 union all
select 80833, 8388 union all
select 80833 , 15799)
select * from SCC_CTE

And here is the expected out output:

;with Result_CTE  (Retained_ID, Dropped_ID)
AS (Select 80833 , 8388 union all
select 80833, 15799 union all
select 80833, 52941)
select * from Result_CTE

Thanks in advance.


---------------------------------------------------------


It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
David Edwards - Media lens

Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
Howard Zinn
Post #1451666
Posted Monday, May 13, 2013 3:19 AM


SSCrazy

SSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazy

Group: General Forum Members
Last Login: 2 days ago @ 1:30 AM
Points: 2,386, Visits: 7,622
Sorry, I'd have replied sooner but things have been going on at work that take priority.

What you need is a recursive check through the list. So, I've knocked up a really quick demonstration, but I'd suggest looking into it yourself as I'm convinced that this is not the optimal query. Unfortunately, I'm a bit busy to dig in to it any further.

IF object_id('tempdb..#SCC') IS NOT NULL
BEGIN
DROP TABLE #SCC;
END;

--SAMPLE DATA
SELECT Retained_ID, Dropped_ID
INTO #SCC
FROM (VALUES(15799, 8388),(52941, 8388),(52941, 15799),(80833, 8388),(80833, 15799)
)a(Retained_ID, Dropped_ID);

-- EXPECTED RESULTS
SELECT Retained_ID, Dropped_ID
FROM (VALUES(80833, 8388),(80833, 15799),(80833, 52941))a(Retained_ID, Dropped_ID);

-- RECURSIVE QUERY
WITH CTE AS (
SELECT Retained_ID, Dropped_ID
FROM #SCC
UNION ALL
SELECT a.Retained_ID, CASE WHEN a.Retained_ID > b.Retained_ID THEN b.Retained_ID ELSE a.Dropped_ID END
FROM #SCC a
INNER JOIN CTE b ON a.Dropped_ID = b.Dropped_ID AND a.Retained_ID > b.Retained_ID)
SELECT DISTINCT ca.Retained_ID, ca.Dropped_ID
FROM CTE q
CROSS APPLY (SELECT TOP 1 a.Retained_ID, a.Dropped_ID
FROM CTE a
WHERE a.Dropped_ID = q.Dropped_ID
ORDER BY a.Retained_ID DESC
) ca;


--EDIT--

Almost immediately after posting, I realised that the recursive query could just as easily be written like this: -
-- RECURSIVE QUERY
WITH CTE AS (
SELECT Retained_ID, Dropped_ID
FROM #SCC
UNION ALL
SELECT a.Retained_ID, b.Retained_ID
FROM #SCC a
INNER JOIN CTE b ON a.Dropped_ID = b.Dropped_ID AND a.Retained_ID > b.Retained_ID)
SELECT MAX(Retained_ID) AS Retained_ID, Dropped_ID
FROM CTE
GROUP BY Dropped_ID;




Not a DBA, just trying to learn

For better, quicker answers on T-SQL questions, click on the following...
http://www.sqlservercentral.com/articles/Best+Practices/61537/

For better, quicker answers on SQL Server performance related questions, click on the following...
http://www.sqlservercentral.com/articles/SQLServerCentral/66909/



If you litter your database queries with nolock query hints, are you aware of the side effects?
Try reading a few of these links...

(*) Missing rows with nolock
(*) Allocation order scans with nolock
(*) Consistency issues with nolock
(*) Transient Corruption Errors in SQL Server error log caused by nolock
(*) Dirty reads, read errors, reading rows twice and missing rows with nolock


LinkedIn | Blog coming soon (for sufficiently large values of "soon" )!
Post #1452017
Posted Monday, May 13, 2013 6:03 AM


SSCertifiable

SSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiable

Group: General Forum Members
Last Login: Yesterday @ 1:43 AM
Points: 6,890, Visits: 14,254
Craig - very promising, but try adding these to the sample data set:

UNION ALL
SELECT 2426, 2341 UNION ALL
SELECT 2341, 1200


Abu Dina - are these two additional rows valid data, and if so, can you extend the data set please? I think it will help folks if there are more groups and perhaps a few rows where there are no associations at all.


“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
Exploring Recursive CTEs by Example Dwain Camps
Post #1452064
Posted Monday, May 13, 2013 6:30 AM


Say Hey Kid

Say Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey Kid

Group: General Forum Members
Last Login: Friday, October 31, 2014 10:27 AM
Points: 708, Visits: 3,293
Thanks for all your help so far.

This is one way I've tried and it works I think but I'm sure you clever folks can come up with a better way :

declare @pairs table (dropped_id int, retained_id  int)

insert into @pairs
SELECT 34336, 14553 UNION ALL
SELECT 39352, 14553 UNION ALL
SELECT 39352, 34336 UNION ALL
SELECT 39776, 34123 UNION ALL
SELECT 47833, 14553 UNION ALL
SELECT 47833, 34336 UNION ALL
SELECT 47833, 39352 UNION ALL
SELECT 47969, 14553 UNION ALL
SELECT 47969, 34336 UNION ALL
SELECT 47969, 39352 UNION ALL
SELECT 47969, 47833 UNION ALL
SELECT 48460, 14553 UNION ALL
SELECT 48460, 34336 UNION ALL
SELECT 48460, 39352 UNION ALL
SELECT 48460, 47833 UNION ALL
SELECT 48460, 47969 UNION ALL
SELECT 48463, 14553 UNION ALL
SELECT 48463, 34336 UNION ALL
SELECT 48463, 39352 UNION ALL
SELECT 48463, 47833 UNION ALL
SELECT 48463, 47969 UNION ALL
SELECT 48463, 48460 UNION ALL
SELECT 52149, 14553 UNION ALL
SELECT 52149, 34336 UNION ALL
SELECT 52149, 39352 UNION ALL
SELECT 52149, 39776 UNION ALL
SELECT 52149, 47833 UNION ALL
SELECT 52149, 47969 UNION ALL
SELECT 52149, 48460 UNION ALL
SELECT 52149, 48463 UNION ALL
SELECT 54497, 14553 UNION ALL
SELECT 54497, 34336 UNION ALL
SELECT 54497, 39352 UNION ALL
SELECT 54497, 39776 UNION ALL
SELECT 54497, 47833 UNION ALL
SELECT 54497, 47969 UNION ALL
SELECT 54497, 48460 UNION ALL
SELECT 54497, 48463 UNION ALL
SELECT 54497, 52149 UNION ALL
SELECT 66316, 14553 UNION ALL
SELECT 66316, 34336 UNION ALL
SELECT 66316, 39352 UNION ALL
SELECT 66316, 47833 UNION ALL
SELECT 66316, 47969 UNION ALL
SELECT 66316, 48460 UNION ALL
SELECT 66316, 48463 UNION ALL
SELECT 66316, 48819 UNION ALL
SELECT 66316, 52149 UNION ALL
SELECT 66316, 54497 UNION ALL
SELECT 74054, 66316 UNION ALL
SELECT 77722, 53333

declare @sortedpairs table (retained_id int, dropped_id int)

insert into @sortedpairs (retained_id,dropped_id )
select retained_id ,dropped_id from @pairs where retained_id <dropped_id

insert into @sortedpairs (retained_id,dropped_id )
select dropped_id ,retained_id from @pairs where retained_id >dropped_id


insert into @sortedpairs (retained_ID,dropped_ID)
select sp1.retained_ID,sp2.retained_ID from @sortedpairs sp1
inner join @sortedpairs sp2
on sp1.dropped_ID=sp2.dropped_ID
and sp1.retained_ID<>sp2.retained_ID
and sp1.retained_ID<sp2.retained_ID

insert into @sortedpairs (dropped_ID,retained_ID)
select sp1.retained_ID,sp2.retained_ID from @sortedpairs sp1
inner join @sortedpairs sp2
on sp1.dropped_ID=sp2.dropped_ID
and sp1.retained_ID<>sp2.retained_ID
and sp1.retained_ID>sp2.retained_ID


while (@@rowcount<>0)
begin

update tab1 set tab1.retained_ID=tab2.retained_ID
from @sortedpairs tab1
inner join @sortedpairs tab2
on tab2.dropped_ID=tab1.retained_ID
end


select * from @sortedpairs
group by retained_ID,dropped_ID
order by 1



---------------------------------------------------------


It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
David Edwards - Media lens

Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
Howard Zinn
Post #1452071
Posted Monday, May 13, 2013 10:36 AM


SSCrazy

SSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazy

Group: General Forum Members
Last Login: 2 days ago @ 1:30 AM
Points: 2,386, Visits: 7,622
I really don't have time to dig into this, work is a bet hectic. When I've got some spare time, I'll try to remember to revisit it. I'm fairly certain that this is a recursive CTE problem, so Dwain may well be your man (he loves recursive CTEs).

Just wondering, how many rows do you end up with in your working table? Taking a quick glance at your code, I'd imagine that as the number of rows increases in your source table, the number of rows in your working table must go up exponentially.



Not a DBA, just trying to learn

For better, quicker answers on T-SQL questions, click on the following...
http://www.sqlservercentral.com/articles/Best+Practices/61537/

For better, quicker answers on SQL Server performance related questions, click on the following...
http://www.sqlservercentral.com/articles/SQLServerCentral/66909/



If you litter your database queries with nolock query hints, are you aware of the side effects?
Try reading a few of these links...

(*) Missing rows with nolock
(*) Allocation order scans with nolock
(*) Consistency issues with nolock
(*) Transient Corruption Errors in SQL Server error log caused by nolock
(*) Dirty reads, read errors, reading rows twice and missing rows with nolock


LinkedIn | Blog coming soon (for sufficiently large values of "soon" )!
Post #1452215
Posted Monday, May 13, 2013 10:53 AM


Say Hey Kid

Say Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey Kid

Group: General Forum Members
Last Login: Friday, October 31, 2014 10:27 AM
Points: 708, Visits: 3,293
Join the club! I also have a number of deadlines looming so don't have the luxury to spend too much time on this.

Record linkage is a pain in the backside especially when you have changing spec from week to week!

Yes, the number of rows does increase considerably so although my multi-step method works, I know it's going to suffer performance issues as I try it with larger record sets.

For now though, I've tested it on a sample of 80k rows and my entire process from start to finish takes roughly 2 minute (this includes a dozen CLR TVFs) so I'm quite happy with it.

and yes, I can almost visualise the recursive CTE solution but as I said, I just don't have the time to mess around with it at the moment.

Thanks for your help so far, I do appreciate it.


---------------------------------------------------------


It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
David Edwards - Media lens

Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
Howard Zinn
Post #1452223
« Prev Topic | Next Topic »

Add to briefcase 123»»»

Permissions Expand / Collapse