Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase ««123»»

filter duplicate students via T-SQL Expand / Collapse
Author
Message
Posted Tuesday, April 23, 2013 12:51 AM
SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: Tuesday, April 15, 2014 4:40 PM
Points: 159, Visits: 454
Hi, perhaps I did not explain myself clearly, so I am trying to fix that.
I need to search a table with user records, find only those users that occur more than once.
I could just use the username (Student) to find the duplicates, but unfortunately half of their IDNo's are NULL, so I it's you cannot say they are duplicates. The only other way to check if they are duplicates is to check Tel3 and Tel1 (to try and check if they are duplicates).

Since are few forum users are asking for ddl, below I have provided it. It works to a certain extent.
And NO, this is not for an exam, interview questionnaire, study topic, etc.


CREATE TABLE #Student (
[Student] [varchar](30) NULL,
[IDNo] [varchar](13) NULL,
[Tel3] [varchar](12) NULL,
[Tel1] [varchar](12) NULL,
[Sort] [varchar](12) NULL
)

INSERT INTO #Student
SELECT 'Alison Stew' as Student, '405088' as IDNo, '+77721377200' as Tel3, '+77721377200' as Tel1, NULL
INSERT INTO #Student
SELECT 'Alison Stew' as Student, NULL as IDNo, '+77721377200' as Tel3, '+77721377200' as Tel1, NULL
INSERT INTO #Student
SELECT 'Amber Egno' as Student, NULL as IDNo, '+7677170700' as Tel3, '+76772476573' as Tel1, NULL
INSERT INTO #Student
SELECT 'Amber Egno' as Student, NULL as IDNo, '+76772476573' as Tel3, '+76772476573' as Tel1, NULL
INSERT INTO #Student
SELECT 'Carmen Schw' as Student, '083085' as IDNo, '+77711498898' as Tel3, '+77117642110' as Tel1, NULL
INSERT INTO #Student
SELECT 'Carmen Schw' as Student, NULL as IDNo, '+77117642110' as Tel3, '+77117642110' as Tel1, NULL

-------------------------------------------------------------------------------------------------

SELECT
[Student]
,IDNo
,Tel1
,Tel3

INTO #FilterDupes
FROM #Student

-------------------------------------------------------------------------------------------------

SELECT
[Student]
--,IDNo
--,Tel1
--,Tel3
,count([Student]) as Dupes

FROM #FilterDupes

GROUP BY
[Student]
--,IDNo
--,Tel1
--,Tel3

HAVING COUNT([Student]) > 1

DROP TABLE #Student
DROP TABLE #FilterDupes
Post #1445290
Posted Tuesday, April 23, 2013 4:19 AM
SSCarpal Tunnel

SSCarpal TunnelSSCarpal TunnelSSCarpal TunnelSSCarpal TunnelSSCarpal TunnelSSCarpal TunnelSSCarpal TunnelSSCarpal TunnelSSCarpal Tunnel

Group: General Forum Members
Last Login: Sunday, July 20, 2014 5:23 PM
Points: 4,576, Visits: 8,341
Now, what is wrong with the approach I posted?

Where does it fail to provide the result you need?
Post #1445342
Posted Tuesday, April 23, 2013 4:08 PM
SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: Tuesday, April 15, 2014 4:40 PM
Points: 159, Visits: 454
Hi Sergiy, firstly thanks to you and everyone else who responded to my post.

I think I should rather have named the title of my post:
How to find duplicates based on 3 fields ? (being IDNo, Tel1, Tel3 - for Student)

Your script/sql query worked ok in some instances, but in some cases excluded some students, due to Tel3 / Tel1 / IDNo. What makes writing a query for such an issue difficult is that sometimes all 3 (IDNo, Tel1, Tel3) have values, in other cases only 2 of the 3 have values, in other only 1, in other none have a value. To add more pain to the issue, first instance of IDNo / Tel1 / Tel3 for student Jack might be identical to 2nd instance of Jack, the 3rd instance they might all differ, 4th instance some might differ.

Perhaps I should just use a cte -
Student, Tel1 for 1st query.
Then Student, Tel3 for 2nd query.
Then Student, IDNo for 3rd query.

Then join all 3 together, and hope it does the job.
Post #1445709
Posted Tuesday, April 23, 2013 6:46 PM
SSCarpal Tunnel

SSCarpal TunnelSSCarpal TunnelSSCarpal TunnelSSCarpal TunnelSSCarpal TunnelSSCarpal TunnelSSCarpal TunnelSSCarpal TunnelSSCarpal Tunnel

Group: General Forum Members
Last Login: Sunday, July 20, 2014 5:23 PM
Points: 4,576, Visits: 8,341
hmm, it did not quite explain the problem.
At least to me.

Can you please provide some data samples where the script fails and explain what outcome you'd expect from it.

Then I (or others) could try to figure out a solution.
Post #1445739
Posted Tuesday, April 23, 2013 9:10 PM
Old Hand

Old HandOld HandOld HandOld HandOld HandOld HandOld HandOld Hand

Group: General Forum Members
Last Login: Tuesday, July 15, 2014 2:20 PM
Points: 364, Visits: 384
The factor preventing us from providing an accurate answer is because we need a set of expected results in a tabular format that the query should generate from the sample data. This is the only robust way to describe the required query logic.

Having said, that I'm including a possible answer using the provided DDL:

-- Set up test data
declare @Student table (
[Student] [varchar](30) NULL,
[IDNo] [varchar](13) NULL,
[Tel3] [varchar](12) NULL,
[Tel1] [varchar](12) NULL,
[Sort] [varchar](12) NULL
);

INSERT INTO @Student
SELECT 'Alison Stew' as Student, '405088' as IDNo, '+77721377200' as Tel3, '+77721377200' as Tel1, null;
INSERT INTO @Student
SELECT 'Alison Stew' as Student, NULL as IDNo, '+77721377200' as Tel3, '+77721377200' as Tel1, null;
INSERT INTO @Student
SELECT 'Amber Egno' as Student, NULL as IDNo, '+7677170700' as Tel3, '+76772476573' as Tel1, null;
INSERT INTO @Student
SELECT 'Amber Egno' as Student, NULL as IDNo, '+76772476573' as Tel3, '+76772476573' as Tel1, null;
INSERT INTO @Student
SELECT 'Carmen Schw' as Student, '083085' as IDNo, '+77711498898' as Tel3, '+77117642110' as Tel1, null;
INSERT INTO @Student
SELECT 'Carmen Schw' as Student, NULL as IDNo, '+77117642110' as Tel3, '+77117642110' as Tel1, null;


-- Query
with rules_per_student as (
select Student,
sum(case when IDNo is null then 1 else 0 end) as HasNullRows,
case when min(Tel3) = max(Tel3) then 1 else 0 end as IdenticalTel3Rows,
case when min(Tel1) = max(Tel1) then 1 else 0 end as IdenticalTel1Rows,
sum(case when Tel1 = Tel3 then 1 else 0 end) as HasTel1EqualTel3InRow
from @Student
group by Student
)
select s.Student,
s.IDNo,
s.Tel3,
s.Tel1,
case when r.HasNullRows > 0 then
case when IdenticalTel3Rows = 1 then '1111'
when IdenticalTel3Rows = 0 and IdenticalTel1Rows = 1 then '8888'
when IdenticalTel3Rows = 0 and IdenticalTel1Rows = 0 and HasTel1EqualTel3InRow > 0 then '7777'
end
else null end as Sort
from @Student s
join rules_per_student r on r.student = s.student

Post #1445751
Posted Tuesday, April 23, 2013 9:34 PM
SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: Friday, April 25, 2014 4:20 PM
Points: 152, Visits: 448
After reading through the thread, I see that none of the solutions offered thus far have recognized the necessity of treating the rows with a known IDNo differently than the rows with a null IDNo. Before any "filtering" of duplicates can be accomplished, these two sets of rows must be separated from one another and then related back to each other through a join on the Student values and either the Tel3 or Tel1 values, whichever matches (if either do match).

The solution below uses three CTEs to set the stage: MyData represents your original set of raw rows of data; ID_Populated represents the subset of MyData in which the IDNo has a known value; ID_Unpopulated represents the subset of rows in which the IDNo is unknown (null). After that the main query relates ID_Populated to ID_Unpopulated on the Student column and either the Tel3 or Tel1 column whichever matches. Because the join option is a FULL JOIN, if there are no matches between the two subsets, those rows are still retained. If there are still duplicates in the results of this query, it is because there are more than one duplicate entry of the same student. Additional grouping or filtering would be required to address that situation.

with
MyData as
(SELECT 'Alison' as Student, NULL as IDNo, '1111' as Tel3, '4444' as Tel1 UNION ALL
SELECT 'Alison' as Student, '80405088' as IDNo, '1111' as Tel3, '4444' as Tel1 UNION ALL
SELECT 'Amber' as Student, NULL as IDNo, '5555' as Tel3, '6666' as Tel1 UNION ALL
SELECT 'Amber' as Student, NULL as IDNo, '7777' as Tel3, '7777' as Tel1 UNION ALL
SELECT 'Carmen' as Student, NULL as IDNo, '8888' as Tel3, '8888' as Tel1 UNION ALL
SELECT 'Carmen' as Student, '20083085' as IDNo, '9999' as Tel3, '8888' as Tel1),

ID_Populated as (select Student, IDNo, Tel3, Tel1 from MyData where IDNo is not null),

ID_Unpopulated as (select Student, IDNo, Tel3, Tel1 from MyData where IDNo is null)

select
Student = isnull(ID_Populated.Student, ID_Unpopulated.Student),
IDNo = ID_Populated.IDNo,
Tel3 = isnull(ID_Populated.Tel3, ID_Unpopulated.Tel3),
Tel1 = isnull(ID_Populated.Tel1, ID_Unpopulated.Tel1),
Sort = case when ID_Populated.Tel3 = ID_Unpopulated.Tel3 then ID_Populated.Tel3
when ID_Populated.Tel1 = ID_Unpopulated.Tel1 then ID_Populated.Tel1
when isnull(ID_Populated.Tel1, ID_Unpopulated.Tel1) = isnull(ID_Populated.Tel3, ID_Unpopulated.Tel3) then isnull(ID_Populated.Tel1, ID_Unpopulated.Tel1) end
from
ID_Populated full join
ID_Unpopulated on ID_Populated.Student = ID_Unpopulated.Student and (ID_Populated.Tel3 = ID_Unpopulated.Tel3 or ID_Populated.Tel1 = ID_Unpopulated.Tel1)

Post #1445755
Posted Wednesday, April 24, 2013 12:41 AM
SSCarpal Tunnel

SSCarpal TunnelSSCarpal TunnelSSCarpal TunnelSSCarpal TunnelSSCarpal TunnelSSCarpal TunnelSSCarpal TunnelSSCarpal TunnelSSCarpal Tunnel

Group: General Forum Members
Last Login: Sunday, July 20, 2014 5:23 PM
Points: 4,576, Visits: 8,341
geoff5 (4/23/2013)
After reading through the thread, I see that none of the solutions offered thus far have recognized the necessity of treating the rows with a known IDNo differently than the rows with a null IDNo.


Not quite true.
In my script you can see that all the "duplicate candidates" (T2) must have IDNo IS NULL.
There no condition on T1.IDNo, as "primary" entry may have it NULL as well, according to the OP description.
Post #1445782
Posted Wednesday, April 24, 2013 7:18 AM
SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: Friday, April 25, 2014 4:20 PM
Points: 152, Visits: 448
Sergiy,

I apologize for overlooking that detail of your script. Your script does identify the names with a duplicate entry by isolating the "known" ID values from the "unknown" ID values; however, I think the main intent of the original request was to filter out the duplicates, not to identify them, except perhaps by the "sort" column to indicate that a matching duplicate was identified by means of the T1 or T3 column.

The intent of the query solution I supplied is to yield a list of names with no identifiable duplicates in the list. If there are duplicate entries of the same student, it is only because neither the T1 nor the T3 columns matched up to supply a definite identification of the duplicate entry.

If I misconstrued the intent of the original post, and the goal is to identify duplicates rather than to suppress them, then your solution is likely a better one for this thread.

Come to think of it, there does seem to be a mixing up of "filtering" duplicates and "identifying" them at one and the same time. If so, these tasks are at cross purposes because by definition eliminating duplicates means removing from the results the very rows that are needed in order to identify the duplicates and ultimately to eliminate them from the underlying data.
Post #1445949
Posted Wednesday, April 24, 2013 10:06 AM
SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: Friday, April 25, 2014 4:20 PM
Points: 152, Visits: 448
I've continued to scratch my head over this one, and I think I have a better solution now. What was troubling me is the fact that there is no good way to identify the "original" row from a "duplicate" row, especially with no row ID value, such as a primary key, for the set of data. In the end I decided the only way to approach this was to compare every row with every other row for the same Student name to see if any of the pertinent columns match, and if there is a match on any of them, flag it as a possible duplicate row. This still requires some way to identify the rows uniquely, so I added a new CTE to supply a unique row number for each row, and then in the main query I use a series of subqueries and a FOR XML trick I learned awhile back to check for likely duplicate rows and list their row numbers.

I feel pretty confident that this will yield meaningful data, though it might take awhile to run against "real world" data because of all the subqueries that are essentially partial cross-joins for the complete set of data.

In the example below I added a few rows for the student "Jack" to represent most of the variations in populated vs. null data points.

 with
MyData as
(SELECT 'Alison' as Student, NULL as IDNo, '1111' as Tel3, '4444' as Tel1 UNION ALL
SELECT 'Alison' as Student, '80405088' as IDNo, '1111' as Tel3, '4444' as Tel1 UNION ALL
SELECT 'Amber' as Student, NULL as IDNo, '5555' as Tel3, '6666' as Tel1 UNION ALL
SELECT 'Amber' as Student, NULL as IDNo, '7777' as Tel3, '7777' as Tel1 UNION ALL
SELECT 'Carmen' as Student, NULL as IDNo, '8888' as Tel3, '8888' as Tel1 UNION ALL
SELECT 'Carmen' as Student, '20083085' as IDNo, '9999' as Tel3, '8888' as Tel1
union all select 'Jack', '20001111', '1212', '3232'
union all select 'Jack', '20001111', '1212', '3232'
union all select 'Jack', null, '1212', '3232'
union all select 'Jack', null, null, '3232'
union all select 'Jack', null, null, null
union all select 'Jack', '20001111', null, null
union all select 'Jack', null, '3232', '1212'),

MyDataWithRowID as
(select RowID = row_number() over (order by (select null)), Student, IDNo, Tel3, Tel1 from MyData)

select
RowID,
Student,
IDNo,
Tel3,
Tel1,
IDNo_MatchingRows = stuff((select ', ' + cast(RowID as varchar) from MyDataWithRowID m1 where m.Student = m1.Student and m.RowID <> m1.RowID and m.IDNo = m1.IDNo order by m1.RowID for xml path(''), type).value('.', 'varchar(max)'), 1, 2, ''),
Tel3_MatchingRows = stuff((select ', ' + cast(RowID as varchar) from MyDataWithRowID m1 where m.Student = m1.Student and m.RowID <> m1.RowID and m.Tel3 = m1.Tel3 order by m1.RowID for xml path(''), type).value('.', 'varchar(max)'), 1, 2, ''),
Tel1_MatchingRows = stuff((select ', ' + cast(RowID as varchar) from MyDataWithRowID m1 where m.Student = m1.Student and m.RowID <> m1.RowID and m.Tel1 = m1.Tel1 order by m1.RowID for xml path(''), type).value('.', 'varchar(max)'), 1, 2, ''),
Tel1ToTel3_MatchingRows = stuff((select ', ' + cast(RowID as varchar) from MyDataWithRowID m1 where m.Student = m1.Student and m.RowID <> m1.RowID and (m.Tel1 = m1.Tel3 or m.Tel3 = m1.Tel1) order by m1.RowID for xml path(''), type).value('.', 'varchar(max)'), 1, 2, ''),
All_Identified_Duplicate_Rows = stuff((select ', ' + cast(RowID as varchar) from MyDataWithRowID m1 where m.Student = m1.Student and m.RowID <> m1.RowID and (m.IDNo = m1.IDNo or m.Tel3 = m1.Tel3 or m.Tel1 = m1.Tel1 or m.Tel1 = m1.Tel3 or m.Tel3 = m1.Tel1) order by m1.RowID for xml path(''), type).value('.', 'varchar(max)'), 1, 2, ''),
Tel1_Equals_Tel3 = case when m.Tel1 = m.Tel3 then 'Yes' else 'No' end
from
MyDataWithRowID m

Post #1446096
Posted Wednesday, April 24, 2013 7:15 PM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Monday, August 19, 2013 8:08 PM
Points: 43, Visits: 566
kevin_nikolai (4/23/2013)
Hi Sergiy, firstly thanks to you and everyone else who responded to my post.

I think I should rather have named the title of my post:
How to find duplicates based on 3 fields ? (being IDNo, Tel1, Tel3 - for Student)

Your script/sql query worked ok in some instances, but in some cases excluded some students, due to Tel3 / Tel1 / IDNo. What makes writing a query for such an issue difficult is that sometimes all 3 (IDNo, Tel1, Tel3) have values, in other cases only 2 of the 3 have values, in other only 1, in other none have a value. To add more pain to the issue, first instance of IDNo / Tel1 / Tel3 for student Jack might be identical to 2nd instance of Jack, the 3rd instance they might all differ, 4th instance some might differ.

Perhaps I should just use a cte -
Student, Tel1 for 1st query.
Then Student, Tel3 for 2nd query.
Then Student, IDNo for 3rd query.

Then join all 3 together, and hope it does the job.


u know, u are querying a single table for duplicates... i mean how more simple than that could you get?!

personally, i dont like bottle feeding people... but i think my recommendations would work...
1. joining each field (as what one poster seem to have done)
2. using CTE
3. Using the having count clause...

if none of those work then you to take a look at one of the posters signature....
dont think about what u want to do with a row but what u need to do with a column.
Post #1446268
« Prev Topic | Next Topic »

Add to briefcase ««123»»

Permissions Expand / Collapse