RE: filter duplicate students via T-SQL

Hall of Fame

Points: 3560

April 24, 2013 at 10:06 am

I've continued to scratch my head over this one, and I think I have a better solution now. What was troubling me is the fact that there is no good way to identify the "original" row from a "duplicate" row, especially with no row ID value, such as a primary key, for the set of data. In the end I decided the only way to approach this was to compare every row with every other row for the same Student name to see if any of the pertinent columns match, and if there is a match on any of them, flag it as a possible duplicate row. This still requires some way to identify the rows uniquely, so I added a new CTE to supply a unique row number for each row, and then in the main query I use a series of subqueries and a FOR XML trick I learned awhile back to check for likely duplicate rows and list their row numbers.

I feel pretty confident that this will yield meaningful data, though it might take awhile to run against "real world" data because of all the subqueries that are essentially partial cross-joins for the complete set of data.

In the example below I added a few rows for the student "Jack" to represent most of the variations in populated vs. null data points.

with

MyData as

(SELECT 'Alison' as Student, NULL as IDNo, '1111' as Tel3, '4444' as Tel1 UNION ALL

SELECT 'Alison' as Student, '80405088' as IDNo, '1111' as Tel3, '4444' as Tel1 UNION ALL

SELECT 'Amber' as Student, NULL as IDNo, '5555' as Tel3, '6666' as Tel1 UNION ALL

SELECT 'Amber' as Student, NULL as IDNo, '7777' as Tel3, '7777' as Tel1 UNION ALL

SELECT 'Carmen' as Student, NULL as IDNo, '8888' as Tel3, '8888' as Tel1 UNION ALL

SELECT 'Carmen' as Student, '20083085' as IDNo, '9999' as Tel3, '8888' as Tel1

union all select 'Jack', '20001111', '1212', '3232'

union all select 'Jack', null, '1212', '3232'

union all select 'Jack', null, null, '3232'

union all select 'Jack', null, null, null

union all select 'Jack', '20001111', null, null

union all select 'Jack', null, '3232', '1212'),

MyDataWithRowID as

(select RowID = row_number() over (order by (select null)), Student, IDNo, Tel3, Tel1 from MyData)

select

RowID,

Student,

IDNo,

Tel3,

Tel1,

IDNo_MatchingRows = stuff((select ', ' + cast(RowID as varchar) from MyDataWithRowID m1 where m.Student = m1.Student and m.RowID <> m1.RowID and m.IDNo = m1.IDNo order by m1.RowID for xml path(''), type).value('.', 'varchar(max)'), 1, 2, ''),

Tel3_MatchingRows = stuff((select ', ' + cast(RowID as varchar) from MyDataWithRowID m1 where m.Student = m1.Student and m.RowID <> m1.RowID and m.Tel3 = m1.Tel3 order by m1.RowID for xml path(''), type).value('.', 'varchar(max)'), 1, 2, ''),

Tel1_MatchingRows = stuff((select ', ' + cast(RowID as varchar) from MyDataWithRowID m1 where m.Student = m1.Student and m.RowID <> m1.RowID and m.Tel1 = m1.Tel1 order by m1.RowID for xml path(''), type).value('.', 'varchar(max)'), 1, 2, ''),

Tel1ToTel3_MatchingRows = stuff((select ', ' + cast(RowID as varchar) from MyDataWithRowID m1 where m.Student = m1.Student and m.RowID <> m1.RowID and (m.Tel1 = m1.Tel3 or m.Tel3 = m1.Tel1) order by m1.RowID for xml path(''), type).value('.', 'varchar(max)'), 1, 2, ''),

All_Identified_Duplicate_Rows = stuff((select ', ' + cast(RowID as varchar) from MyDataWithRowID m1 where m.Student = m1.Student and m.RowID <> m1.RowID and (m.IDNo = m1.IDNo or m.Tel3 = m1.Tel3 or m.Tel1 = m1.Tel1 or m.Tel1 = m1.Tel3 or m.Tel3 = m1.Tel1) order by m1.RowID for xml path(''), type).value('.', 'varchar(max)'), 1, 2, ''),

Tel1_Equals_Tel3 = case when m.Tel1 = m.Tel3 then 'Yes' else 'No' end

from

MyDataWithRowID m