filter duplicate students via T-SQL

  • After reading through the thread, I see that none of the solutions offered thus far have recognized the necessity of treating the rows with a known IDNo differently than the rows with a null IDNo. Before any "filtering" of duplicates can be accomplished, these two sets of rows must be separated from one another and then related back to each other through a join on the Student values and either the Tel3 or Tel1 values, whichever matches (if either do match).

    The solution below uses three CTEs to set the stage: MyData represents your original set of raw rows of data; ID_Populated represents the subset of MyData in which the IDNo has a known value; ID_Unpopulated represents the subset of rows in which the IDNo is unknown (null). After that the main query relates ID_Populated to ID_Unpopulated on the Student column and either the Tel3 or Tel1 column whichever matches. Because the join option is a FULL JOIN, if there are no matches between the two subsets, those rows are still retained. If there are still duplicates in the results of this query, it is because there are more than one duplicate entry of the same student. Additional grouping or filtering would be required to address that situation.

    with

    MyData as

    (SELECT 'Alison' as Student, NULL as IDNo, '1111' as Tel3, '4444' as Tel1 UNION ALL

    SELECT 'Alison' as Student, '80405088' as IDNo, '1111' as Tel3, '4444' as Tel1 UNION ALL

    SELECT 'Amber' as Student, NULL as IDNo, '5555' as Tel3, '6666' as Tel1 UNION ALL

    SELECT 'Amber' as Student, NULL as IDNo, '7777' as Tel3, '7777' as Tel1 UNION ALL

    SELECT 'Carmen' as Student, NULL as IDNo, '8888' as Tel3, '8888' as Tel1 UNION ALL

    SELECT 'Carmen' as Student, '20083085' as IDNo, '9999' as Tel3, '8888' as Tel1),

    ID_Populated as (select Student, IDNo, Tel3, Tel1 from MyData where IDNo is not null),

    ID_Unpopulated as (select Student, IDNo, Tel3, Tel1 from MyData where IDNo is null)

    select

    Student = isnull(ID_Populated.Student, ID_Unpopulated.Student),

    IDNo = ID_Populated.IDNo,

    Tel3 = isnull(ID_Populated.Tel3, ID_Unpopulated.Tel3),

    Tel1 = isnull(ID_Populated.Tel1, ID_Unpopulated.Tel1),

    Sort = case when ID_Populated.Tel3 = ID_Unpopulated.Tel3 then ID_Populated.Tel3

    when ID_Populated.Tel1 = ID_Unpopulated.Tel1 then ID_Populated.Tel1

    when isnull(ID_Populated.Tel1, ID_Unpopulated.Tel1) = isnull(ID_Populated.Tel3, ID_Unpopulated.Tel3) then isnull(ID_Populated.Tel1, ID_Unpopulated.Tel1) end

    from

    ID_Populated full join

    ID_Unpopulated on ID_Populated.Student = ID_Unpopulated.Student and (ID_Populated.Tel3 = ID_Unpopulated.Tel3 or ID_Populated.Tel1 = ID_Unpopulated.Tel1)

  • geoff5 (4/23/2013)


    After reading through the thread, I see that none of the solutions offered thus far have recognized the necessity of treating the rows with a known IDNo differently than the rows with a null IDNo.

    Not quite true.

    In my script you can see that all the "duplicate candidates" (T2) must have IDNo IS NULL.

    There no condition on T1.IDNo, as "primary" entry may have it NULL as well, according to the OP description.

    _____________
    Code for TallyGenerator

  • Sergiy,

    I apologize for overlooking that detail of your script. Your script does identify the names with a duplicate entry by isolating the "known" ID values from the "unknown" ID values; however, I think the main intent of the original request was to filter out the duplicates, not to identify them, except perhaps by the "sort" column to indicate that a matching duplicate was identified by means of the T1 or T3 column.

    The intent of the query solution I supplied is to yield a list of names with no identifiable duplicates in the list. If there are duplicate entries of the same student, it is only because neither the T1 nor the T3 columns matched up to supply a definite identification of the duplicate entry.

    If I misconstrued the intent of the original post, and the goal is to identify duplicates rather than to suppress them, then your solution is likely a better one for this thread.

    Come to think of it, there does seem to be a mixing up of "filtering" duplicates and "identifying" them at one and the same time. If so, these tasks are at cross purposes because by definition eliminating duplicates means removing from the results the very rows that are needed in order to identify the duplicates and ultimately to eliminate them from the underlying data.

  • I've continued to scratch my head over this one, and I think I have a better solution now. What was troubling me is the fact that there is no good way to identify the "original" row from a "duplicate" row, especially with no row ID value, such as a primary key, for the set of data. In the end I decided the only way to approach this was to compare every row with every other row for the same Student name to see if any of the pertinent columns match, and if there is a match on any of them, flag it as a possible duplicate row. This still requires some way to identify the rows uniquely, so I added a new CTE to supply a unique row number for each row, and then in the main query I use a series of subqueries and a FOR XML trick I learned awhile back to check for likely duplicate rows and list their row numbers.

    I feel pretty confident that this will yield meaningful data, though it might take awhile to run against "real world" data because of all the subqueries that are essentially partial cross-joins for the complete set of data.

    In the example below I added a few rows for the student "Jack" to represent most of the variations in populated vs. null data points.

    with

    MyData as

    (SELECT 'Alison' as Student, NULL as IDNo, '1111' as Tel3, '4444' as Tel1 UNION ALL

    SELECT 'Alison' as Student, '80405088' as IDNo, '1111' as Tel3, '4444' as Tel1 UNION ALL

    SELECT 'Amber' as Student, NULL as IDNo, '5555' as Tel3, '6666' as Tel1 UNION ALL

    SELECT 'Amber' as Student, NULL as IDNo, '7777' as Tel3, '7777' as Tel1 UNION ALL

    SELECT 'Carmen' as Student, NULL as IDNo, '8888' as Tel3, '8888' as Tel1 UNION ALL

    SELECT 'Carmen' as Student, '20083085' as IDNo, '9999' as Tel3, '8888' as Tel1

    union all select 'Jack', '20001111', '1212', '3232'

    union all select 'Jack', '20001111', '1212', '3232'

    union all select 'Jack', null, '1212', '3232'

    union all select 'Jack', null, null, '3232'

    union all select 'Jack', null, null, null

    union all select 'Jack', '20001111', null, null

    union all select 'Jack', null, '3232', '1212'),

    MyDataWithRowID as

    (select RowID = row_number() over (order by (select null)), Student, IDNo, Tel3, Tel1 from MyData)

    select

    RowID,

    Student,

    IDNo,

    Tel3,

    Tel1,

    IDNo_MatchingRows = stuff((select ', ' + cast(RowID as varchar) from MyDataWithRowID m1 where m.Student = m1.Student and m.RowID <> m1.RowID and m.IDNo = m1.IDNo order by m1.RowID for xml path(''), type).value('.', 'varchar(max)'), 1, 2, ''),

    Tel3_MatchingRows = stuff((select ', ' + cast(RowID as varchar) from MyDataWithRowID m1 where m.Student = m1.Student and m.RowID <> m1.RowID and m.Tel3 = m1.Tel3 order by m1.RowID for xml path(''), type).value('.', 'varchar(max)'), 1, 2, ''),

    Tel1_MatchingRows = stuff((select ', ' + cast(RowID as varchar) from MyDataWithRowID m1 where m.Student = m1.Student and m.RowID <> m1.RowID and m.Tel1 = m1.Tel1 order by m1.RowID for xml path(''), type).value('.', 'varchar(max)'), 1, 2, ''),

    Tel1ToTel3_MatchingRows = stuff((select ', ' + cast(RowID as varchar) from MyDataWithRowID m1 where m.Student = m1.Student and m.RowID <> m1.RowID and (m.Tel1 = m1.Tel3 or m.Tel3 = m1.Tel1) order by m1.RowID for xml path(''), type).value('.', 'varchar(max)'), 1, 2, ''),

    All_Identified_Duplicate_Rows = stuff((select ', ' + cast(RowID as varchar) from MyDataWithRowID m1 where m.Student = m1.Student and m.RowID <> m1.RowID and (m.IDNo = m1.IDNo or m.Tel3 = m1.Tel3 or m.Tel1 = m1.Tel1 or m.Tel1 = m1.Tel3 or m.Tel3 = m1.Tel1) order by m1.RowID for xml path(''), type).value('.', 'varchar(max)'), 1, 2, ''),

    Tel1_Equals_Tel3 = case when m.Tel1 = m.Tel3 then 'Yes' else 'No' end

    from

    MyDataWithRowID m

  • kevin_nikolai (4/23/2013)


    Hi Sergiy, firstly thanks to you and everyone else who responded to my post.

    I think I should rather have named the title of my post:

    How to find duplicates based on 3 fields ? (being IDNo, Tel1, Tel3 - for Student)

    Your script/sql query worked ok in some instances, but in some cases excluded some students, due to Tel3 / Tel1 / IDNo. What makes writing a query for such an issue difficult is that sometimes all 3 (IDNo, Tel1, Tel3) have values, in other cases only 2 of the 3 have values, in other only 1, in other none have a value. To add more pain to the issue, first instance of IDNo / Tel1 / Tel3 for student Jack might be identical to 2nd instance of Jack, the 3rd instance they might all differ, 4th instance some might differ.

    Perhaps I should just use a cte -

    Student, Tel1 for 1st query.

    Then Student, Tel3 for 2nd query.

    Then Student, IDNo for 3rd query.

    Then join all 3 together, and hope it does the job.

    u know, u are querying a single table for duplicates... i mean how more simple than that could you get?!

    personally, i dont like bottle feeding people... but i think my recommendations would work...

    1. joining each field (as what one poster seem to have done)

    2. using CTE

    3. Using the having count clause...

    if none of those work then you to take a look at one of the posters signature....

    dont think about what u want to do with a row but what u need to do with a column.

  • first of all naming a column and the table "Students is no good!

    , this gave me a duplicate and counted the number of duplicates

    SELECT Student, IDNo, Tel3,Tel1, COUNT(*) AS dupes

    FROM dbo.Student

    GROUP BY Student, IDNo, Tel3,Tel1

    HAVING (COUNT(*) > 1)

    now if u need to find the duplicates and delete one of them:

    /* Delete Duplicate records */

    WITH CTE (Student, IDNo, Tel3,Tel1, DuplicateCount)

    AS

    (

    SELECT [Student], IDNo, Tel3,Tel1,

    ROW_NUMBER() OVER(PARTITION BY Student, IDNo, Tel3,Tel1 ORDER BY Student) AS DuplicateCount

    FROM dbo.Student

    )

    DELETE

    FROM CTE

    WHERE DuplicateCount > 1

    problem solved...

    i may be going on a wild thing here ... and cant seem to understand your problem well... but i dont know..

  • i have a problem with using numbers like 1111, 8888, 7777 etc in that sort field... whats the point of limiting a "sort" field to these numbers? i dont get it.

  • sdhanpaul,

    The problem with the "group by" approach you propose is that it defines a duplicate row as that shares values in all the columns Student AND IDNo AND Tel3 AND Tel1. This does not meet the definition of a duplicate row stated in the original post, which is that a duplicate row is one that shares values in Student AND (IDNo OR Tel3 OR Tel1). To handle the latter definition of "duplicate" is much more complex than a basic GROUP BY clause can address.

    The situation is further complicated by the fact that the value in Tel1 in one row (which I presume to be a telephone number in the actual real-world data) could be stored as Tel3 in another row and vice versa. So the test of "duplicate" requires checking not just the values within each of these columns but also between these two columns on different rows.

    I think the solution I posted earlier addresses these concerns. I welcome your feedback, if you test it and find that it does not.

  • geoff5 (4/24/2013)


    I think the solution I posted earlier addresses these concerns. I welcome your feedback, if you test it and find that it does not.

    That would be for Kevin to answer...

    But he's gone and we're not getting to a solution of this riddle.

    Such a pity...

    :crying:

    _____________
    Code for TallyGenerator

Viewing 9 posts - 16 through 23 (of 23 total)

You must be logged in to reply to this topic. Login to reply