filter duplicate students via T-SQL

  • /*

    I need to filter for duplicate students:

    ---------------------------------------

    For same student, if one or both instance/s of IDNo is null then do the following:

    1.

    if both instances of Tel3 are identical for same Student (Alison), if they are

    use it to populate field Sort with the value, 1111.

    2.

    if both instances of Tel3 are not identical for same Student (Carmen), check if both instances of Tel1 are identical for same Student, if they are use it to populate field Sort with the value, 8888.

    3.

    if both instances of Tel3 and Tel1 are not identical for same Student (Amber), check if an instance of Tel3 & Tel1 are identical for same Student, if they are use it to populate field Sort with the value, 7777.

    Since there is no match between 1st row and 2nd row (except for the name) we can't say it's the

    same user for sure, so don't populate the field Sort with any value where row = ( 'Amber' as Student, NULL as IDNo, '5555' as Tel3, '6666' as Tel1 ).

    */

    SELECT 'Alison' as Student, NULL as IDNo, '1111' as Tel3, '4444' as Tel1, NULL as Sort UNION ALL

    SELECT 'Alison' as Student, '80405088' as IDNo, '1111' as Tel3, '4444' as Tel1, NULL as Sort UNION ALL

    SELECT 'Amber' as Student, NULL as IDNo, '5555' as Tel3, '6666' as Tel1, NULL as Sort UNION ALL

    SELECT 'Amber' as Student, NULL as IDNo, '7777' as Tel3, '7777' as Tel1, NULL as Sort UNION ALL

    SELECT 'Carmen' as Student, NULL as IDNo, '8888' as Tel3, '8888' as Tel1, NULL as Sort UNION ALL

    SELECT 'Carmen' as Student, '20083085' as IDNo, '9999' as Tel3, '8888' as Tel1, NULL as Sort

    ***

    Apologies for duplicate posting, when trying to post initially I got DNS errors from Internet Explorer, so re-posted, again DNS errors, etc. After the 3rd attempt my post was successful, only to notice more entries posted.

    ***

  • This would be a lot easier with ddl (create table statements) and sample data (insert statements) along with the desired output based on the sample data. Please take a few minutes and read the first article in my signature for best practices when posting questions.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • SELECT EmpPin, trxdate, trxtime, COUNT(*) AS dupes

    FROM dbo.rawtrx

    GROUP BY EmpPin, trxdate, trxtime

    HAVING (COUNT(*) > 1)

    play around with this query

  • okay, after reading further your questions and scenarios... i doubt that my last post would work...

    how many times do you have the same student recurring? do you ever have a situation like this?

    kyle null 1111 1111

    kyle null 1111 1112

    kyle 111 1111 1111

  • sdhanpaul,

    Prior to removing duplicate instance of user, I first want to get list of duplicate users, to do that I need to filter them and make sure they really are duplicates.

    I can filter them by IDNo - unfortunately more than half of the users with same name (firstname + surname) don't have IDNo, so I am forced to check TelephoneNo3 and TelephoneNo1 fields to try and identify if they really are the same user/s. As in my example above, the first row for user Alison is not identical to the second row for user Alison, same applied to the other users. So using COUNT(*) and HAVING (COUNT(*) > 1) clause won't do the job.

  • how about if you join the fields?

    Try this:

    WITH CTE (emppin,trxdate, trxtime, DuplicateCount)

    AS

    (

    SELECT emppin,trxdate, trxtime,

    ROW_NUMBER() OVER(PARTITION BY emppin,trxdate, trxtime ORDER BY emppin) AS DuplicateCount

    FROM dbo.rawtrx

    )

    DELETE

    FROM CTE

    WHERE DuplicateCount > 1

    you need to replace the fields etc

    this however deletes the duplicates... im really not much of a bottle feeder so you need to do the modifications.

  • I would like to follow this, but I don't understand the sentence "if they are use it to populate field Sort with the value". I don't see the Sort column populated at all for any of the examples.

    Did the replies here already help you?

  • Notice how people just keep taking random attempts at helping you. If you would take a few minutes and post ddl, sample data and desired output you would have a tested answer quickly. This really isn't too difficult but without something to work with it is just guessing.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • Apart from what Sean said (i.e. the missing DDL, which is basically table create statements plus some INSERT statements of sample data to create your scenario), I assume you also have a Tel2 column in that table? Not a good idea, since it makes your database design incredibly inflexible. Read up on Normalization (start with this Wikipedia article or read through Tom Thomson's really great series on Normalization on this site. By the way, is this an exam question?

    --------------------------------------------------------------------------
    A little knowledge is a dangerous thing (Alexander Pope)
    In order for us to help you as efficiently as possible, please read this before posting (courtesy of Jeff Moden)[/url]

  • First, you need to have some kind of ID on the records in the table, otherwise each record will be always matched to itself.

    Assuming, you have such ID here is a test table (which should be really provided by you in the initial post):

    --DROP TABLE #Student

    SELECT IDENTITY(int, 1,1) RowID, *

    INTO #Student

    FROM (

    SELECT 'Alison' as Student, NULL as IDNo, '1111' as Tel3, '4444' as Tel1, NULL as Sort UNION ALL

    SELECT 'Alison' as Student, '80405088' as IDNo, '1111' as Tel3, '4444' as Tel1, NULL as Sort UNION ALL

    SELECT 'Amber' as Student, NULL as IDNo, '5555' as Tel3, '6666' as Tel1, NULL as Sort UNION ALL

    SELECT 'Amber' as Student, NULL as IDNo, '7777' as Tel3, '7777' as Tel1, NULL as Sort UNION ALL

    SELECT 'Carmen' as Student, NULL as IDNo, '8888' as Tel3, '8888' as Tel1, NULL as Sort UNION ALL

    SELECT 'Carmen' as Student, '20083085' as IDNo, '9999' as Tel3, '8888' as Tel1, NULL as Sort

    ) DT (Student, IDNo, Tel3, Tel1, Sort)

    And here is the query returning matching pairs with appropriate Sort value:

    SELECT T1.Student, T1.RowID, T1.IDNo, T1.Tel3, T1.Tel1,

    CASE

    WHEN T2.Tel3 = T1.Tel3 THEN 1111

    WHEN T2.Tel1 = T1.Tel1 THEN 8888

    WHEN T2.Tel1 = T1.Tel3 OR T2.Tel3 = T1.Tel1 THEN 7777

    END Sort

    , T2.RowID,T2.IDNo, T2.Tel3, T2.Tel1

    FROM #Student T1

    INNER JOIN #Student T2 ON T2.Student = T1.Student AND T2.RowID <> T1.RowID AND T2.IDNo IS NULL AND (

    T2.Tel3 = T1.Tel3

    OR T2.Tel1 = T1.Tel1

    OR T2.Tel1 = T1.Tel3 OR T2.Tel3 = T1.Tel1

    )

    Assuming that it's the lines with IDNo which must be marked as duplicates, I apply the Sort status to T2 entries:

    UPDATE T2

    SET Sort = CASE

    WHEN T2.Tel3 = T1.Tel3 THEN 1111

    WHEN T2.Tel1 = T1.Tel1 THEN 8888

    WHEN T2.Tel1 = T1.Tel3 OR T2.Tel3 = T1.Tel1 THEN 7777

    END

    FROM ..

    Is it close to what you need?

    _____________
    Code for TallyGenerator

  • Hi, perhaps I did not explain myself clearly, so I am trying to fix that.

    I need to search a table with user records, find only those users that occur more than once.

    I could just use the username (Student) to find the duplicates, but unfortunately half of their IDNo's are NULL, so I it's you cannot say they are duplicates. The only other way to check if they are duplicates is to check Tel3 and Tel1 (to try and check if they are duplicates).

    Since are few forum users are asking for ddl, below I have provided it. It works to a certain extent.

    And NO, this is not for an exam, interview questionnaire, study topic, etc.

    CREATE TABLE #Student (

    [Student] [varchar](30) NULL,

    [IDNo] [varchar](13) NULL,

    [Tel3] [varchar](12) NULL,

    [Tel1] [varchar](12) NULL,

    [Sort] [varchar](12) NULL

    )

    INSERT INTO #Student

    SELECT 'Alison Stew' as Student, '405088' as IDNo, '+77721377200' as Tel3, '+77721377200' as Tel1, NULL

    INSERT INTO #Student

    SELECT 'Alison Stew' as Student, NULL as IDNo, '+77721377200' as Tel3, '+77721377200' as Tel1, NULL

    INSERT INTO #Student

    SELECT 'Amber Egno' as Student, NULL as IDNo, '+7677170700' as Tel3, '+76772476573' as Tel1, NULL

    INSERT INTO #Student

    SELECT 'Amber Egno' as Student, NULL as IDNo, '+76772476573' as Tel3, '+76772476573' as Tel1, NULL

    INSERT INTO #Student

    SELECT 'Carmen Schw' as Student, '083085' as IDNo, '+77711498898' as Tel3, '+77117642110' as Tel1, NULL

    INSERT INTO #Student

    SELECT 'Carmen Schw' as Student, NULL as IDNo, '+77117642110' as Tel3, '+77117642110' as Tel1, NULL

    -------------------------------------------------------------------------------------------------

    SELECT

    [Student]

    ,IDNo

    ,Tel1

    ,Tel3

    INTO #FilterDupes

    FROM #Student

    -------------------------------------------------------------------------------------------------

    SELECT

    [Student]

    --,IDNo

    --,Tel1

    --,Tel3

    ,count([Student]) as Dupes

    FROM #FilterDupes

    GROUP BY

    [Student]

    --,IDNo

    --,Tel1

    --,Tel3

    HAVING COUNT([Student]) > 1

    DROP TABLE #Student

    DROP TABLE #FilterDupes

  • Now, what is wrong with the approach I posted?

    Where does it fail to provide the result you need?

    _____________
    Code for TallyGenerator

  • Hi Sergiy, firstly thanks to you and everyone else who responded to my post.

    I think I should rather have named the title of my post:

    How to find duplicates based on 3 fields ? (being IDNo, Tel1, Tel3 - for Student)

    Your script/sql query worked ok in some instances, but in some cases excluded some students, due to Tel3 / Tel1 / IDNo. What makes writing a query for such an issue difficult is that sometimes all 3 (IDNo, Tel1, Tel3) have values, in other cases only 2 of the 3 have values, in other only 1, in other none have a value. To add more pain to the issue, first instance of IDNo / Tel1 / Tel3 for student Jack might be identical to 2nd instance of Jack, the 3rd instance they might all differ, 4th instance some might differ.

    Perhaps I should just use a cte -

    Student, Tel1 for 1st query.

    Then Student, Tel3 for 2nd query.

    Then Student, IDNo for 3rd query.

    Then join all 3 together, and hope it does the job.

  • hmm, it did not quite explain the problem.

    At least to me.;-)

    Can you please provide some data samples where the script fails and explain what outcome you'd expect from it.

    Then I (or others) could try to figure out a solution.

    _____________
    Code for TallyGenerator

  • The factor preventing us from providing an accurate answer is because we need a set of expected results in a tabular format that the query should generate from the sample data. This is the only robust way to describe the required query logic.

    Having said, that I'm including a possible answer using the provided DDL:

    -- Set up test data

    declare @Student table (

    [Student] [varchar](30) NULL,

    [IDNo] [varchar](13) NULL,

    [Tel3] [varchar](12) NULL,

    [Tel1] [varchar](12) NULL,

    [Sort] [varchar](12) NULL

    );

    INSERT INTO @Student

    SELECT 'Alison Stew' as Student, '405088' as IDNo, '+77721377200' as Tel3, '+77721377200' as Tel1, null;

    INSERT INTO @Student

    SELECT 'Alison Stew' as Student, NULL as IDNo, '+77721377200' as Tel3, '+77721377200' as Tel1, null;

    INSERT INTO @Student

    SELECT 'Amber Egno' as Student, NULL as IDNo, '+7677170700' as Tel3, '+76772476573' as Tel1, null;

    INSERT INTO @Student

    SELECT 'Amber Egno' as Student, NULL as IDNo, '+76772476573' as Tel3, '+76772476573' as Tel1, null;

    INSERT INTO @Student

    SELECT 'Carmen Schw' as Student, '083085' as IDNo, '+77711498898' as Tel3, '+77117642110' as Tel1, null;

    INSERT INTO @Student

    SELECT 'Carmen Schw' as Student, NULL as IDNo, '+77117642110' as Tel3, '+77117642110' as Tel1, null;

    -- Query

    with rules_per_student as (

    selectStudent,

    sum(case when IDNo is null then 1 else 0 end) as HasNullRows,

    case when min(Tel3) = max(Tel3) then 1 else 0 end as IdenticalTel3Rows,

    case when min(Tel1) = max(Tel1) then 1 else 0 end as IdenticalTel1Rows,

    sum(case when Tel1 = Tel3 then 1 else 0 end) as HasTel1EqualTel3InRow

    from @Student

    group by Student

    )

    selects.Student,

    s.IDNo,

    s.Tel3,

    s.Tel1,

    case when r.HasNullRows > 0 then

    case when IdenticalTel3Rows = 1 then '1111'

    when IdenticalTel3Rows = 0 and IdenticalTel1Rows = 1 then '8888'

    when IdenticalTel3Rows = 0 and IdenticalTel1Rows = 0 and HasTel1EqualTel3InRow > 0 then '7777'

    end

    else null end as Sort

    from @Student s

    join rules_per_student r on r.student = s.student

Viewing 15 posts - 1 through 15 (of 23 total)

You must be logged in to reply to this topic. Login to reply