Identifying Duplicates (Not the usual)

  • Hi,

    I currently use some software which identifies duplicate records for me based on Initial, Surname, Property and Postcode so that I don't have to worry about doing it. The way that it identifies duplicate records is by entering a value in a field called dup_code. The problem is that it's not that clever when you have for example the following records all at the same address:

    S Andrews

    Sarah Andrews

    Sharon Andrews

    The software will mark all these records as duplicates because the first letter of the "Forename" field is "S". The software identifies the records as duplicates in the following way:

    Forename Surname Dup_Code

    S Andrews 1

    Sarah Andrews 1

    Sharon Andrews 1

    Paul Rowling 2

    P Rowling 2

    Ozzie Osborne 0

    In this example any record that does not have a duplicate record is always given a dup_code of 0. All other records are given a unique dup_code which will match with an associated duplicate record.

    Because there is no way to identify if S Andrews is a duplicate with Sarah or Sharon, I want to update their dup_code to 0, but leave all others intact.

    You may ask why I want to do this? Well, the DB that I have stores customer info in a customer table and transactions (such as what the customer bought, when they bought it and how much) in a transactions table. Therefore I want to be able to identify duplicate customer records, transfer the transactions of the duplicate person I want to delete to the person I want to keep and then delete the duplicate person. In the instance of the "Andrews" example above, I don't know which person to transfer the transactions to and so because I cannot do this accurately I will just take a hit and leave them as they are.

    The actual field definitions are as follows:

    Customers.forename

    Customers.surname

    Customers.dup_code

    Any help would be greatly appreciated.

    Cheers

    Paul

  • I think you'll have to use more than just the customers first and last names to cull duplications out.

    I would think that it would be very easy to have two distinct customers with the same name, but living in two different places.

    Can you give us the structure of the customer table, and if it links to an address table, that too?

    I'm bet that with the customer name, zipcode, and email all combined, you can find the real duplicates.

    Lowell

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • Hi Lowell,

    The customer table actually has all the address fields in there. The software that I use finds duplicates based on the customers initial, surname, property and postcode. The reason I need to id duplicates via inital is because the data I receive from shops has numerous duplicates whereby a customer record has beeen entered twice such as P Rowling, and Paul Rowling. These can be assumed to be the same person if they are at the same address, which my software does for me. The problem I have is that the software identifies the following example as duplicates:

    P Rowling

    Paul Rowling

    Peter Rowling

    The above cannot be assumed because P Rowling could be the same person as Paul or Peter. In this case I want to say that none of the records are duplicates.

    The fields of interest in the customer table are:

    forename

    surname

    dup_code

    property

    street

    locality

    town

    county

    postcode

    Thanks for your help

    Paul

  • If you had a CustomerNo as the primary key in your table and accept user input via the CustomerNo, you wouldn't have this problem.

    What logic would you use to know if Sarah Andrews or Sharon Andrews is a duplicate of S Andrews? You need a manual intervention to do that. I don't envy your job!


    Joseph

  • This would give you the customers that are in this situation :

    
    
    SELECT cust.* FROM Customers cust
    WHERE cust.dup_code = 1
    AND EXISTS
    (SELECT 1 FROM Customers c2
    WHERE c2.surname = cust.surname
    AND LEFT(c2.forename,1) = left(c2.forename,1)
    GROUP BY c2.surname, left(c2.forename,1)
    HAVING COUNT(*) > 2)
  • Just for information, we receive the data from a shop till system whereby the primary key is a concatenation of the customers forename, surname and date of birth. This primary key is pretty useless as the DOB is rarely recorded properly, hence you can see why so many duplicates can arise.

    With regards to Sarah Andrews, Sharon Andrews and S Andrews this exactly my point. You cannot identify if S Andrews is a duplicate with Sarah or Sharon, I agree this would require manual intervention which is not feasible therefore I want to flag them as non duplicates.

    Cheers

    Paul

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply