Duplicates

  • Hello

    I have 4 tables:

    Customer address, name, telephone number and email address:

    SELECT A.Contact_id, A.first_name, A.last_name,

    B.Contact_id, B.Address_1, B.Address_2, B.Address_3, B.zipcode

    C.Contact_id, C.Telephone,

    D.Contact_Id_D.Email,

    FROM Contact AS A,

    Addresses AS B,

    Telephones AS C,

    Emails as D,

    INNER JOIN A.contact_id = B.contact_id

    INNER JOIN c.contact_id = a.contact_id;

    INNER JOIN d.contact_id = c.contact_id

    I want to find out if I have duplications in name (surname) and address. However I also want to show details like the customer's telephone number in my report.

    So I have basically linked the 4 tables together with inner joins. Next step is trying to find where surname and address 1 and zip code are duplicates.

    Do you have any suggestions on how I can achieve this?

    Many thanks.

  • faulknerwilliam2 (3/26/2015)


    Hello

    I have 4 tables:

    Customer address, name, telephone number and email address:

    SELECT A.Contact_id, A.first_name, A.last_name,

    B.Contact_id, B.Address_1, B.Address_2, B.Address_3, B.zipcode

    C.Contact_id, C.Telephone,

    D.Contact_Id_D.Email,

    FROM Contact AS A,

    Addresses AS B,

    Telephones AS C,

    Emails as D,

    INNER JOIN A.contact_id = B.contact_id

    INNER JOIN c.contact_id = a.contact_id;

    INNER JOIN d.contact_id = c.contact_id

    I want to find out if I have duplications in name (surname) and address. However I also want to show details like the customer's telephone number in my report.

    So I have basically linked the 4 tables together with inner joins. Next step is trying to find where surname and address 1 and zip code are duplicates.

    Do you have any suggestions on how I can achieve this?

    Many thanks.

    Your question is pretty confusing. Can you post ddl, sample data and desired output?

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • Hello Sean,

    Yes, it was a hopeless entry. I reposted with this, which I hope is clearer:

    Hello, I posted earlier with a poor script. Below is a better one.

    It pulls through all customers, plus their addresses, email and phone numbers.

    However some have duplicate surnames and addresses.

    Please can you advise an amendment to the script that would show which records were duplicates? Ideally I'd like the output to show where the surname, address 1 and the zipcode were identical - and restrict the output to show duplicate rows only.

    Many thanks.

    SELECT

    Reference.ReferenceNumber AS [Reference.ReferenceNumber)

    Reference.LastName

    Reference.FirstName

    Address.ReferenceNumber AS [Address.ReferenceNumber]

    Address.Address1

    Address.Address2

    Address.Address3

    Address.Address4

    Address.ZipCode

    Telephone.ReferenceNumber AS [Telephone.ReferenceNumber]

    Telephone.TelephoneNumber

    Email.ReferenceNumber AS [Email.ReferenceNumber]

    Email.EmailAddress

    FROM

    Reference

    INNER JOIN Address

    ON Reference.ReferenceNumber = Reference.ContactNumber

    LEFT OUTER JOIN Telephone

    ON Reference.ReferenceNumber = Telephone.ReferenceNumber

    LEFT OUTER Join Email

    ON Reference.ReferenceNumber = Email.ReferenceNumber

  • faulknerwilliam2 (3/26/2015)


    Hello Sean,

    Yes, it was a hopeless entry. I reposted with this, which I hope is clearer:

    Hello, I posted earlier with a poor script. Below is a better one.

    It pulls through all customers, plus their addresses, email and phone numbers.

    However some have duplicate surnames and addresses.

    Please can you advise an amendment to the script that would show which records were duplicates? Ideally I'd like the output to show where the surname, address 1 and the zipcode were identical - and restrict the output to show duplicate rows only.

    Many thanks.

    SELECT

    Reference.ReferenceNumber AS [Reference.ReferenceNumber)

    Reference.LastName

    Reference.FirstName

    Address.ReferenceNumber AS [Address.ReferenceNumber]

    Address.Address1

    Address.Address2

    Address.Address3

    Address.Address4

    Address.ZipCode

    Telephone.ReferenceNumber AS [Telephone.ReferenceNumber]

    Telephone.TelephoneNumber

    Email.ReferenceNumber AS [Email.ReferenceNumber]

    Email.EmailAddress

    FROM

    Reference

    INNER JOIN Address

    ON Reference.ReferenceNumber = Reference.ContactNumber

    LEFT OUTER JOIN Telephone

    ON Reference.ReferenceNumber = Telephone.ReferenceNumber

    LEFT OUTER Join Email

    ON Reference.ReferenceNumber = Email.ReferenceNumber

    Starting with a query that actually works would be a big improvement.

    Why do you want to return the reference number over and over. They will either be NULL or the same value because that is your join criteria.

    And since you still didn't post ddl, sample data or desired output this is completely untested.

    with Duplicates as

    (

    select r.LastName

    , a.Address1

    , a.ZipCode

    , COUNT(*) as DuplicateCount

    FROM Reference r

    INNER JOIN Address a ON a.ReferenceNumber = r.ReferenceNumber --?? not sure what is right here

    LEFT OUTER JOIN Telephone t ON r.ReferenceNumber = t.ReferenceNumber

    LEFT OUTER Join Email e ON r.ReferenceNumber = e.ReferenceNumber

    group by r.LastName

    , a.Address1

    , a.ZipCode

    having COUNT(*) > 1

    )

    SELECT

    r.ReferenceNumber

    , r.LastName

    , r.FirstName

    --Address.ReferenceNumber AS [Address.ReferenceNumber] why bother? This is the same value

    , a.Address1

    , a.Address2

    , a.Address3

    , a.Address4

    , a.ZipCode

    --, t.ReferenceNumber AS [Telephone.ReferenceNumber]

    , t.TelephoneNumber

    --, e.ReferenceNumber AS [Email.ReferenceNumber]

    , e.EmailAddress

    , d.DuplicateCount

    FROM Reference r

    INNER JOIN Address a ON a.ReferenceNumber = r.ReferenceNumber --?? not sure what is right here

    LEFT OUTER JOIN Telephone t ON r.ReferenceNumber = t.ReferenceNumber

    LEFT OUTER Join Email e ON r.ReferenceNumber = e.ReferenceNumber

    join Duplicates d on d.LastName = r.LastName

    AND d.Address1 = a.Address1

    AND d.ZipCode = a.ZipCode

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • Thanks for taking the time and your patience; I'm relatively new to actual scripting, as is probably obvious.

    The query worked fine, in that it pulled through all our customers: their names, addresses, phone numbers and email addresses. The first join that you question, the INNER, is to link the address table to the reference table. Should I be doing that in some other way?

    My ultimate aim is to return duplicate values only.

    Hope this is clear(er)?

  • faulknerwilliam2 (3/26/2015)Should I be doing that in some other way?

    Not necessarily but your join condition did not reference the address table.

    INNER JOIN Address

    ON Reference.ReferenceNumber = Reference.ContactNumber

    If there is something else not working in your query you need to provide ddl, sample data and desired output because I have no idea what that might be. Please take a few minutes and read the first link in my signature.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • Dear Sean

    The script you suggested above worked, but (due to my poor explanation of what I actually wanted) the data that's come out isn't quite what I want.

    Basically our contact centre sometimes logs the same person without checking if they have contacted us before and so the customer ends up with two (sometimes more!) reference numbers. We want to cleanse this, so:

    I would like to output instances where the customer's surname, address1 and zipcode are duplicated but only if the customer has different reference numbers.

    This is the type of data that I'd like to see output:

    Ref Number Surname Address1 Zip

    1875 Faulkner 10 Smith Street 08540

    1876 Faulkner 10 Smith Street 08540

    If we had input John Smith, 11 Smith Street, 08976 twice, but using the same reference number (as should be happening) then I wouldn't want to see him being output in the report.

    So basically, your script suggestion returns all duplicates, not those that duplicate except for the reference number.

    Do you have any advice on how I can achieve this?

    Many thanks.

  • Could you please post the scripts for the two tables in the problem and enough inserts statements to recreate the problem you are having? If you ask a question and include code to create an populate the tables in your problem, you are more likely to get tested solutions instead of haphazard guesses. Go ahead and obfuscate your data if you need to, but if we have nothing to work with, it's really hard to come up with a solution that will actually work.

  • Will do. Apologies for my many egregious breaches of forum etiquette: I will do better next time.

  • As per the request above, please see:

    create table Reference(referencenumber int primary key identity(1,1),lastname varchar(20),firstname varchar(20))

    create table [Address](addressid int primary key identity(1,1),referencenumber int references reference(referencenumber),address1 varchar(20),

    city varchar(20),state varchar(2),country char(2),zipcode varchar(20))

    go

    insert into reference values('lincoln','sam'),('wright','mike'),('lincoln','will'),('lincoln','sam'),('lincoln','sam')

    go

    insert into [Address] values(1,'east avenue','atlanta','GA','US',12345),

    (2,'south avenue','atlanta','GA','US',12345),

    (3,'west avenue','atlanta','GA','US',12345),

    (4,'east avenue','atlanta','GA','US',12345),

    (5,'east avenue','atlanta','GA','US',12345)

    So basically in my query I would only want to see reference numbers 4 and 5. Basically surname, address 1, zip code should = but the reference number should be <> , not =.

    I hope this is clearer and that the test data is workable.

  • I'd be tempted to work on each table in turn, logging the duplicated reference numbers. There are loads of ways of detecting dupes. Here's one of them which works nicely on your sample set:

    SELECT a.*

    FROM #Address a

    WHERE EXISTS (

    SELECT 1

    FROM #Address b

    WHERE b.address1 = a.address1

    AND b.city = a.city

    AND b.state = a.state

    AND b.country = a.country

    AND b.zipcode = a.zipcode

    AND b.referencenumber <> a.referencenumber

    )

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Why not just add this to the join predicates to the Duplicates cte in the script I posted yesterday.

    AND d.ReferenceNumber <> r.ReferenceNumber

    I am more than a little concerned that you don't understand the code very well because that should be a pretty minor tweak.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • Why not just add this to the join predicates to the Duplicates cte in the script I posted yesterday.

    AND d.ReferenceNumber <> r.ReferenceNumber

    I did try that Sean, but the output is currently doing two things. It IS showing me where the fields are identical apart from the reference number but confusingly it is still bringing through records where all the fields are matching, including the reference number.

    I am slightly confused about this as I thought using <> would stop records coming through where the reference number was the same.

    Yours (or anyone else's) thoughts about this would be appreciated.

    Re your second point, which I accept completely:

    I am more than a little concerned that you don't understand the code very well because that should be a pretty minor tweak.

    Up till now I have written fairly basic scripts then used filters, IIF statements etc in SSRS to get the data I want. The above is an attempt to get the data in a relatively presentable state prior to using SSRS. However, as I soon realised, I was probably trying to run before I could walk.

    Thanks.

  • To ChrisM@Work

    Thanks - your solution worked well and WHERE EXISTS was not something I had encountered before.

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply