Query Duplicates

  • Hello, I just had an interview where I was asked to write a query that only returned records when a member id occured 3 or more times.

    Example: A table contains records of customer complaints with name, email, comments etc.

    Return results where the customer name has appeared in the table 3 or more times.

    I tried:

    SELECT name, email, comments,

    COUNT(name) AS NumOccurrences

    FROM custComplaint

    GROUP BY name, email, comments

    HAVING ( COUNT(name) > 2 )

    but NumOccurrances always equals 1 so no results.

    If I change to HAVING ( COUNT(name) = 1 )

    it returns all records.

    Am I taking the wrong approach? Or is there something missing in my query?

  • This approach seems correct. the NumOfOccurences always showing 1 means there are no duplicate names in your table. try and insert 2 duplicate values(in dev environment) and then re-run the query. it will return rows that have more than 3 occurences.

    try below:

    declare @t table (name varchar(255), email varchar(255), comments varchar(255))

    insert into @t

    select 'abc' as name, 'abc.abc' as email, 'Duplicate' as comments

    insert into @t

    select 'abc' as name, 'abc.abc' as email, 'Duplicate' as comments

    insert into @t

    select 'abc' as name, 'abc.abc' as email, 'Duplicate' as comments

    insert into @t

    select 'cba' as name, 'cba.cba' as email, 'Single' as comments

    SELECT name, email, comments,

    COUNT(name) AS NumOccurrences

    FROM @t

    GROUP BY name, email, comments

    HAVING (COUNT(name) > 2)

  • try to use ROWNUMBER() function , it will give you more flexibilty

    -------Bhuvnesh----------
    I work only to learn Sql Server...though my company pays me for getting their stuff done;-)

  • Thanks for the reply,

    When I paste your query it works perfectly, but when I change it to apply to one of my existing tables NumOccurrences always = 1 even if there are obviuos duplicates.

    The table I am using is :

    tblContacts

    With Columns:

    CONID int

    Type nvarchar(10)

    LID int

    CID int

    ContactCode int

    Notes text

    Date datetime

    EmpID int

    Query:

    SELECT CONID, Type, LID, CID, ContactCode,

    COUNT(CID) AS NumOccurrences

    FROM tblContacts

    GROUP BY CONID, Type,LID, CID, ContactCode, Date, EmpID

    HAVING (COUNT(CID) > 2 )

    Returns an empty table.

    When I change > 2 to = 1 I see why.

    The results I am getting:

    CONID TYPE LID CID ContactCode NumOccurrences

    30772 Group 12 1 0 1

    31249 Group 12 1 3 1

    31840 Group 15 3 0 1

    31511 Group 10 7 0 1

    31515 Group 10 7 3 1

    31251 Group 13 8 3 1

    31061 Group 13 8 0 1

    31374 Group 13 8 3 1

    As you can see there are many instances where CID duplicates occur and yet the NumOccurrences = 1

  • The problem here is that you are grouping by "CONID, Type,LID, CID, ContactCode, Date, EmpID", so only repetitions of same data for all those rows would "add" to the count you are doing.

    You probably want to aggregate the data first and then display those and the other columns you want to display.

    _______________________________________________________________________
    For better assistance in answering your questions, click here[/url]

  • gspeedtech

    To assist you in understanding the problem (Or at least I hope this help)

    Execute this command

    SET SHOWPLAN_ALL ON

    GO

    1. Execute your query

    2. Examine the execution plan. Note the Query Optimizer has elected to use a "Stream Aggregate Operator".

    3. Using Books On Line subject "Stream Aggregate Showplan Operator " execute the example given.

    4. Review what this implies;(Emphasis added by the poster)

    the query optimizer uses the Stream Aggregate operator to group the rows in the SalesPerson table by the TerritoryID column and then calculate the values for the AVG(Bonus) and SUM(SalesYTD) columns based on that grouping.

    I hope this assists you in gaining an understanding of what is happening "under the covers" - so to speak.

    If you are further interested in learning what value an execution plan can be to understanding what is happening vist:

    http://www.sqlservercentral.com/articles/books/65831/

    The above link will allow you to download in PDF format an excellent book by Grant Fritchey.

    If the above does not assist you in understand what is happening post again and some one more familiar with T-SQL might be able to assist you further.

    If everything seems to be going well, you have obviously overlooked something.

    Ron

    Please help us, help you -before posting a question please read[/url]
    Before posting a performance problem please read[/url]

  • Eureka! I Found it.

    I created a temp table and Inner Joined it to produce the desired results.

    SELECT CID,

    COUNT(CID) AS NumOccurrences

    Into tblCID FROM tblContacts

    GROUP BY CID

    HAVING (COUNT(CID) > 2)

    order by CID

    Select CONID, Type, LID, tblContacts.CID, ContactCode

    from tblcontacts

    inner Join tblCID

    on tblContacts.CID = tblCID.CID

    order by tblContacts.CID

    Drop Table tblCID

    Please let me know if you can see why my approach would be ill advised or offer inconsistant results.

    Thanks to all for the help.

  • bitbucket-25253

    I will definitely look into this to get a better understanding.

    Thanks for the reply!

  • Review the effect of your "SELECT INTO"

    From BOL:

    Note that your last posting of the T-SQL that works:

    The SELECT INTO statement creates a new table and populates it with the result set of the SELECT statement

    IF, and that is a big assumption on my part.... looking at your last posted T_SQL the table

    SELECT CID,

    COUNT(CID) AS NumOccurrences

    Into tblCID FROM tblContacts

    table "tblCID" does not exist until your T-SQL executes and since the query optimizer is attempting to process the data in the most efficient manner - BEFORE the query is actually executed it of course can not find the nonexistent table. You might want to use a CREATE TABLE statement and then insert the data into it... Check CREATE TABLE in BOL and by the way read up on a true temporary table defined as #tablename .. or #tblCid - which must be created and then populated.

    If everything seems to be going well, you have obviously overlooked something.

    Ron

    Please help us, help you -before posting a question please read[/url]
    Before posting a performance problem please read[/url]

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply