Selecting all records that occur more than once

  • I want to select the id and name of all companies in [prospects] table that have the same name and occur more than once.

    So if Microsoft and Google occur more than once I want as a result:

    [prospects]

    id companyname

    2 Microsoft

    5 Microsoft

    7 Microsoft

    10 Google

    14 Google

    I now have the following sql, but that only returns the companyname once:

    SELECT id,companyname,COUNT(companyname) AS NumOccurrences FROM prospects P

    GROUP BY companyname,id

    HAVING ( COUNT(companyname) > 1 )

  • @petervdkerk,

    First of all, Joe is absolutely correct... you'll get better answers more quickly if you invest just a minute or two in how you post your data. Please see the first link in my signature line at the bottom of this post for how to best do that. It's also the polite thing to do so people providing answers can actually test their code to make sure it does what you want it to. 😉

    Joe is also correct about the "ID" column naming. It's pretty much non-descriptive and can become terribly confusing when you're joining multiple tables.

    That, notwithstanding, I left your column names as you provided them in the following test code setup. The only thing I changed was I used a Temp Table to test with so we don't accidently drop a real table in the process...

    Here's how we'd like to see code posted in the future... It instantly answers nearly any question that anyone could have about the data and table(s) involved with your request.

    --========================================================

    -- Create the test data. Nothing in this section has

    -- anything to do with the solution to the problem.

    -- We're just creating test data here.

    -- THIS IS HOW YOU SHOULD POST DATA IN THE FUTURE!!!!

    --========================================================

    --===== Conditionally drop the test table to make reruns

    -- easier in SSMS.

    IF OBJECT_ID('tempdb..#Prospects','U') IS NOT NULL

    DROP TABLE #Prospects

    ;

    --===== Create the table.

    CREATE TABLE #Prospects

    (

    ID INT NOT NULL,

    CompanyName VARCHAR(20) NOT NULL

    )

    ;

    --===== Populate the table with test data for

    -- both positive AND negative results.

    INSERT INTO #Prospects

    (ID, CompanyName)

    SELECT 2,'Microsoft' UNION ALL

    SELECT 5,'Microsoft' UNION ALL

    SELECT 7,'Microsoft' UNION ALL

    SELECT 10,'Google' UNION ALL

    SELECT 14,'Google' UNION ALL

    SELECT 22,'SomeCompany1' UNION ALL

    SELECT 24,'SomeCompany2'

    ;

    --===== Build the PK and indexes

    ALTER TABLE #Prospects

    ADD PRIMARY KEY (ID)

    ;

    -- This is a nice little covering index

    CREATE INDEX IX_#Prospect_CompanyName

    ON #Prospects (CompanyName, ID)

    ;

    I tried using a couple of the classic methods for solving this problem and threw Joe's code (after changing his code to your column names) into the mix, as well. Here's that test code...

    PRINT '--===== Solve the problem using just one of the half dozen ways to do this. ===';

    SET STATISTICS IO, TIME ON;

    SELECT ID, CompanyName

    FROM #Prospects

    WHERE CompanyName IN

    ( --=== Find companies with multiple occurrences

    SELECT CompanyName

    FROM #Prospects p2

    GROUP BY CompanyName

    HAVING COUNT(*) > 1

    )

    ORDER BY CompanyName, ID;

    SET STATISTICS IO, TIME OFF;

    ------------------------------------------------------------------------------------------

    PRINT '--===== Solve the problem using another of the half dozen ways to do this. ===';

    SET STATISTICS IO, TIME ON;

    SELECT ID, CompanyName

    FROM #Prospects p1

    WHERE EXISTS

    ( --=== Find companies with multiple occurrences

    SELECT 1

    FROM #Prospects p2

    WHERE p1.CompanyName = p2.CompanyName

    GROUP BY CompanyName

    HAVING COUNT(ID) > 1

    )

    ORDER BY CompanyName, ID;

    SET STATISTICS IO, TIME OFF;

    ------------------------------------------------------------------------------------------

    PRINT '--===== Joe''s method =========================================================';

    SET STATISTICS IO, TIME ON;

    SELECT x.ID, x.CompanyName

    FROM (

    SELECT ID,

    CompanyName,

    COUNT(*) OVER (PARTITION BY CompanyName) AS CompanyOccurrences

    FROM #Prospects

    ) AS x

    WHERE CompanyOccurrences > 1;

    SET STATISTICS IO, TIME OFF;

    Here's the output from the messages tab. See the "Worktable" entry and all of the READs associated with it? 31 page reads to cover only 7 rows of data? That smacks of a "Table Spool" and what I call "Hidden RBAR". "Hidden RBAR" is a very bad thing because it will crush performance if a larger number of rows is involved. While it doesn't look bad below, it can get a lot worse in a hurry.

    --===== Solve the problem using just one of the half dozen ways to do this. ===

    (5 row(s) affected)

    Table '#Prospects__________________________________________________________________________________________________________00000000003D'.

    Scan count 3, logical reads 6, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 1 ms.

    --===== Solve the problem using another of the half dozen ways to do this. ===

    (5 row(s) affected)

    Table '#Prospects__________________________________________________________________________________________________________00000000003D'.

    Scan count 3, logical reads 6, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 0 ms.

    --===== Joe's method =========================================================

    (5 row(s) affected)

    Table 'Worktable'. Scan count 3, logical reads 31, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table '#Prospects__________________________________________________________________________________________________________00000000003D'.

    Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 1 ms.

    Ok... let's see how bad the "Hidden RBAR" can get. I know you probably don't have a million rows of data but it'll show just how bad the "Hidden RBAR" in the code actually is. Here's the code to build a million row test table for the problem at hand...

    --========================================================

    -- Create the test data. Nothing in this section has

    -- anything to do with the solution to the problem.

    -- We're just creating test data here.

    --========================================================

    --===== Conditionally drop the test table to make reruns

    -- easier in SSMS.

    IF OBJECT_ID('tempdb..#Prospects','U') IS NOT NULL

    DROP TABLE #Prospects

    ;

    --===== Create and build a large test table on the fly

    SELECT TOP 1000000

    ID = IDENTITY(INT,1,1), --Creates NOT NULL column

    CompanyName = ISNULL(LEFT(NEWID(),7),'') --Creates NOT NULL column

    INTO #Prospects

    FROM sys.all_columns ac1

    CROSS JOIN sys.all_columns ac2

    ;

    --===== Build the PK and indexes

    ALTER TABLE #Prospects

    ADD PRIMARY KEY (ID)

    ;

    CREATE INDEX IX_#Prospect_CompanyName

    ON #Prospects (CompanyName, ID)

    ;

    When we run the test code against THAT test data, we find out that the "Hidden RBAR" method is actually more than 10 times slower than the two classic methods and is thousands of times more aggressive for reads than the two classic methods. Here's the output that shows that.

    --===== Solve the problem using just one of the half dozen ways to do this. ===

    (3759 row(s) affected)

    Table '#Prospects__________________________________________________________________________________________________________00000000003C'.

    Scan count 2, logical reads 4474, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 2062 ms, elapsed time = 2404 ms.

    --===== Solve the problem using another of the half dozen ways to do this. ===

    (3759 row(s) affected)

    Table '#Prospects__________________________________________________________________________________________________________00000000003C'.

    Scan count 2, logical reads 4474, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 2016 ms, elapsed time = 2319 ms.

    --===== Joe's method =========================================================

    (3759 row(s) affected)

    Table 'Worktable'. Scan count 3, logical reads 5992481, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table '#Prospects__________________________________________________________________________________________________________00000000003C'.

    Scan count 1, logical reads 2237, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 26203 ms, elapsed time = 27219 ms.

    Please let us know if you have any additional questions on this problem or any of the solutions provided so far.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • CELKO (9/25/2011)


    >> I want to select the prospect_nbr and name of all companies in [prospects] table that have the same name and occur more than once. <<

    Please be polite enough to post DDL for us. And some sample data is nice, too. Since there is no such thing as a magical, universal, generic “id”, can I assume that the prospects are on a call list of some kinds with a ticket number?

    That's actually a pretty nice post, Joe. Thanks.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 3 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic. Login to reply