Selecting all records that occur more than once

Question

Selecting all records that occur more than once

petervdkerk

SSCrazy

Points: 2155
More actions
September 25, 2011 at 10:29 am

#245406

I want to select the id and name of all companies in [prospects] table that have the same name and occur more than once.
So if Microsoft and Google occur more than once I want as a result:
[prospects]
id companyname
2 Microsoft
5 Microsoft
7 Microsoft
10 Google
14 Google
I now have the following sql, but that only returns the companyname once:
SELECT id,companyname,COUNT(companyname) AS NumOccurrences FROM prospects P
GROUP BY companyname,id
HAVING ( COUNT(companyname) > 1 )

Viewing 3 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic. Login to reply

Jeff Moden SSC Guru Points: 1003927 More actions · Answer 1

First of all, Joe is absolutely correct... you'll get better answers more quickly if you invest just a minute or two in how you post your data. Please see the first link in my signature line at the bottom of this post for how to best do that. It's also the polite thing to do so people providing answers can actually test their code to make sure it does what you want it to. 😉

Joe is also correct about the "ID" column naming. It's pretty much non-descriptive and can become terribly confusing when you're joining multiple tables.

That, notwithstanding, I left your column names as you provided them in the following test code setup. The only thing I changed was I used a Temp Table to test with so we don't accidently drop a real table in the process...

Here's how we'd like to see code posted in the future... It instantly answers nearly any question that anyone could have about the data and table(s) involved with your request.

--========================================================

-- Create the test data. Nothing in this section has

-- anything to do with the solution to the problem.

-- We're just creating test data here.

-- THIS IS HOW YOU SHOULD POST DATA IN THE FUTURE!!!!

--========================================================

--===== Conditionally drop the test table to make reruns

-- easier in SSMS.

IF OBJECT_ID('tempdb..#Prospects','U') IS NOT NULL

DROP TABLE #Prospects

;

--===== Create the table.

CREATE TABLE #Prospects

(

ID INT NOT NULL,

CompanyName VARCHAR(20) NOT NULL

)

;

--===== Populate the table with test data for

-- both positive AND negative results.

INSERT INTO #Prospects

(ID, CompanyName)

SELECT 2,'Microsoft' UNION ALL

SELECT 5,'Microsoft' UNION ALL

SELECT 7,'Microsoft' UNION ALL

SELECT 10,'Google' UNION ALL

SELECT 14,'Google' UNION ALL

SELECT 22,'SomeCompany1' UNION ALL

SELECT 24,'SomeCompany2'

;

--===== Build the PK and indexes

ALTER TABLE #Prospects

ADD PRIMARY KEY (ID)

;

-- This is a nice little covering index

CREATE INDEX IX_#Prospect_CompanyName

ON #Prospects (CompanyName, ID)

;

I tried using a couple of the classic methods for solving this problem and threw Joe's code (after changing his code to your column names) into the mix, as well. Here's that test code...

PRINT '--===== Solve the problem using just one of the half dozen ways to do this. ===';

SET STATISTICS IO, TIME ON;

SELECT ID, CompanyName

FROM #Prospects

WHERE CompanyName IN

( --=== Find companies with multiple occurrences

SELECT CompanyName

FROM #Prospects p2

GROUP BY CompanyName

HAVING COUNT(*) > 1

)

ORDER BY CompanyName, ID;

SET STATISTICS IO, TIME OFF;

------------------------------------------------------------------------------------------

PRINT '--===== Solve the problem using another of the half dozen ways to do this. ===';

SET STATISTICS IO, TIME ON;

SELECT ID, CompanyName

FROM #Prospects p1

WHERE EXISTS

( --=== Find companies with multiple occurrences

SELECT 1

FROM #Prospects p2

WHERE p1.CompanyName = p2.CompanyName

GROUP BY CompanyName

HAVING COUNT(ID) > 1

)

ORDER BY CompanyName, ID;

SET STATISTICS IO, TIME OFF;

------------------------------------------------------------------------------------------

PRINT '--===== Joe''s method =========================================================';

SET STATISTICS IO, TIME ON;

SELECT x.ID, x.CompanyName

FROM (

SELECT ID,

CompanyName,

COUNT(*) OVER (PARTITION BY CompanyName) AS CompanyOccurrences

FROM #Prospects

) AS x

WHERE CompanyOccurrences > 1;

SET STATISTICS IO, TIME OFF;

Here's the output from the messages tab. See the "Worktable" entry and all of the READs associated with it? 31 page reads to cover only 7 rows of data? That smacks of a "Table Spool" and what I call "Hidden RBAR". "Hidden RBAR" is a very bad thing because it will crush performance if a larger number of rows is involved. While it doesn't look bad below, it can get a lot worse in a hurry.

--===== Solve the problem using just one of the half dozen ways to do this. ===

(5 row(s) affected)

Table '#Prospects__________________________________________________________________________________________________________00000000003D'.

Scan count 3, logical reads 6, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 0 ms, elapsed time = 1 ms.

--===== Solve the problem using another of the half dozen ways to do this. ===

(5 row(s) affected)

Table '#Prospects__________________________________________________________________________________________________________00000000003D'.

Scan count 3, logical reads 6, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 0 ms, elapsed time = 0 ms.

--===== Joe's method =========================================================

(5 row(s) affected)

Table 'Worktable'. Scan count 3, logical reads 31, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table '#Prospects__________________________________________________________________________________________________________00000000003D'.

Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 0 ms, elapsed time = 1 ms.

Ok... let's see how bad the "Hidden RBAR" can get. I know you probably don't have a million rows of data but it'll show just how bad the "Hidden RBAR" in the code actually is. Here's the code to build a million row test table for the problem at hand...

--========================================================

-- Create the test data. Nothing in this section has

-- anything to do with the solution to the problem.

-- We're just creating test data here.

--========================================================

--===== Conditionally drop the test table to make reruns

-- easier in SSMS.

IF OBJECT_ID('tempdb..#Prospects','U') IS NOT NULL

DROP TABLE #Prospects

;

--===== Create and build a large test table on the fly

SELECT TOP 1000000

ID = IDENTITY(INT,1,1), --Creates NOT NULL column

CompanyName = ISNULL(LEFT(NEWID(),7),'') --Creates NOT NULL column

INTO #Prospects

FROM sys.all_columns ac1

CROSS JOIN sys.all_columns ac2

;

--===== Build the PK and indexes

ALTER TABLE #Prospects

ADD PRIMARY KEY (ID)

;

CREATE INDEX IX_#Prospect_CompanyName

ON #Prospects (CompanyName, ID)

;

When we run the test code against THAT test data, we find out that the "Hidden RBAR" method is actually more than 10 times slower than the two classic methods and is thousands of times more aggressive for reads than the two classic methods. Here's the output that shows that.

--===== Solve the problem using just one of the half dozen ways to do this. ===

(3759 row(s) affected)

Table '#Prospects__________________________________________________________________________________________________________00000000003C'.

Scan count 2, logical reads 4474, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 2062 ms, elapsed time = 2404 ms.

--===== Solve the problem using another of the half dozen ways to do this. ===

(3759 row(s) affected)

Table '#Prospects__________________________________________________________________________________________________________00000000003C'.

Scan count 2, logical reads 4474, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 2016 ms, elapsed time = 2319 ms.

--===== Joe's method =========================================================

(3759 row(s) affected)

Table 'Worktable'. Scan count 3, logical reads 5992481, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table '#Prospects__________________________________________________________________________________________________________00000000003C'.

Scan count 1, logical reads 2237, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 26203 ms, elapsed time = 27219 ms.

Please let us know if you have any additional questions on this problem or any of the solutions provided so far.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1003927 More actions · Answer 2

CELKO (9/25/2011)
>> I want to select the prospect_nbr and name of all companies in [prospects] table that have the same name and occur more than once. <<
Please be polite enough to post DDL for us. And some sample data is nice, too. Since there is no such thing as a magical, universal, generic “id”, can I assume that the prospects are on a call list of some kinds with a ticket number?

That's actually a pretty nice post, Joe. Thanks.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)