Selecting Duplicate Rows

Question

Post reply

Selecting Duplicate Rows

Cadavre

SSC-Forever

Points: 41690
More actions
March 16, 2010 at 9:55 am

#219995

Hey guys.
I'm trying to select all rows of data that have the same ID. Basically I have a table where the same unique person could've arrived at the park at different times. I'm looking at eventually displaying a flag to show that someone has re-entered.
For now I'm simply having trouble displaying the records I want. I currently have ~650,000 to sort through, so my statement is very inefficient.
Below I'll create some random data similar to what I'm using: -
IF EXISTS (SELECT *
FROM sys.tables t
JOIN sys.schemas s
ON t.schema_id = s.schema_id
WHERE t.[Name] = 'Test_Data'
AND s.[Name] = 'dbo')
DROP TABLE dbo.test_data
CREATE TABLE dbo.test_data (
[ID] VARCHAR(128) NOT NULL,
[address] VARCHAR(250) NOT NULL,
[StartDate] DATETIME NOT NULL)
ON [PRIMARY]
DECLARE @maxRandomValue TINYINT,
@minRandomValue TINYINT,
@cnt INT,
@string VARCHAR(256),
@stringlen INT,
@index INT,
@fakeLength INT,
@fakeaddress VARCHAR(8000)
SET @maxRandomValue = 100
SET @minRandomValue = 1
/* CAREFUL! CURRENTLY CREATING 650,000 ROWS OF FAKE DATA */
SET @cnt = 650000
SELECT @string = 'abcdefghijklmnopqrstuvwxyz'
SELECT @stringlen = Len(@string)
WHILE @cnt > 0
BEGIN
SET @cnt = @cnt - 1
SET @fakeLength = Cast(((@maxRandomValue + 20) - (@minRandomValue + 5)) * Rand() + (@minRandomValue + 5) AS TINYINT)
SELECT @fakeaddress = ''
WHILE (@fakeLength > 0)
BEGIN
SELECT @index = (Abs(Checksum(Newid()))%@stringlen) + 1
SELECT @fakeaddress = @fakeaddress + Substring(@string,@index,1)
SELECT @fakeLength = @fakeLength - 1
END
INSERT INTO dbo.test_data
([ID],
[address],
[StartDate])
SELECT Cast(((@maxRandomValue + 1) - @minRandomValue) * Rand() + @minRandomValue AS TINYINT),
Cast(Cast(((@maxRandomValue + 91) - @minRandomValue) * Rand() + @minRandomValue AS TINYINT) AS VARCHAR) + ' ' + @fakeaddress,
Getdate() - ((18 * 365) + Rand() * (47 * 365))
END
GO
/* The above just creates some random data. */
Now, what I'm using to select what I need from this data worked fine on small ammounts of data but on large amounts such as the above table, it takes forever.
SELECT *
FROM [dbo].[test_data] AS a,
[dbo].[test_data] AS b
WHERE a.[ID] = b.[ID]
AND a.[StartDate] <> b.[StartDate];
Would love a pointer or two in the correct direction for this!
Forever trying to learn
My blog - http://www.cadavre.co.uk/
For better, quicker answers on T-SQL questions, click on the following...http://www.sqlservercentral.com/articles/Best+Practices/61537/
For better, quicker answers on SQL Server performance related questions, click on the following...http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply

Cadavre SSC-Forever Points: 41690 More actions · Answer 1

My apologies, I meant to give you some sample output.

I'd want something like: -

------------------------------------------------------------------------------------

------------------------------------------------------------------------------------

14 - | 15 iojwepoh | - 1976-02-26 18:35:20.500- | 14 - | 15 iojwepoh -| 1977-02-26 18:35:20.500

14 - | 15 iojwepoh | - 1977-02-26 18:35:20.500 -| 14 - | 15 iojwepoh-| 1979-02-26 18:35:20.500

Forever trying to learn
My blog - http://www.cadavre.co.uk/
For better, quicker answers on T-SQL questions, click on the following...http://www.sqlservercentral.com/articles/Best+Practices/61537/
For better, quicker answers on SQL Server performance related questions, click on the following...http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

Cory E. SSCertifiable Points: 7076 More actions · Answer 2

I would consider creating an index - Something with ID, and startdate.

Also, reworked the last select.

SELECT *

FROM [dbo].[test_data] AS a

INNER JOIN [dbo].[test_data] AS b

ON a.[ID] = b.[ID]

WHERE a.[StartDate] <> b.[StartDate];

-- Cory

Steve Cullen SSCertifiable Points: 5598 More actions · Answer 3

I would add an index as Cory suggests.

WITH DupCheck AS (

SELECT ID, COUNT(*) [COUNT]

FROM dbo.test_data

GROUP BY ID

HAVING COUNT(*) > 1

)

SELECT DupCheck.ID, Address, StartDate FROM DupCheck

LEFT JOIN dbo.test_data

ON DupCheck.ID = test_data.ID

ORDER BY DupCheck.ID

Converting oxygen into carbon dioxide, since 1955.

Cadavre SSC-Forever Points: 41690 More actions · Answer 4

Thanks for that. I've added an ORDER BY StartDate so I can keep things orderly, then added a quick script to test the time taken for the script to run. Suffice to say, a massive improvement. 7 second execution time, in comparison to the previous several minutes.

DECLARE @time DATETIME

SET @time = Getdate();

WITH [DupCheck]

AS (SELECT [ID],

Count(* ) [COUNT]

FROM dbo.test_data

GROUP BY [ID]

HAVING Count(* ) > 1)

SELECT [DupCheck].[ID],

[Address],

[StartDate]

FROM [DupCheck]

LEFT JOIN dbo.test_data

ON [DupCheck].[ID] = [test_data].[ID]

ORDER BY [DupCheck].[ID], [StartDate]

SELECT Datediff(ms,@time,Getdate()) AS "time taken for grabbing data"

Forever trying to learn
My blog - http://www.cadavre.co.uk/
For better, quicker answers on T-SQL questions, click on the following...http://www.sqlservercentral.com/articles/Best+Practices/61537/
For better, quicker answers on SQL Server performance related questions, click on the following...http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

Selecting Duplicate Rows

Cookies on SQLServerCentral