Selecting Duplicate Rows

  • Hey guys.

    I'm trying to select all rows of data that have the same ID. Basically I have a table where the same unique person could've arrived at the park at different times. I'm looking at eventually displaying a flag to show that someone has re-entered.

    For now I'm simply having trouble displaying the records I want. I currently have ~650,000 to sort through, so my statement is very inefficient.

    Below I'll create some random data similar to what I'm using: -

    IF EXISTS (SELECT *

    FROM sys.tables t

    JOIN sys.schemas s

    ON t.schema_id = s.schema_id

    WHERE t.[Name] = 'Test_Data'

    AND s.[Name] = 'dbo')

    DROP TABLE dbo.test_data

    CREATE TABLE dbo.test_data (

    [ID] VARCHAR(128) NOT NULL,

    [address] VARCHAR(250) NOT NULL,

    [StartDate] DATETIME NOT NULL)

    ON [PRIMARY]

    DECLARE @maxRandomValue TINYINT,

    @minRandomValue TINYINT,

    @cnt INT,

    @string VARCHAR(256),

    @stringlen INT,

    @index INT,

    @fakeLength INT,

    @fakeaddress VARCHAR(8000)

    SET @maxRandomValue = 100

    SET @minRandomValue = 1

    /* CAREFUL! CURRENTLY CREATING 650,000 ROWS OF FAKE DATA */

    SET @cnt = 650000

    SELECT @string = 'abcdefghijklmnopqrstuvwxyz'

    SELECT @stringlen = Len(@string)

    WHILE @cnt > 0

    BEGIN

    SET @cnt = @cnt - 1

    SET @fakeLength = Cast(((@maxRandomValue + 20) - (@minRandomValue + 5)) * Rand() + (@minRandomValue + 5) AS TINYINT)

    SELECT @fakeaddress = ''

    WHILE (@fakeLength > 0)

    BEGIN

    SELECT @index = (Abs(Checksum(Newid()))%@stringlen) + 1

    SELECT @fakeaddress = @fakeaddress + Substring(@string,@index,1)

    SELECT @fakeLength = @fakeLength - 1

    END

    INSERT INTO dbo.test_data

    ([ID],

    [address],

    [StartDate])

    SELECT Cast(((@maxRandomValue + 1) - @minRandomValue) * Rand() + @minRandomValue AS TINYINT),

    Cast(Cast(((@maxRandomValue + 91) - @minRandomValue) * Rand() + @minRandomValue AS TINYINT) AS VARCHAR) + ' ' + @fakeaddress,

    Getdate() - ((18 * 365) + Rand() * (47 * 365))

    END

    GO

    /* The above just creates some random data. */

    Now, what I'm using to select what I need from this data worked fine on small ammounts of data but on large amounts such as the above table, it takes forever.

    SELECT *

    FROM [dbo].[test_data] AS a,

    [dbo].[test_data] AS b

    WHERE a.[ID] = b.[ID]

    AND a.[StartDate] <> b.[StartDate];

    Would love a pointer or two in the correct direction for this! 🙂


    Forever trying to learn
    My blog - http://www.cadavre.co.uk/
    For better, quicker answers on T-SQL questions, click on the following...http://www.sqlservercentral.com/articles/Best+Practices/61537/
    For better, quicker answers on SQL Server performance related questions, click on the following...http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

  • My apologies, I meant to give you some sample output.

    I'd want something like: -

    ------------------------------------------------------------------------------------

    ID - | -- Address - | - Start Date --------------- | ID - | -- Address - | - Start Date

    ------------------------------------------------------------------------------------

    14 - | 15 iojwepoh | - 1976-02-26 18:35:20.500- | 14 - | 15 iojwepoh -| 1977-02-26 18:35:20.500

    14 - | 15 iojwepoh | - 1977-02-26 18:35:20.500 -| 14 - | 15 iojwepoh-| 1979-02-26 18:35:20.500


    Forever trying to learn
    My blog - http://www.cadavre.co.uk/
    For better, quicker answers on T-SQL questions, click on the following...http://www.sqlservercentral.com/articles/Best+Practices/61537/
    For better, quicker answers on SQL Server performance related questions, click on the following...http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

  • I would consider creating an index - Something with ID, and startdate.

    Also, reworked the last select.

    SELECT *

    FROM [dbo].[test_data] AS a

    INNER JOIN [dbo].[test_data] AS b

    ON a.[ID] = b.[ID]

    WHERE a.[StartDate] <> b.[StartDate];

    -- Cory

  • I would add an index as Cory suggests.

    WITH DupCheck AS (

    SELECT ID, COUNT(*) [COUNT]

    FROM dbo.test_data

    GROUP BY ID

    HAVING COUNT(*) > 1

    )

    SELECT DupCheck.ID, Address, StartDate FROM DupCheck

    LEFT JOIN dbo.test_data

    ON DupCheck.ID = test_data.ID

    ORDER BY DupCheck.ID

    Converting oxygen into carbon dioxide, since 1955.
  • Thanks for that. I've added an ORDER BY StartDate so I can keep things orderly, then added a quick script to test the time taken for the script to run. Suffice to say, a massive improvement. 7 second execution time, in comparison to the previous several minutes.

    DECLARE @time DATETIME

    SET @time = Getdate();

    WITH [DupCheck]

    AS (SELECT [ID],

    Count(* ) [COUNT]

    FROM dbo.test_data

    GROUP BY [ID]

    HAVING Count(* ) > 1)

    SELECT [DupCheck].[ID],

    [Address],

    [StartDate]

    FROM [DupCheck]

    LEFT JOIN dbo.test_data

    ON [DupCheck].[ID] = [test_data].[ID]

    ORDER BY [DupCheck].[ID], [StartDate]

    SELECT Datediff(ms,@time,Getdate()) AS "time taken for grabbing data"


    Forever trying to learn
    My blog - http://www.cadavre.co.uk/
    For better, quicker answers on T-SQL questions, click on the following...http://www.sqlservercentral.com/articles/Best+Practices/61537/
    For better, quicker answers on SQL Server performance related questions, click on the following...http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

Viewing 5 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply