Home Forums SQL Server 2008 T-SQL (SS2K8) which is the easy way to Eliminate the duplicates? RE: which is the easy way to Eliminate the duplicates?

  • Cadavre (10/17/2012)


    dwain.c (10/16/2012)


    I recommend that you take heed of Jeff's question above. An improper JOIN often generates tons of duplicates and developers can be too lazy to understand why and eliminate them the proper way.

    To return the id number with the code, you can do this:

    ;WITH Results AS (

    SELECT id, code

    ,rn=ROW_NUMBER() OVER (PARTITION BY code ORDER BY id)

    FROM @T)

    SELECT id, code

    FROM Results

    WHERE rn=1

    Why bother with the row_number?

    SELECT MIN(id) AS id, code

    FROM @T

    GROUP BY code;

    Performance test: -

    IF object_id('tempdb..#T') IS NOT NULL

    BEGIN

    DROP TABLE #T;

    END;

    SELECT TOP 1000000 IDENTITY(INT,1,1) AS id,

    CAST((ABS(CHECKSUM(NEWID())) % 100) + 1 AS VARCHAR(3)) AS code

    INTO #T

    FROM master.dbo.syscolumns sc1, master.dbo.syscolumns sc2, master.dbo.syscolumns sc3;

    DECLARE @HOLDER INT, @HOLDER2 VARCHAR(3);

    PRINT 'ROWNUMBER';

    SET STATISTICS IO, TIME ON;

    ;WITH Results AS (

    SELECT id, code

    ,rn=ROW_NUMBER() OVER (PARTITION BY code ORDER BY id)

    FROM #T)

    SELECT @HOLDER = id, @HOLDER2 = code

    FROM Results

    WHERE rn=1;

    SET STATISTICS IO, TIME OFF;

    PRINT 'MIN';

    SET STATISTICS IO, TIME ON;

    SELECT @HOLDER = MIN(id), @HOLDER2 = code

    FROM #T

    GROUP BY code;

    SET STATISTICS IO, TIME OFF;

    Results: -

    ROWNUMBER

    Table '#T'. Scan count 5, logical reads 2341, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 4797 ms, elapsed time = 2547 ms.

    MIN

    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table '#T'. Scan count 5, logical reads 2341, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 626 ms, elapsed time = 164 ms.

    Again, but with indexes: -

    IF object_id('tempdb..#T') IS NOT NULL

    BEGIN

    DROP TABLE #T;

    END;

    SELECT TOP 1000000 IDENTITY(INT,1,1) AS id,

    CAST((ABS(CHECKSUM(NEWID())) % 100) + 1 AS VARCHAR(3)) AS code

    INTO #T

    FROM master.dbo.syscolumns sc1, master.dbo.syscolumns sc2, master.dbo.syscolumns sc3;

    CREATE CLUSTERED INDEX cl_t_id ON #T(id);

    CREATE NONCLUSTERED INDEX nc_t_code ON #T(code);

    DECLARE @HOLDER INT, @HOLDER2 VARCHAR(3);

    PRINT 'ROWNUMBER';

    SET STATISTICS IO, TIME ON;

    ;WITH Results AS (

    SELECT id, code

    ,rn=ROW_NUMBER() OVER (PARTITION BY code ORDER BY id)

    FROM #T)

    SELECT @HOLDER = id, @HOLDER2 = code

    FROM Results

    WHERE rn=1;

    SET STATISTICS IO, TIME OFF;

    PRINT 'MIN';

    SET STATISTICS IO, TIME ON;

    SELECT @HOLDER = MIN(id), @HOLDER2 = code

    FROM #T

    GROUP BY code;

    SET STATISTICS IO, TIME OFF;

    Results: -

    ROWNUMBER

    Table '#T'. Scan count 1, logical reads 1979, physical reads 0, read-ahead reads 15, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 422 ms, elapsed time = 409 ms.

    MIN

    Table '#T'. Scan count 1, logical reads 1979, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 391 ms, elapsed time = 402 ms.

    So roughly equivalent once indexes are in place, but wildly different before they are.

    Nice effort Cadavre! Interesting results too (with the indexing). The former doesn't surprise me at all - wish I'd thought of it.


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St