RE: which is the easy way to Eliminate the duplicates?

SSC-Forever

Points: 41690

October 17, 2012 at 1:47 am

dwain.c (10/16/2012)
I recommend that you take heed of Jeff's question above. An improper JOIN often generates tons of duplicates and developers can be too lazy to understand why and eliminate them the proper way.
To return the id number with the code, you can do this:
;WITH Results AS (
SELECT id, code
,rn=ROW_NUMBER() OVER (PARTITION BY code ORDER BY id)
FROM @T)
SELECT id, code
FROM Results
WHERE rn=1

Why bother with the row_number?

SELECT MIN(id) AS id, code

FROM @T

GROUP BY code;

Performance test: -

IF object_id('tempdb..#T') IS NOT NULL

BEGIN

DROP TABLE #T;

END;

SELECT TOP 1000000 IDENTITY(INT,1,1) AS id,

CAST((ABS(CHECKSUM(NEWID())) % 100) + 1 AS VARCHAR(3)) AS code

INTO #T

FROM master.dbo.syscolumns sc1, master.dbo.syscolumns sc2, master.dbo.syscolumns sc3;

DECLARE @HOLDER INT, @HOLDER2 VARCHAR(3);

PRINT 'ROWNUMBER';

SET STATISTICS IO, TIME ON;

;WITH Results AS (

SELECT id, code

,rn=ROW_NUMBER() OVER (PARTITION BY code ORDER BY id)

FROM #T)

SELECT @HOLDER = id, @HOLDER2 = code

FROM Results

WHERE rn=1;

SET STATISTICS IO, TIME OFF;

PRINT 'MIN';

SET STATISTICS IO, TIME ON;

SELECT @HOLDER = MIN(id), @HOLDER2 = code

FROM #T

GROUP BY code;

SET STATISTICS IO, TIME OFF;

Results: -

ROWNUMBER

Table '#T'. Scan count 5, logical reads 2341, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 4797 ms, elapsed time = 2547 ms.

MIN

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table '#T'. Scan count 5, logical reads 2341, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 626 ms, elapsed time = 164 ms.

Again, but with indexes: -

IF object_id('tempdb..#T') IS NOT NULL

BEGIN

DROP TABLE #T;

END;

SELECT TOP 1000000 IDENTITY(INT,1,1) AS id,

CAST((ABS(CHECKSUM(NEWID())) % 100) + 1 AS VARCHAR(3)) AS code

INTO #T

FROM master.dbo.syscolumns sc1, master.dbo.syscolumns sc2, master.dbo.syscolumns sc3;

CREATE CLUSTERED INDEX cl_t_id ON #T(id);

CREATE NONCLUSTERED INDEX nc_t_code ON #T(code);

DECLARE @HOLDER INT, @HOLDER2 VARCHAR(3);

PRINT 'ROWNUMBER';

SET STATISTICS IO, TIME ON;

;WITH Results AS (

SELECT id, code

,rn=ROW_NUMBER() OVER (PARTITION BY code ORDER BY id)

FROM #T)

SELECT @HOLDER = id, @HOLDER2 = code

FROM Results

WHERE rn=1;

SET STATISTICS IO, TIME OFF;

PRINT 'MIN';

SET STATISTICS IO, TIME ON;

SELECT @HOLDER = MIN(id), @HOLDER2 = code

FROM #T

GROUP BY code;

SET STATISTICS IO, TIME OFF;

Results: -

ROWNUMBER

Table '#T'. Scan count 1, logical reads 1979, physical reads 0, read-ahead reads 15, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 422 ms, elapsed time = 409 ms.

MIN

Table '#T'. Scan count 1, logical reads 1979, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 391 ms, elapsed time = 402 ms.

So roughly equivalent once indexes are in place, but wildly different before they are.

Forever trying to learn
My blog - http://www.cadavre.co.uk/
For better, quicker answers on T-SQL questions, click on the following...http://www.sqlservercentral.com/articles/Best+Practices/61537/
For better, quicker answers on SQL Server performance related questions, click on the following...http://www.sqlservercentral.com/articles/SQLServerCentral/66909/