RE: Can I pull DISTINCT customers with SQL based on most recent row record only?

SSCoach

Points: 18409

July 24, 2018 at 9:00 am

#1998805

drew.allen - Tuesday, July 24, 2018 8:48 AM
You only tested on four rows (hence the scan count 4). As the number of rows increases, the cost of the CROSS APPLY will increase roughly at O(n log n), whereas the cost of the ROW_NUMBER() method will increase at roughly O(n). So your method may well "win" at small row counts, it will lose big time on row counts typical of a normal database, in the same way that a human can win a race against a horse if the distance is short enough, but will lose when the distance is longer.
Drew

I think my method will win when there are a lot of rows with the same Id, it doesn't matter how many distinct Ids there are. When there are less than about 4 rows per Id then the row_count method might win. My query only does one table scan to get the distinct Ids. The query it does to get the TOP(1) uses a seek (so is fast). The only way to tell is to test it on different data.

Here's a script where you can change the total number of rows and the number of rows per Id. (you'll need a tally table with at least 1 million rows on http://www.sqlservercentral.com/articles/T-SQL/62867/)
set statistics io,time off

DROP TABLE latestrecords ; GO CREATE TABLE latestrecords ( Id int, Email varchar(200), Customer_FirstName varchar(200), Status varchar(200), RecordTimestamp DATETIME ) GO DECLARE @RowsToInsert as int = 1000000 DECLARE @RowsPerId as int = 100 DECLARE @DistinctIds as int =@RowsToInsert /@RowsPerId INSERT INTO latestrecords SELECT t.N%@DistinctIds ,CONCAT('test', t.N%@DistinctIds,'@test.com'),CONCAT('Name',t.N%@DistinctIds),'New',DATEADD(minute,t.n,'July 15, 2018 10:08 PM') from tally t where t.N <= @RowsToInsert CREATE INDEX IX_latestrecords_1 ON latestrecords(Id,RecordTimestamp) INCLUDE (Email,Customer_FirstName,Status) ;

SELECT COUNT(*) Count FROM latestrecords SELECT COUNT(distinct id) [distinct id] FROM latestrecords

IF OBJECT_ID('tempdb..#t1') IS NOT NULL DROP TABLE #t1 IF OBJECT_ID('tempdb..#t2') IS NOT NULL DROP TABLE #t2

set statistics io,time on PRINT '************* row_number *************' select Id,Email,Customer_FirstName,Status,RecordTimestamp into #t1 from ( select *, row_number()over(partition by Id,Email,Customer_FirstName order by RecordTimestamp desc) as rnk from latestrecords )ranked_records where rnk=1 PRINT '************* CTE *************' ;WITH CTE AS (select distinct Id from latestrecords) SELECT T.* into #t2 FROM CTE CROSS APPLY(SELECT TOP(1) * FROM latestrecords X WHERE X.Id =CTE.Id ORDER BY X.RecordTimestamp desc) as T

set statistics io,time off
With the values I've put in 1 million rows total with 10,000 different Ids I get the following results on my machine:

************* row_number *************

SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.
SQL Server parse and compile time:
CPU time = 511 ms, elapsed time = 511 ms.
Table 'latestrecords'. Scan count 5, logical reads 7017, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 1607 ms, elapsed time = 750 ms.

(10000 rows affected)
************* CTE *************

SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 1 ms.
Table 'latestrecords'. Scan count 10005, logical reads 39809, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 187 ms, elapsed time = 35 ms.

(10000 rows affected)