Can I pull DISTINCT customers with SQL based on most recent row record only?

  • Jonathan AC Roberts - Tuesday, July 24, 2018 11:02 AM

    ScottPletcher - Tuesday, July 24, 2018 10:50 AM

    The imperfect index definition is a drag on the first query, requiring a sort of all rows.  Correcting the index definition:

    DROP INDEX IX_latestrecords_1 ON latestrecords;
    CREATE UNIQUE INDEX IX_latestrecords_1 ON latestrecords(Id,RecordTimestamp DESC) INCLUDE (Email,Customer_FirstName,Status) ;

    eliminates the sort, making the original queries much closer in CPU time/usage, although it does seem that the Segment operator takes some additional CPU.  Still, I'm happy to trade that for the drastically reduced I/O of the first query, since typically I/O is the critical performance issue, not CPU.  [For example, on my main prod instances, I've got I/O latency on many assorted files, with CPU typically between 8% and 16% and rarely exceeding 25% (except for backups, which seem to push it around 50%).]

    I don't think creating and index with that column desc would help performance, at least not much anyway. The main reason you would want to put desc on an index is if rows are being inserted in that order. That way the index would get less fragmented. As the dates would be inserted in ascending order I would still recommend having the index with RecordTimestamp ASC.
    What actually does improve performance is to leave off the INCLUDE columns as it makes the index smaller.

    Creating the index with a descending index column changed nothing, neither did dropping the included columns.  This is based on the code provided above.

  • Lynn Pettis - Tuesday, July 24, 2018 11:26 AM

    Jonathan AC Roberts - Tuesday, July 24, 2018 11:02 AM

    ScottPletcher - Tuesday, July 24, 2018 10:50 AM

    The imperfect index definition is a drag on the first query, requiring a sort of all rows.  Correcting the index definition:

    DROP INDEX IX_latestrecords_1 ON latestrecords;
    CREATE UNIQUE INDEX IX_latestrecords_1 ON latestrecords(Id,RecordTimestamp DESC) INCLUDE (Email,Customer_FirstName,Status) ;

    eliminates the sort, making the original queries much closer in CPU time/usage, although it does seem that the Segment operator takes some additional CPU.  Still, I'm happy to trade that for the drastically reduced I/O of the first query, since typically I/O is the critical performance issue, not CPU.  [For example, on my main prod instances, I've got I/O latency on many assorted files, with CPU typically between 8% and 16% and rarely exceeding 25% (except for backups, which seem to push it around 50%).]

    I don't think creating and index with that column desc would help performance, at least not much anyway. The main reason you would want to put desc on an index is if rows are being inserted in that order. That way the index would get less fragmented. As the dates would be inserted in ascending order I would still recommend having the index with RecordTimestamp ASC.
    What actually does improve performance is to leave off the INCLUDE columns as it makes the index smaller.

    Creating the index with a descending index column changed nothing, neither did dropping the included columns.  This is based on the code provided above.

    Oh, right.  That's true if you leave the partitioning/windowing as originally specified:
    (partition by Id,Email,Customer_FirstName order by RecordTimestamp desc)

    I reduced the PARTITION BY to just Id, which I think is the correct business logic.  Why can't the customer change their Email?  Would I really want to see separate rows for the customer if they did?  I don't think so.  Ditto for Customer_FirstName (which seems to be some type of unique value here as well, but presumably could also change?).

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • ScottPletcher - Tuesday, July 24, 2018 11:32 AM

    Lynn Pettis - Tuesday, July 24, 2018 11:26 AM

    Jonathan AC Roberts - Tuesday, July 24, 2018 11:02 AM

    ScottPletcher - Tuesday, July 24, 2018 10:50 AM

    The imperfect index definition is a drag on the first query, requiring a sort of all rows.  Correcting the index definition:

    DROP INDEX IX_latestrecords_1 ON latestrecords;
    CREATE UNIQUE INDEX IX_latestrecords_1 ON latestrecords(Id,RecordTimestamp DESC) INCLUDE (Email,Customer_FirstName,Status) ;

    eliminates the sort, making the original queries much closer in CPU time/usage, although it does seem that the Segment operator takes some additional CPU.  Still, I'm happy to trade that for the drastically reduced I/O of the first query, since typically I/O is the critical performance issue, not CPU.  [For example, on my main prod instances, I've got I/O latency on many assorted files, with CPU typically between 8% and 16% and rarely exceeding 25% (except for backups, which seem to push it around 50%).]

    I don't think creating and index with that column desc would help performance, at least not much anyway. The main reason you would want to put desc on an index is if rows are being inserted in that order. That way the index would get less fragmented. As the dates would be inserted in ascending order I would still recommend having the index with RecordTimestamp ASC.
    What actually does improve performance is to leave off the INCLUDE columns as it makes the index smaller.

    Creating the index with a descending index column changed nothing, neither did dropping the included columns.  This is based on the code provided above.

    Oh, right.  That's true if you leave the partitioning/windowing as originally specified:
    (partition by Id,Email,Customer_FirstName order by RecordTimestamp desc)

    I reduced the PARTITION BY to just Id, which I think is the correct business logic.  Why can't the customer change their Email?  Would I really want to see separate rows for the customer if they did?  I don't think so.  Ditto for Customer_FirstName (which seems to be some type of unique value here as well, but presumably could also change?).

    Perhaps you should post your alternative solution and test harness.  Then we will see the differences.

  • Lynn Pettis - Tuesday, July 24, 2018 11:38 AM

    ScottPletcher - Tuesday, July 24, 2018 11:32 AM

    Lynn Pettis - Tuesday, July 24, 2018 11:26 AM

    Jonathan AC Roberts - Tuesday, July 24, 2018 11:02 AM

    ScottPletcher - Tuesday, July 24, 2018 10:50 AM

    The imperfect index definition is a drag on the first query, requiring a sort of all rows.  Correcting the index definition:

    DROP INDEX IX_latestrecords_1 ON latestrecords;
    CREATE UNIQUE INDEX IX_latestrecords_1 ON latestrecords(Id,RecordTimestamp DESC) INCLUDE (Email,Customer_FirstName,Status) ;

    eliminates the sort, making the original queries much closer in CPU time/usage, although it does seem that the Segment operator takes some additional CPU.  Still, I'm happy to trade that for the drastically reduced I/O of the first query, since typically I/O is the critical performance issue, not CPU.  [For example, on my main prod instances, I've got I/O latency on many assorted files, with CPU typically between 8% and 16% and rarely exceeding 25% (except for backups, which seem to push it around 50%).]

    I don't think creating and index with that column desc would help performance, at least not much anyway. The main reason you would want to put desc on an index is if rows are being inserted in that order. That way the index would get less fragmented. As the dates would be inserted in ascending order I would still recommend having the index with RecordTimestamp ASC.
    What actually does improve performance is to leave off the INCLUDE columns as it makes the index smaller.

    Creating the index with a descending index column changed nothing, neither did dropping the included columns.  This is based on the code provided above.

    Oh, right.  That's true if you leave the partitioning/windowing as originally specified:
    (partition by Id,Email,Customer_FirstName order by RecordTimestamp desc)

    I reduced the PARTITION BY to just Id, which I think is the correct business logic.  Why can't the customer change their Email?  Would I really want to see separate rows for the customer if they did?  I don't think so.  Ditto for Customer_FirstName (which seems to be some type of unique value here as well, but presumably could also change?).

    Perhaps you should post your alternative solution and test harness.  Then we will see the differences.

    Change the code to:
    (PARTITION BY Id ORDER BY RecordTimestamp DESC)

    Is there any rational business reason to include Email, etc., in the PARTITION BY?

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • Lynn Pettis - Tuesday, July 24, 2018 11:38 AM

    ScottPletcher - Tuesday, July 24, 2018 11:32 AM

    Lynn Pettis - Tuesday, July 24, 2018 11:26 AM

    Jonathan AC Roberts - Tuesday, July 24, 2018 11:02 AM

    ScottPletcher - Tuesday, July 24, 2018 10:50 AM

    The imperfect index definition is a drag on the first query, requiring a sort of all rows.  Correcting the index definition:

    DROP INDEX IX_latestrecords_1 ON latestrecords;
    CREATE UNIQUE INDEX IX_latestrecords_1 ON latestrecords(Id,RecordTimestamp DESC) INCLUDE (Email,Customer_FirstName,Status) ;

    eliminates the sort, making the original queries much closer in CPU time/usage, although it does seem that the Segment operator takes some additional CPU.  Still, I'm happy to trade that for the drastically reduced I/O of the first query, since typically I/O is the critical performance issue, not CPU.  [For example, on my main prod instances, I've got I/O latency on many assorted files, with CPU typically between 8% and 16% and rarely exceeding 25% (except for backups, which seem to push it around 50%).]

    I don't think creating and index with that column desc would help performance, at least not much anyway. The main reason you would want to put desc on an index is if rows are being inserted in that order. That way the index would get less fragmented. As the dates would be inserted in ascending order I would still recommend having the index with RecordTimestamp ASC.
    What actually does improve performance is to leave off the INCLUDE columns as it makes the index smaller.

    Creating the index with a descending index column changed nothing, neither did dropping the included columns.  This is based on the code provided above.

    Oh, right.  That's true if you leave the partitioning/windowing as originally specified:
    (partition by Id,Email,Customer_FirstName order by RecordTimestamp desc)

    I reduced the PARTITION BY to just Id, which I think is the correct business logic.  Why can't the customer change their Email?  Would I really want to see separate rows for the customer if they did?  I don't think so.  Ditto for Customer_FirstName (which seems to be some type of unique value here as well, but presumably could also change?).

    Perhaps you should post your alternative solution and test harness.  Then we will see the differences.

    I think this is great idea.  People like me comes to learn here would be benefited if somebody provides a concrete solution with proof. 
    Now we are getting many conversations for this particular question  . For a person with average proficient in sql confused on whose idea I should follow and take.

    Saravanan

  • ScottPletcher - Tuesday, July 24, 2018 11:44 AM

    Lynn Pettis - Tuesday, July 24, 2018 11:38 AM

    ScottPletcher - Tuesday, July 24, 2018 11:32 AM

    Lynn Pettis - Tuesday, July 24, 2018 11:26 AM

    Jonathan AC Roberts - Tuesday, July 24, 2018 11:02 AM

    ScottPletcher - Tuesday, July 24, 2018 10:50 AM

    The imperfect index definition is a drag on the first query, requiring a sort of all rows.  Correcting the index definition:

    DROP INDEX IX_latestrecords_1 ON latestrecords;
    CREATE UNIQUE INDEX IX_latestrecords_1 ON latestrecords(Id,RecordTimestamp DESC) INCLUDE (Email,Customer_FirstName,Status) ;

    eliminates the sort, making the original queries much closer in CPU time/usage, although it does seem that the Segment operator takes some additional CPU.  Still, I'm happy to trade that for the drastically reduced I/O of the first query, since typically I/O is the critical performance issue, not CPU.  [For example, on my main prod instances, I've got I/O latency on many assorted files, with CPU typically between 8% and 16% and rarely exceeding 25% (except for backups, which seem to push it around 50%).]

    I don't think creating and index with that column desc would help performance, at least not much anyway. The main reason you would want to put desc on an index is if rows are being inserted in that order. That way the index would get less fragmented. As the dates would be inserted in ascending order I would still recommend having the index with RecordTimestamp ASC.
    What actually does improve performance is to leave off the INCLUDE columns as it makes the index smaller.

    Creating the index with a descending index column changed nothing, neither did dropping the included columns.  This is based on the code provided above.

    Oh, right.  That's true if you leave the partitioning/windowing as originally specified:
    (partition by Id,Email,Customer_FirstName order by RecordTimestamp desc)

    I reduced the PARTITION BY to just Id, which I think is the correct business logic.  Why can't the customer change their Email?  Would I really want to see separate rows for the customer if they did?  I don't think so.  Ditto for Customer_FirstName (which seems to be some type of unique value here as well, but presumably could also change?).

    Perhaps you should post your alternative solution and test harness.  Then we will see the differences.

    Change the code to:
    (PARTITION BY Id ORDER BY RecordTimestamp DESC)

    Is there any rational business reason to include Email, etc., in the PARTITION BY?

    Speeds up the ROW_NUMBER code but the CTE is still about twice as fast.

  • Lynn Pettis - Tuesday, July 24, 2018 11:56 AM

    ScottPletcher - Tuesday, July 24, 2018 11:44 AM

    Lynn Pettis - Tuesday, July 24, 2018 11:38 AM

    ScottPletcher - Tuesday, July 24, 2018 11:32 AM

    Lynn Pettis - Tuesday, July 24, 2018 11:26 AM

    Jonathan AC Roberts - Tuesday, July 24, 2018 11:02 AM

    ScottPletcher - Tuesday, July 24, 2018 10:50 AM

    The imperfect index definition is a drag on the first query, requiring a sort of all rows.  Correcting the index definition:

    DROP INDEX IX_latestrecords_1 ON latestrecords;
    CREATE UNIQUE INDEX IX_latestrecords_1 ON latestrecords(Id,RecordTimestamp DESC) INCLUDE (Email,Customer_FirstName,Status) ;

    eliminates the sort, making the original queries much closer in CPU time/usage, although it does seem that the Segment operator takes some additional CPU.  Still, I'm happy to trade that for the drastically reduced I/O of the first query, since typically I/O is the critical performance issue, not CPU.  [For example, on my main prod instances, I've got I/O latency on many assorted files, with CPU typically between 8% and 16% and rarely exceeding 25% (except for backups, which seem to push it around 50%).]

    I don't think creating and index with that column desc would help performance, at least not much anyway. The main reason you would want to put desc on an index is if rows are being inserted in that order. That way the index would get less fragmented. As the dates would be inserted in ascending order I would still recommend having the index with RecordTimestamp ASC.
    What actually does improve performance is to leave off the INCLUDE columns as it makes the index smaller.

    Creating the index with a descending index column changed nothing, neither did dropping the included columns.  This is based on the code provided above.

    Oh, right.  That's true if you leave the partitioning/windowing as originally specified:
    (partition by Id,Email,Customer_FirstName order by RecordTimestamp desc)

    I reduced the PARTITION BY to just Id, which I think is the correct business logic.  Why can't the customer change their Email?  Would I really want to see separate rows for the customer if they did?  I don't think so.  Ditto for Customer_FirstName (which seems to be some type of unique value here as well, but presumably could also change?).

    Perhaps you should post your alternative solution and test harness.  Then we will see the differences.

    Change the code to:
    (PARTITION BY Id ORDER BY RecordTimestamp DESC)

    Is there any rational business reason to include Email, etc., in the PARTITION BY?

    Speeds up the ROW_NUMBER code but the CTE is still about twice as fast.

    In clock time, yes.  But with far more I/Os.  So it's a trade off.  For me, I'll reduce the I/O and accept the extra CPU time, since I have more spare capacity of the latter than the former.

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • ScottPletcher - Tuesday, July 24, 2018 12:15 PM

    In clock time, yes.  But with far more I/Os.  So it's a trade off.  For me, I'll reduce the I/O and accept the extra CPU time, since I have more spare capacity of the latter than the former.

    The elapsed time is also something to consider.

  • saravanatn - Tuesday, July 24, 2018 11:50 AM

    Lynn Pettis - Tuesday, July 24, 2018 11:38 AM

    Perhaps you should post your alternative solution and test harness.  Then we will see the differences.

    I think this is great idea.  People like me comes to learn here would be benefited if somebody provides a concrete solution with proof. 
    Now we are getting many conversations for this particular question  . For a person with average proficient in sql confused on whose idea I should follow and take.

    Excellent point. Those frustrate me as well.

    Sue

  • Jonathan AC Roberts - Tuesday, July 24, 2018 12:24 PM

    ScottPletcher - Tuesday, July 24, 2018 12:15 PM

    In clock time, yes.  But with far more I/Os.  So it's a trade off.  For me, I'll reduce the I/O and accept the extra CPU time, since I have more spare capacity of the latter than the former.

    The elapsed time is also something to consider.

    Yes, it is.  That's why I called it a "trade off".  I think more realistically you'd have vastly fewer rows per id than 10K.  Let's try, say, 4.  Then the exec stats on my server are:

    ************* row_number *************
    Table 'latestrecords'. Scan count 5, logical reads 8253, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    SQL Server Execution Times:
     CPU time = 1843 ms, elapsed time = 945 ms.
    (250000 rows affected)

    ************* CTE *************
    Table 'latestrecords'. Scan count 250005, logical reads 808423, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    SQL Server Execution Times:
     CPU time = 2641 ms, elapsed time = 742 ms.
    (250000 rows affected)

    Where you have both more overall CPU and more I/O to save 0.2 seconds.  And even to get that you have to assume a fully covering index, which is very unlikely and in itself causes huge overhead when inserting to the initial table.

    So let's do a more realistic real-world test and remove the custom-made index [you could add a clustered index, but that's typically on datetime first for a logging table.  If you do it by id first -- to match this query -- you will have lots of fragmentation in the table itself, very bad for a logging table].
    DROP INDEX IX_latestrecords_1 ON dbo.latestrecords;

    Run the queries now (for the record, I also had "Include Actual Query Plan" on, but that should affect each query roughly equally), and you get:
    ************* row_number *************
    Table 'latestrecords'. Scan count 5, logical reads 7455, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    (250000 rows affected)
    SQL Server Execution Times:
     CPU time = 3248 ms, elapsed time = 848 ms.
    ************* CTE *************
    Table 'latestrecords'. Scan count 6, logical reads 14910, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'Worktable'. Scan count 250000, logical reads 3821000, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    (250000 rows affected)
    SQL Server Execution Times:
     CPU time = 12206 ms, elapsed time = 8871 ms.

    Woah, ten times the elapsed time and tens-of-thousands-times more I/O.  Adding a nonclustered index on (Id, RecordTimestamp) really helps this query, but, again, I have the added overhead for every INSERT to the table of maintaining that fragmented index.

    ************* CTE *************
    Table 'latestrecords'. Scan count 250005, logical reads 1003297, physical reads 0, read-ahead reads 12, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    (250000 rows affected)
    SQL Server Execution Times:
     CPU time = 3614 ms, elapsed time = 996 ms.

    Not for me.  At all.  I'll stick with ROW_NUMBER() here.

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • I don't have time to work on this right now, but I think that the answer will come down to density of the duplicates.  That is, it depends on the average number of duplicate records for each ID.  In other words, having 4000 unique IDs with an average of 10 duplicates is going to perform very differently from having 40 unique IDs having an average of 1000 duplicates even though the total number of records is the same.  The ROW_NUMBER method works better when the duplicates are sparse, and the CROSS APPLY method works better when the duplicates are dense.  And also there is the tradeoff between I/O and CPU. When I have time (yeah, right), I'll try to come up with a sample to test this.

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • drew.allen - Tuesday, July 24, 2018 12:48 PM

    I don't have time to work on this right now, but I think that the answer will come down to density of the duplicates.  That is, it depends on the average number of duplicate records for each ID.  In other words, having 4000 unique IDs with an average of 10 duplicates is going to perform very differently from having 40 unique IDs having an average of 1000 duplicates even though the total number of records is the same.  The ROW_NUMBER method works better when the duplicates are sparse, and the CROSS APPLY method works better when the duplicates are dense.  And also there is the tradeoff between I/O and CPU. When I have time (yeah, right), I'll try to come up with a sample to test this.

    Drew

    In the real world, I don't think it does, for the reason shown in my previous post, i.e.:
    The "CTE method" only works well at all with a fully covering index.
    And that causes such performance issues of its own that it's just not worth it.

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • ScottPletcher - Tuesday, July 24, 2018 12:42 PM

    Yes, it is.  That's why I called it a "trade off".  I think more realistically you'd have vastly fewer rows per id than 10K.  Let's try, say, 4.  Then the exec stats on my server are:

    ************* row_number *************
    Table 'latestrecords'. Scan count 5, logical reads 8253, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    SQL Server Execution Times:
     CPU time = 1843 ms, elapsed time = 945 ms.
    (250000 rows affected)

    ************* CTE *************
    Table 'latestrecords'. Scan count 250005, logical reads 808423, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    SQL Server Execution Times:
     CPU time = 2641 ms, elapsed time = 742 ms.
    (250000 rows affected)

    Where you have both more overall CPU and more I/O to save 0.2 seconds.  And even to get that you have to assume a fully covering index, which is very unlikely and in itself causes huge overhead when inserting to the initial table.

    So let's do a more realistic real-world test and remove the custom-made index [you could add a clustered index, but that's typically on datetime first for a logging table.  If you do it by id first -- to match this query -- you will have lots of fragmentation in the table itself, very bad for a logging table].
    DROP INDEX IX_latestrecords_1 ON dbo.latestrecords;

    Run the queries now (for the record, I also had "Include Actual Query Plan" on, but that should affect each query roughly equally), and you get:
    ************* row_number *************
    Table 'latestrecords'. Scan count 5, logical reads 7455, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    (250000 rows affected)
    SQL Server Execution Times:
     CPU time = 3248 ms, elapsed time = 848 ms.
    ************* CTE *************
    Table 'latestrecords'. Scan count 6, logical reads 14910, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'Worktable'. Scan count 250000, logical reads 3821000, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    (250000 rows affected)
    SQL Server Execution Times:
     CPU time = 12206 ms, elapsed time = 8871 ms.

    Woah, ten times the elapsed time and tens-of-thousands-times more I/O.

    Not for me.  I'll stick with ROW_NUMBER() here.

    Well to get the actual distribution of the data (how many rows per id) I think we'd have to ask the questioner who could run a query. As far as I know there could be one row per Id with the odd duplicate or there could be 1,000, I don't know that. I also don't know the total row count on the table. I did say in my original posting that an index would be needed for the query I submitted to perform well. The other thing to take into account is that the I/O you are pointing at is logical not physical, the rows are already in the cache from the distinct query so it won't affect physical I/O but just add a few cycles to the CPU.

  • drew.allen - Tuesday, July 24, 2018 12:48 PM

    I don't have time to work on this right now, but I think that the answer will come down to density of the duplicates.  That is, it depends on the average number of duplicate records for each ID.  In other words, having 4000 unique IDs with an average of 10 duplicates is going to perform very differently from having 40 unique IDs having an average of 1000 duplicates even though the total number of records is the same.  The ROW_NUMBER method works better when the duplicates are sparse, and the CROSS APPLY method works better when the duplicates are dense.  And also there is the tradeoff between I/O and CPU. When I have time (yeah, right), I'll try to come up with a sample to test this.

    Drew

    Yes, it's exactly that.
    And here's a link to some sample data to test the code: https://www.sqlservercentral.com/Forums/FindPost1978300.aspx

  • ScottPletcher - Tuesday, July 24, 2018 12:42 PM

    Jonathan AC Roberts - Tuesday, July 24, 2018 12:24 PM

    ScottPletcher - Tuesday, July 24, 2018 12:15 PM

    In clock time, yes.  But with far more I/Os.  So it's a trade off.  For me, I'll reduce the I/O and accept the extra CPU time, since I have more spare capacity of the latter than the former.

    The elapsed time is also something to consider.

    Yes, it is.  That's why I called it a "trade off".  I think more realistically you'd have vastly fewer rows per id than 10K.  Let's try, say, 4.  Then the exec stats on my server are:

    ************* row_number *************
    Table 'latestrecords'. Scan count 5, logical reads 8253, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    SQL Server Execution Times:
     CPU time = 1843 ms, elapsed time = 945 ms.
    (250000 rows affected)

    ************* CTE *************
    Table 'latestrecords'. Scan count 250005, logical reads 808423, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    SQL Server Execution Times:
     CPU time = 2641 ms, elapsed time = 742 ms.
    (250000 rows affected)

    Where you have both more overall CPU and more I/O to save 0.2 seconds.  And even to get that you have to assume a fully covering index, which is very unlikely and in itself causes huge overhead when inserting to the initial table.

    So let's do a more realistic real-world test and remove the custom-made index [you could add a clustered index, but that's typically on datetime first for a logging table.  If you do it by id first -- to match this query -- you will have lots of fragmentation in the table itself, very bad for a logging table].
    DROP INDEX IX_latestrecords_1 ON dbo.latestrecords;

    Run the queries now (for the record, I also had "Include Actual Query Plan" on, but that should affect each query roughly equally), and you get:
    ************* row_number *************
    Table 'latestrecords'. Scan count 5, logical reads 7455, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    (250000 rows affected)
    SQL Server Execution Times:
     CPU time = 3248 ms, elapsed time = 848 ms.
    ************* CTE *************
    Table 'latestrecords'. Scan count 6, logical reads 14910, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'Worktable'. Scan count 250000, logical reads 3821000, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    (250000 rows affected)
    SQL Server Execution Times:
     CPU time = 12206 ms, elapsed time = 8871 ms.

    Woah, ten times the elapsed time and tens-of-thousands-times more I/O.  Adding a nonclustered index on (Id, RecordTimestamp) really helps this query, but, again, I have the added overhead for every INSERT to the table of maintaining that fragmented index.

    ************* CTE *************
    Table 'latestrecords'. Scan count 250005, logical reads 1003297, physical reads 0, read-ahead reads 12, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    (250000 rows affected)
    SQL Server Execution Times:
     CPU time = 3614 ms, elapsed time = 996 ms.

    Not for me.  At all.  I'll stick with ROW_NUMBER() here.

    Again, post your code and test harness.  That way I have the same code you do to test with, including any minor changes you made but didn't mention.

Viewing 15 posts - 16 through 30 (of 33 total)

You must be logged in to reply to this topic. Login to reply