Which will execute first DISTINCT or TOP, if used in one query.

  • Hi All,

    I am using DISTINCT and TOP in same query as :

    SELECT DISTINCT TOP 100 ValidationDate FROM ValidatorDetails.

    ValidatorDetails table contains 2000 records.

    While looking at the execution plan, I found that DISTINCT is executed first but it returned only 100 records to the TOP execution block and then it show that TOP is executed, which returns 100 records to SELECT.

    Kindly help, Thanks in advance.

    Regards

    Varun

  • What would you like help with? The DISTINCT has to be processed first as that's what you're selecting the top 100 of.

  • varun.dubey29 (4/13/2011)


    Hi All,

    I am using DISTINCT and TOP in same query as :

    SELECT DISTINCT TOP 100 ValidationDate FROM ValidatorDetails.

    ValidatorDetails table contains 2000 records.

    While looking at the execution plan, I found that DISTINCT is executed first but it returned only 100 records to the TOP execution block and then it show that TOP is executed, which returns 100 records to SELECT.

    Kindly help, Thanks in advance.

    Regards

    Varun

    Help with what?

    There's an alternative if you prefer:

    SELECT TOP 100 ValidationDate

    FROM ValidatorDetails

    GROUP BY ValidationDate

    ORDER BY ValidationDate


    [font="Arial"]Low-hanging fruit picker and defender of the moggies[/font]

    For better assistance in answering your questions, please read this[/url].


    Understanding and using APPLY, (I)[/url] and (II)[/url] Paul White[/url]

    Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]

  • That is what the expected behavior I am expecting but in the execution plan it show that Distinct block returns fewer records than the one in the table(it is returning the number of records that is mentioned with TOP) to the TOP Block.

    I have recieved different record set from this query.

    That occured only once, but in production.

  • varun.dubey29 (4/13/2011)


    That is what the expected behavior I am expecting but in the execution plan it show that Distinct block returns fewer records than the one in the table(it is returning the number of records that is mentioned with TOP) to the TOP Block.

    I have recieved different record set from this query.

    That occured only once, but in production.

    So if your table had 20 million rows (and a suitable index on ValidationDate), you would expect it to return the distinct values from within all 20 million before selecting the top 100 from them?

    The execution plan has to show TOP and DISTINCT as two separate operations but that doesn't mean they are performed separately.


    [font="Arial"]Low-hanging fruit picker and defender of the moggies[/font]

    For better assistance in answering your questions, please read this[/url].


    Understanding and using APPLY, (I)[/url] and (II)[/url] Paul White[/url]

    Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]

  • ChrisM@home (4/13/2011)


    varun.dubey29 (4/13/2011)


    That is what the expected behavior I am expecting but in the execution plan it show that Distinct block returns fewer records than the one in the table(it is returning the number of records that is mentioned with TOP) to the TOP Block.

    I have recieved different record set from this query.

    That occured only once, but in production.

    So if your table had 20 million rows (and a suitable index on ValidationDate), you would expect it to return the distinct values from within all 20 million before selecting the top 100 from them?

    The execution plan has to show TOP and DISTINCT as two separate operations but that doesn't mean they are performed separately.

    Actually, yeah, they will be processed as two steps because they're logically separated.

    It really depends on how the query is structured. If it's a straight SELECT TOP 100 DISTINCT..., then the DISTINCT must be processed, then the TOP 100, and yes, because of the DISTINCT operation, you may see less than 100 rows in the result set if there are fewer than 100 distinct values. You can start looking at derived tables where you can select the TOP 100 and then get the DISTINCT data set of that, but again, it may result in fewer than 100 rows.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Thanks to all for the reply.

    I got the solution to the existing solution.

    by Adding the Order by Clause to the existing query.

    I am curious to know about if TOP gets executed before DISTINCT when we use them in same query.

    If any one know about this or any link that explaains about the execution of the query, please do let me know .

    Thanks in Advance. 🙂

  • The distinct operation must occur first because that's logically what you're asking from the query.

    Reading between the lines of what you've said above, are you're saying that you sometimes receive a different set of 100 rows than at other times?

    TOP doesn't discriminate against which 100 rows it has to return if you do not have an ORDER BY...

  • Yeah I had recieved different record set then the usual but only once in prodution.

    This scenario is not reproducible there after 🙁

  • Ok, the important thing to know is that the order of a select statement is never guaranteed unless you have an order by clause. If you want it to return in the order of the clustered index/primary key for example, you always need to specify this in an ORDER BY.

  • Grant Fritchey (4/13/2011)


    ChrisM@home (4/13/2011)


    varun.dubey29 (4/13/2011)


    That is what the expected behavior I am expecting but in the execution plan it show that Distinct block returns fewer records than the one in the table(it is returning the number of records that is mentioned with TOP) to the TOP Block.

    I have recieved different record set from this query.

    That occured only once, but in production.

    So if your table had 20 million rows (and a suitable index on ValidationDate), you would expect it to return the distinct values from within all 20 million before selecting the top 100 from them?

    The execution plan has to show TOP and DISTINCT as two separate operations but that doesn't mean they are performed separately.

    Actually, yeah, they will be processed as two steps because they're logically separated.

    It really depends on how the query is structured. If it's a straight SELECT TOP 100 DISTINCT..., then the DISTINCT must be processed, then the TOP 100, and yes, because of the DISTINCT operation, you may see less than 100 rows in the result set if there are fewer than 100 distinct values. You can start looking at derived tables where you can select the TOP 100 and then get the DISTINCT data set of that, but again, it may result in fewer than 100 rows.

    <<Kneels, lays head on block>>

    I have a sample table which has 100,000 rows, of which 1 column [Sentence] has about 45k distinct values.

    If I run

    SELECT DISTINCT TOP 100000 Sentence

    FROM TransactionDetail

    ORDER BY Sentence -- 44,582 / 100,000

    - the whole table is scanned and the Stream Aggregate operation outputs 44,582 rows to the TOP operator.

    If I run

    SELECT DISTINCT TOP 100 Sentence

    FROM TransactionDetail

    ORDER BY Sentence -- 100 / 1,605

    - sufficient rows are output from the index scan operation (there's an index on [Sentence]) to output 100 rows from the Stream Aggregate operation.

    To restrict the working set in this manner, surely the Stream Aggregate operation (i.e. the DISTINCT) works in parallel with the TOP operator? Rather than in serial?

    Here's the sproc which generates the sample data - credit goes to Jeff Moden and others for the bulk of it:

    IF OBJECT_ID('TempDB.dbo.TransactionDetail') IS NOT NULL

    DROP TABLE TempDB.dbo.TransactionDetail

    --===== Create the test table (TransactionDetail) with a clustered PK

    CREATE TABLE TempDB.dbo.TransactionDetail (

    TransactionDetailID INT IDENTITY(1,1) PRIMARY KEY CLUSTERED,

    [Date] DATETIME,

    AccountID INT,

    Amount MONEY,

    SentenceVARCHAR(200),

    AccountRunningTotal MONEY, --Running total across each account

    AccountRunningCount INT, --Like "Rank" across each account

    NCID INT)

    INSERT INTO TempDB.dbo.TransactionDetail (Date, AccountID, Amount, Sentence)

    SELECT TOP 100000

    --10 years worth of dates with times from 1/1/2000 to 12/31/2009

    CAST(RAND(CHECKSUM(NEWID()))*3653.0+36524.0 AS DATETIME) AS Date,

    --100 different account numbers

    ABS(CHECKSUM(NEWID()))%100+1,

    --Dollar amounts from -99.99 to + 99.99

    CAST(CHECKSUM(NEWID())%10000 /100.0 AS MONEY),

    --Randomised number of words

    iTVF.Sentence

    FROM Master.dbo.SysColumns sc1

    CROSS JOIN Master.dbo.SysColumns sc2

    CROSS APPLY (

    SELECT Word AS 'data()'

    FROM (

    SELECT TOP ((sc1.colorder*sc2.colorder)%9+1) word

    FROM (

    SELECT word = CAST('the' AS VARCHAR(200)) UNION ALL

    SELECT 'quick' UNION ALL

    SELECT 'brown' UNION ALL

    SELECT 'fox' UNION ALL

    SELECT 'jumped' UNION ALL

    SELECT 'over' UNION ALL

    SELECT 'the' UNION ALL

    SELECT 'lazy' UNION ALL

    SELECT 'dog'

    ) Words ORDER BY NEWID()) u2

    FOR XML PATH('')

    ) iTVF(Sentence)


    [font="Arial"]Low-hanging fruit picker and defender of the moggies[/font]

    For better assistance in answering your questions, please read this[/url].


    Understanding and using APPLY, (I)[/url] and (II)[/url] Paul White[/url]

    Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]

  • ChrisM@home (4/13/2011)


    Grant Fritchey (4/13/2011)


    ChrisM@home (4/13/2011)


    varun.dubey29 (4/13/2011)


    That is what the expected behavior I am expecting but in the execution plan it show that Distinct block returns fewer records than the one in the table(it is returning the number of records that is mentioned with TOP) to the TOP Block.

    I have recieved different record set from this query.

    That occured only once, but in production.

    So if your table had 20 million rows (and a suitable index on ValidationDate), you would expect it to return the distinct values from within all 20 million before selecting the top 100 from them?

    The execution plan has to show TOP and DISTINCT as two separate operations but that doesn't mean they are performed separately.

    Actually, yeah, they will be processed as two steps because they're logically separated.

    It really depends on how the query is structured. If it's a straight SELECT TOP 100 DISTINCT..., then the DISTINCT must be processed, then the TOP 100, and yes, because of the DISTINCT operation, you may see less than 100 rows in the result set if there are fewer than 100 distinct values. You can start looking at derived tables where you can select the TOP 100 and then get the DISTINCT data set of that, but again, it may result in fewer than 100 rows.

    <<Kneels, lays head on block>>

    I have a sample table which has 100,000 rows, of which 1 column [Sentence] has about 45k distinct values.

    If I run

    SELECT DISTINCT TOP 100000 Sentence

    FROM TransactionDetail

    ORDER BY Sentence -- 44,582 / 100,000

    - the whole table is scanned and the Stream Aggregate operation outputs 44,582 rows to the TOP operator.

    If I run

    SELECT DISTINCT TOP 100 Sentence

    FROM TransactionDetail

    ORDER BY Sentence -- 100 / 1,605

    - sufficient rows are output from the index scan operation (there's an index on [Sentence]) to output 100 rows from the Stream Aggregate operation.

    To restrict the working set in this manner, surely the Stream Aggregate operation (i.e. the DISTINCT) works in parallel with the TOP operator? Rather than in serial?

    Here's the sproc which generates the sample data - credit goes to Jeff Moden and others for the bulk of it:

    IF OBJECT_ID('TempDB.dbo.TransactionDetail') IS NOT NULL

    DROP TABLE TempDB.dbo.TransactionDetail

    --===== Create the test table (TransactionDetail) with a clustered PK

    CREATE TABLE TempDB.dbo.TransactionDetail (

    TransactionDetailID INT IDENTITY(1,1) PRIMARY KEY CLUSTERED,

    [Date] DATETIME,

    AccountID INT,

    Amount MONEY,

    SentenceVARCHAR(200),

    AccountRunningTotal MONEY, --Running total across each account

    AccountRunningCount INT, --Like "Rank" across each account

    NCID INT)

    INSERT INTO TempDB.dbo.TransactionDetail (Date, AccountID, Amount, Sentence)

    SELECT TOP 100000

    --10 years worth of dates with times from 1/1/2000 to 12/31/2009

    CAST(RAND(CHECKSUM(NEWID()))*3653.0+36524.0 AS DATETIME) AS Date,

    --100 different account numbers

    ABS(CHECKSUM(NEWID()))%100+1,

    --Dollar amounts from -99.99 to + 99.99

    CAST(CHECKSUM(NEWID())%10000 /100.0 AS MONEY),

    --Randomised number of words

    iTVF.Sentence

    FROM Master.dbo.SysColumns sc1

    CROSS JOIN Master.dbo.SysColumns sc2

    CROSS APPLY (

    SELECT Word AS 'data()'

    FROM (

    SELECT TOP ((sc1.colorder*sc2.colorder)%9+1) word

    FROM (

    SELECT word = CAST('the' AS VARCHAR(200)) UNION ALL

    SELECT 'quick' UNION ALL

    SELECT 'brown' UNION ALL

    SELECT 'fox' UNION ALL

    SELECT 'jumped' UNION ALL

    SELECT 'over' UNION ALL

    SELECT 'the' UNION ALL

    SELECT 'lazy' UNION ALL

    SELECT 'dog'

    ) Words ORDER BY NEWID()) u2

    FOR XML PATH('')

    ) iTVF(Sentence)

    You can keep your head. I'm not the guy to take anyone's.

    You're still seeing two operations, right? If it's showing two operations, at least based on what I've learned, it's performing two operations. I could certainly be wrong about that (wouldn't be the first time in the last 10 minutes).

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Grant Fritchey (4/13/2011)


    ...

    You're still seeing two operations, right? If it's showing two operations, at least based on what I've learned, it's performing two operations...

    I'll live with that. Thanks Grant.


    [font="Arial"]Low-hanging fruit picker and defender of the moggies[/font]

    For better assistance in answering your questions, please read this[/url].


    Understanding and using APPLY, (I)[/url] and (II)[/url] Paul White[/url]

    Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]

Viewing 13 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply