April 13, 2011 at 2:44 am
Hi All,
I am using DISTINCT and TOP in same query as :
SELECT DISTINCT TOP 100 ValidationDate FROM ValidatorDetails.
ValidatorDetails table contains 2000 records.
While looking at the execution plan, I found that DISTINCT is executed first but it returned only 100 records to the TOP execution block and then it show that TOP is executed, which returns 100 records to SELECT.
Kindly help, Thanks in advance.
Regards
Varun
April 13, 2011 at 2:48 am
What would you like help with? The DISTINCT has to be processed first as that's what you're selecting the top 100 of.
April 13, 2011 at 3:42 am
varun.dubey29 (4/13/2011)
Hi All,I am using DISTINCT and TOP in same query as :
SELECT DISTINCT TOP 100 ValidationDate FROM ValidatorDetails.
ValidatorDetails table contains 2000 records.
While looking at the execution plan, I found that DISTINCT is executed first but it returned only 100 records to the TOP execution block and then it show that TOP is executed, which returns 100 records to SELECT.
Kindly help, Thanks in advance.
Regards
Varun
Help with what?
There's an alternative if you prefer:
SELECT TOP 100 ValidationDate
FROM ValidatorDetails
GROUP BY ValidationDate
ORDER BY ValidationDate
For better assistance in answering your questions, please read this[/url].
Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]
April 13, 2011 at 3:54 am
That is what the expected behavior I am expecting but in the execution plan it show that Distinct block returns fewer records than the one in the table(it is returning the number of records that is mentioned with TOP) to the TOP Block.
I have recieved different record set from this query.
That occured only once, but in production.
April 13, 2011 at 4:17 am
varun.dubey29 (4/13/2011)
That is what the expected behavior I am expecting but in the execution plan it show that Distinct block returns fewer records than the one in the table(it is returning the number of records that is mentioned with TOP) to the TOP Block.I have recieved different record set from this query.
That occured only once, but in production.
So if your table had 20 million rows (and a suitable index on ValidationDate), you would expect it to return the distinct values from within all 20 million before selecting the top 100 from them?
The execution plan has to show TOP and DISTINCT as two separate operations but that doesn't mean they are performed separately.
For better assistance in answering your questions, please read this[/url].
Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]
April 13, 2011 at 6:12 am
ChrisM@home (4/13/2011)
varun.dubey29 (4/13/2011)
That is what the expected behavior I am expecting but in the execution plan it show that Distinct block returns fewer records than the one in the table(it is returning the number of records that is mentioned with TOP) to the TOP Block.I have recieved different record set from this query.
That occured only once, but in production.
So if your table had 20 million rows (and a suitable index on ValidationDate), you would expect it to return the distinct values from within all 20 million before selecting the top 100 from them?
The execution plan has to show TOP and DISTINCT as two separate operations but that doesn't mean they are performed separately.
Actually, yeah, they will be processed as two steps because they're logically separated.
It really depends on how the query is structured. If it's a straight SELECT TOP 100 DISTINCT..., then the DISTINCT must be processed, then the TOP 100, and yes, because of the DISTINCT operation, you may see less than 100 rows in the result set if there are fewer than 100 distinct values. You can start looking at derived tables where you can select the TOP 100 and then get the DISTINCT data set of that, but again, it may result in fewer than 100 rows.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
April 13, 2011 at 6:28 am
Thanks to all for the reply.
I got the solution to the existing solution.
by Adding the Order by Clause to the existing query.
I am curious to know about if TOP gets executed before DISTINCT when we use them in same query.
If any one know about this or any link that explaains about the execution of the query, please do let me know .
Thanks in Advance. 🙂
April 13, 2011 at 6:37 am
The distinct operation must occur first because that's logically what you're asking from the query.
Reading between the lines of what you've said above, are you're saying that you sometimes receive a different set of 100 rows than at other times?
TOP doesn't discriminate against which 100 rows it has to return if you do not have an ORDER BY...
April 13, 2011 at 6:45 am
Yeah I had recieved different record set then the usual but only once in prodution.
This scenario is not reproducible there after 🙁
April 13, 2011 at 6:50 am
Ok, the important thing to know is that the order of a select statement is never guaranteed unless you have an order by clause. If you want it to return in the order of the clustered index/primary key for example, you always need to specify this in an ORDER BY.
April 13, 2011 at 11:59 am
Grant Fritchey (4/13/2011)
ChrisM@home (4/13/2011)
varun.dubey29 (4/13/2011)
That is what the expected behavior I am expecting but in the execution plan it show that Distinct block returns fewer records than the one in the table(it is returning the number of records that is mentioned with TOP) to the TOP Block.I have recieved different record set from this query.
That occured only once, but in production.
So if your table had 20 million rows (and a suitable index on ValidationDate), you would expect it to return the distinct values from within all 20 million before selecting the top 100 from them?
The execution plan has to show TOP and DISTINCT as two separate operations but that doesn't mean they are performed separately.
Actually, yeah, they will be processed as two steps because they're logically separated.
It really depends on how the query is structured. If it's a straight SELECT TOP 100 DISTINCT..., then the DISTINCT must be processed, then the TOP 100, and yes, because of the DISTINCT operation, you may see less than 100 rows in the result set if there are fewer than 100 distinct values. You can start looking at derived tables where you can select the TOP 100 and then get the DISTINCT data set of that, but again, it may result in fewer than 100 rows.
<<Kneels, lays head on block>>
I have a sample table which has 100,000 rows, of which 1 column [Sentence] has about 45k distinct values.
If I run
SELECT DISTINCT TOP 100000 Sentence
FROM TransactionDetail
ORDER BY Sentence -- 44,582 / 100,000
- the whole table is scanned and the Stream Aggregate operation outputs 44,582 rows to the TOP operator.
If I run
SELECT DISTINCT TOP 100 Sentence
FROM TransactionDetail
ORDER BY Sentence -- 100 / 1,605
- sufficient rows are output from the index scan operation (there's an index on [Sentence]) to output 100 rows from the Stream Aggregate operation.
To restrict the working set in this manner, surely the Stream Aggregate operation (i.e. the DISTINCT) works in parallel with the TOP operator? Rather than in serial?
Here's the sproc which generates the sample data - credit goes to Jeff Moden and others for the bulk of it:
IF OBJECT_ID('TempDB.dbo.TransactionDetail') IS NOT NULL
DROP TABLE TempDB.dbo.TransactionDetail
--===== Create the test table (TransactionDetail) with a clustered PK
CREATE TABLE TempDB.dbo.TransactionDetail (
TransactionDetailID INT IDENTITY(1,1) PRIMARY KEY CLUSTERED,
[Date] DATETIME,
AccountID INT,
Amount MONEY,
SentenceVARCHAR(200),
AccountRunningTotal MONEY, --Running total across each account
AccountRunningCount INT, --Like "Rank" across each account
NCID INT)
INSERT INTO TempDB.dbo.TransactionDetail (Date, AccountID, Amount, Sentence)
SELECT TOP 100000
--10 years worth of dates with times from 1/1/2000 to 12/31/2009
CAST(RAND(CHECKSUM(NEWID()))*3653.0+36524.0 AS DATETIME) AS Date,
--100 different account numbers
ABS(CHECKSUM(NEWID()))%100+1,
--Dollar amounts from -99.99 to + 99.99
CAST(CHECKSUM(NEWID())%10000 /100.0 AS MONEY),
--Randomised number of words
iTVF.Sentence
FROM Master.dbo.SysColumns sc1
CROSS JOIN Master.dbo.SysColumns sc2
CROSS APPLY (
SELECT Word AS 'data()'
FROM (
SELECT TOP ((sc1.colorder*sc2.colorder)%9+1) word
FROM (
SELECT word = CAST('the' AS VARCHAR(200)) UNION ALL
SELECT 'quick' UNION ALL
SELECT 'brown' UNION ALL
SELECT 'fox' UNION ALL
SELECT 'jumped' UNION ALL
SELECT 'over' UNION ALL
SELECT 'the' UNION ALL
SELECT 'lazy' UNION ALL
SELECT 'dog'
) Words ORDER BY NEWID()) u2
FOR XML PATH('')
) iTVF(Sentence)
For better assistance in answering your questions, please read this[/url].
Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]
April 13, 2011 at 12:14 pm
ChrisM@home (4/13/2011)
Grant Fritchey (4/13/2011)
ChrisM@home (4/13/2011)
varun.dubey29 (4/13/2011)
That is what the expected behavior I am expecting but in the execution plan it show that Distinct block returns fewer records than the one in the table(it is returning the number of records that is mentioned with TOP) to the TOP Block.I have recieved different record set from this query.
That occured only once, but in production.
So if your table had 20 million rows (and a suitable index on ValidationDate), you would expect it to return the distinct values from within all 20 million before selecting the top 100 from them?
The execution plan has to show TOP and DISTINCT as two separate operations but that doesn't mean they are performed separately.
Actually, yeah, they will be processed as two steps because they're logically separated.
It really depends on how the query is structured. If it's a straight SELECT TOP 100 DISTINCT..., then the DISTINCT must be processed, then the TOP 100, and yes, because of the DISTINCT operation, you may see less than 100 rows in the result set if there are fewer than 100 distinct values. You can start looking at derived tables where you can select the TOP 100 and then get the DISTINCT data set of that, but again, it may result in fewer than 100 rows.
<<Kneels, lays head on block>>
I have a sample table which has 100,000 rows, of which 1 column [Sentence] has about 45k distinct values.
If I run
SELECT DISTINCT TOP 100000 Sentence
FROM TransactionDetail
ORDER BY Sentence -- 44,582 / 100,000
- the whole table is scanned and the Stream Aggregate operation outputs 44,582 rows to the TOP operator.
If I run
SELECT DISTINCT TOP 100 Sentence
FROM TransactionDetail
ORDER BY Sentence -- 100 / 1,605
- sufficient rows are output from the index scan operation (there's an index on [Sentence]) to output 100 rows from the Stream Aggregate operation.
To restrict the working set in this manner, surely the Stream Aggregate operation (i.e. the DISTINCT) works in parallel with the TOP operator? Rather than in serial?
Here's the sproc which generates the sample data - credit goes to Jeff Moden and others for the bulk of it:
IF OBJECT_ID('TempDB.dbo.TransactionDetail') IS NOT NULL
DROP TABLE TempDB.dbo.TransactionDetail
--===== Create the test table (TransactionDetail) with a clustered PK
CREATE TABLE TempDB.dbo.TransactionDetail (
TransactionDetailID INT IDENTITY(1,1) PRIMARY KEY CLUSTERED,
[Date] DATETIME,
AccountID INT,
Amount MONEY,
SentenceVARCHAR(200),
AccountRunningTotal MONEY, --Running total across each account
AccountRunningCount INT, --Like "Rank" across each account
NCID INT)
INSERT INTO TempDB.dbo.TransactionDetail (Date, AccountID, Amount, Sentence)
SELECT TOP 100000
--10 years worth of dates with times from 1/1/2000 to 12/31/2009
CAST(RAND(CHECKSUM(NEWID()))*3653.0+36524.0 AS DATETIME) AS Date,
--100 different account numbers
ABS(CHECKSUM(NEWID()))%100+1,
--Dollar amounts from -99.99 to + 99.99
CAST(CHECKSUM(NEWID())%10000 /100.0 AS MONEY),
--Randomised number of words
iTVF.Sentence
FROM Master.dbo.SysColumns sc1
CROSS JOIN Master.dbo.SysColumns sc2
CROSS APPLY (
SELECT Word AS 'data()'
FROM (
SELECT TOP ((sc1.colorder*sc2.colorder)%9+1) word
FROM (
SELECT word = CAST('the' AS VARCHAR(200)) UNION ALL
SELECT 'quick' UNION ALL
SELECT 'brown' UNION ALL
SELECT 'fox' UNION ALL
SELECT 'jumped' UNION ALL
SELECT 'over' UNION ALL
SELECT 'the' UNION ALL
SELECT 'lazy' UNION ALL
SELECT 'dog'
) Words ORDER BY NEWID()) u2
FOR XML PATH('')
) iTVF(Sentence)
You can keep your head. I'm not the guy to take anyone's.
You're still seeing two operations, right? If it's showing two operations, at least based on what I've learned, it's performing two operations. I could certainly be wrong about that (wouldn't be the first time in the last 10 minutes).
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
April 13, 2011 at 2:02 pm
Grant Fritchey (4/13/2011)
...You're still seeing two operations, right? If it's showing two operations, at least based on what I've learned, it's performing two operations...
I'll live with that. Thanks Grant.
For better assistance in answering your questions, please read this[/url].
Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]
Viewing 13 posts - 1 through 13 (of 13 total)
You must be logged in to reply to this topic. Login to reply