Execution of DISTINCT Clause

Question

Execution of DISTINCT Clause

mohd.nizamuddin

Ten Centuries

Points: 1328
More actions
January 7, 2010 at 12:14 am

#156560

Hi,
We all know that DISTINCT clause will provide unique resultset (means removes the duplicate rows). I have a problem while using DISTINCT clause. Can anybody help me that how DISTINCT has been treated and executed by SQL Engine internally.
Lets see an example below:
DECLARE @temp TABLE(
id INT,
Name VARCHAR(50)
)
INSERT INTO @temp
SELECT 1, 'Nizam' UNION ALL
SELECT 2, 'Sandeep' UNION ALL
SELECT 3, 'Ravi' UNION ALL
SELECT 4, 'Sandeep'
In the above table, I have inserted data, where I have two rows having the same name (id = 2 and 4).
SELECT DISTINCT name FROM @temp
Once I execute the above SQL, will give me three distinct names, in this case.
Moreover I want these names as a comma delimited. So I have written the standard logic to add comma along with names like below.
Declare @x VARCHAR(50)
SELECT DISTINCT @x = COALESCE(@x + ',', '') + name
FROM @temp
SELECT @x
OUTPUT
--------
Sandeep
But problem is that I am getting only one (id = 4) name in the variable @x. If I remove DISTINCT clause from the above SQL, @x variable get populated by names but with duplicates.
OUTPUT
--------
Nizam,Sandeep,Ravi,Sandeep
Moreover I have a work around of the problem. Which gives me appropriate result using below SQL.
SET @x = NULL
SELECT @x = COALESCE(@x + ',', '') + x.name
FROM (SELECT DISTINCT name FROM @temp) x
SELECT @x
I want to know that when I was using DISTINCT clause along with COALESCE, I was getting single name. Though I have three unique names in the table.
Please explain that what is happening behind the scene and why I am getting only one name.
Thanks in Advance,
Nizam

Viewing 15 posts - 1 through 15 (of 16 total)

You must be logged in to reply to this topic. Login to reply

Gail Shaw SSC Guru Points: 1004514 More actions · Answer 1

*edit removed*

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

toniupstny Hall of Fame Points: 3986 More actions · Answer 2

You could get the results without the overhead of distinct by checking for occurence of name in the string built thus far through a Case statement:

Declare @x VARCHAR(100)
SELECT @x = case
when @x is null then name
when patindex('%'+name+'%', @x) = 0 then COALESCE(@x + ',', '') + name
else @x
end
FROM @temp
SELECT 'Result ==> @x is: ''' + Coalesce(@x, 'Null value')+ ''''

Toni

toniupstny Hall of Fame Points: 3986 More actions · Answer 3

Actually the Coalesce does not add anything since the first when clause handles a null '@x' variable already.

Toni

Mike01 SSChampion Points: 11420 More actions · Answer 4

This works

DECLARE @temp TABLE(

id INT,

Name VARCHAR(50)

)

INSERT INTO @temp

SELECT 1, 'Nizam' UNION ALL

SELECT 2, 'Sandeep' UNION ALL

SELECT 3, 'Ravi' UNION ALL

SELECT 4, 'Sandeep'

Declare @x VARCHAR(50)

set @x = ''

SELECT @x = @x + name + ','

FROM @temp

group by name

SELECT @x

For better, quicker answers, click on the following...
http://www.sqlservercentral.com/articles/Best+Practices/61537/

For better answers on performance questions, click on the following...
http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

José.Cruz Right there with Babe Points: 722 More actions · Answer 5

Hi,

That way an extra comma will be placed in the end.

DECLARE @temp TABLE(

id INT,

Name VARCHAR(50)

)

INSERT INTO @temp

SELECT 1, 'Nizam' UNION ALL

SELECT 2, 'Sandeep' UNION ALL

SELECT 3, 'Ravi' UNION ALL

SELECT 4, 'Sandeep'

Declare @x VARCHAR(50)

set @x = null

SELECT @x = COALESCE(@x + ',', '') + name

FROM @temp

group by name

SELECT @x

José Cruz

mohd.nizamuddin Ten Centuries Points: 1328 More actions · Answer 6

Thanks for your responses.

But my problem is not the getting data (output) efficiently, rather I want to know that when I was using DISTINCT clause along with COALESCE, I was getting single name. Though I have three unique names in the table.

Please explain that what is happening behind the scene and why I am getting only one name.

Something internal of the SQL engine (HOW).

Nizam.

mohd.nizamuddin Ten Centuries Points: 1328 More actions · Answer 7

Thanks for your responses.

But my problem is not the getting data (output) efficiently, rather I want to know that when I was using DISTINCT clause along with COALESCE, I was getting single name. Though I have three unique names in the table.

Please explain that what is happening behind the scene and why I am getting only one name.

Something internal of the SQL engine (HOW).

Nizam.

ChiragNS One Orange Chip Points: 26137 More actions · Answer 8

If i remeber correctly COALESCE returns the first not null value.

"Keep Trying"

Jeff Moden SSC Guru Points: 1004686 More actions · Answer 9

mohd.nizamuddin (1/7/2010)
Thanks for your responses.
But my problem is not the getting data (output) efficiently, rather I want to know that when I was using DISTINCT clause along with COALESCE, I was getting single name. Though I have three unique names in the table.
Please explain that what is happening behind the scene and why I am getting only one name.
Something internal of the SQL engine (HOW).
Nizam.

Actually, how it works or not behind the scenes is neither the problem nor the question here. You've simply told it take take a "Distinct" on the wrong thing and it returned the wrong thing. Your code attempts to take a DISTINCT on an answer that does not yet exist. Instead, you must take the DISTINCT on the source of the information used to create the answer... thusly...

DECLARE @temp TABLE

(

ID INT,

Name VARCHAR(50)

)

INSERT INTO @temp

SELECT 1, 'Robert' UNION ALL

SELECT 2, 'Sandy' UNION ALL

SELECT 3, 'Roger' UNION ALL

SELECT 4, 'Sandy'

DECLARE @x VARCHAR(50)

SELECT @x = COALESCE(@x + ',', '') + d.Name

FROM (SELECT DISTINCT Name FROM @temp) d

SELECT @x

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

nadabadan SSCommitted Points: 1523 More actions · Answer 10

The reason you get Sandeep as a result is because DISTINCT is applied to "COALESCE( @x + ',', '') + name" first - which generates an internal virtual table. Then the variable @x is set the the rows of the virtual table.

Basically SQL runs the following queries internally for your query

INSERT __InternalVirtualTable1( __col1 )

SELECT DISTINCT COALESCE(@x + ',', '') + name

FROM @temp

So __InternalVirtualTable1 will have the the values below

Nizam

Ravi

Sandeep

Then SQL runs

SELECT @x = __col1

FROM __InternalVirtualTable1

and @x will contain 'Sandeep' because the last row in __InternalVirtualTable1 contains 'Sandeep'.

Look into virtual tables if you want to know how sql processes a query internally.

mohd.nizamuddin Ten Centuries Points: 1328 More actions · Answer 11

nadabadan (1/8/2010)
The reason you get Sandeep as a result is because DISTINCT
.
.
.
Look into virtual tables if you want to know how sql processes a query internally.

Thanks a lot. I will look further to know more about. Really, it will help me to start with to know more inside SQL. 🙂

mohd.nizamuddin Ten Centuries Points: 1328 More actions · Answer 12

Jeff Moden (1/8/2010)
Actually, how it works or not behind the scenes is neither the problem nor the question here. You've simply told it take take a "Distinct" on the wrong thing and it returned the wrong thing. Your code attempts to take a DISTINCT on an answer that does not yet exist. Instead, you must take the DISTINCT on the source of the information used to create the answer... thusly...
DECLARE @temp TABLE
(
ID INT,
Name VARCHAR(50)
)
INSERT INTO @temp
SELECT 1, 'Robert' UNION ALL
SELECT 2, 'Sandy' UNION ALL
SELECT 3, 'Roger' UNION ALL
SELECT 4, 'Sandy'
DECLARE @x VARCHAR(50)
SELECT @x = COALESCE(@x + ',', '') + d.Name
FROM (SELECT DISTINCT Name FROM @temp) d
SELECT @x

Thanks Jeff.

Paul White SSC Guru Points: 150468 More actions · Answer 13

About the 'internals'...

The first thing to realize is that the trick of generating a comma separated string using multiple assignments takes advantage of an undocumented quirk in SQL Server. It is a quirk that has been around for many years, and has pretty much always worked...but if you're going to use it, you probably need to be aware of its internals.

First though, I'm going to mention a better (though still not perfect) alternative:

SELECT @x =

STUFF(

(

SELECT DISTINCT ',' + name

FROM @temp

ORDER BY name

FOR XML PATH (''), TYPE

).value('.[1]', 'varchar(max)')

,1,1, ''); -- Removes the leading comma

(This method is documented, faster, allows for explicit ordering, and does not rely on @x having a NULL value at the start).

Anyway, back to the multiple-assignment trick: It relies on the variable being updated for every row, with the updated value being available for the next row to use.

So what causes the unusual behaviour noted in your first post? SQL Server creates a plan to evaluate the SELECT that is designed to assign @x from itself once per row. The operator within which implements this operation is the Compute Scalar.

The Compute Scalar is responsible for evaluating the expression which is assigned to @x, though there is no guarantee of exactly when the evaluation or the assignment will take place - or even if they are to occur at the same point in the plan.

In the simple case, without the DISTINCT, there is only one Compute Scalar in the plan and it handles the assignment and the evaluation, and chooses to do the assignment to @x on a per-row basis and so all is well. There is still no absolute guarantee that @x will be assigned to once per row - it just happens to work that way in practice.

When we add a DISTINCT to the query, an extra operator appears in the plan (either a Sort or Hash operator to implement the grouping) as we might expect; however we also find an extra Compute Scalar which appears before the grouping!

This operator takes care of the evaluation of the expression involving @x. It evaluates this once per rowbut crucially does not assign the result to @x between rows. In fact, it can't do that yet because that would be logically wrong - we need to do the DISTINCT before assigning to @x. Instead, the per-row result of the expression is stored as an Defined Value result with a name like [Expr1003].

Since we don't assign to @x at this stage, this Compute Scalar finds that @x always has the value it had before the statement started - in this case NULL. So, every row that is processed finds that @x is NULL so [Expr1003] is set to the contents of the name column. No concatenation or adding of commas occurs for any row. This step results in four rows with the [Expr1003] set to Nizam, Sandeep, Ravi, and Sandeep - though not necessarily in any particular order.

Then we do the DISTINCT on these [Expr1003] values - which results in Nizam, Sandeep, and Ravi - again in no guaranteed order. Finally, the second Compute Scalar assigns to @x from [Expr1003]. The result of this is non-deterministic since it depends on the order of rows coming from the prior operator. It may, for example, set @x to Nizam, then set it to Ravi, and then set it to Sandeep. It might do it in some other order, but whatever happens, @x will be set to just one of the values at the end.

In practice, you may find that the order of rows is predictable for small sets so you always get one of the candidate names. This is just another quirk of the code - there are optimizations to deal with small numbers of rows for example which give the impression of a guaranteed order, but that's all it is - an impression of a guarantee.

For example, with a larger number of rows, SQL Server might choose to implement the DISTINCT using a Hash Group operator, or it might choose to use parallelism, either of which would change the order of the rows in unpredictable ways.

There is one last thing about Compute Scalar operators: the execution engine will generally defer their evaluation until the result is actually required by some other operator - so the evaluation may happen later than the operator's position in the plan would suggest.

So, why does SQL Server break the operation into two parts when we introduce the DISTINCT? It is the result of the application of a general rule - the optimizer tries to place operators like Compute Scalar and Filter as early in the plan as it can. This is an optimization that makes sense if you think about it: evaluating early will usually be good for later operators and may avoid extra work. The optimizer applies these sorts of optimizations wherever it can, so long as it results in a provably correct plan.

You may think that the plan isn't correct since it produces unexpected output; but SQL Server does not guarantee that these sorts of multiple assignments should work in any particular way...

Jeff's solution, by the way, works by changing the query so that it is not safe for the optimizer to split the evaluation and assignment operations into two Compute Scalars. I have yet to come across a case where I was unable to rewrite the query to enable this technique to work - but I'll leave it up to the interested reader to decide for themselves which way to go with this 😉

Paul

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi

mohd.nizamuddin Ten Centuries Points: 1328 More actions · Answer 14

Thank you Paul, for your detailed explanation. Really this help me a lot to know something extra about the SQL.