Execution of DISTINCT Clause

  • Hi,

    We all know that DISTINCT clause will provide unique resultset (means removes the duplicate rows). I have a problem while using DISTINCT clause. Can anybody help me that how DISTINCT has been treated and executed by SQL Engine internally.

    Lets see an example below:

    DECLARE @temp TABLE(

    id INT,

    Name VARCHAR(50)

    )

    INSERT INTO @temp

    SELECT 1, 'Nizam' UNION ALL

    SELECT 2, 'Sandeep' UNION ALL

    SELECT 3, 'Ravi' UNION ALL

    SELECT 4, 'Sandeep'

    In the above table, I have inserted data, where I have two rows having the same name (id = 2 and 4).

    SELECT DISTINCT name FROM @temp

    Once I execute the above SQL, will give me three distinct names, in this case.

    Moreover I want these names as a comma delimited. So I have written the standard logic to add comma along with names like below.

    Declare @x VARCHAR(50)

    SELECT DISTINCT @x = COALESCE(@x + ',', '') + name

    FROM @temp

    SELECT @x

    OUTPUT

    --------

    Sandeep

    But problem is that I am getting only one (id = 4) name in the variable @x. If I remove DISTINCT clause from the above SQL, @x variable get populated by names but with duplicates.

    OUTPUT

    --------

    Nizam,Sandeep,Ravi,Sandeep

    Moreover I have a work around of the problem. Which gives me appropriate result using below SQL.

    SET @x = NULL

    SELECT @x = COALESCE(@x + ',', '') + x.name

    FROM (SELECT DISTINCT name FROM @temp) x

    SELECT @x

    I want to know that when I was using DISTINCT clause along with COALESCE, I was getting single name. Though I have three unique names in the table.

    Please explain that what is happening behind the scene and why I am getting only one name.

    Thanks in Advance,

    Nizam

  • *edit removed*

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • You could get the results without the overhead of distinct by checking for occurence of name in the string built thus far through a Case statement:

    Declare @x VARCHAR(100)

    SELECT @x = case

    when @x is null then name

    when patindex('%'+name+'%', @x) = 0 then COALESCE(@x + ',', '') + name

    else @x

    end

    FROM @temp

    SELECT 'Result ==> @x is: ''' + Coalesce(@x, 'Null value')+ ''''

    Toni

  • Actually the Coalesce does not add anything since the first when clause handles a null '@x' variable already.

    Toni

  • This works

    DECLARE @temp TABLE(

    id INT,

    Name VARCHAR(50)

    )

    INSERT INTO @temp

    SELECT 1, 'Nizam' UNION ALL

    SELECT 2, 'Sandeep' UNION ALL

    SELECT 3, 'Ravi' UNION ALL

    SELECT 4, 'Sandeep'

    Declare @x VARCHAR(50)

    set @x = ''

    SELECT @x = @x + name + ','

    FROM @temp

    group by name

    SELECT @x

    For better, quicker answers, click on the following...
    http://www.sqlservercentral.com/articles/Best+Practices/61537/

    For better answers on performance questions, click on the following...
    http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

  • Hi,

    That way an extra comma will be placed in the end.

    DECLARE @temp TABLE(

    id INT,

    Name VARCHAR(50)

    )

    INSERT INTO @temp

    SELECT 1, 'Nizam' UNION ALL

    SELECT 2, 'Sandeep' UNION ALL

    SELECT 3, 'Ravi' UNION ALL

    SELECT 4, 'Sandeep'

    Declare @x VARCHAR(50)

    set @x = null

    SELECT @x = COALESCE(@x + ',', '') + name

    FROM @temp

    group by name

    SELECT @x

    José Cruz

  • Thanks for your responses.

    But my problem is not the getting data (output) efficiently, rather I want to know that when I was using DISTINCT clause along with COALESCE, I was getting single name. Though I have three unique names in the table.

    Please explain that what is happening behind the scene and why I am getting only one name.

    Something internal of the SQL engine (HOW).

    Nizam.

  • Thanks for your responses.

    But my problem is not the getting data (output) efficiently, rather I want to know that when I was using DISTINCT clause along with COALESCE, I was getting single name. Though I have three unique names in the table.

    Please explain that what is happening behind the scene and why I am getting only one name.

    Something internal of the SQL engine (HOW).

    Nizam.

  • If i remeber correctly COALESCE returns the first not null value.

    "Keep Trying"

  • mohd.nizamuddin (1/7/2010)


    Thanks for your responses.

    But my problem is not the getting data (output) efficiently, rather I want to know that when I was using DISTINCT clause along with COALESCE, I was getting single name. Though I have three unique names in the table.

    Please explain that what is happening behind the scene and why I am getting only one name.

    Something internal of the SQL engine (HOW).

    Nizam.

    Actually, how it works or not behind the scenes is neither the problem nor the question here. You've simply told it take take a "Distinct" on the wrong thing and it returned the wrong thing. Your code attempts to take a DISTINCT on an answer that does not yet exist. Instead, you must take the DISTINCT on the source of the information used to create the answer... thusly...

    DECLARE @temp TABLE

    (

    ID INT,

    Name VARCHAR(50)

    )

    INSERT INTO @temp

    SELECT 1, 'Robert' UNION ALL

    SELECT 2, 'Sandy' UNION ALL

    SELECT 3, 'Roger' UNION ALL

    SELECT 4, 'Sandy'

    DECLARE @x VARCHAR(50)

    SELECT @x = COALESCE(@x + ',', '') + d.Name

    FROM (SELECT DISTINCT Name FROM @temp) d

    SELECT @x

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • The reason you get Sandeep as a result is because DISTINCT is applied to "COALESCE( @x + ',', '') + name" first - which generates an internal virtual table. Then the variable @x is set the the rows of the virtual table.

    Basically SQL runs the following queries internally for your query

    INSERT __InternalVirtualTable1( __col1 )

    SELECT DISTINCT COALESCE(@x + ',', '') + name

    FROM @temp

    So __InternalVirtualTable1 will have the the values below

    Nizam

    Ravi

    Sandeep

    Then SQL runs

    SELECT @x = __col1

    FROM __InternalVirtualTable1

    and @x will contain 'Sandeep' because the last row in __InternalVirtualTable1 contains 'Sandeep'.

    Look into virtual tables if you want to know how sql processes a query internally.

  • nadabadan (1/8/2010)


    The reason you get Sandeep as a result is because DISTINCT

    .

    .

    .

    Look into virtual tables if you want to know how sql processes a query internally.

    Thanks a lot. I will look further to know more about. Really, it will help me to start with to know more inside SQL. 🙂

  • Jeff Moden (1/8/2010)


    Actually, how it works or not behind the scenes is neither the problem nor the question here. You've simply told it take take a "Distinct" on the wrong thing and it returned the wrong thing. Your code attempts to take a DISTINCT on an answer that does not yet exist. Instead, you must take the DISTINCT on the source of the information used to create the answer... thusly...

    DECLARE @temp TABLE

    (

    ID INT,

    Name VARCHAR(50)

    )

    INSERT INTO @temp

    SELECT 1, 'Robert' UNION ALL

    SELECT 2, 'Sandy' UNION ALL

    SELECT 3, 'Roger' UNION ALL

    SELECT 4, 'Sandy'

    DECLARE @x VARCHAR(50)

    SELECT @x = COALESCE(@x + ',', '') + d.Name

    FROM (SELECT DISTINCT Name FROM @temp) d

    SELECT @x

    Thanks Jeff.

  • About the 'internals'...

    The first thing to realize is that the trick of generating a comma separated string using multiple assignments takes advantage of an undocumented quirk in SQL Server. It is a quirk that has been around for many years, and has pretty much always worked...but if you're going to use it, you probably need to be aware of its internals.

    First though, I'm going to mention a better (though still not perfect) alternative:

    SELECT @x =

    STUFF(

    (

    SELECT DISTINCT ',' + name

    FROM @temp

    ORDER BY name

    FOR XML PATH (''), TYPE

    ).value('.[1]', 'varchar(max)')

    ,1,1, ''); -- Removes the leading comma

    (This method is documented, faster, allows for explicit ordering, and does not rely on @x having a NULL value at the start).

    Anyway, back to the multiple-assignment trick: It relies on the variable being updated for every row, with the updated value being available for the next row to use.

    So what causes the unusual behaviour noted in your first post? SQL Server creates a plan to evaluate the SELECT that is designed to assign @x from itself once per row. The operator within which implements this operation is the Compute Scalar.

    The Compute Scalar is responsible for evaluating the expression which is assigned to @x, though there is no guarantee of exactly when the evaluation or the assignment will take place - or even if they are to occur at the same point in the plan.

    In the simple case, without the DISTINCT, there is only one Compute Scalar in the plan and it handles the assignment and the evaluation, and chooses to do the assignment to @x on a per-row basis and so all is well. There is still no absolute guarantee that @x will be assigned to once per row - it just happens to work that way in practice.

    When we add a DISTINCT to the query, an extra operator appears in the plan (either a Sort or Hash operator to implement the grouping) as we might expect; however we also find an extra Compute Scalar which appears before the grouping!

    This operator takes care of the evaluation of the expression involving @x. It evaluates this once per rowbut crucially does not assign the result to @x between rows. In fact, it can't do that yet because that would be logically wrong - we need to do the DISTINCT before assigning to @x. Instead, the per-row result of the expression is stored as an Defined Value result with a name like [Expr1003].

    Since we don't assign to @x at this stage, this Compute Scalar finds that @x always has the value it had before the statement started - in this case NULL. So, every row that is processed finds that @x is NULL so [Expr1003] is set to the contents of the name column. No concatenation or adding of commas occurs for any row. This step results in four rows with the [Expr1003] set to Nizam, Sandeep, Ravi, and Sandeep - though not necessarily in any particular order.

    Then we do the DISTINCT on these [Expr1003] values - which results in Nizam, Sandeep, and Ravi - again in no guaranteed order. Finally, the second Compute Scalar assigns to @x from [Expr1003]. The result of this is non-deterministic since it depends on the order of rows coming from the prior operator. It may, for example, set @x to Nizam, then set it to Ravi, and then set it to Sandeep. It might do it in some other order, but whatever happens, @x will be set to just one of the values at the end.

    In practice, you may find that the order of rows is predictable for small sets so you always get one of the candidate names. This is just another quirk of the code - there are optimizations to deal with small numbers of rows for example which give the impression of a guaranteed order, but that's all it is - an impression of a guarantee.

    For example, with a larger number of rows, SQL Server might choose to implement the DISTINCT using a Hash Group operator, or it might choose to use parallelism, either of which would change the order of the rows in unpredictable ways.

    There is one last thing about Compute Scalar operators: the execution engine will generally defer their evaluation until the result is actually required by some other operator - so the evaluation may happen later than the operator's position in the plan would suggest.

    So, why does SQL Server break the operation into two parts when we introduce the DISTINCT? It is the result of the application of a general rule - the optimizer tries to place operators like Compute Scalar and Filter as early in the plan as it can. This is an optimization that makes sense if you think about it: evaluating early will usually be good for later operators and may avoid extra work. The optimizer applies these sorts of optimizations wherever it can, so long as it results in a provably correct plan.

    You may think that the plan isn't correct since it produces unexpected output; but SQL Server does not guarantee that these sorts of multiple assignments should work in any particular way...

    Jeff's solution, by the way, works by changing the query so that it is not safe for the optimizer to split the evaluation and assignment operations into two Compute Scalars. I have yet to come across a case where I was unable to rewrite the query to enable this technique to work - but I'll leave it up to the interested reader to decide for themselves which way to go with this 😉

    Paul

  • Thank you Paul, for your detailed explanation. Really this help me a lot to know something extra about the SQL.

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply