SQL DISTINCT on Multiple Columns

  • This was a great article on distinct.

    Did I learn anything about Distinct? No.

    But that doesn't mean that somebody else can't come along and learn something valuable from this article. Keep writing them. SSC has quickly eclipsed MSDN as a go to source to learn how to code SQL and it's the primary place I send developers to learn it.



    --Mark Tassin
    MCITP - SQL Server DBA
    Proud member of the Anti-RBAR alliance.
    For help with Performance click this link[/url]
    For tips on how to post your problems[/url]

  • This is the second article in a row that have had errors in them. I hate to be the person who uses the comments section to complain about typos, but in code it's pretty important. I like the articles, just a little more care in proofing would be great.

    Amy

  • as a (relative) beginner, it helped illustrate the concept for me much better than BOL does. it also inspired me to do some side by side runs on our data to see if there was any perf difference between distinct and group by.

    i created two queries in different SPIDS. one was select DISTINCT column, one was SELECT column GROUP BY column.

    The column that i chose was indexed.

    The statistics io output was exactly the same for this simple query. I added a couple of more columns to the queries, and im still getting the same stats.

    I'm going to try some more examples, but thanks for the inspiration to test this!

  • Perhaps I missed something, but one of the query samples doesn't behave as expected. The article says, "Let's look at another example where you can use DISTINCT on multiple columns to find duplicate addreses. I've taken this example from the post. Please refer to this post for more detail." Followed by this query:

    SELECT PostCode, COUNT(PostCode)

    FROM

    (

    SELECT DISTINCT Address1, Address2, City, PostCode

    FROM DuplicateTest

    ) AS Sub

    GROUP BY PostCode

    HAVING COUNT(PostCode) > 1

    However, this query does not return a duplicate address. Here's an example of not returning a duplicate address:

    CREATE TABLE DuplicateTest(

    Firstname nvarchar (30) NOT NULL,

    Lastname nvarchar(30) NOT NULL,

    Address1 nvarchar(50) NOT NULL,

    Address2 nvarchar(50) NULL,

    PostCode nvarchar(15) NOT NULL,

    City nvarchar(30) NOT NULL

    )

    insert into DuplicateTest

    (Firstname,Lastname, Address1, PostCode,City)

    values

    ('Param', 'Singh', '12 Church Street', 'B283SP', 'Birmingham'),

    ('Steve', 'White', '1 May Lane', 'EC224HQ', 'London'),

    ('Mark', 'Smith', '42 May Lane', 'L324JK', 'Liverpool'),

    ('Claire', 'whitehood', '789 High Road', 'M236DM', 'Manchester'),

    ('P', 'Singh', '12 Church Street', 'B283SP', 'Birmingham')

    SELECT PostCode, COUNT(PostCode)

    FROM

    (

    SELECT DISTINCT Address1, Address2, City, PostCode

    FROM DuplicateTest

    ) AS Sub

    GROUP BY PostCode

    HAVING COUNT(PostCode) > 1

    You can see that records 1 and 5 have the same address for Adress1, PostCode and City. The firstnames are different - which is typical when identifying duplicate records in a mailing list. The query provided returns no records.

    The queries included in an article should perform what the article states. Rated - 1 star.


    P Goldy

  • Goldy it wouldnt as you've already done a distinct Select on the Address which is the same, thus by the time you get to the outer query you only have 1 record returned to do the count on.

    If you change the Second instance of the address to house number 14, you will get a count of two for that postal code.

    _________________________________________________________________________
    SSC Guide to Posting and Best Practices

  • Jason. Right - I understood that. Just wanted to point out to Sarvesh that he should make his article explanation match the query and produce expected results. Maybe I'm being to picky...


    P Goldy

  • Enjoyed the article especially the comparison between DISTINCT and GROUP BY.

  • pgoldy (2/8/2011)


    Jason. Right - I understood that. Just wanted to point out to Sarvesh that he should make his article explanation match the query and produce expected results. Maybe I'm being to picky...

    Apologies, I miss read your post (end of a long day)

    _________________________________________________________________________
    SSC Guide to Posting and Best Practices

  • The input data has been fixed.

    If this article was something you already knew, then it wasn't for you, and I'm not sure it makes sense for you to rate it lower. We don't want to put out all articles that examine performance or detailed dissection of the workings of a function. We do try to make sure that the people that started learning T-SQL this last month have some content as well.

  • I did not find this to be a good introductory article on DISTINCT. I gave this article 1 star and I have "the nads to say why".

    I often see SELECT DISTINCT when doing code reviews. To me it's a red flag: Either a junior developer didn't fully understand why duplicates were being returned and did a band-aid fix by slapping on a "SELECT DISTINCT" or, they blindly copied code from some query generator that automatically adds DISTINCT whether it's needed or not.

    Also, as another poster pointed out, in the section for finding duplicate addresses (where the mysterious "AddressTable" first appears), the two "equivalent" queries are NOT the same at all! I'm not sure what the author is trying to show here, but I'm surprised the example code was published without any testing to confirm it accurately illustrates the authors point (whatever that is). I'll bet "many a newbie" will be thoroughly confused or mislead by it, though. I just hope this mistake was caused by an editing error and not by a lack of understanding.

    I'm also surprised that, although this article attempts to point out similarities between DISTINCT and GROUP BY, it doesn't really go into trade-offs between them. The most obvious to me is that with GROUP BY, it's very easy to retrieve additional information by adding aggregates (such as a count). I've found that often the first question asked after "what are the unique values?" is "how many rows are there for each of those unique values?". When using GROUP BY, all you have to do is add a count(*) to the select list. The next request is likely for an additional level of detail (eg. break-down by calendar year) which can easily be accomodated by adding another column to the select list and group by clause.

    Finally, the example using aggregates seems to be somewhat contrived. In my experience, requests to sum distinct values are very rare, but I have had to re-write queries that incorrectly summed on results from a SELECT DISTINCT.

    DISTINCTS in aggregates = BAD IDEA!

    I have much to learn, so somebody please tell me What makes SELECT DISTINCT such a 'fine bit of "SQL Spackle"'.

    Please sell me on any reason beside "clarity of code" or "easy solution" (read band-aid fix) where using SELECT DISTINCT would be preferable to using GROUP BY.

  • This was a good introductory article about DISTINCT, and using GROUP BY.

    Please see posting by srivatsa.hg to correct the typo. My understanding was that the PostalCode were meant to be different for 'Sarvesh' and 'Param'.

    'select * from DistinctTutorial' was meant to be 'select * from DuplicateTest'. Please verify and correct as needed.

  • I too enjoyed the article and thought it would a great time to pimp one of my own blog posts being that it's closely related to this topic.

    The post covers counting based off DISTINCT clauses and the pitfalls one needs to be aware of: COUNTing with the DISTINCT Clause. I hope some of your educated audience can get something out of it to add to their knowledge base.

    Thanks...Troy

  • Thanks for the article.

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • Thanks for the article. I try not to use distinct, especially in multi joined select statements, but your article delivered on what you set out to deliver as far as I am concerned. Besides, look at the great discussion you generated!

  • Good effort Sarvesh.

    See also QOTD:

    http://www.sqlservercentral.com/questions/T-SQL/70386/

    🙂

Viewing 15 posts - 16 through 30 (of 37 total)

You must be logged in to reply to this topic. Login to reply