How to return a LOT of other columns with group by

  • I have a MyNames table which contains over 100 columns. The Name column may have the same name in many rows. The data looks like this.

    Name..........Factor1..........Factor2...

    John..........X................Y

    John..........X................X

    Mary..........T................Y

    Mary..........V................Y

    I need a result table containing all 100 columns from MyNames but only the unique names in the name column. I don’t care which name row is used.

    Name..........Factor1..........Factor2...

    John..........X................Y

    Mary..........V................Y

    I have tried DISTINCT and GROUP BY.

    DISTINCT will unique on every column listed so I will end up with the name repetred in the Name column.

    SELECT DISIINCT Name, Factor1, Factor2 => I do not want the Factor columns involved in the distinct. This will cause the same name to appear in more than one row.

    Group by will requires that I place the remaininf 99 columns in a function in order to have the columns listed.

    SELECT Name, MAX(Address)... => adding 99 fields.

    Is there a way get only the unique Names with the 99 other columns with out having to manually put the other 99 columns in functions?

  • Several ways to do this, here's one

    WITH CTE AS(

    SELECT Name, Factor1, Factor2,

    ROW_NUMBER() OVER(PARTITION BY Name ORDER BY Factor1) as rn

    FROM MyNames)

    SELECT Name, Factor1, Factor2

    FROM CTE

    WHERE rn=1

    ____________________________________________________

    Deja View - The strange feeling that somewhere, sometime you've optimised this query before

    How to get the best help on a forum

    http://www.sqlservercentral.com/articles/Best+Practices/61537
  • I do not know CTS. I will have to do some study.

  • I've seen requests like this before and they always puzzle me... it would appear that the name should be the primary key in this case. Yet, you have more than one row with the same name? And, why don't you care which name row is returned? You would think that the information would be important enough to at least select the most recent row.

    Last, but not least, why are there 99 columns associated with a name? Seems to be pretty much denormalized.

    Not trying to be a smart guy, here... just concerned about your data...

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • These are all reasonable questions Jeff. Here is what is going on in my case and it may be similar conditions for other people that have made a similar request.

    1)... it would appear that the name should be the primary key in this case. Yet, you have more than one row with the same name?

    Basically the table was corrupted. The name should have been unique but due to some processing some where I have been given a table where the names aren’t unique.

    2)And, why don't you care which name row is returned?

    Each row has columns that are null and some that are populated. What I want to do is pick an arbitrary name. Then from the other rows with the same name populate the null fields in the selected row

    3)…why are there 99 columns associated with a name? Seems to be pretty much demoralized.

    This table is absolutely demoralized! I never would have designed something like this. But the people I got this file from are basically spread sheet users and are superimposing what they know on to data tables. So every thing ends up being a huge flat file. In this case the flat file got corrupted and I’m left with trying to clean up the mess.

    I am trying to get them to understand the concept of keys and related tables but it is slow going.

    I very much appreciate youe help. At least I will be learning something new out of all this.

  • Heh... "Flat file cleanup on aisle 5"!

    Are you all set, Kirk? Did Mark's post do what you needed (looks like it but thougt I'd double check)?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Yes that worked. Thanks to all of you. It took me some study to understand what he did but i got it.

  • A really nice one from Mark showing the use of CTE (Common Table Expressions) and ROW_NUMBER() two of the many new features of SQL Server 2005.

    Check the below link for more info.

    Common Table Expressions (CTE) in SQL Server 2005

    -----------------------------------------------------------[font=Arial Black]Time Is Money[/font][font=Arial Narrow]Calculating the Number of Business Hours Passed since a Point of Time[/url][/font][font=Arial Narrow]Calculating the Number of Business Hours Passed Between Two Points of Time[/font]

  • Hi Mark,

    The same result can be achieved by:

    SELECT Name, Factor1, Factor2 FROM

    (SELECT Name, Factor1, Factor2, ROW_NUMBER() OVER(PARTITION BY Name ORDER BY Factor1) as rn

    FROM MyNames) A

    WHERE rn=1

    Do you see any benefit by using CTE here.

    -----------------------------------------------------------[font=Arial Black]Time Is Money[/font][font=Arial Narrow]Calculating the Number of Business Hours Passed since a Point of Time[/url][/font][font=Arial Narrow]Calculating the Number of Business Hours Passed Between Two Points of Time[/font]

  • The CTE and the subquery versions are essentially identical in this case.

    I am worried that you have multiple rows for some names that may have different subsets of columns populated. You may want to use a GROUP BY query to get as many column values as possible.

    SELECT Name, MAX(Factor1), MAX(Factor2), ...

    FROM MyNames

    GROUP BY Name

    This will get the most data values, but each value could be from a different row which may be the wrong thing to do.

    If you want to get all factors from the same row you should use the ROW_NUMBER() approach, but if you also want to pick the row with the most non-NULL values you could work with the ORDER BY clause:

    ROW_NUMBER() OVER (PARTITION BY Name ORDER BY

    CASE WHEN Factor1 IS NULL THEN 0 ELSE 1 END

    + CASE WHEN Factor2 IS NULL THEN 0 ELSE 1 END

    + CASE WHEN Factor3 IS NULL THEN 0 ELSE 1 END

    ...

    DESC) as rn

  • Hi Scott,

    Your inputs are very useful for the requirements like you mentioned.

    But in the present case

    I need a result table containing all 100 columns from MyNames but only the unique names in the name column. I don’t care which name row is used.

    ROW_NUMBER() OVER(PARTITION BY Name

    is enough to pick the unique names because we are using

    WHERE rn=1

    I tested this also.

    I have actually ran into such requirements from users when they demand "pick the first row for a customer" from a data set if he has multiple rows for him.

    Sincerely,

    Maz

    -----------------------------------------------------------[font=Arial Black]Time Is Money[/font][font=Arial Narrow]Calculating the Number of Business Hours Passed since a Point of Time[/url][/font][font=Arial Narrow]Calculating the Number of Business Hours Passed Between Two Points of Time[/font]

  • Picking unique names is not the hard part, there are several ways to do it. It sounds like there are duplicate rows with the same name, and they may have been added or updated at different times. Some of the rows may have different subsets of NULL columns. The problem becomes how to pick the "best" row among each group of duplicates.

    The aggregate query guarantees that all columns that had a value in any duplicate row will have a value in the unique result, but there is no good way to choose the "best" value for each column. The ROW_NUMBER() query ensures that all fields come from the same original row, so if there are subsets of columns that were likely edited as a group they should have consistent values.

    Possibly the best approach would be to use a ROW_NUMBER() query to get an initial set of results, and join those to an aggregate query to fill in any NULL values.

  • That is representative of one of the conditions. I am not sure how I am going to populate the null columns from one of the duplicate rows. I am thinking I will have to writhe a loop of some kind to try and step through the rows that have the same name. I am sure there must be a better way of doing this.

  • Aggregate functions (like what Scott was proposing) would handle the NULL values. In other words - Max(column) will return the row with an actual value if every other row had nulls in it.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Yes I see where that will work. I'll give that a shot. thanks

Viewing 15 posts - 1 through 15 (of 20 total)

You must be logged in to reply to this topic. Login to reply