Combining Multiple Records Into One

  • I have an order table with an order number column and a country column. Some orders have multiple countries. I need to combine these records into one. Here is the table:

    --===== If the test table already exists, drop it

    IF OBJECT_ID('TempDB..#JobCountry','U') IS NOT NULL DROP TABLE #JobCountry

    GO

    --===== Create the test table

    CREATE TABLE #JobCountry

    (OrderNum NVARCHAR(15),

    OrdCountry NVARCHAR(20))

    --===== Insert the test data into the test table

    INSERT INTO #JobCountry

    (OrderNum, OrdCountry)

    SELECT 'AF0F','Canada' UNION ALL

    SELECT 'AF0F','Brazil' UNION ALL

    SELECT 'AF0F','Mexico' UNION ALL

    SELECT 'BD37','UnitedArabEmirates' UNION ALL

    SELECT 'BD37','UnitedKingdom' UNION ALL

    --==== SELECT the records

    SELECT OrderNum, OrdCountry FROM #JobCountry

    Here is the expected result:

    OrdNum Ordcountry

    --------------- -----------------------------------

    AF0F Canada, Brazil, Brazil

    BD37 UnitedArabEmirates, UnitedKingdom

    Is this possible, and if so, what is the best way to do it? The commas aren't necessary. Let me know if you need more info.

    Thanks.

    Steve

  • You can use SQLCLR aggregate GROUP_CONCAT to do this (get it here http://groupconcat.codeplex.com):

    SELECT jc.OrderNum,

    dbo.GROUP_CONCAT(DISTINCT OrdCountry) AS OrdCountries

    FROM #JobCountry jc

    GROUP BY jc.OrderNum

    Or you can use XML:

    SELECT jc.OrderNum,

    STUFF(

    (SELECT N',' + OrdCountry

    FROM (SELECT DISTINCT OrdCountry

    FROM #JobCountry

    WHERE OrderNum = jc.OrderNum) r

    FOR XML PATH(''), TYPE

    ).value('.[1]', 'NVARCHAR(MAX)'), 1, 1, '') AS OrdCountries

    FROM #JobCountry jc

    GROUP BY jc.OrderNum

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • That worked. Thanks much. You have saved me a bunch of time.

  • You're welcome πŸ™‚

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • opc.three (6/27/2011)


    You can use SQLCLR aggregate GROUP_CONCAT to do this (get it here http://groupconcat.codeplex.com):

    SELECT jc.OrderNum,

    dbo.GROUP_CONCAT(DISTINCT OrdCountry) AS OrdCountries

    FROM #JobCountry jc

    GROUP BY jc.OrderNum

    Or you can use XML:

    SELECT jc.OrderNum,

    STUFF(

    (SELECT N',' + OrdCountry

    FROM (SELECT DISTINCT OrdCountry

    FROM #JobCountry

    WHERE OrderNum = jc.OrderNum) r

    FOR XML PATH(''), TYPE

    ).value('.[1]', 'NVARCHAR(MAX)'), 1, 1, '') AS OrdCountries

    FROM #JobCountry jc

    GROUP BY jc.OrderNum

    From the link you cited...

    Overall performance is comparable between the UDAs and XML methods.

    My question, then, would be... why would you use externally managed code if there's no advantage?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (6/27/2011)


    opc.three (6/27/2011)


    You can use SQLCLR aggregate GROUP_CONCAT to do this (get it here http://groupconcat.codeplex.com):

    SELECT jc.OrderNum,

    dbo.GROUP_CONCAT(DISTINCT OrdCountry) AS OrdCountries

    FROM #JobCountry jc

    GROUP BY jc.OrderNum

    Or you can use XML:

    SELECT jc.OrderNum,

    STUFF(

    (SELECT N',' + OrdCountry

    FROM (SELECT DISTINCT OrdCountry

    FROM #JobCountry

    WHERE OrderNum = jc.OrderNum) r

    FOR XML PATH(''), TYPE

    ).value('.[1]', 'NVARCHAR(MAX)'), 1, 1, '') AS OrdCountries

    FROM #JobCountry jc

    GROUP BY jc.OrderNum

    From the link you cited...

    Overall performance is comparable between the UDAs and XML methods.

    My question, then, would be... why would you use externally managed code if there's no advantage?

    Readability and performance. Depending on the use case the UDA is a better choice in terms of performance. The UDA slightly outperforms the "XML PATH, TYPE" in 4 out of 6 use cases. If you include the the "XML PATH" method (suffers from character escapes) then it only outperforms XML in 3 out of 6 use cases. That's why I went with the word "comparable".

    However, if I were being sent to a desert island and had to choose one method, either the UDA or XML...I would have to say it would depend on the island as to which one I would choose πŸ™‚

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • Readability? Absolutely. You'll get no argument from me there.

    Performance? I gotta go with a "maybe" (or at least IMHO :-P). On another link, you said it was only about a 10% gain in performance (which isn't bad but isn't great, either) and you've confirmed on that link and this link that the performance increase occurs only between 1/2 and 2/3rds of the time depending on the Use Case it's applied to. That also means that 1/3 to 1/2 the time, it's slower.

    I'm actually a bit disappointed in this CLR. We've all heard and experienced the fact that SQL Server sucks at string handling and that we should use CLR's instead. With CLR's being touted so highly for string handling, I would have expected at least a 100% (ie: twice as fast) gain in performance in the CLR.

    If SQLCLR were already enabled on a server, I might use such string handling CLR's (for readability and the small chance of a performance increase) but based on the relatively small gains to be had by string handling CLR's that I've seen to date (the exception being a well written splitter), I'd have a really tough time convincing a Systems DBA to turn SQLCLR on for "comparable performance" unless there were something else that provided significant (and I'm not talking 10%) gains.

    Agreed on the island. In either case, I'd hope they have beer. πŸ˜€

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Performance? I gotta go with a "maybe" (or at least IMHO :-P). On another link, you said it was only about a 10% gain in performance (which isn't bad but isn't great, either) and you've confirmed on that link and this link that the performance increase occurs only between 1/2 and 2/3rds of the time depending on the Use Case it's applied to. That also means that 1/3 to 1/2 the time, it's slower.

    Glass half full, glass half empty πŸ™‚ ~10% is pretty significant in the large.

    In order to squeeze every last drop of performance from the server you would have to use a mix of the "XML PATH" (data permitting), "XML PATH, TYPE" and the SQLCLR UDA. If you're looking to settle on a standard way to solve this problem in a given shop then I would submit a solid "it depends", mostly on the politics of the shop and the tendencies of the DBA In Charge, because I think the author of the UDA accomplished what they set out to do which was to create a viable drop-in alternative to the XML methods.

    I'm actually a bit disappointed in this CLR. We've all heard and experienced the fact that SQL Server sucks at string handling and that we should use CLR's instead. With CLR's being touted so highly for string handling, I would have expected at least a 100% (ie: twice as fast) gain in performance in the CLR.

    If this were a scalar or table-valued string manipulation function I would agree, however this is very different. When you talk about aggregates you're talking about maintaining an internal data structure that will be present throughout the life of a query where the aggregate function is used. The aggregate function has to accumulate the data passed into it for all rows in the query and cannot return a result until it sees all the rows of data. I would love for someone like Paul White, Adam Machanic, Erland or Bob B to have a look at the source code for these UDAs to see if there was room for improvement. I would bet on their experience and skill level over the author's, the worst the author could do would be a push πŸ˜‰

    If SQLCLR were already enabled on a server, I might use such string handling CLR's (for readability and the small chance of a performance increase) but based on the relatively small gains to be had by string handling CLR's that I've seen to date (the exception being a well written splitter), I'd have a really tough time convincing a Systems DBA to turn SQLCLR on for "comparable performance" unless there were something else that provided significant (and I'm not talking 10%) gains.

    That's the rub...the level of prejudice towards SQLCLR is right up there with the bazooka...and we both know it's mostly FUD.

    Agreed on the island. In either case, I'd hope they have beer. πŸ˜€

    Cheers to the island!

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • An interesting debate.

    I've put together a function in pure T-SQL which would do the same job. I doubt this would perform as well as opc's code, but hey it's another option.

    Caveat one is temp tables cannot be accessed by UDFs, so I've amended the original code slightly to use a permanent table instead of a temporary table.

    Caveat two is the SELECT statement needs a DISTINCT, otherwise the same result is returned for each instance of the order number.

    Here's the function:

    -- Returns a single comma-separated string of all countries

    -- for the specified order number

    CREATE FUNCTION dbo.fnMergeCountriesForOrderNumber

    (

    @OrderNum NVARCHAR(15)

    )

    RETURNS NVARCHAR(2000)

    AS

    BEGIN

    DECLARE @CountriesReturnStringNVARCHAR(2000) = '';

    DECLARE @CountryTableTABLE (OrdCountryNVARCHAR(20));

    DECLARE @NoRecordsINT = 0;

    -- Grab the countries for the order number

    INSERT INTO @CountryTable (OrdCountry)

    SELECT OrdCountry FROM JobCountry

    WHERE OrderNum = @OrderNum;

    SELECT @NoRecords = COUNT(*) FROM @CountryTable;

    -- Build up a string

    WHILE (@NoRecords > 0)

    BEGIN

    IF (@NoRecords > 0)

    BEGIN

    SET @CountriesReturnString +=

    (SELECT TOP 1 OrdCountry FROM @CountryTable) + ', ';

    END

    DELETE TOP (1) FROM @CountryTable;

    SELECT @NoRecords = COUNT(*) FROM @CountryTable;

    END

    -- Remove last extra space and comma

    IF (LEN(@CountriesReturnString) > 0)

    BEGIN

    SET @CountriesReturnString = LEFT(@CountriesReturnString, LEN(@CountriesReturnString) - 1);

    END

    RETURN @CountriesReturnString;

    END

    And here's the original code, albeit amended to use a permanent table and the function:

    --===== If the test table already exists, drop it

    IF OBJECT_ID('JobCountry','U') IS NOT NULL DROP TABLE JobCountry

    GO

    --===== Create the test table

    CREATE TABLE JobCountry

    (OrderNum NVARCHAR(15),

    OrdCountry NVARCHAR(20))

    --===== Insert the test data into the test table

    INSERT INTO JobCountry

    (OrderNum, OrdCountry)

    SELECT 'AF0F', 'Canada' UNION ALL

    SELECT 'AF0F', 'Brazil' UNION ALL

    SELECT 'AF0F', 'Mexico' UNION ALL

    SELECT 'BD37', 'UnitedArabEmirates' UNION ALL

    SELECT 'BD37', 'UnitedKingdom'

    --==== SELECT the records

    SELECT DISTINCT OrderNum, dbo.fnMergeCountriesForOrderNumber(OrderNum) FROM JobCountry

    Feel free to amend/shoot down/etc as you see fit!

    Mike.

  • No "shooting down" activities out of me. What you have provided will certainly get the job done Mike. All I'll say is that you'll be leaving money (in the form of server resources) on the table if you go with the UDF, but the performance is only part of finding the "best" solution which is why this is such an interesting debate.

    That said, if you're going to stay in T-SQL I would recommend one of the XML methods. They'll be a little less maintenance for you because you won't need to build a new UDF for every use-case, and they will perform much better than the UDF.

    If you're interested in learning more about the topic/debate here is a good article that provides lots of different ways to solve this problem:

    http://www.simple-talk.com/sql/t-sql-programming/concatenating-row-values-in-transact-sql/[/url]

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • Thanks for the link opc, it's an interesting read.

    I tend to agree that XML is probably the way to go, just wanted to throw in an alternative.

    As usual, the optimal solution depends on what you are trying to do. T'was ever thus...

    Mike.

  • opc.three (6/28/2011)


    That's the rub...the level of prejudice towards SQLCLR is right up there with the bazooka...and we both know it's mostly FUD.

    I realize I'm getting into semantics but it's actually not a prejudice towards SQLCLR... it's a prejudice against the possibility of some really, really, stupid, damaging code. πŸ™‚ Please don't warp that statement into thinking that I'm against the use of SQLCLR. I'm not against it and I actually celebrated when it was first announced. I'm merely trying to explain the source of the well deserved FUD that some DBA's have towards SQLCLR (and other forms of external code including but not limited to PERL, VBS, Powershell, and DOS CMD) especially in smaller shops where code might be promoted directly from Dev to Prod or doesn't have much of a QA process especially if the DBA can't tell what good C# (or whatever languange) is (like me πŸ˜‰ ).

    Although I relish the idea of using SQLCLR (like I said, I celebrated when they first announced it) for doing intelligent things to augment the many shortcomings found in T-SQL, I've been astounded by the shear stupidity and overwhelming ignorance of some of the implementations. For example, I know a "developer" who wrote a MODULUS function because he couldn't figure out how to do a modulus in T-SQL. I know another who wrote a MERGE equivalent because he couldn't figure out how to do a merge in T-SQL 2000 (and it absolutely sucked for performance). I've personally been burned by a bit of SQLCLR that had a huge memory leak in it, a BCP-like bit of code that skipped the first row after the header and skipped the last row of the file, and an SQLCLR for a running total that always doubled the first row of each "grouping". Then I (and other DBA's I'm trying to explain for) see dozen's and dozen's of articles on people writing huge amounts of managed code to do stupid-simple things like create a random integer (frequently claiming that SQL Server has no such capability) only to have someone point out that they wrote the code incorrectly and it will return a biased result, a result missing one of the endpoints, or the same random integer if used within the same millisecond (for example).

    Allow me to give you another example of what happened to me recently. I know a (very well known and trusted) fellow MVP that I think is absolutely one of the best in my book of intelligent, knowledgeable, and skilled people. Because of that implicit trust, I asked him to write an SQLCLR to accomplish a certain task that I could use for testing and comparison purposes in a project I was working on. He happily obliged. I had to send the code back twice because it didn't work correctly (3rd rev finally worked) . Think about it... a Microsoft MVP who has earned my trust (a rare thing for anyone to do) and he didn't test his own code well enough to know that it wasn't working correctly!!??

    ... and we accuse DBA's of having FUD about SQLCLR's??? :Whistling: Such "FUD" is no more unreasonable than a DBA having FUD about the use of xp_CmdShell even though it's been proven that it can be done safely. :hehe:

    So, for many of us, it's not "Fear" of the unknown... it's the "Fear" of what we've already experienced at the hands of "developers" who don't know their own trade never mind the art of databases. It's not "Uncertainty" either... based on experience, unless there's some rather stringent testing involved, we're absolutely certain that some dummy is eventually going to bring our server to it's knees or damage some data with crap code that we can't personally fix (many of us are not C# programmers and will probably never add that art to our repertoire). And there's no "Doubt" about any of that. We've seen it happen over and over again.

    With that thought in mind, that's why I said that a ~10% gain in performance of concatenation code isn't going to convince a DBA that (s)he should enable SQLCLR. It opens Pandora's box to the world of possible crap code. It's not FUD... It's certain knowledge gained by experience that, until a more stringent bit of testing is also enabled, there will be SQLCLR written that is either stupid or dangerous and it will eventually cause a problem in the server or the data. To prevent that from happening, we simply refuse to enable SQLCLR just like some of you refuse to enable xp_CmdShell (even though you haven’t tried the secure methods). πŸ˜‰

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Sorry, I'm a little late to respond, I've ben out the past couple of days. All of you have provided a wealth of information and options, plus more reading to be done. Thank you all so much for your input. To get the job done quickly, and since I am not worried about performance right now, I am going to go with pure T-SQL. However, this data is going to grow quickly, so I am going to weigh all f the other options and put one into place soon. That way, by the time there is a performance issue I will have already addressed it.

    Thanks again. Talk to you soon.

    Steve

  • sdownen05 (7/5/2011)


    To get the job done quickly, and since I am not worried about performance right now, I am going to go with pure T-SQL.

    Heh... famous last words. No one goes back to it until it becomes a performance problem because they simply forget to go back until the performance problem reminds them.

    Do it right the first time. It just won't take that long to do so. πŸ˜‰

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply