A Case For Concatenation (Building Delimited Strings ) In T-SQL

  • I have heard some criticism of doing concatenation (building delimited strings ) in T-SQL. Two reasons given for this criticism are (1) to protect the resources of the SQL server by farming the load of string manipulation out to the UI, and (2) to "do things where they make sense". Basically, other languages are better at string manipulation and other tasks, so why bother doing it in SQL? Just return the rows and let the UI handle it.

    This makes obvious sense when building strings. SQL passes a FirstName|LastName|OverdueAmt as separate columns to a UI that can then build the string it wants, whether that string is

    John Smith

    Smith, John

    Dear John

    or

    Dear Mr. Smith we currently show your account as having an overdue balance of $250.

    In the last instance especially, it is foolishly wasteful to add all the constant information at the server then pass it over the network to the application.

    However, a common requirement is to build a delimited string from a single column value repeated for multiple rows. This is something easily accomplished in T-SQL with very few lines of code, so I consider it to be a draw with respect to string manipulation capability. That leaves the issue of workload on the server.

    The code at the bottom generates a table of unique first names, last names and department numbers. The schema for the tables containing the first names and last names are provided. I populated them by simply typing in 50 or so names off the top of my head, in some place duplicating last names such as Smith or Jones. But you could just as easily populate them with random strings. Department number is a random number between 1 and 20.

    The resulting table (#temp2) will be populated with approximately 22,000 unique name/department combinations. The rest of the code compares SQL concatenation of this table with simply returning the values. On my laptop, the concatenation actually runs faster than the simple display of the data. Furthermore, the result set of the concatenation is approximately 1200 rows, instead of over 22,0000. This is, in effect, a compression of the data, because the Firstname/Lastname columns aren't being repeated an additional 20 times. When output was directed to text files, the concatenated version was around 110k in size and the simple display was over a meg.

    That is nine times the data which has to be buffered and passed to the UI code. If temporary or permanent holding files are the target of this output, there would also be additional logging involved. To me it seems that concatenation at the SQL server may ultimately be reducing the load on the server. What am I missing here?

    ------------------------------

    -- concatenation test

    ------------------------------

    set nocount on;

    --- pseudo random number generation for numbers between 1 and @X

    SELECT TOP 100000

    IDENTITY(INT,1,1) AS RowID,

    ABS(CHECKSUM(NEWID())) % 50 + 1 AS FKey,

    ABS(CHECKSUM(NEWID())) % 50 + 1 AS LKey,

    ABS(CHECKSUM(NEWID())) % 20 + 1 AS Dept

    INTO #temp

    FROM Tally;

    select distinct firstName,lastname,dept

    into #temp2

    from #temp

    join Lastnames on lastNameID = Lkey

    join FirstNames on firstNameID = Fkey

    select count(*) as [#temp2 rows] from #temp2

    declare @timer datetime

    set @timer = getdate()

    print 'Display Only'

    set statistics time on;

    select * from #temp2

    ORDER BY LastName,firstName,dept

    set statistics time off;

    print datediff(ms,@timer,getdate())

    set @timer = getdate()

    print 'Concatentation'

    set statistics time on;

    select firstName,LastName

    ,stuff((SELECT ',' + cast(dept as varchar(2))

    FROM #temp2 t2

    WHERE t2.LastName = t1.LastName and t2.firstName = t1.firstName -- must match GROUP BY below

    ORDER BY dept

    FOR XML PATH('')

    ),1,1,'') as [Departments]

    from #temp2 t1

    GROUP BY LastName,firstName -- without GROUP BY multiple rows are returned

    ORDER BY LastName,firstName

    set statistics time off;

    print datediff(ms,@timer,getdate())

    drop table #temp

    drop table #temp2

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • As with many things, "it depends". In the situation you outlined, manipulating the values in SQL makes sense.

    What you might try, if you can, is duplicating the functionality in a CLR proc and see if it's faster at it than the T-SQL version.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • I'm afraid I can't, but I will listen respectfullly to anyone who can, if they'd care to report their results. However, I see that as a different issue from doing it at the database level instead of passing it to a UI.

    I can't argue with "It depends." The difference in concatenating and simple passing of the raw data is a function of the average number of rows that need to be concatenated, the average size of the data within the column being concatenated, and the combined size of the columns that would be duplicated. With only two or three average rows being concatenated, the percentage of "compression" would obviously be less. The same would hold true if the values being concatenated were longer strings.

    Please understand, I'm not arguing that concatenation should always be done at the server, I'm just saying there seem to be instances where it would be the better solution, and more instances where it really doesn't matter.

    Thanks for the feedback, G.

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • I'm a fan of letting the UI or business layer handle formatting the data and let SQL Server return the data. I know I can do a lot of formatting in T-SQL and I have at times, but that's not what it is best at so I try not to do it there. I'm not a hard-liner on either side, I'm more concerned with making it work and so anyone can read it and debug it. In the example you present I would consider the SQL you have doing the concatenation advanced SQL and most people would not understand it, while doing the concatenation in VB or C# would be a simple loop that any novice programmer could read and debug.

    Here are a couple of blog posts. The first one on the do it the presentation layer side and the second on the do it in SQL sometimes side.

    http://sqlblog.com/blogs/jonathan_kehayias/archive/2009/02/19/sql-tip-keep-your-presentation-logic-in-your-presentation-layer.aspx

    http://sqlblog.com/blogs/alexander_kuznetsov/archive/2009/02/20/should-i-always-keep-my-presentation-logic-in-my-presentation-layer.aspx

  • Jack, I can see that it would be considered advanced. I never heard about using FOR XML this way until the last couple of months. But having seen it once, the technique is elegantly simple. It is much more intuitive than using a tally table to parse strings.

    I deliberately sidestepped the issue of other peoples' skills as a consideration in "doing it where it makes sense". Skill levels vary from organization to organization, and for some people, it would take the hours of time to get it right in the UI. You write code on both sides, so you can do it wherever it is easiest for you. So again the answer becomes "it depends".

    Again, this is just me trying to understand what other considerations should influence the choice. When people I respect say not to do something, it's wise to listen; but it's also important to understand the "why" behind the general rule. The first post contains a number of examples where indisputably the UI should handle string manipulation. But in the case of building delimited strings out of column values for multiple rows, SQL still looks like the better choice.

    Thanks for the input.

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • Bob,

    Always willing to share my opinion, even when it probably stinks.

    See, I think that it is necessary to consider others abilities whenever coding something. I want the person that comes behind me, assuming I will eventually move on, to be able to understand my code. I've been the guy coming behind too many times. Certainly if you accurately document your code then it isn't as much of an issue, but I still think it needs to be considered when choosing a solution.

  • Ah... but you are denying them the opportunity to broaden their horizons and expand their minds by saying "What the...? How did he DO that???" 😀

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • Bob Hovious (2/24/2009)


    Ah... but you are denying them the opportunity to broaden their horizons and expand their minds by saying "What the...? How did he DO that???" 😀

    Another it depends... what if it takes 5 hours for him to figure it out... while it would take you 2 minutes to document it or code it "better".

    That's a big value for your employer too...

  • (Sigh... nobody has a sense of humor anymore.)

    Guys, I've learned to document code out of self-preservation... because I may be the one who has to revisit the code months or years later later. When someone tells me "You ought to be able to look at the code and see what it's doing." my response it always "I can see what it's doing, what I need to know is what it is TRYING to do."

    My comment wasn't aimed at the comment about documentation, but at the suggestion that we should code to the lowest common denominator. When better solutions come around, we should pass them along, not avoid them because the next guy might not be familiar with them.

    Thanks again for engaging here. I'm very happy this discussion isn't corrupting the Endless Thread.

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • The problem with even some small scale concatenation is... people abuse it and don't know what it actually does to the server. Sure, "It Depends"... but few take the time to figure out "What Depends"... 😉

    Test data...

    DROP TABLE JBMTest

    GO

    --===== Create and populate a 1,000,000 row test table.

    -- Column "RowNum" has a range of 1 to 1,000,000 unique numbers

    -- Column "SomeInt" has a range of 1 to 50,000 non-unique numbers

    -- Column "SomeLetters2" has a range of "AA" to "ZZ" non-unique 2 character strings

    -- Column "SomeMoney has a range of 0.0000 to 99.9999 non-unique numbers

    -- Column "SomeDate" has a range of >=01/01/2000 and <01/01/2010 non-unique date/times

    -- Column "SomeCSV" contains 'Part01,Part02,Part03,Part04,Part05,Part06,Part07,Part08,Part09,Part10'

    -- for all rows.

    -- Column "SomeHex12" contains 12 random hex characters (ie, 0-9,A-F)

    -- Jeff Moden

    SELECT TOP 1000000

    RowNum = IDENTITY(INT,1,1),

    SomeInt = ABS(CHECKSUM(NEWID()))%50000+1,

    SomeLetters2 = CHAR(ABS(CHECKSUM(NEWID()))%26+65)

    + CHAR(ABS(CHECKSUM(NEWID()))%26+65),

    SomeCSV = CAST('Part01,Part02,Part03,Part04,Part05,Part06,Part07,Part08,Part09,Part10' AS VARCHAR(80)),

    SomeMoney = CAST(ABS(CHECKSUM(NEWID()))%10000 /100.0 AS MONEY),

    SomeDate = CAST(RAND(CHECKSUM(NEWID()))*3653.0+36524.0 AS DATETIME),

    SomeHex12 = RIGHT(NEWID(),12)

    INTO dbo.JBMTest

    FROM Master.dbo.SysColumns t1

    CROSS JOIN Master.dbo.SysColumns t2

    --===== A table is not properly formed unless a Primary Key has been assigned

    ALTER TABLE dbo.JBMTest

    ADD PRIMARY KEY CLUSTERED (RowNum)

    Test code...

    SET STATISTICS TIME ON

    SELECT t1.SomeInt,

    STUFF((SELECT ',' + t2.SomeLetters2 FROM dbo.JBMTest t2 WHERE t1.SomeInt = t2.SomeInt FOR XML PATH('')),1,1,'') AS Nodes

    FROM dbo.JBMTest t1

    GROUP BY t1.SomeInt

    SELECT SomeInt,SomeLetters2

    FROM dbo.JBMTest

    SET STATISTICS TIME OFF

    Test results...

    (50000 row(s) affected)

    SQL Server Execution Times:

    CPU time = 19281 ms, elapsed time = 29566 ms.

    (1000000 row(s) affected)

    SQL Server Execution Times:

    CPU time = 703 ms, elapsed time = 25840 ms.

    The concatenation used 27.42 times more cpu and, if you consider the display as the I/O, the concatenation was longer in duration, as well.

    Even though the concatenation sent about half the number of bytes (because the concatenated operands were so short), which do you suppose is tougher on the server?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thanks Jeff:

    I realize that I've got a double thread going now, and I just had a light bulb go on when I looked at elapsed time for writing to disk, as opposed to elapsed time for displaying data.

    Essentially, when displaying data I was seeing elapsed times for simple display being double that of concatenation and I was attributing the bottleneck to the SQL Server. That made me want to stick with concatenation because it was taking 11 seconds to run as opposed to 24. But when I wrote the results to disk (on my laptop), I saw elapsed times similar to yours. So I hit the reset button in my brain.

    Bob

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • I'm still curious as to why the elapsed times for you (displaying the data) are so dramatically different from the times I get running your code. Any thoughts?

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • Bob Hovious (2/25/2009)


    I'm still curious as to why the elapsed times for you (displaying the data) are so dramatically different from the times I get running your code. Any thoughts?

    I have a 7 year old desktop box with server quality cache (whatever that means... salesmen!) but with a single P4 1.8Ghz CPU, a single gig of ram, and twin IDE harddrives. When I boot it, I've got TempDB setup for 2 gig on the MDF and a half gig on the LDF. Dunno if that's it or not. I've found that some laptops do the damndest things especially when it comes to display and execution times.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Gracias.

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • [font="Verdana"]One of the recent (i.e. in the last ten years) changes in data warehousing has been a move away from specialised ETL tools like Informatica, Data Stage, etc, and towards letting the RDBMS do the work. This approach has been called ELT (extract, load, transform).

    For smaller data warehouses, this makes sense. There's no need to support two disctinct platforms for throwing data around. It's all done in the RDBMS.

    Oracle have a product that does that (sorry, can't remember the name), and we use a product at work named Wherescape RED that does the same over the top of SQL Server 2005.

    So the database does a lot of work in chopping, changing and reformatting data during the staging process. There's no real presentation layer, as the data just gets posted into another database.

    That's a very valid reason to be looking at string operations within an RDBMS, from my perspective.

    [/font]

Viewing 15 posts - 1 through 15 (of 24 total)

You must be logged in to reply to this topic. Login to reply