Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 

Be Careful with String Concatenations in SQL Server with Big Strings.

 Now, we are going to get slightly esoteric here. I'd noticed that the performance of string concatenation tasks didn't increase in a linear fashion with increasing size of a string. For a short string, it was blindingly fast, but when you scale up, the performance gets disappointing. When you're handling big strings, it is time to change the algorithm. I felt it was time to do a rough test. The poor old development machine is wheeled out to run a test on a table with every word of Moby Dick

So we'll read in the entire text of Moby Dick, splice it into words and then see how long it takes to join them all together. It takes under a minute to slice it all up (169,683 words if you're curious).

Considering the ease of the operation of splicing Moby Dick back together again, I was expecting it to be quicker. It wasn't I was surprised that the regression wasn't linear; it was polynomial (the equation is shown in the graph. What this means is that the operation is relatively more expensive as the size of the string increases.

Graph of time taken against the number of string concatenations

Hmm. Time to put it to the test and create a graph. We try joining more and more of Moby Dick together to see if the operation increases linearly with the number of words we join. We start with just a hundred or so words and increase until we have the whole book, and draw a graph of the time taken (in milliseconds) against the number of words concatenated together.

So we construct a simple test to try it out. You've seen some of this code before! I did this with SQL Server 2005, on a very modest developmentsystem that I always us for writing time-critical code. It may me different for SQL Server 2008 with lots of memory.I've attached the text file of Moby Dick in case you'd like to experiment.


DECLARE @LotsOfText VARCHAR(MAX)
SELECT   @LotsOfText = BulkColumn
FROM     OPENROWSET(BULK 'C:\workbench\moby-dick.txt', SINGLE_BLOB) AS x  

/* read each word into a table (we make it a temporary table as we don't want to keep it for ever!*/
CREATE TABLE ##word
  
(
    
Word_ID INT IDENTITY(1, 1)
                
PRIMARY KEY,
    
item VARCHAR(80)
   )

INSERT   INTO ##word
        
(ITEM)
        
SELECT   item
        
FROM     wordchop(@LotsOfText)
---48 seconds (214243 words)

CREATE TABLE ##log
  
(
    
Log_ID INT IDENTITY(1, 1),
    
Event VARCHAR(30),
    
param INT,
    
InsertionDate DATETIME DEFAULT GETDATE()
   )
DECLARE @command VARCHAR(255),
  
@ActualCommand VARCHAR(255)
SELECT   @Command = '
DECLARE @STRING VARCHAR(MAX)
insert into ##log (event,param) Select ''concatenate || rows'',||
SET ROWCOUNT ||
sELECT @sTRING =COALESCE(@String,'''')+item from ##word'
DECLARE @ii INT

SELECT  
@ii = 1000
WHILE @ii < 200000
  
BEGIN
      SELECT  
@ActualCommand = REPLACE(@command, '||', CONVERT(VARCHAR(8), @ii))
      
EXECUTE (@ActualCommand)
      
SELECT   @ii = @ii + 1000
  
END

SELECT  
event,
      
[rows] = param,
      
[Time (Ms)] = DATEDIFF(ms, [InsertionDate],
                           (
SELECT [InsertionDate]
                              
FROM ##log f
                              
WHERE f . Log_ID = g . Log_ID + 1
                          
)
                       )
FROM     ##log g
WHERE    Log_ID < ( SELECT MAX (Log_ID) FROM ##log )


Stop press! Already, Barry is onto something very useful. I haven't worked out a way of putting a graph into a comment so I'll add this to the original post. I've no re-run the tests by doing a comparison between a simple concatenation and Barry's Rollover 2^k technique. Barry's trick wins hands-down, and is showing significant gains even in modest concatenation operations.

Second Test Run

 


Cor. This is fun. Just sneaked off and ran Peter's version, and his is the clear winner. Even the titanic act of joining together the entire text of Moby Dick took only half a second (406 Ms actually. Here is the combined chart, which shows the huge improvement possible in a single SQL Statement!

 The Third Test

 

Comments

Posted by Simon Sabin on 16 February 2009

Have you tried with the XML string concatenation trick?

Posted by peter on 18 February 2009

Generic repeative string concatenation is always showing this kind of behavior. This is regardles of the environment it happends in, be it  C/C++, C#, Coldfusion, Javascript, VBScript, Visual basic, etc.

It really doesnt matter which one as at the core of the lies the fact that a string requires a piece of non-fragmented memory to be kept in. As the string grows, so does the amount of memory required to store it. And because the memory needs to be unfragmented it cannot always grow 'in-place' as some other data can be stored in the memory after that of the growing string.

So what happends when we do grow our string, when it cannot grow in-place? Right, a new location of memory big enough to store the result must be found and then both the original and the addition are copies to that new place.

The perviously allocated memory is freed up and reused for other data, but and here come it.....not for our sting, as the memory gap it leaves behind is always to small to suit the ever growing sting.

The result is not only a lot of copying that does not contribute to the final outcome, but also a staggering increase of memory usage. Often this memory sticks around for a long time after the final concatenation was done and the result flushed to disk or elsewhere and the memory for the final string has been freed.

In lower level languages like C++ where strings are dealt with using classes that manage the memory buffer you, you can often specify by how much these buffers have to grow automatically or even grow them to a specific size BEFORE you start the task of repeative concatenation.

In languages lacking this control (scripting languages), there is a trick that gets close:

1. Store all string parts in an array.

2. Use a build-in function to concatenate the array in one step to a new string.

The first effect of this is that at first there are loads of small strings in memory, so that even existing gaps in memory can be reused by these strings. The second effect is that the build-in functions are generaly smart enough to first examine the length of all the string parts, then reserve memory for the cumulative size and finally to copy all of them into the allocated memory. This prevents the unneeded copying and allocation of memory by a tremendous amount.

Then there are those languages like java, where strings are immutable and the old version cannot by this very definition grow. These languages offer specialised buffer classes that act like the solution found in lower level language string languages. The scripting language technique is always an option here as well.

Where does SQL fit in?

IF strings ARE NOT immutable then building a string in one step of the required size and then using stuff to copy the data into it in-place has the potential to be the fastest method. It depends heavily on a smart implementation of the stuff function, in that it does no reallocation if the replaced part equals in length of the inserted part. It then would require allocationg memory only once and the copying work is no more then the cumulative length of all the parts.

Watch out with a procedural aproach on implementing this, as the *stupendous* statement overhead of each line of code will kill most of what you can possibly gain.

IF under-water, the strings in SQL ARE immutable, then XML concatenation using for xml path('') is problably the only method as it is optimized for large stings and works very much like my description of the Java stringbuffer.

ALso execution behavior might differ a lot between SQL2005 and SQL2008, given the new += operator. To the untrained eye this might seem like suger coating but it does in fact signify more.

With a += operator it stands clear that an existing variable is being manipulated to grow. With the more traditional operator +, there is an independent result first, which then is assigned to a variable.

This can mean a world of difference as the + operator requires extra *external* smartness to create the conditions for optimizations that the += operator gets by default.

I will look into it at some point, sufficient to say it is a very well understood problem. However people traditionally have turned a blind eye to it and seem not to know or reallize it.

Posted by Phil Factor on 18 February 2009

Peter. That's an exciting thought. I'll re-run the tests with the += operator and see if there is a difference. Now the test harness is in place we can try all sorts of things. If we can make the behaviour more linear it will help make a lot of interesting operations practical.

I was wondering about reviving the age-old technique of creating a large block of space characters in one go and using STUFF with the characters incementally.

At this stage, I'm just trying to prove that the problem exists

Posted by RBarryYoung on 18 February 2009

It has long been known that naive linear string concatenation is an O(n^2) operation (more specifically, it's triangular).  That simple fact is one of the big reasons that .NET has the StringBuilder class (which uses the strings->array->mass_concatenate trick that peter alludes to).

As it happens, I spent quite a lot of time last month investigating this problem and potential solutions and let me just say:  it's tough in SQL.  Here are the problems:

1) SQL Server apparently (based on my tests) already does the "buffer-extension" trick available to mutable strings.  Unfortunately, the Extension trick does NOT solve the O(n^2) problem, it just partially alleviates it at the lower end because some percentage(k) of appends can be extensions instead of creating a new string.  Effectively it changes the {(n)*(n+1)/2} cost of the naive implementation to {n +(n)*(n-1)/(2*k)} where "k" is that percentage.  It's better, but it's still O(n^2).

2) The "array & mass-concatenate" trick is not available to T-SQL, not because SQL doesn't have arrays (tables serve the same purpose), but because AFAIK, there is no function that can take a variable "collection" of strings (table, array, whatever) and produce an output string.

3) The "pre-allocate and Stuff" trick popular with mutable strings is not workable in T-SQL because the STUFF()  function in T_SQL is NOT like the function of the same name in some general purpose languages: the T-SQL STUFF() is an RHS (right-hand side) function and NOT an LHS (left-hand side) function.  AFAIK, there is no function in SQL that can (physically) write into a pre-existing string.

That all said, I did eventually find a way to do it.  Not in the O(n) time (linear) achievable in most general purpose languages, but I could get it down to O(n*Log(n)) time which is still a huge improvement.  Unfortunately, I do not have it together in a presentable form and it would take me some time to do so (a day or so), but I can get it to you if you want.

Posted by Anonymous on 18 February 2009

Just a quick note on string concatenation: Concatenation(appending) a table of many small strings together into one big string is easy in SQL Server, but slow. Fixng it is hard f ...

Posted by RBarryYoung on 18 February 2009

Phil: Where is the Moby Dick file attachment?  I cannot seem to find it here.

Posted by Phil Factor on 18 February 2009

Barry,

Moby Dick is  an attachment to

www.sqlservercentral.com/.../Never-say-never-to-the-while-loop.aspx

along with the word-splitting function.

I'm sorry, I thought I'd attached it to this blog but I failed to push the right button.

I'd be thrilled to see a solution to this problem

Posted by RBarryYoung on 18 February 2009

I cannot seem to download it from the "Moby-Dyck.zip" link there either.  If I try to save the link, it saves a small file (210 bytes), then when I try to open it, says that it is not a valid archive.

Posted by RBarryYoung on 18 February 2009

OK, since I couldn't get a copy of Moby Dick, I made my own test data.  Here is your test rig Phil, with my test data and my technique, followed by your original approach.  Needless to say, my method is nearly linear, but the simple approach grows geometrically:

Drop Table ##word

go

Drop Table ##log

go

/* read each word into a table (we make it a temporary table as we don't want to keep it for ever!*/

CREATE TABLE ##word

  (

   Word_ID INT IDENTITY(1, 1)

               PRIMARY KEY,

   item VARCHAR(80)

  )

INSERT   INTO ##word(ITEM)

        SELECT TOP (200000) c1.name

        FROM   sys.system_columns c1, sys.system_columns c2

---7 seconds (200000 words)

CREATE TABLE ##log

  (

   Log_ID INT IDENTITY(1, 1),

   Event VARCHAR(30),

   param INT,

   InsertionDate DATETIME DEFAULT GETDATE()

  )

--My version

DECLARE @command VARCHAR(MAX),

  @ActualCommand VARCHAR(MAX)

SELECT   @Command = '

Declare @i int

Declare @String Varchar(Max)

Declare @S02 Varchar(MAX)

Declare @S04 Varchar(MAX)

Declare @S06 Varchar(MAX)

Declare @S08 Varchar(MAX)

Declare @S10 Varchar(MAX)

Declare @S12 Varchar(MAX)

Declare @S14 Varchar(MAX)

Declare @S16 Varchar(MAX)

Declare @S18 Varchar(MAX)

Select @S02=''''

, @S04=''''

, @S06=''''

, @S08=''''

, @S10=''''

, @S12=''''

, @S14=''''

, @S16=''''

, @S18=''''

, @i = 0

insert into ##log (event,param) Select ''rollover 2^k || rows'',||

Select Top (||) @i = @i+1

, @S18 = CASE When @i%263072=0 Then @S18+@S16+@S14+@S12+@S10+@S08+@S06+@S04+@S02+@String+item Else @S16 End

, @S16 = CASE When @i%263072=0 Then '''' When @i%65768=0 Then @S16+@S14+@S12+@S10+@S08+@S06+@S04+@S02+@String+item Else @S16 End

, @S14 = CASE When @i%65768=0  Then '''' When @i%16384=0 Then @S14+@S12+@S10+@S08+@S06+@S04+@S02+@String+item Else @S14 End

, @S12 = CASE When @i%16384=0  Then '''' When @i%4096=0  Then @S12+@S10+@S08+@S06+@S04+@S02+@String+item Else @S12 End

, @S10 = CASE When @i%4096=0   Then '''' When @i%1024=0  Then @S10+@S08+@S06+@S04+@S02+@String+item Else @S10 End

, @S08 = CASE When @i%1024=0   Then '''' When @i%256=0   Then @S08+@S06+@S04+@S02+@String+item Else @S08 End

, @S06 = CASE When @i%256=0    Then '''' When @i%64=0    Then @S06+@S04+@S02+@String+item Else @S06 End

, @S04 = CASE When @i%64=0     Then '''' When @i%16=0    Then @S04+@S02+@String+item Else @S04 End

, @S02 = CASE When @i%16=0     Then '''' When @i%4=0     Then @S02+@String+item Else @S02 End

, @String = CASE When @i%4=0 Then '''' Else @String + item End

From ##word

Order By Word_id

Select @String = @S18+@S16+@S14+@S12+@S10+@S08+@S06+@S04+@S02+@String'

DECLARE @ii INT

SELECT   @ii = 1000

WHILE @ii < 200000

  BEGIN

     SELECT   @ActualCommand = REPLACE(@command, '||', CONVERT(VARCHAR(8), @ii))

--print @ActualCommand

     EXECUTE (@ActualCommand)

     SELECT   @ii = @ii + 1000

  END

--Phils version

SELECT   @Command = '

DECLARE @STRING VARCHAR(MAX)

insert into ##log (event,param) Select ''concatenate || rows'',||

SET ROWCOUNT ||

sELECT @sTRING =COALESCE(@String,'''')+item from ##word Order by Word_id'

SELECT   @ii = 1000

WHILE @ii < 200000

  BEGIN

     SELECT   @ActualCommand = REPLACE(@command, '||', CONVERT(VARCHAR(8), @ii))

     EXECUTE (@ActualCommand)

     SELECT   @ii = @ii + 1000

  END

SELECT   event,

      [rows] = param,

      [Time (Ms)] = DATEDIFF(ms, [InsertionDate], (SELECT [InsertionDate]

              FROM ##log f

              WHERE f . Log_ID = g . Log_ID + 1)

          )

FROM     ##log g

WHERE    Log_ID < ( SELECT MAX (Log_ID) FROM ##log )

Posted by peter on 19 February 2009

Try this as Siman Sabin suggested (i assume a populated words table).

declare @total varchar(max)

select @total = stuff( ( select cast( ',' as varchar(max) ) + item from ##word order by word_id for xml path( '' ) ), 1, 1, '' )

select len( @total ),@total

As you can see, XML logic in SQL server is very much optimized to work with large strings and concatenation. On teh face of it, it performs like the equivalent of a large buffer that is at most reallocated a few times by growing it by a certain percentage relative to its current size (which is fine when adding a lot of small strings).

The original code and the optimized version of RBarryYoung both use a pseudo cursor for the actual addition. This is horribly slow when you compare it with the for xml patch( '' ) trick.

What to remember from this is that in SQL Server you should not use pseudo cursors or solutions that depend on looping proceduraly when you can avoid it. Each statment in SQL has a great deal of overhead, which when workig with sets is amortised over the whole operation. But when working with iterative solutions like loops with some operation inside them this overhead is absolutely killing. This is the main reason the SET movement in SQL is so strong and nearly always proven right within their own domain!

Take this code:

declare @x int

set @x = 1

while @x <= 10000000 begin

 set @x = @x + 1

end

Which increases a counter 10 million times. On my server this pertty uch empty loop takes anywhere from 10 to 15 seconds to execute! Do the same in C++ and compile it and you are likely not even to notice it executing, even without optimizations turned on!

In SQL some of the rules of programming change and our time is better spend learning the rules and tricks well in order to get fast code in this environment. Every complex iterative solution however smart will be handicaped by the statement overhead native to the SQL Server implementations.

Posted by Phil Factor on 19 February 2009

OK: I've added the test-run I did from Barry's code. The difference is pretty startling. Barry's technique wins hands-down. I'll try Peter's technique as soon as I get a moment. The graph has been put in the body of the original blog post

Posted by peter on 19 February 2009

The code I posted adds comma separators, without it it is simpler, like:

declare @total varchar(max)

select @total = ( select cast( item as varchar(max) ) from ##word order by word_id for xml path( '' ) )

select len( @total ), @total

Posted by RBarryYoung on 19 February 2009

Thanks Phil, let me know if you want an explanation for it (I think that it's fathomable by most folks with a CompSci background, but I've been wrong about that berfore :) ).

Simon and Peter are right: I had forgotten about the XML trick (weak area for me anyway), it is basically approach (2), "array & mass concatenate" mad even better because the "array" can be the original table.  If SQL XML is implemented well internally, it should be dead linear in it's performance profile.

One other thing: technically you need to include the "Order By" on these as Peter and I did.  Unlike Update pseudocursors, Select pseudocurors CAN insure their order because they can use the only mechanism in T-SQL that is documented and supported to guarantee order: an ORDER BY on the outermost SELECT of a query.

Posted by RBarryYoung on 19 February 2009

I found an error in my code: the @STRING variable is not initialized to '', which causes the results to be NULL.  Surprisingly, fixing this changes the results very little.

Posted by Phil Factor on 19 February 2009

I've just added Simon and Peter's version into the test harness and it is the clear winner. Even when you join the entire text of Moby Dick together from a table, it takes only 406 milliseconds, as compared with 2046 for Barry's version, and 40173 for the simple concatenation technique. Many thanks to all of you for your wonderful contributions. Any advances on 406 Ms for the test?. I've added the graph in the body of the article as before.

Posted by RBarryYoung on 19 February 2009

Thanks Phil, nice test-rig, by the way.

However, I feel compelled to correct the numerous mistaken claims made by a previous poster with respect to pseudocursors and FOR XML.  First of all, pseudocurors are not "horribly slow" when compared to FOR XML queries.  Rather, it is slower for this particular task because it is poorly suited to it and has to implement a relatively inefficient algorithm, as I previoulsy noted, while the FOR XML query is doing something that it is particularly well-suited to.  Take out the eight case statements and a pseudocursor has no trouble keeping up.

Secondly, the poster seems to be confusing pseudocusors with actual explicit cursors and loops.  Unlike explicit cursors, pseudocursors do not "rely on looping procedurally", because they are not procedural.  Procedures are multiple statements executed in a sequence.  Pseudocursors are a single set-based statement that is order dependent.  Order dependent processing is not automatically procedural and certainly not in this case.  (Since sets with order are formally called "Lists", it would be more proper to call this an example of List-based processing).

Finally, since pseudocursors do not execute statements in a loop anymore than FOR XML does, it is incorrect to say that it's statement overhead is "absolutely killing".  If you take the explicit loop example posted and adapt it to both pseudocursors and FOR XML you will get completely different results:

select convert(varchar(25), getdate(), 121)

--=======

declare @x int

set @x = 1

while @x <= 10000000 begin

set @x = @x + 1

end

select @x as Procedural, convert(varchar(25), getdate(), 121)

--======

set @x = 1

Select TOP 10000000 @x=@x+1

From master..syscolumns c1, master..syscolumns c2

select @x as Pseudocursor, convert(varchar(25), getdate(), 121)

--======

DECLARE @STRING VARCHAR(MAX)

select @STRING = (Select TOP 10000000 1

From master..syscolumns c1, master..syscolumns c2

for xml path(''))

select len(@STRING) as FOR_XML, convert(varchar(25), getdate(), 121)

--======

Of course, now it's the FOR XML that has to use an inefficient algorithm, making it horribly slow.

Posted by peter on 20 February 2009

RBarryYoung,

You misinterpreted me on my notions that pseudocursors and iteratrive processing are slow, they are both separate issues whereas you think I mean that pseudocursors are procedural.

What I wtote was:

"you should not use pseudo cursors **OR** solutions that depend on looping proceduraly when you can avoid it. Each statment in SQL has a great deal of overhead, which when workig with sets is amortised over the whole operation. But when working with iterative solutions like loops with some operation inside them this overhead is absolutely killing."

Which I followed by demo code to show just how slow it actually is. I felt it was worth noting as there was use of loops in the code, even while the innermost to be benchmarked code did not use a loop itself it was still in the benchmark timings.

As for your reasoning and demo code on pseudocursors, I never seen them being used outside the field of concatenating strings before. And exactly this is where your demo code does not make sense to me as the pseudocusros solution increments a counter, while the [for xml path('')] solution has to processes strings without actual counting. You simulate the counting by building a large string and then do a len() on it. That is just insane for the purpose of being insane and does nothing to do with algorithms used by SQL server.

The counting with a pseudocursor is also quite questionable as my @x=@x+1 was merely meant to have a classic iterative loop construct where the purpose is not the counting itself as it is in your example. Why this is important becomes clear in light of your following statement:

"out the eight case statements and a pseudocursor has no trouble keeping up."

Which seems to imply that a pseudocursor consctuct that does not deal with case **expressions** has no real problem with keeping up with for xml path( '' ).

But then it is not doing the same job anymore, is it? If you strip the case expressions, you end up with the initial concationation code in this article, which has been show not to be competitive at all.

One of the issues with pseudocursors is that you need variables for them to work, which implies multi-statement constructs. This limits the use of them as they cannot be part of inline table valued functions for example. Thus even if pseudocursors themselfs are fast, using them forces you to go into an environment where encapsulating logic makes everything go slower. Which in turn makes them ill suited for real development. Especially in the light of string concatenation (its main use as far as i seen) that has a faster alternative without these drawbacks.

Posted by RBarryYoung on 20 February 2009

Either you were talking about pseudocursors or you were not, you can't have it both ways.  If you were not talking about pseudocursors then your entire argument was irrelevant because Phil & I were not using (or even suggesting) anything that you were talking about.  At best this is a straw man.  And if you have never seen SELECT pseudocursors used for anything other than string-concatenation, then I don't think that you should be making the sweeping generalization that they should not be used.

As for the rest of your argument, it has the same misunderstanding as your original post: it seems to be unable to distinguish between general techniques and specific implementations that may have to deal with the limitations of those techniques in certain applications.  The point is, it is not the general technique that makes the difference in these two cases (pseudocursors vs. FOR XML) but the specific implementations that are available to them that causes the performance difference.

Clearly, FOR XML is well-suited to large string construction, but when you take a completely different example (such as your own example) they perform poorly, simply because they have to use an inefficient algorithm.

As for the final statement: I agree, where possible and equivalent, queries that can be used in Views and inline Table-Valued functions are preferable.  Nonetheless I note that everyone that I know of does eventually use stored procedures and variables in SQL Server.

Posted by peter on 20 February 2009

RBarryYoung,

I did prepare most of a reply that I was to submit, but decided I will not post it. At some point I felt it became too personal with me pointing out errors and parading statements where you do imply things about me. It would become a meta discussion from here on forward and not something worthy of this blog. Suffice to say, we don't understand eachother enough and disagree on how to view the playing field and sometimes even on how to describe things.

We are sure to meet in another discussion sometime and hopefully not get stuck in this sort of thing.

Posted by bnordberg on 29 March 2009

I just finished concatenating ~1.3 billion rows of text data. The data each contained a row of text up to 255 characters long. I had to do a replace on all the Unicode characters inside the row to get the FOR XML to work. But given a nest of 12 replaces, the query was able to concatenate ~1.3 billion rows into 45 million newly concatenated rows (held in a varchar max field) in 2hrs 15 minutes. So I would agree the FOR XML path method is blazing fast compared to other methods.

Posted by karthikeyan on 4 July 2009

- can you give me a real world example of when you would want to concatenate the masses of strings in a database/table/row/column

- SQL is a Structured Query language, for querying databases; it is not a programming language

Posted by Phil Factor on 4 July 2009

I use string concatenation for a number of database tasks. I warned at the beginning of the article that this was getting quite esoteric. However, this sort of task is occasionally required. If you've never had to do this sort of thing then there is no need to worry about it. I'm using it at this precise moment because I am processing and interpreting a data feed of JSON strings into relational data, but I wouldn't imagine you'd ever want to do that. The reason I became particularly interested was that I was revising my prettifier, which you will absolutely hate if you think it is wrong to use SQL as a programming language even if you need to for a particular purpose.

Leave a Comment

Please register or log in to leave a comment.