SQLServerCentral Article

Performance Tuning: Concatenation Functions and Some Tuning Myths

,

Introduction:

For one reason or another, you need a "Concatenation Function" that builds, say, a CSV column ("Comedy Separated Values", in this case... no quotes on string values, just commas). Yeah, I know... denormalization and a thousand other bad words come to mind when I say that. Still, someone requires you to build one (perhaps for some type of file output).

You find or design a really good one and run it. Its performance is absolutely terrible. You throw indexes at the problem. You rewrite the function several different ways. Still, the performance is absolutely terrible.

You check forum after forum and they all say the same thing. "Concatenation Functions are performance hogs" and "The use of Concatenation Functions on large record sets is a form of "Death by SQL".

The real problem is not the function... it's how you used it.

Example Problem Definition:

Here's an example... you have a table that has multiple columns. One of the columns is some ID and another is some 2 letter code. Your job is to create a script or proc that will find and concatenate (with comma separators) all of the 2 letter codes in order as they appear according to a RowNum (third) column in the same table. A single row of output would look something like this...

SomeID CSVString
------ -----------------------------------------------------------------
1034   AA,AB,ZC,DE,FJ,TT,TA,MZ

You also find out that each ID may have around 400 2 letter codes and that there are a million rows in the table.

We'll start off with only 10,000 rows, just to see what happens...

The Test Data:

As always, in order to do performance testing of this nature, we need some data to test on... lots of it. The following code will build the test data we want and will only contain the 3 columns mentioned in the "Example Problem Definition" above. We can also control how many rows there are and the "distribution of data" by tweaking a couple of literals in the code (plainly marked in the code).

--===== Create and populate test table.
     -- Column "RowNum" contains unique row numbers
     -- Column "SomeID" contains non-unique numbers 
     --   (should be 1/400th of the row count to allow about 400 concatenated items per SomeInt).
     -- Column "SomeCode" has a range of "AA" to "ZZ" non-unique 2 character strings
 SELECT TOP 10000 --<<<LOOK! CHANGE THIS NUMBER TO CHANGE THE NUMBER OF ROWS!
        RowNum   = IDENTITY(INT,1,1),
        SomeID   = ABS(CHECKSUM(NEWID()))%25+1, --<<<LOOK! CHANGE THIS NUMBER TO 1/400th THE ROW COUNT
        SomeCode = CHAR(ABS(CHECKSUM(NEWID()))%26+65)
                 + CHAR(ABS(CHECKSUM(NEWID()))%26+65)
   INTO dbo.TestData
   FROM Master.dbo.SysColumns t1,
        Master.dbo.SysColumns t2 
--===== A table is not properly formed unless a Primary Key has been assigned
  ALTER TABLE dbo.TestData
        ADD PRIMARY KEY CLUSTERED (RowNum)

Run that code to create the table called "TestData". It'll create a 10,000 row table with all the data we need for the first test.

The "Concatenation Function":

Just to make sure we're all on the same page for this test, here's the Concatenation Function I built to do the required concatenation...

 CREATE FUNCTION dbo.fnConcatTest (@SomeID INT)
RETURNS VARCHAR(8000) AS 
  BEGIN
        DECLARE @Return VARCHAR(8000)
         SELECT @Return = ISNULL(@Return+',','')+SomeCode
           FROM dbo.TestData
          WHERE SomeID = @SomeID
          ORDER BY RowNum
         RETURN @Return
    END

The "Hog" Test:

Ok... we've got everything we need... let's build some code and just return the output to the GRID mode of Query Analyzer (nope, I don't have 2k5, yet). Lets do it the same way almost everyone else does...

 SELECT DISTINCT SomeID, dbo.fnConcatTest(SomeID) AS CSVString
   FROM dbo.TestData

On my poor old 1.8Ghz 5 year old test box, that takes about 2 minutes and 50 seconds. During that entire time, my single CPU was slammed into the wall at 100%. Even the performance grid on Windows Task Manager wouldn't update!

Ok... Put your "performance tuning hat" on... what's the matter here? First thing most people would do is add an index and run it again... try it... add the following index and run the code above, again.

CREATE INDEX IX_TestData_SomeInt ON dbo.TestData(SomeID)

That did absolutely nothing for performance.

The next thing folks might try, is a full "covering" index...

CREATE INDEX IXC_TestData_Cover1 ON dbo.TestData(SomeID,SomeCode,RowNum)

Yee-haa!!! That worked! It reduced the run time from 2:50 to only 37 seconds! What a super time to return, what... 25 rows? For a million rows of a similar nature, that we'd need to return 2,500 such rows... that'll take 100 times the duration or 3,700 seconds! There's only 3,600 seconds in an hour... we' re dead, huh?

Guess they were right about "Concatenation Functions" being real hogs, huh? Ok, we're done here... let's all go home. Can't do anything else...

Well, except maybe do a little figuring...

Figure It Out!

A lot of people simply miss what's going on in the code. Think about it. Can we do anything to the function to make it faster? NO! Can we add some more indexes to make it faster? NO! Are we dead? NO!

Look at the code. What's it doing? It simply calculates the CSV using the function for each ID, right?

WRONG! It's executing the function for all 10,000 rows... not just the 25 that get returned. Do the math... the function has to "touch" about 400 rows for every execution (about 400 codes for each ID). It has to do it for EVERY row because that's the way we wrote it. The code will process all 10,000 rows and then do a DISTINCT to filter out the dupes. That means that the code "touches" 10,000 * 400 or 4 MILLION rows to get a return on 25.

Even the estimated and actual execution plans lie! Both say that we're only processing (touching) 10,000 rows.

It's Our Fault!

Ok... how do we fix the problem? Simple... we obey Rule #1 of performance tuning. Heh... No, no... I'm not talking about buying a faster server! Let's FIX the problem that so many miss. Let's fix the code! Rule #1 of performance tuning says (or, at least in the Jeff Moden "Book of Boy Howdy!" it does), touch each row that's needed only once and touch only those rows needed.

We want to return a row for each SomeID... there are about 400 occurrences of each of the 25 SomeID's in the table. Let's isolate those, first! Only THEN can we use the function. Here's the full code written without all the "hidden RBAR" the earlier code used...

 SELECT d.SomeID, dbo.fnConcatTest(d.SomeID) AS CSVString
   FROM (--==== Derived table "d" finds the unique SomeID's first
         SELECT DISTINCT SomeID
           FROM dbo.TestData
        )d

Drop the test table and rebuild it. Add the single column index on SomeID, again. Then, run the code above. You'll see it takes remarkably little time to execute.

Let's go for the "big one". Drop the test table, change it to generate a million rows and instantiate the index, again. Here's the code...

--===== Create and populate test table.
     -- Column "RowNum" contains unique row numbers
     -- Column "SomeID" contains non-unique numbers 
     --   (should be 1/400th of the row count to allow about 400 concatenated items per SomeInt).
     -- Column "SomeCode" has a range of "AA" to "ZZ" non-unique 2 character strings
 SELECT TOP 1000000 --<<<LOOK! CHANGE THIS NUMBER TO CHANGE THE NUMBER OF ROWS!
        RowNum   = IDENTITY(INT,1,1),
        SomeID   = ABS(CHECKSUM(NEWID()))%2500+1, --<<<LOOK! CHANGE THIS NUMBER TO 1/400th THE ROW COUNT
        SomeCode = CHAR(ABS(CHECKSUM(NEWID()))%26+65)
                 + CHAR(ABS(CHECKSUM(NEWID()))%26+65)
   INTO dbo.TestData
   FROM Master.dbo.SysColumns t1,
        Master.dbo.SysColumns t2 
--===== A table is not properly formed unless a Primary Key has been assigned
  ALTER TABLE dbo.TestData
        ADD PRIMARY KEY CLUSTERED (RowNum)
--===== Add the necessary index
CREATE INDEX IX_TestData_SomeInt ON dbo.TestData(SomeID)

Now, run the modified code that we built just a minute ago...

 SELECT d.SomeID, dbo.fnConcatTest(d.SomeID) AS CSVString
   FROM (--==== Derived table "d" finds the unique SomeID's first
         SELECT DISTINCT SomeID
           FROM dbo.TestData
        )d

I don't know how fast it runs on your machine but, on mine, it only took 25 seconds in the grid mode and you can bet some of that time was chewed up by displaying the 2,500 rows this produced. And, by the way, without the index on the SomeID column, it takes almost 7 minutes to execute. With the full "covering" index, it only takes about 12 seconds.

Another way to do the same thing is...

 SELECT SomeID, dbo.fnConcatTest(SomeID) AS CSVString
   FROM dbo.TestData
  GROUP BY SomeID

Finally, SQL Server 2005 has a few extra tricks available for concatenation... the following only takes about 17 seconds for the million row example in SQL Server 2005... I normally warn against using correlated sub-queries, but the following just absolutely flies (no UDF overhead)...

 SELECT t1.SomeID,
        STUFF((SELECT ','+t2.SomeCode 
                 FROM dbo.TestData t2 
                WHERE t1.SomeID = t2.SomeID FOR XML PATH('')),1,1,'')
   FROM dbo.TestData t1
  GROUP BY t1.SomeID

Conclusion:

There's several lessons to be learned here. The first one is that "Concatenation Functions" don't deserve the awful reputation they've earned. But, more importantly, we've learned that "performance tuning" isn't about buying faster hardware (I have an old and slow machine). It certainly isn't about giving up on the code (or writing a CLR especially if you only have SQL Server 2000, like I do). It certainly isn't just about what indexes to throw at a problem. And, forget about the execution plan... it did nothing for us.

No... it's all about the code, the same thing I've said many times on SQL Server Central.com. It's about knowing what the code will do, how many rows it will process internally, and getting good at recognizing a code problem. Many huge and terrible performance problems can be quickly made to simply disappear by rewriting a section or two of set based code. If you've written it using full RBAR methods, then you're just going to have to bite down hard and rewrite the code. Then, you can do a little tuning using indexes.

If you don't think it's worth it, consider what we've just done... we took code and made it run on 100 times more data than what we started with in about a 6th of the time (1/12th with the covering index)... And that was an improvement over what looked like set based code.!

Pay attention to the code... that's where the real performance gains are.

--Jeff Moden


"Your lack of planning DOES constitute an emergency on my part... SO PLAN BETTER! "

"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row"

Redgate SQL Monitor

Rate

4.7 (74)

You rated this post out of 5. Change rating

Share

Share

Rate

4.7 (74)

You rated this post out of 5. Change rating