RE: Convert CSV values in three columns to rows

SSC Guru

Points: 1003867

October 2, 2010 at 11:37 am

Pavel Pawlowski (9/26/2010)
But the situation differs with increased number of element. Tested it on 1333 elements (nearly 8k). The Tally table version I stopped after 19 minutes of execution time.

Do you see the two errors we've both made that caused the Tally Table to look bad? 😉

The Tally Table works by joining to the data at the character level. So the first performance error occurs because of what happens if you join two columns that are different datatypes... The data is NVARCHAR and the Tally Table splitter is based on VARCHAR.

The second "error" is caused by the myth than an iTVF always works as well at high row-counts as it does at low row-counts. Because of the join at the character level, you actually end up with a much larger cross join between the Tally Table and the data than required for the larger column counts because the optimizer can "see" the splitter code instead of not being able to see it as if it were a scalar function. In other words, the optimizer makes a bad mistake here. There are some fixes you can add but let’s not do that for the moment. I will, however, include mlTVF code to do a Tally Table split below.

So... if we do like we would do when tuning any query for performance, we need to make the datatypes match for the join and we need to change to a NON iTVF to get the performance the Tally Table is famous for even at larger numbers of CSV elements.

In my tests, the only change I made to the data generator was to change between creating an NVARCHAR column and a VARCHAR column and then ran the tests for 10, 100, and 1333 for each datatype. I also threw in the NON iTVF function to show that the problem really isn't with the Tally Table... the problem is with what the optimizer chose to do with it all and we had to do a little tuning just like we would with any query...

Here are the results from the tests. As you can see, "It Depends" and "Tuning Prevails". It clearly demonstrates that the iTVF is very good (in this case) for very low column counts but loses it's mind even to XML splitters at higher column counts especially when the datatype doesn’t match. It also clearly demonstrates that the mlTVF for the Tally table is slow at very low row counts and blows the doors off both the iTVF and the XML method for the larger row count that you did your testing with.

So, it’s not the Tally Table split that’s the problem… it’s how it’s used that’s the problem and, just like any other code in SQL, there are some best practices to follow including the matching of data-types in joins.

Oh yeah... almost forgot... here's the mlTVF I used...

--===== Tally Table (Split8KTallyM mlTVF) ===============================================================================

DROP FUNCTION dbo.Split8KTallyM;

GO

CREATE FUNCTION dbo.Split8KTallyM

(@Parameter VARCHAR(8000), @Delimiter VARCHAR(1))

RETURNS @Result TABLE (ItemNumber INT, ItemValue INT) AS

BEGIN

INSERT INTO @Result

(ItemNumber, ItemValue)

SELECT CAST(ROW_NUMBER() OVER (ORDER BY N) AS INT) AS ItemNumber,

SUBSTRING(@Parameter,N,CHARINDEX(@Delimiter,@Parameter+@Delimiter,N)-N) AS ItemValue

FROM dbo.Tally

WHERE N BETWEEN 1 AND LEN(@Parameter)+1

AND SUBSTRING(@Delimiter+@Parameter,N,1) = @Delimiter; --Notice how we find the comma

RETURN

END;

GO

And, no... I didn't take the time to test an NVARCHAR version of the Tally Table code here. I'm saving that for the article. 😀

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)