SQL Clone
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


Convert CSV values in three columns to rows


Convert CSV values in three columns to rows

Author
Message
Jeff Moden
Jeff Moden
SSC Guru
SSC Guru (936K reputation)SSC Guru (936K reputation)SSC Guru (936K reputation)SSC Guru (936K reputation)SSC Guru (936K reputation)SSC Guru (936K reputation)SSC Guru (936K reputation)SSC Guru (936K reputation)

Group: General Forum Members
Points: 936351 Visits: 49080
Pavel Pawlowski (9/26/2010)
But the situation differs with increased number of element. Tested it on 1333 elements (nearly 8k). The Tally table version I stopped after 19 minutes of execution time.



Do you see the two errors we've both made that caused the Tally Table to look bad? ;-)

The Tally Table works by joining to the data at the character level. So the first performance error occurs because of what happens if you join two columns that are different datatypes... The data is NVARCHAR and the Tally Table splitter is based on VARCHAR.

The second "error" is caused by the myth than an iTVF always works as well at high row-counts as it does at low row-counts. Because of the join at the character level, you actually end up with a much larger cross join between the Tally Table and the data than required for the larger column counts because the optimizer can "see" the splitter code instead of not being able to see it as if it were a scalar function. In other words, the optimizer makes a bad mistake here. There are some fixes you can add but let’s not do that for the moment. I will, however, include mlTVF code to do a Tally Table split below.

So... if we do like we would do when tuning any query for performance, we need to make the datatypes match for the join and we need to change to a NON iTVF to get the performance the Tally Table is famous for even at larger numbers of CSV elements.

In my tests, the only change I made to the data generator was to change between creating an NVARCHAR column and a VARCHAR column and then ran the tests for 10, 100, and 1333 for each datatype. I also threw in the NON iTVF function to show that the problem really isn't with the Tally Table... the problem is with what the optimizer chose to do with it all and we had to do a little tuning just like we would with any query...

Here are the results from the tests. As you can see, "It Depends" and "Tuning Prevails". It clearly demonstrates that the iTVF is very good (in this case) for very low column counts but loses it's mind even to XML splitters at higher column counts especially when the datatype doesn’t match. It also clearly demonstrates that the mlTVF for the Tally table is slow at very low row counts and blows the doors off both the iTVF and the XML method for the larger row count that you did your testing with.

So, it’s not the Tally Table split that’s the problem… it’s how it’s used that’s the problem and, just like any other code in SQL, there are some best practices to follow including the matching of data-types in joins.


Oh yeah... almost forgot... here's the mlTVF I used...
--===== Tally Table (Split8KTallyM mlTVF) ===============================================================================
DROP FUNCTION dbo.Split8KTallyM;
GO
CREATE FUNCTION dbo.Split8KTallyM
(@Parameter VARCHAR(8000), @Delimiter VARCHAR(1))
RETURNS @Result TABLE (ItemNumber INT, ItemValue INT) AS
BEGIN
INSERT INTO @Result
(ItemNumber, ItemValue)
SELECT CAST(ROW_NUMBER() OVER (ORDER BY N) AS INT) AS ItemNumber,
SUBSTRING(@Parameter,N,CHARINDEX(@Delimiter,@Parameter+@Delimiter,N)-N) AS ItemValue
FROM dbo.Tally
WHERE N BETWEEN 1 AND LEN(@Parameter)+1
AND SUBSTRING(@Delimiter+@Parameter,N,1) = @Delimiter; --Notice how we find the comma
RETURN
END;
GO




And, no... I didn't take the time to test an NVARCHAR version of the Tally Table code here. I'm saving that for the article. :-D

--Jeff Moden

RBAR is pronounced ree-bar and is a Modenism for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
If you think its expensive to hire a professional to do the job, wait until you hire an amateur. -- Red Adair

When you put the right degree of spin on it, the number 318 is also a glyph that describes the nature of a DBAs job. Wink

Helpful Links:
How to post code problems
How to post performance problems
Forum FAQs
Attachments
Tally Split Race.gif (595 views, 58.00 KB)
Pavel Pawlowski
Pavel Pawlowski
Ten Centuries
Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)

Group: General Forum Members
Points: 1222 Visits: 460
Jeff,

you are right about the data types, so I've modified the nvarchar to varchar in the test data generator and make a comparison of the Split8KTallyM vs. CLR RegEx. Aso I've made small enhancement to the CLR RegEx by changing input and output datatypes to match closer the input and output data and simplified the Regular expression for the concrete test.

As you said, it's alwasy "It Depends".

But even the Split8KTallyM is 8 times slower comparing the CLR RegEx on the hi element counts (1333).
Maybe I'm doing somethig wrong or forget something, but I'm not aware of this.

Here are results from test on my workstation. (both runs against the test data with varchar column instead of nvarchar)



Here is definition CLR RegEx function (rest is the same as in previous post)

    [SqlFunction(FillRowMethodName = "FillRegExRow")]
public static IEnumerable RegExMatches2(SqlString sourceString, SqlString pattern)
{
Regex r = new Regex(pattern.Value, RegexOptions.Compiled);
int rowId = 0;
int matchId = 0;
foreach (Match m in r.Matches(sourceString.Value))
{
matchId++;
for (int i = 0; i < m.Groups.Count; i++)
{
yield return new RegExRow(++rowId, matchId, i, m.Groups[i].Value);
}
}
}



CREATE FUNCTION [dbo].[fn_RegExMatches2](@sourceString [nvarchar](max), @pattern [nvarchar](10))
RETURNS TABLE (
[rowId] [int] NULL,
[matchId] [int] NULL,
[groupId] [int] NULL,
[value] [nvarchar](10) NULL
) WITH EXECUTE AS CALLER
AS
EXTERNAL NAME [SQLRegEx].[SQLRegEx].[RegExMatches2]
GO



And TestQuery

--===  CLR-RegEx Version  ===
DECLARE @RowNum INT, @ItemNumber INT, @ItemValue INT;
SELECT
@RowNum = D.RowNum,
@ItemNumber = V.matchId,
@ItemValue = CAST(V.value as int)
FROM dbo.CsvTest4 D
CROSS APPLY dbo.fn_RegExMatches2(D.CsvParameter, '\d+') V
GO
--=== Split8KTallyM ===
DECLARE @RowNum INT, @ItemNumber INT, @ItemValue INT;
SELECT
@RowNum = D.RowNum,
@ItemNumber = V.ItemNumber,
@ItemValue = V.ItemValue
FROM dbo.CsvTest4 D
CROSS APPLY dbo.Split8KTallyM(D.CsvParameter, ',') V
GO


Attachments
CLRRegExTally.png (507 views, 8.00 KB)
steve-893342
steve-893342
SSCarpal Tunnel
SSCarpal Tunnel (4.3K reputation)SSCarpal Tunnel (4.3K reputation)SSCarpal Tunnel (4.3K reputation)SSCarpal Tunnel (4.3K reputation)SSCarpal Tunnel (4.3K reputation)SSCarpal Tunnel (4.3K reputation)SSCarpal Tunnel (4.3K reputation)SSCarpal Tunnel (4.3K reputation)

Group: General Forum Members
Points: 4253 Visits: 2670
Aha the 1,333 test:-) Curiously enough I had already carried out the analysis according to your CsvTest but on 10,000 Row X 1,333 Element table. Results available on request
steve-893342
steve-893342
SSCarpal Tunnel
SSCarpal Tunnel (4.3K reputation)SSCarpal Tunnel (4.3K reputation)SSCarpal Tunnel (4.3K reputation)SSCarpal Tunnel (4.3K reputation)SSCarpal Tunnel (4.3K reputation)SSCarpal Tunnel (4.3K reputation)SSCarpal Tunnel (4.3K reputation)SSCarpal Tunnel (4.3K reputation)

Group: General Forum Members
Points: 4253 Visits: 2670
mishaluba (6/27/2010)
Hello,

I have a table with three columns containing CSV values. I need to normalize them. The order of the CSV values is meaningful (explained more below) Here is the sample DDL:


CREATE TABLE #t (id int, kpi1 varchar(16), kpi2 varchar(16), kpi3 varchar(16))

insert into #t (id, kpi1, kpi2, kpi3)
values(1, '0.1,0.11,0.111', '0.2,0.22,0.222', '0.3,0.33,0.333')


insert into #t (id, kpi1, kpi2, kpi3)
values(2, '0.4,0.44,0.444', '0.5,0.55,0.555', '0.6,0.66,0.666')

select *
from #t



My desired result would look something like this:

id kpi1 kpi2 kpi3 items1 items2 items3
----------- ---------------- ---------------- ---------------- ---------- ------ -----
1 0.1,0.11,0.111 0.2,0.22,0.222 0.3,0.33,0.333 0.1 0.2 0.3
1 0.1,0.11,0.111 0.2,0.22,0.222 0.3,0.33,0.333 0.11 0.22 0.33
1 0.1,0.11,0.111 0.2,0.22,0.222 0.3,0.33,0.333 0.111 0.222 0.333
2 0.4,0.44,0.444 0.5,0.55,0.555 0.6,0.66,0.666 0.4 0.5 0.6
2 0.4,0.44,0.444 0.5,0.55,0.555 0.6,0.66,0.666 0.44 0.55 0.66
2 0.4,0.44,0.444 0.5,0.55,0.555 0.6,0.66,0.666 0.444 0.555 0.666


I hope this is clear. Basically I want to split CSV values in each of the columns without multiplying the number of records every time I do the split, but just to append them to the same id. Once again, the order of CSV values is meaningful for the subsequent manipulations. For example: 0.1, 0.2, 0.3 (first values from kpi1, kpi2, kpi3) form a meaningful set of values, that's why I want them to be in the first row. The next row contains second values from each of the kpi's: 0.11, 0.22, 0.33. Since I do have an id in my source table and the function returns this id, I thought I could do a JOIN instead of CROSS APPLY, but this gives me a syntax error:

OK then folks, returning to the original post:-) I have another solution.
This approach uses CROSS APPLY and some string manipulation to accomplish the splitting

SELECT id, kpi1, kpi2, kpi3, items1, items2, items3
FROM #t
CROSS APPLY
(
SELECT LEFT(kpi1, CHARINDEX(',', kpi1) - 1), LEFT(kpi2, CHARINDEX(',', kpi2) - 1), LEFT(kpi3, CHARINDEX(',', kpi3) - 1)
UNION ALL SELECT SUBSTRING(kpi1, CHARINDEX(',', kpi1) + 1, CHARINDEX(',', kpi1, CHARINDEX(',', kpi1) + 1) - CHARINDEX(',', kpi1) - 1), SUBSTRING(kpi2, CHARINDEX(',', kpi2) + 1, CHARINDEX(',', kpi2, CHARINDEX(',', kpi2) + 1) - CHARINDEX(',', kpi2) - 1), SUBSTRING(kpi3, CHARINDEX(',', kpi3) + 1, CHARINDEX(',', kpi3, CHARINDEX(',', kpi3) + 1) - CHARINDEX(',', kpi3) - 1)
UNION ALL SELECT RIGHT(kpi1, CHARINDEX(',', REVERSE(kpi1)) - 1), RIGHT(kpi2, CHARINDEX(',', REVERSE(kpi2)) - 1), RIGHT(kpi3, CHARINDEX(',', REVERSE(kpi3)) - 1)
) AS Z (items1, items2, items3)



Jeff Moden
Jeff Moden
SSC Guru
SSC Guru (936K reputation)SSC Guru (936K reputation)SSC Guru (936K reputation)SSC Guru (936K reputation)SSC Guru (936K reputation)SSC Guru (936K reputation)SSC Guru (936K reputation)SSC Guru (936K reputation)

Group: General Forum Members
Points: 936351 Visits: 49080
steve-893342 (10/2/2010)
Aha the 1,333 test:-) Curiously enough I had already carried out the analysis according to your CsvTest but on 10,000 Row X 1,333 Element table. Results available on request


Heh... I'm not quite sure why you make people request the results. They'd be useful to the forum even if I had absolutely no interest. ;-)

--Jeff Moden

RBAR is pronounced ree-bar and is a Modenism for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
If you think its expensive to hire a professional to do the job, wait until you hire an amateur. -- Red Adair

When you put the right degree of spin on it, the number 318 is also a glyph that describes the nature of a DBAs job. Wink

Helpful Links:
How to post code problems
How to post performance problems
Forum FAQs
Jeff Moden
Jeff Moden
SSC Guru
SSC Guru (936K reputation)SSC Guru (936K reputation)SSC Guru (936K reputation)SSC Guru (936K reputation)SSC Guru (936K reputation)SSC Guru (936K reputation)SSC Guru (936K reputation)SSC Guru (936K reputation)

Group: General Forum Members
Points: 936351 Visits: 49080
Pavel Pawlowski (10/2/2010)
But even the Split8KTallyM is 8 times slower comparing the CLR RegEx on the hi element counts (1333).
Maybe I'm doing somethig wrong or forget something, but I'm not aware of this.


You're correct. I'm not sure what or if you're doing anything wrong because the tally table versions I used obviously took a lot less than yours did. I'll take a look at your code and see if I can figure it out.

--Jeff Moden

RBAR is pronounced ree-bar and is a Modenism for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
If you think its expensive to hire a professional to do the job, wait until you hire an amateur. -- Red Adair

When you put the right degree of spin on it, the number 318 is also a glyph that describes the nature of a DBAs job. Wink

Helpful Links:
How to post code problems
How to post performance problems
Forum FAQs
Jeff Moden
Jeff Moden
SSC Guru
SSC Guru (936K reputation)SSC Guru (936K reputation)SSC Guru (936K reputation)SSC Guru (936K reputation)SSC Guru (936K reputation)SSC Guru (936K reputation)SSC Guru (936K reputation)SSC Guru (936K reputation)

Group: General Forum Members
Points: 936351 Visits: 49080
@Pavel,

What is the default collation on your machine, please? And thanks ahead of time for helping me figure this out.

--Jeff Moden

RBAR is pronounced ree-bar and is a Modenism for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
If you think its expensive to hire a professional to do the job, wait until you hire an amateur. -- Red Adair

When you put the right degree of spin on it, the number 318 is also a glyph that describes the nature of a DBAs job. Wink

Helpful Links:
How to post code problems
How to post performance problems
Forum FAQs
Pavel Pawlowski
Pavel Pawlowski
Ten Centuries
Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)

Group: General Forum Members
Points: 1222 Visits: 460
Jeff Moden (10/3/2010)
@Pavel,

What is the default collation on your machine, please? And thanks ahead of time for helping me figure this out.


Jeff,

my default collation is Czech_CI_AS, but I've tested this also on DB with Latin1_General_CI_AS collation and the results are the same. Tested it also on Clustered table and also on heap.

Maybe the difference is in the query Plan. Don't know your query plan, but on my machine SQL Server chooses paralel plan for the RegEx solution and maybe it will be benefit for it.
Here are the Plans. The first paralel one is for the RegEx CLR.




Also in term of optimizations and speed of CSV Splitting I've created a little small CLR function which is nearly 5 times quicker than the RegEx for CSV splitting and more than 30 times faster than the tally on my box. It can be used for simple CSV with one char delimiter. (The maxLen argument is length of buffer (maximum length of item i the CSV)

Here is the source code and below the results from Profilers.

public class StringSplit
{
private struct StrRow
{
public StrRow(int rowId, SqlChars value)
{
RowId = rowId;
Value = value;
}

public int RowId;
public SqlChars Value;

}

[SqlFunction(FillRowMethodName = "FillSplitString3")]
public static IEnumerable SplitString3(SqlString sourceString, string delimiter, int maxLen)
{
char[] buffer = new char[maxLen];
char delim = delimiter[0];
int rowNumber = 0;
int chars = 0;
char[] finalString;

foreach (char chr in sourceString.Value)
{
if (chr == delim)
{
finalString = new char[chars];
Array.Copy(buffer, finalString, chars);
yield return new StrRow(++rowNumber, new SqlChars(finalString));
chars = 0;
}
else
{
buffer[chars++] = chr;
}
}
if (chars > 0)
{
finalString = new char[chars];
Array.Copy(buffer, finalString, chars);
yield return new StrRow(++rowNumber, new SqlChars(finalString));
}

}

public static void FillSplitString3(object obj, out int rowId, out SqlChars value)
{
StrRow r = (StrRow)obj;
rowId = r.RowId;
value = r.Value;
}
}



CREATE FUNCTION dbo.fn_SplitString3(
@sourceString nvarchar(max),
@delimiter nchar(1),
@maxLen int
)
returns TABLE (
RowID int,
Value nvarchar(10)
)
AS
EXTERNAL NAME [CQLCLR].StringSplit.SplitString3
GO




Profiler Output:

Attachments
CLRRegExTally2.png (458 views, 41.00 KB)
CLRRegExTally3.png (468 views, 11.00 KB)
Jeff Moden
Jeff Moden
SSC Guru
SSC Guru (936K reputation)SSC Guru (936K reputation)SSC Guru (936K reputation)SSC Guru (936K reputation)SSC Guru (936K reputation)SSC Guru (936K reputation)SSC Guru (936K reputation)SSC Guru (936K reputation)

Group: General Forum Members
Points: 936351 Visits: 49080
I'm not sure why but the Tally Table method appears (from this angle... can't tell really) to be Cross Joining with the data which, of course, is going to make it terribly slow. The question is... why? Again... can't tell from here and it's not happening that way with any of the (apparently) same tests I've run on 2005 or 2008. At this point, I'm not sure what the two of us are doing differently.

I've also lost track a bit of what we've done. The difference may be in the Tally Table itself (post the CREATE TABLE statement and all indexes you have on it just so I can have a look, please?). This type of stuff is difficult to TS remotely.

If you want (don't worry about the CLR side... we know that's good), post all of the code YOU are using to setup the test, the function, and the Tally Table and I'll try that on my machine to see if I can duplicate the problem you're having. I have both a 2k5 and 2k8 machine at home to test on now.

And, yes... I realize that much of the code may be what's already been posted and much of it may be mine. Since I'm trying to duplicate your problem, I wanted you to post the actual code you used so I can try and find a difference.

Thanks, Pavel.

--Jeff Moden

RBAR is pronounced ree-bar and is a Modenism for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
If you think its expensive to hire a professional to do the job, wait until you hire an amateur. -- Red Adair

When you put the right degree of spin on it, the number 318 is also a glyph that describes the nature of a DBAs job. Wink

Helpful Links:
How to post code problems
How to post performance problems
Forum FAQs
Pavel Pawlowski
Pavel Pawlowski
Ten Centuries
Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)

Group: General Forum Members
Points: 1222 Visits: 460
Jeff Moden (10/5/2010)

If you want (don't worry about the CLR side... we know that's good), post all of the code YOU are using to setup the test, the function, and the Tally Table and I'll try that on my machine to see if I can duplicate the problem you're having. I have both a 2k5 and 2k8 machine at home to test on now.

And, yes... I realize that much of the code may be what's already been posted and much of it may be mine. Since I'm trying to duplicate your problem, I wanted you to post the actual code you used so I can try and find a difference.

Thanks, Pavel.


Jeff,

here is a complete script and also profiler output:

--Create TestDB
CREATE DATABASE [TestDB]
COLLATE Latin1_General_CI_AS
GO

--Use TestDB
USE [TestDB]
GO


--Create and populate tally table
SELECT TOP 11000
IDENTITY(INT, 1, 1) AS N
INTO dbo.Tally
FROM sys.all_objects o1, sys.all_objects
GO

--Add Clustered Index on Tally table
ALTER TABLE dbo.Tally
ADD CONSTRAINT PK_Tally PRIMARY KEY CLUSTERED (N) WITH FILLFACTOR = 100
GO

--Create and populate CsvTest table (doesn't matter whether the table has Clustered index or it is simply heap)
SELECT TOP (10000) --Controls the number of rows in the test table
ISNULL(ROW_NUMBER() OVER (ORDER BY(SELECT NULL)),0) AS RowNum,
(
SELECT CAST(STUFF( --=== STUFF get's rid of the leading comma
( --=== This builds CSV row with a leading comma
SELECT TOP (1333) --Controls the number of CSV elements in each row
','+CAST(ABS(CHECKSUM(NEWID()))%100000 AS VARCHAR(10))
FROM dbo.Tally t3 --Classic cross join pseudo-cursor
CROSS JOIN dbo.Tally t4 --can produce row sets up 121 million.
WHERE t1.N <> t3.N --Without this line, all rows would be the same
FOR XML PATH('')
)
,1,1,'') AS VARCHAR(8000))
) AS CsvParameter
INTO CsvTest
FROM dbo.Tally t1 --Classic cross join pseudo-cursor
CROSS JOIN dbo.Tally t2; --can produce row sets up 121 million.
GO

--Create Split Tally Function
CREATE FUNCTION dbo.Split8KTallyM (
@Parameter VARCHAR(8000),
@Delimiter VARCHAR(1)
)
RETURNS @Result TABLE (ItemNumber INT, ItemValue INT) AS
BEGIN
INSERT INTO @Result
(ItemNumber, ItemValue)
SELECT CAST(ROW_NUMBER() OVER (ORDER BY N) AS INT) AS ItemNumber,
SUBSTRING(@Parameter,N,CHARINDEX(@Delimiter,@Parameter+@Delimiter,N)-N) AS ItemValue
FROM dbo.Tally
WHERE N BETWEEN 1 AND LEN(@Parameter)+1
AND SUBSTRING(@Delimiter+@Parameter,N,1) = @Delimiter; --Notice how we find the comma
RETURN
END;
GO

--Tally Test
DECLARE @RowNum INT, @ItemNumber INT, @ItemValue INT;
SELECT
@RowNum = D.RowNum,
@ItemNumber = V.ItemNumber,
@ItemValue = V.ItemValue
FROM dbo.CsvTest D
CROSS APPLY dbo.Split8KTallyM(D.CsvParameter, ',') V
GO




Attachments
TallySplitProfiler.png (501 views, 17.00 KB)
Go


Permissions

You can't post new topics.
You can't post topic replies.
You can't post new polls.
You can't post replies to polls.
You can't edit your own topics.
You can't delete your own topics.
You can't edit other topics.
You can't delete other topics.
You can't edit your own posts.
You can't edit other posts.
You can't delete your own posts.
You can't delete other posts.
You can't post events.
You can't edit your own events.
You can't edit other events.
You can't delete your own events.
You can't delete other events.
You can't send private messages.
You can't send emails.
You can read topics.
You can't vote in polls.
You can't upload attachments.
You can download attachments.
You can't post HTML code.
You can't edit HTML code.
You can't post IFCode.
You can't post JavaScript.
You can post emoticons.
You can't post or upload images.

Select a forum









































































































































































SQLServerCentral


Search