Tally OH! An Improved SQL 8K “CSV Splitter” Function

Question

Tally OH! An Improved SQL 8K “CSV Splitter” Function

Viewing 15 posts - 436 through 450 (of 991 total)

You must be logged in to reply to this topic. Login to reply

Usman Butt Hall of Fame Points: 3186 More actions · Answer 1

peter-757102 (9/11/2012)
Having to specify collations everywhere would make the function cumbersome to use. It might even throw off the optimizer in not using indexes due to collation differences.
An example scenario could be this:
The outcome of the splitter is used to match the substrings against a table of specific words (indexed) and for each word we count the number of occurrences in the original separated input sting. Assuming both the words in that table and the original string are in the same collation, the code should be completely transparent in terms of collations, else if looses much of its appeal.

Why there would be a need to specify collations everywhere? For the sake of clearing the ambiguity this is the part (in bold) of the code where comparison is done using binary collation (only for the delimiter, so can be used in most cases)

SUBSTRING(@pString, (N+1),

ISNULL(NULLIF(CHARINDEX(@pDelimiter ,

@pString COLLATE Latin1_General_BIN, (N+1)), 0)

- (N+1), 8000) )

And this is the part where the original string is split into items

SUBSTRING(@pString, (N+1),

ISNULL(NULLIF(CHARINDEX(@pDelimiter ,

@pString COLLATE Latin1_General_BIN, (N+1)), 0)

- (N+1), 8000) )

So I do not think we are losing the default collation.

peter-757102 SSCertifiable Points: 6877 More actions · Answer 2

Usman Butt (9/11/2012)
peter-757102 (9/11/2012)
Having to specify collations everywhere would make the function cumbersome to use. It might even throw off the optimizer in not using indexes due to collation differences.
An example scenario could be this:
The outcome of the splitter is used to match the substrings against a table of specific words (indexed) and for each word we count the number of occurrences in the original separated input sting. Assuming both the words in that table and the original string are in the same collation, the code should be completely transparent in terms of collations, else if looses much of its appeal.
Why there would be a need to specify collations everywhere? For the sake of remove the ambiguity this is the part (in bold) of the code where comparison is done using binary collation (only for the delimiter, so can be used in most cases)
SUBSTRING(@pString, (N+1),
ISNULL(NULLIF(CHARINDEX(@pDelimiter ,
@pString COLLATE Latin1_General_BIN, (N+1)), 0)
And this is the part where the original string is split into items
SUBSTRING(@pString, (N+1),
ISNULL(NULLIF(CHARINDEX(@pDelimiter ,
@pString COLLATE Latin1_General_BIN, (N+1)), 0)
- (N+1), 8000) )
- (N+1), 8000) )
So I do not think we are losing the default collation.

My concern was just a general remark for such routines, not based on any specific implementation. I explicitly agreed that a binary comparison for the delimeter made sense. My concern is only about the final output, which I like to see in the original collation or that of the database default in case of a constant or variable as input.

Just to make further processing as clean as possible and able to use indexes without collation specifiers. If your implementation takes care of all that, all the better!

It is easy to test for any function. Just output the result to a table using the select ... into TableX ... from ... construct and then check the collation on the columns using sp_help TableX.

Paul White SSC Guru Points: 150468 More actions · Answer 3

Usman Butt (9/11/2012)
Another thing I just noticed is the return data type for the item is nvarchar(4000). Should not it be nvarchar(max)?

Not for me, no 🙂

The source code is there for anyone that does want to split strings where elements may be > 4000 characters.

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi

Paul White SSC Guru Points: 150468 More actions · Answer 4

m.t.cleary (9/11/2012)
Can you avoid the conversion to Unicode by casting to a byte array?

Yes, but it all gets very inconvenient from there on in - for example working out whether the byte stream was originally Unicode or not.

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi

Usman Butt Hall of Fame Points: 3186 More actions · Answer 5

SQL Kiwi (9/11/2012)
Usman Butt (9/11/2012)
Another thing I just noticed is the return data type for the item is nvarchar(4000). Should not it be nvarchar(max)?
Not for me, no 🙂
The source code is there for anyone that does want to split strings where elements may be > 4000 characters.

hahahaha..:-D Yes, one can change it since the source code is available. But since it is to be taken as the generic splitter :cool:, it should be handling strings with any length of elements :hehe: Should say we are comparing oranges with oranges when it comes to varchar(max) versions 😀

But on a serious note, is there any performance degrade if nvarchar(max) is used? I do not remember the link, but I read somewhere that sometimes nvarchar(max) could speed up the execution as compared to nvarchar(N) but may depend upon the logic of the code. BTW, In my initial testing, I did see some improved performance if I change the return type to let say nvarchar(20) etc.

Paul White SSC Guru Points: 150468 More actions · Answer 6

Usman Butt (9/12/2012)
hahahaha..:-D Yes, one can change it since the source code is available. But since it is to be taken as the generic splitter :cool:, it should be handling strings with any length of elements :hehe: Should say we are comparing oranges with oranges when it comes to varchar(max) versions 😀

I know - and it has that annoying limitation of only accepting a maximum of 2GB of string as input too! 😀

But on a serious note, is there any performance degrade if nvarchar(max) is used? I do not remember the link, but I read somewhere that sometimes nvarchar(max) could speed up the execution as compared to nvarchar(N) but may depend upon the logic of the code. BTW, In my initial testing, I did see some improved performance if I change the return type to let say nvarchar(20) etc.

I have never tried it. There quite probably would be since handling LOB types always incurs overhead, a little in the CLR and a great deal more in the SQL Server engine. Improving performance by shortening the maximum length of the output from nvarchar(4000) to nvarchar(20) surprises me. I can't think of a reason that would affect anything. On the general topic of increasing the speed of the CLR spliter - no, I haven't tried much on that at all. It has always been the fastest method for me, and I have never seen it be the performance bottleneck for a real problem I have had. I also value keeping the implementation nice and simple.

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi

m.t.cleary SSC Veteran Points: 200 More actions · Answer 7

I was thinking that there would be a separate version for ASCII which would avoid the conversions to and from UNICODE and probably make it even faster. The convenience of having a single function counts for a lot given the speed is already good.

SQL Kiwi (9/11/2012)
m.t.cleary (9/11/2012)
Can you avoid the conversion to Unicode by casting to a byte array?
Yes, but it all gets very inconvenient from there on in - for example working out whether the byte stream was originally Unicode or not.

m.t.cleary SSC Veteran Points: 200 More actions · Answer 8

I found similar failure to proceed on a physical server - a desktop running Vista 32 bit on an aging AMD processor. When I changed the script to select into local variables instead of inserting into temp tables, it completed in good time. There is probably a problem with tempdb allocation or expansion.

This is OK for micro-bencmarks but if you want a realistic test you should use the splitter output rather than discarding it. If you can't fix tempdb, I'd change the script to delete the temporary tables as you go or to just run one splitter variant at a time.

-mark

ErikEckhardt (3/19/2012)
Second try: failed. 3 days and 20-some-odd hours later, exact same results as before, only with wait time 334,605,998. 🙂
I think I will not be running this again on my VM SQL instance. That's okay, there are plenty other places I can try it...
The entire SQL Server could not be used for any other purpose for 2 days. I have something logging to a table every minute and it could not operate.

david 97103 Newbie Points: 3 More actions · Answer 9

I had an online csv splitter made that can be used on any platform including mac...

meh won't link...

http://www.promotional-products.org/automatic-online-csv-file-splitter.html

Jeff Moden SSC Guru Points: 1004706 More actions · Answer 10

david 97103 (10/5/2012)
I had an online csv splitter made that can be used on any platform including mac...
meh won't link...
http://www.promotional-products.org/automatic-online-csv-file-splitter.html

That's nice and thanks for the freebee but not exactly what we're talking about here. That splits the files into multiple files. It doesn't actually split the rows into elements and insert them into a database table.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

kipb7 SSC Rookie Points: 45 More actions · Answer 11

Thanks, Jeff!

I was inspired by this to make a CSV Splitter that also respects double-quotes, such as 10,"20,30",40 so it doesn't use the comma inside quotes.

Style note - commas are before the cte names (instead of after) so there is room to insert a testing statement for seeing the results of an intermediate cte

by uncommenting and running just that section.

IF object_id('dbo.parseCSV') IS NOT NULL DROP FUNCTION dbo.parseCSV

GO

--parseCSV.sql Derived from: http://www.sqlservercentral.com/articles/Tally+Table/72993/

/*

We assume input data is comma-separated values with no quotes EXCEPT for fields completely surrounded by quotes.

That is, the opening quote is either column 1 or just after a comma, and the closing quote is either end-of-string or just before a comma.

Usage Example:

SELECT * FROM dbo.parseCSV('Fat,"10,20,30",and,Lazy')

--*/

CREATE FUNCTION dbo.parseCSV (@pString VARCHAR(8000))

--WARNING!!! DO NOT USE MAX DATA-TYPES HERE! IT WILL KILL PERFORMANCE!

RETURNS TABLE WITH SCHEMABINDING AS

RETURN

--===== "Inline" CTE Driven "Tally Table" produces values from 1 up to 10,000...

-- enough to cover VARCHAR(8000)

--You can test by executing from this Declare to the end or an uncommented SELECT after a cte definition.

--declare @pString varchar(1000)='1,"4,6",9'; set @pString='a1,",xb,",c1';--set @pString='Fat,"10,20,30",and,Lazy';--set @pString='a,b,"c,d"';--set @pString='"1,2","3,4"';

WITH E1(N) AS (SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1

) --10E+1 or 10 rows

,E2(N) AS (SELECT 1 FROM E1 a, E1 b) --10E+2 or 100 rows

,E4(N) AS (SELECT 1 FROM E2 a, E2 b) --10E+4 or 10,000 rows max

,cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front

-- for both a performance gain and prevention of accidental "overruns"

SELECT TOP (ISNULL(DATALENGTH(@pString),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4

)

,cteQtCom AS (--find all quotes and commas

SELECT t.N

, CASE WHEN SUBSTRING(@pString,t.N,1)=',' THEN 1 ELSE 0 END [isComma]

, CASE WHEN SUBSTRING(@pString,t.N,1)='"' THEN 1 ELSE 0 END [isQuote]

FROM cteTally t

WHERE SUBSTRING(@pString,t.N,1) IN (',','"')

)--cteQtCom

--SELECT * FROM cteQtCom

,cteNumQuotesBefore AS (

SELECT t.N, t.isComma, t.isQuote

, ISNULL((SELECT SUM(t2.isQuote) FROM cteQtCom t2 WHERE t2.N<t.N),0) [qb4]

FROM cteQtCom t

)--cteNumQuotesBefore

--SELECT * FROM cteNumQuotesBefore

,cteCommasNotWithinQuotes AS (--if the number of quotes before is EVEN then this comma is not inside quotes.

SELECT t.N

FROM cteNumQuotesBefore t

WHERE t.isComma=1 AND (qb4%2)=0

)--cteCommasNotWithinQuotes

--SELECT *, SUBSTRING(@pString,1,N)[substr] FROM cteCommasNotWithinQuotes

,cteStart(N1,this_startquote,prev_endquote) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter)

SELECT 1[N1], CASE WHEN SUBSTRING(@pString,1,1)='"' THEN 1 ELSE 0 END[this_startquote], 0[prev_endquote]

UNION ALL

SELECT t.N+1[N1]

, CASE WHEN SUBSTRING(@pString,t.N+1,1)='"' THEN 1 ELSE 0 END[this_startquote]--this field starts with quote

, CASE WHEN SUBSTRING(@pString,t.N-1,1)='"' THEN 1 ELSE 0 END[prev_endquote]--previous field ended with quote

--both could be true for contiguous quoted fields: "1","2"

FROM cteCommasNotWithinQuotes t

)--cteStart

--SELECT * FROM cteStart

,cteLen AS(--==== Return start and length (for use in substring)

SELECT s.N1, s.this_startquote

,CASE WHEN ending_comma.N1 IS NOT NULL THEN ending_comma.N1-s.N1-1

ELSE LEN(@pString)-s.N1+1 END [L1]

,CASE WHEN ending_comma.prev_endquote IS NOT NULL THEN ending_comma.prev_endquote

ELSE CASE WHEN RIGHT(@pString,1)='"' THEN 1 ELSE 0 END END[this_endquote]

FROM cteStart s

OUTER APPLY (

SELECT TOP 1 e.N1,e.prev_endquote

FROM cteStart e

WHERE e.N1>s.N1

ORDER BY N1

) ending_comma

)--cteLen

--SELECT *,SUBSTRING(@pString,N1,L1)[substr] FROM cteLen

SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY LL.N1),

Item = SUBSTRING(@pString, LL.N1+LL.this_startquote, LL.L1-this_startquote-this_endquote)

FROM cteLen LL

;

GO

And here is a little tester code:

--Create a few hundred test cases by all combinations of A,B,C where each can be bare or "surrounded by quotes"

DECLARE @a TABLE (x VARCHAR(200))

INSERT INTO @a VALUES (''),('a1'),('qa1,qa2'),(',xa,')

DECLARE @q TABLE (iq TINYINT, q VARCHAR(1))

INSERT INTO @q VALUES (0,''),(1,'"')

IF object_id('tempdb..#T') IS NOT NULL DROP TABLE #T

CREATE TABLE #T (id INT IDENTITY(1,1),s VARCHAR(200))

INSERT INTO #T

SELECT QA.q+A.x+QA.q

+','+QB.q+B.x+QB.q

+','+QC.q+C.x+QC.q

FROM @a A CROSS JOIN @q QA

CROSS JOIN @a B CROSS JOIN @q QB

CROSS JOIN @a C CROSS JOIN @q QC

--Run the test -- 512 input rows, 2112 output rows -- up to you to eyeball the results

IF 1=0

SELECT id,s, r.*

FROM #T

OUTER APPLY dbo.parseCSV(s) r

--show fields in separate columns so more compact:

SELECT s, [1][one],[2][two],[3][three],[4][four],[5][five],[6][six],[7][seven]

FROM (

SELECT t.id, t.s, r.ItemNumber, r.Item

FROM #T t

OUTER APPLY dbo.parseCSV(s) r

) AS SRC

PIVOT (

MAX(Item)

FOR ItemNumber IN ([1],[2],[3],[4],[5],[6],[7])

) AS pvt

IF 1=0 BEGIN--testa --prints as we go, so we can find where a problem happened:

DECLARE @n INT=0

WHILE EXISTS(SELECT * FROM #T WHERE id>@n) BEGIN--n

SELECT @n=MIN(id) FROM #T WHERE id>@n

DECLARE @s1 VARCHAR(200)

SELECT @s1=s FROM #T WHERE id=@n

PRINT CONVERT(VARCHAR,@n)+': '+@s1

SELECT @n[n],* FROM dbo.parseCSV(@s1) v

END--n

END--testa

Dirk vd Watt SSC Rookie Points: 26 More actions · Answer 12

Hi Jeff,

Can you please test your new functions' performance against the following function :

CREATE FUNCTION [dbo].[ParseString]

(

@String VarChar(8000),

@Delimiter VarChar(1)

) RETURNS TABLE

AS

RETURN

(

WITH Tokens(LinePos, StartPos, EndPos) AS

(

SELECT

1,

CharIndex(@Delimiter, @String)

UNION ALL

SELECT

LinePos + 1,

EndPos + 1,

CharIndex(@Delimiter, @String, EndPos + 1)

FROM

Tokens

WHERE

EndPos > 0

)

SELECT

CONVERT(Int, LinePos) AS RowNumber,

SubString

(

@String,

StartPos,

CASE

WHEN EndPos > 0 THEN (EndPos - StartPos)

ELSE 8000

END

)

AS StringValue

from Tokens

)

and let us know which performs best ?

Regards,

Dirk van der Watt

Jeff Moden SSC Guru Points: 1004706 More actions · Answer 13

Dirk vd Watt (12/28/2012)
Hi Jeff,
Can you please test your new functions' performance against the following function :
CREATE FUNCTION [dbo].[ParseString]
(
@String VarChar(8000),
@Delimiter VarChar(1)
) RETURNS TABLE
AS
RETURN
(
WITH Tokens(LinePos, StartPos, EndPos) AS
(
SELECT
1,
1,
CharIndex(@Delimiter, @String)
UNION ALL
SELECT
LinePos + 1,
EndPos + 1,
CharIndex(@Delimiter, @String, EndPos + 1)
FROM
Tokens
WHERE
EndPos > 0
)
SELECT
CONVERT(Int, LinePos) AS RowNumber,
SubString
(
@String,
StartPos,
CASE
WHEN EndPos > 0 THEN (EndPos - StartPos)
ELSE 8000
END
)
AS StringValue
from Tokens
)
and let us know which performs best ?
Regards,
Dirk van der Watt

That's a recursive CTE that counts. Please see the following article.

http://www.sqlservercentral.com/articles/T-SQL/74118/

Considering the findings in that article and the fact that I provided a complete test harness so that anyone could do a full test on their own functions, please understand when I say "Can you please test {your own} new functions' performance ... and let us know which performs best?" 🙂

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

telcogod SSC Enthusiast Points: 143 More actions · Answer 14

what about xml?

declare @table table (id int identity(1,1), textdata varchar(4000))

insert @table

exec xp_cmdshell 'wmic volume get capacity, drivetype, freespace, caption /format:csv'

; with cte as (

select Data.value('(/root/r/text())[1]','varchar(1000)') Node

, convert(bigint,Data.value('(/root/r/text())[2]','varchar(1000)')) / (1024 * 1024 * 1024) Capacity

, Data.value('(/root/r/text())[3]','varchar(1000)') Caption

, Data.value('(/root/r/text())[4]','varchar(1000)') DriveType

, convert(bigint,replace(Data.value('(/root/r/text())[5]','varchar(1000)'),char(10),'')) / (1024 * 1024 * 1024) FreeSpace

from @table

cross apply (select convert(xml,'<root><r>' + replace(textdata,',','</r><r>') + '</r></root>') as Data) textdataXML

where Data.value('(/root/r/text())[4]','varchar(1000)') = '3'

)

select Caption DriveName

, Capacity

, FreeSpace

, convert(money,FreeSpace) / convert(money,Capacity) PercentFree

from cte

where Capacity > 0

order by 3 desc

Sean Lange SSC Guru Points: 286609 More actions · Answer 15

telcogod (12/28/2012)
what about xml?

Did you read the whole article and look at the performance comparisons? This style of xml splitter was on the list and it was not as fast.

As Jeff said above, their is a full test harness posted. Try out your xml splitter next to the DelimitedSplit8K. You might be surprised,

_______________________________________________________________

Need help? Help us help you.

Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/