Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase «««4344454647»»»

Tally OH! An Improved SQL 8K “CSV Splitter” Function Expand / Collapse
Author
Message
Posted Wednesday, September 12, 2012 3:31 PM


SSChampion

SSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampion

Group: General Forum Members
Last Login: Today @ 11:03 AM
Points: 11,194, Visits: 11,167
Usman Butt (9/12/2012)
hahahaha.. Yes, one can change it since the source code is available. But since it is to be taken as the generic splitter , it should be handling strings with any length of elements Should say we are comparing oranges with oranges when it comes to varchar(max) versions

I know - and it has that annoying limitation of only accepting a maximum of 2GB of string as input too!

But on a serious note, is there any performance degrade if nvarchar(max) is used? I do not remember the link, but I read somewhere that sometimes nvarchar(max) could speed up the execution as compared to nvarchar(N) but may depend upon the logic of the code. BTW, In my initial testing, I did see some improved performance if I change the return type to let say nvarchar(20) etc.

I have never tried it. There quite probably would be since handling LOB types always incurs overhead, a little in the CLR and a great deal more in the SQL Server engine. Improving performance by shortening the maximum length of the output from nvarchar(4000) to nvarchar(20) surprises me. I can't think of a reason that would affect anything. On the general topic of increasing the speed of the CLR spliter - no, I haven't tried much on that at all. It has always been the fastest method for me, and I have never seen it be the performance bottleneck for a real problem I have had. I also value keeping the implementation nice and simple.




Paul White
SQL Server MVP
SQLblog.com
@SQL_Kiwi
Post #1358237
Posted Friday, September 14, 2012 5:08 AM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Wednesday, September 10, 2014 2:22 PM
Points: 13, Visits: 220
I was thinking that there would be a separate version for ASCII which would avoid the conversions to and from UNICODE and probably make it even faster. The convenience of having a single function counts for a lot given the speed is already good.

SQL Kiwi (9/11/2012)
m.t.cleary (9/11/2012)
Can you avoid the conversion to Unicode by casting to a byte array?

Yes, but it all gets very inconvenient from there on in - for example working out whether the byte stream was originally Unicode or not.
Post #1359176
Posted Sunday, September 16, 2012 7:09 PM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Wednesday, September 10, 2014 2:22 PM
Points: 13, Visits: 220
I found similar failure to proceed on a physical server - a desktop running Vista 32 bit on an aging AMD processor. When I changed the script to select into local variables instead of inserting into temp tables, it completed in good time. There is probably a problem with tempdb allocation or expansion.

This is OK for micro-bencmarks but if you want a realistic test you should use the splitter output rather than discarding it. If you can't fix tempdb, I'd change the script to delete the temporary tables as you go or to just run one splitter variant at a time.

-mark

ErikEckhardt (3/19/2012)
Second try: failed. 3 days and 20-some-odd hours later, exact same results as before, only with wait time 334,605,998. :)

I think I will not be running this again on my VM SQL instance. That's okay, there are plenty other places I can try it...

The entire SQL Server could not be used for any other purpose for 2 days. I have something logging to a table every minute and it could not operate.
Post #1359939
Posted Friday, October 5, 2012 3:36 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Friday, October 5, 2012 6:12 AM
Points: 1, Visits: 1
I had an online csv splitter made that can be used on any platform including mac...

[url=http://www.promotional-products.org/automatic-online-csv-file-splitter.html][/url]

meh won't link...

http://www.promotional-products.org/automatic-online-csv-file-splitter.html
Post #1368913
Posted Friday, October 5, 2012 6:08 AM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: Today @ 2:20 PM
Points: 37,099, Visits: 31,651
david 97103 (10/5/2012)
I had an online csv splitter made that can be used on any platform including mac...

[url=http://www.promotional-products.org/automatic-online-csv-file-splitter.html][/url]

meh won't link...

http://www.promotional-products.org/automatic-online-csv-file-splitter.html


That's nice and thanks for the freebee but not exactly what we're talking about here. That splits the files into multiple files. It doesn't actually split the rows into elements and insert them into a database table.


--Jeff Moden
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

Helpful Links:
How to post code problems
How to post performance problems
Post #1368965
Posted Wednesday, November 28, 2012 6:39 PM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Monday, February 3, 2014 5:52 PM
Points: 11, Visits: 51
Thanks, Jeff!

I was inspired by this to make a CSV Splitter that also respects double-quotes, such as 10,"20,30",40 so it doesn't use the comma inside quotes.
Style note - commas are before the cte names (instead of after) so there is room to insert a testing statement for seeing the results of an intermediate cte
by uncommenting and running just that section.

IF object_id('dbo.parseCSV') IS NOT NULL DROP FUNCTION dbo.parseCSV
GO
--parseCSV.sql Derived from: http://www.sqlservercentral.com/articles/Tally+Table/72993/
/*
We assume input data is comma-separated values with no quotes EXCEPT for fields completely surrounded by quotes.
That is, the opening quote is either column 1 or just after a comma, and the closing quote is either end-of-string or just before a comma.
Usage Example:

SELECT * FROM dbo.parseCSV('Fat,"10,20,30",and,Lazy')
--*/
CREATE FUNCTION dbo.parseCSV (@pString VARCHAR(8000))
--WARNING!!! DO NOT USE MAX DATA-TYPES HERE! IT WILL KILL PERFORMANCE!
RETURNS TABLE WITH SCHEMABINDING AS
RETURN
--===== "Inline" CTE Driven "Tally Table" produces values from 1 up to 10,000...
-- enough to cover VARCHAR(8000)
--You can test by executing from this Declare to the end or an uncommented SELECT after a cte definition.
--declare @pString varchar(1000)='1,"4,6",9'; set @pString='a1,",xb,",c1';--set @pString='Fat,"10,20,30",and,Lazy';--set @pString='a,b,"c,d"';--set @pString='"1,2","3,4"';
WITH E1(N) AS (SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
) --10E+1 or 10 rows
,E2(N) AS (SELECT 1 FROM E1 a, E1 b) --10E+2 or 100 rows
,E4(N) AS (SELECT 1 FROM E2 a, E2 b) --10E+4 or 10,000 rows max
,cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front
-- for both a performance gain and prevention of accidental "overruns"
SELECT TOP (ISNULL(DATALENGTH(@pString),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
)
,cteQtCom AS (--find all quotes and commas
SELECT t.N
, CASE WHEN SUBSTRING(@pString,t.N,1)=',' THEN 1 ELSE 0 END [isComma]
, CASE WHEN SUBSTRING(@pString,t.N,1)='"' THEN 1 ELSE 0 END [isQuote]
FROM cteTally t
WHERE SUBSTRING(@pString,t.N,1) IN (',','"')
)--cteQtCom
--SELECT * FROM cteQtCom
,cteNumQuotesBefore AS (
SELECT t.N, t.isComma, t.isQuote
, ISNULL((SELECT SUM(t2.isQuote) FROM cteQtCom t2 WHERE t2.N<t.N),0) [qb4]
FROM cteQtCom t
)--cteNumQuotesBefore
--SELECT * FROM cteNumQuotesBefore
,cteCommasNotWithinQuotes AS (--if the number of quotes before is EVEN then this comma is not inside quotes.
SELECT t.N
FROM cteNumQuotesBefore t
WHERE t.isComma=1 AND (qb4%2)=0
)--cteCommasNotWithinQuotes
--SELECT *, SUBSTRING(@pString,1,N)[substr] FROM cteCommasNotWithinQuotes
,cteStart(N1,this_startquote,prev_endquote) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter)
SELECT 1[N1], CASE WHEN SUBSTRING(@pString,1,1)='"' THEN 1 ELSE 0 END[this_startquote], 0[prev_endquote]
UNION ALL
SELECT t.N+1[N1]
, CASE WHEN SUBSTRING(@pString,t.N+1,1)='"' THEN 1 ELSE 0 END[this_startquote]--this field starts with quote
, CASE WHEN SUBSTRING(@pString,t.N-1,1)='"' THEN 1 ELSE 0 END[prev_endquote]--previous field ended with quote
--both could be true for contiguous quoted fields: "1","2"
FROM cteCommasNotWithinQuotes t
)--cteStart
--SELECT * FROM cteStart
,cteLen AS(--==== Return start and length (for use in substring)
SELECT s.N1, s.this_startquote
,CASE WHEN ending_comma.N1 IS NOT NULL THEN ending_comma.N1-s.N1-1
ELSE LEN(@pString)-s.N1+1 END [L1]
,CASE WHEN ending_comma.prev_endquote IS NOT NULL THEN ending_comma.prev_endquote
ELSE CASE WHEN RIGHT(@pString,1)='"' THEN 1 ELSE 0 END END[this_endquote]
FROM cteStart s
OUTER APPLY (
SELECT TOP 1 e.N1,e.prev_endquote
FROM cteStart e
WHERE e.N1>s.N1
ORDER BY N1
) ending_comma
)--cteLen
--SELECT *,SUBSTRING(@pString,N1,L1)[substr] FROM cteLen
SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY LL.N1),
Item = SUBSTRING(@pString, LL.N1+LL.this_startquote, LL.L1-this_startquote-this_endquote)
FROM cteLen LL
;
GO

And here is a little tester code:
--Create a few hundred test cases by all combinations of A,B,C where each can be bare or "surrounded by quotes"
DECLARE @a TABLE (x VARCHAR(200))
INSERT INTO @a VALUES (''),('a1'),('qa1,qa2'),(',xa,')
DECLARE @q TABLE (iq TINYINT, q VARCHAR(1))
INSERT INTO @q VALUES (0,''),(1,'"')

IF object_id('tempdb..#T') IS NOT NULL DROP TABLE #T
CREATE TABLE #T (id INT IDENTITY(1,1),s VARCHAR(200))
INSERT INTO #T
SELECT QA.q+A.x+QA.q
+','+QB.q+B.x+QB.q
+','+QC.q+C.x+QC.q
FROM @a A CROSS JOIN @q QA
CROSS JOIN @a B CROSS JOIN @q QB
CROSS JOIN @a C CROSS JOIN @q QC


--Run the test -- 512 input rows, 2112 output rows -- up to you to eyeball the results
IF 1=0
SELECT id,s, r.*
FROM #T
OUTER APPLY dbo.parseCSV(s) r

--show fields in separate columns so more compact:
SELECT s, [1][one],[2][two],[3][three],[4][four],[5][five],[6][six],[7][seven]
FROM (
SELECT t.id, t.s, r.ItemNumber, r.Item
FROM #T t
OUTER APPLY dbo.parseCSV(s) r
) AS SRC
PIVOT (
MAX(Item)
FOR ItemNumber IN ([1],[2],[3],[4],[5],[6],[7])
) AS pvt


IF 1=0 BEGIN--testa --prints as we go, so we can find where a problem happened:
DECLARE @n INT=0
WHILE EXISTS(SELECT * FROM #T WHERE id>@n) BEGIN--n
SELECT @n=MIN(id) FROM #T WHERE id>@n
DECLARE @s1 VARCHAR(200)
SELECT @s1=s FROM #T WHERE id=@n
PRINT CONVERT(VARCHAR,@n)+': '+@s1
SELECT @n[n],* FROM dbo.parseCSV(@s1) v
END--n
END--testa

Post #1390224
Posted Friday, December 28, 2012 1:06 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Thursday, May 2, 2013 4:10 AM
Points: 2, Visits: 7
Hi Jeff,

Can you please test your new functions' performance against the following function :


CREATE FUNCTION [dbo].[ParseString]
(
@String VarChar(8000),
@Delimiter VarChar(1)
) RETURNS TABLE
AS

RETURN
(
WITH Tokens(LinePos, StartPos, EndPos) AS
(
SELECT
1,
1,
CharIndex(@Delimiter, @String)
UNION ALL
SELECT
LinePos + 1,
EndPos + 1,
CharIndex(@Delimiter, @String, EndPos + 1)
FROM
Tokens
WHERE
EndPos > 0
)
SELECT
CONVERT(Int, LinePos) AS RowNumber,
SubString
(
@String,
StartPos,
CASE
WHEN EndPos > 0 THEN (EndPos - StartPos)
ELSE 8000
END
)
AS StringValue
from Tokens
)



and let us know which performs best ?

Regards,
Dirk van der Watt
Post #1400817
Posted Friday, December 28, 2012 7:00 AM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: Today @ 2:20 PM
Points: 37,099, Visits: 31,651
Dirk vd Watt (12/28/2012)
Hi Jeff,

Can you please test your new functions' performance against the following function :


CREATE FUNCTION [dbo].[ParseString]
(
@String VarChar(8000),
@Delimiter VarChar(1)
) RETURNS TABLE
AS

RETURN
(
WITH Tokens(LinePos, StartPos, EndPos) AS
(
SELECT
1,
1,
CharIndex(@Delimiter, @String)
UNION ALL
SELECT
LinePos + 1,
EndPos + 1,
CharIndex(@Delimiter, @String, EndPos + 1)
FROM
Tokens
WHERE
EndPos > 0
)
SELECT
CONVERT(Int, LinePos) AS RowNumber,
SubString
(
@String,
StartPos,
CASE
WHEN EndPos > 0 THEN (EndPos - StartPos)
ELSE 8000
END
)
AS StringValue
from Tokens
)



and let us know which performs best ?

Regards,
Dirk van der Watt


That's a recursive CTE that counts. Please see the following article.
http://www.sqlservercentral.com/articles/T-SQL/74118/

Considering the findings in that article and the fact that I provided a complete test harness so that anyone could do a full test on their own functions, please understand when I say "Can you please test {your own} new functions' performance ... and let us know which performs best?"



--Jeff Moden
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

Helpful Links:
How to post code problems
How to post performance problems
Post #1400899
Posted Friday, December 28, 2012 7:02 AM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Wednesday, December 18, 2013 12:21 PM
Points: 18, Visits: 558
what about xml?

declare @table table (id int identity(1,1), textdata varchar(4000))
insert @table
exec xp_cmdshell 'wmic volume get capacity, drivetype, freespace, caption /format:csv'

; with cte as (
select Data.value('(/root/r/text())[1]','varchar(1000)') Node
, convert(bigint,Data.value('(/root/r/text())[2]','varchar(1000)')) / (1024 * 1024 * 1024) Capacity
, Data.value('(/root/r/text())[3]','varchar(1000)') Caption
, Data.value('(/root/r/text())[4]','varchar(1000)') DriveType
, convert(bigint,replace(Data.value('(/root/r/text())[5]','varchar(1000)'),char(10),'')) / (1024 * 1024 * 1024) FreeSpace
from @table
cross apply (select convert(xml,'<root><r>' + replace(textdata,',','</r><r>') + '</r></root>') as Data) textdataXML
where Data.value('(/root/r/text())[4]','varchar(1000)') = '3'
)
select Caption DriveName
, Capacity
, FreeSpace
, convert(money,FreeSpace) / convert(money,Capacity) PercentFree
from cte
where Capacity > 0
order by 3 desc

Post #1400900
Posted Friday, December 28, 2012 7:40 AM


SSChampion

SSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampion

Group: General Forum Members
Last Login: Today @ 1:33 PM
Points: 13,472, Visits: 12,328
telcogod (12/28/2012)
what about xml?


Did you read the whole article and look at the performance comparisons? This style of xml splitter was on the list and it was not as fast.

As Jeff said above, their is a full test harness posted. Try out your xml splitter next to the DelimitedSplit8K. You might be surprised,


_______________________________________________________________

Need help? Help us help you.

Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

Need to split a string? Try Jeff Moden's splitter.

Cross Tabs and Pivots, Part 1 – Converting Rows to Columns
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs
Understanding and Using APPLY (Part 1)
Understanding and Using APPLY (Part 2)
Post #1400917
« Prev Topic | Next Topic »

Add to briefcase «««4344454647»»»

Permissions Expand / Collapse