Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase «««5657585960»»»

Tally OH! An Improved SQL 8K “CSV Splitter” Function Expand / Collapse
Author
Message
Posted Sunday, April 13, 2014 11:34 AM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: Today @ 12:23 PM
Points: 35,403, Visits: 31,965
Miller (4/13/2014)
Thanks Jeff,
I think you forgot to add the link in your quote.


Not sure what happened but I've repaired the link. Here it is, again.
http://www.sqlservercentral.com/articles/SQL+Server+2012/106903/


--Jeff Moden
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

Helpful Links:
How to post code problems
How to post performance problems
Post #1561288
Posted Sunday, April 13, 2014 12:31 PM
Valued Member

Valued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued Member

Group: General Forum Members
Last Login: Thursday, October 16, 2014 12:00 PM
Points: 67, Visits: 58
I made a few improvements: handles space delimiters now. Also added DISTINCT so you can use in JOINS

CREATE FUNCTION [dbo].[BetterSplit]
(@L NVARCHAR(MAX),@D NVARCHAR(100))
RETURNS TABLE AS RETURN
WITH A(N)AS(SELECT 1UNION ALL SELECT 1UNION ALL SELECT 1UNION ALL SELECT 1UNION ALL SELECT 1)
,B(N)AS(SELECT 1FROM A,A B),C(N)AS(SELECT 1FROM B A,B),D(N)AS(SELECT 1FROM C A,C)
,T(N)AS(SELECT TOP(LEN(@L))ROW_NUMBER()OVER(ORDER BY(SELECT 1))FROM D)
,S AS(SELECT S=1UNION ALL SELECT S=N+DATALENGTH(@D)/2FROM T WHERE SUBSTRING(@L,N,DATALENGTH(@D)/2)=@D)
,L AS(SELECT S,L=LEAD(S,1,LEN(@L)+DATALENGTH(@D)/2+1)OVER(ORDER BY S)-S-DATALENGTH(@D)/2FROM S)
SELECT DISTINCT Value=LTRIM(RTRIM(SUBSTRING(@L,S,L)))FROM L



Post #1561290
Posted Monday, April 14, 2014 3:26 AM


SSCertifiable

SSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiable

Group: General Forum Members
Last Login: Today @ 9:49 AM
Points: 6,828, Visits: 14,070
Miller (4/13/2014)
I made a few improvements: handles space delimiters now. Also added DISTINCT so you can use in JOINS

CREATE FUNCTION [dbo].[BetterSplit]
(@L NVARCHAR(MAX),@D NVARCHAR(100))
RETURNS TABLE AS RETURN
WITH A(N)AS(SELECT 1UNION ALL SELECT 1UNION ALL SELECT 1UNION ALL SELECT 1UNION ALL SELECT 1)
,B(N)AS(SELECT 1FROM A,A B),C(N)AS(SELECT 1FROM B A,B),D(N)AS(SELECT 1FROM C A,C)
,T(N)AS(SELECT TOP(LEN(@L))ROW_NUMBER()OVER(ORDER BY(SELECT 1))FROM D)
,S AS(SELECT S=1UNION ALL SELECT S=N+DATALENGTH(@D)/2FROM T WHERE SUBSTRING(@L,N,DATALENGTH(@D)/2)=@D)
,L AS(SELECT S,L=LEAD(S,1,LEN(@L)+DATALENGTH(@D)/2+1)OVER(ORDER BY S)-S-DATALENGTH(@D)/2FROM S)
SELECT DISTINCT Value=LTRIM(RTRIM(SUBSTRING(@L,S,L)))FROM L


Jeff’s “Tally Oh!” article and this discussion thread are a brilliant example of continuous improvement – incremental changes, tested and peer-reviewed. An excellent example of this is Eirikur’s recent performance-boosting modification using LEAD(). It’s been rigorously tested to ensure it meets requirements and also meets the claims made – that it’s significantly faster than the original. It does, and it is.
“BetterSplit” is perhaps a little optimistic. It fails if the delimiter has a trailing space
SELECT * FROM [dbo].[BetterSplit] ('the; quick;    ,brown; fox; jumped; over; the; lazy; dog; ', '; ')
Msg 537, Level 16, State 3, Line 1
Invalid length parameter passed to the LEFT or SUBSTRING function.

Resolved elements have leading spaces stripped. The ordinal position of elements within the string is discarded and the elements are re-ordered alphabetically by a distinct sort which adds considerably to the cost of the function. Apparently it hasn’t been performance-tested since MAX strings are known to spoil performance. Finally, DelimitedSplit8K does handle a single space as a delimiter – and correctly resolves a string with a terminal space delimiter, which “BetterSplit” does not:
SELECT DISTINCT Item FROM [dbo].[DelimitedSplit8K] ('the quick brown fox jumped over the lazy dog ', ' ')
 9 rows returned
SELECT * FROM [dbo].[BetterSplit] ('the quick brown fox jumped over the lazy dog ', ' ')
 8 rows returned

Whilst there are always going to be edge cases which are better handled by modifications to DelimitedSplit8K than by the original function, it’s important to define exactly why such a modification might be better and for exactly what purpose. Returning DISTINCT results from the function is probably insufficient as it’s trivial to return distinct results from DelimitedSplit8K. Multi-character delimiters is certainly worth investigation as a desirable modification, once the bugs (and the performance) in “BetterSplit” have been fixed.


“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
Exploring Recursive CTEs by Example Dwain Camps
Post #1561395
Posted Monday, April 14, 2014 7:16 AM
Valued Member

Valued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued Member

Group: General Forum Members
Last Login: Thursday, October 16, 2014 12:00 PM
Points: 67, Visits: 58
Yes, Thanks for the test. I fixed the problem noted. Had to use DATALENGTH throughout instead of LEN to handle trailing spaces.

SELECT * FROM [dbo].[BetterSplit] ('the; quick; ,brown; fox; jumped; over; the; lazy; dog; ', '; ')
WORKS NOW.

CREATE FUNCTION [dbo].[BetterSplit]
(@L NVARCHAR(MAX),@D NVARCHAR(100))
RETURNS TABLE AS RETURN
WITH A(N)AS(SELECT 1UNION ALL SELECT 1UNION ALL SELECT 1UNION ALL SELECT 1UNION ALL SELECT 1)
,B(N)AS(SELECT 1FROM A,A B),C(N)AS(SELECT 1FROM B A,B),D(N)AS(SELECT 1FROM C A,C)
,T(N)AS(SELECT TOP(DATALENGTH(@L)/2)ROW_NUMBER()OVER(ORDER BY(SELECT 1))FROM D)
,S AS(SELECT S=1UNION ALL SELECT S=N+DATALENGTH(@D)/2FROM T WHERE SUBSTRING(@L,N,DATALENGTH(@D)/2)=@D)
,L AS(SELECT S,L=LEAD(S,1,DATALENGTH(@L)/2+DATALENGTH(@D)/2+1)OVER(ORDER BY S)-S-DATALENGTH(@D)/2FROM S)
SELECT Value=SUBSTRING(@L,S,L)FROM L




Post #1561464
Posted Monday, April 14, 2014 8:34 AM


SSC-Insane

SSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-Insane

Group: General Forum Members
Last Login: Today @ 2:18 PM
Points: 20,755, Visits: 32,571
Miller (4/14/2014)
Yes, Thanks for the test. I fixed the problem noted. Had to use DATALENGTH throughout instead of LEN to handle trailing spaces.

SELECT * FROM [dbo].[BetterSplit] ('the; quick; ,brown; fox; jumped; over; the; lazy; dog; ', '; ')
WORKS NOW.

CREATE FUNCTION [dbo].[BetterSplit]
(@L NVARCHAR(MAX),@D NVARCHAR(100))
RETURNS TABLE AS RETURN
WITH A(N)AS(SELECT 1UNION ALL SELECT 1UNION ALL SELECT 1UNION ALL SELECT 1UNION ALL SELECT 1)
,B(N)AS(SELECT 1FROM A,A B),C(N)AS(SELECT 1FROM B A,B),D(N)AS(SELECT 1FROM C A,C)
,T(N)AS(SELECT TOP(DATALENGTH(@L)/2)ROW_NUMBER()OVER(ORDER BY(SELECT 1))FROM D)
,S AS(SELECT S=1UNION ALL SELECT S=N+DATALENGTH(@D)/2FROM T WHERE SUBSTRING(@L,N,DATALENGTH(@D)/2)=@D)
,L AS(SELECT S,L=LEAD(S,1,DATALENGTH(@L)/2+DATALENGTH(@D)/2+1)OVER(ORDER BY S)-S-DATALENGTH(@D)/2FROM S)
SELECT Value=SUBSTRING(@L,S,L)FROM L



Now, you only need to do two more things. One format your code for readability. Two, change your CTE to use the value constructor instead of SELECT 1 UNION's. The latter only makes sense since you are using SQL Server 2012 features in the code anyway.



Lynn Pettis

For better assistance in answering your questions, click here
For tips to get better help with Performance Problems, click here
For Running Totals and its variations, click here or when working with partitioned tables
For more about Tally Tables, click here
For more about Cross Tabs and Pivots, click here and here
Managing Transaction Logs

SQL Musings from the Desert Fountain Valley SQL (My Mirror Blog)
Post #1561497
Posted Monday, April 14, 2014 9:39 AM
Valued Member

Valued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued Member

Group: General Forum Members
Last Login: Thursday, October 16, 2014 12:00 PM
Points: 67, Visits: 58
How about this? I'd be curious to see results of performance testing with this.

CREATE FUNCTION [dbo].[BetterSplit]
(
@L NVARCHAR(MAX)
,@D NVARCHAR(100)
)
RETURNS TABLE WITH SCHEMABINDING AS
RETURN

WITH
--A,B,C,D used to generate large number of rows for tally
A AS(SELECT N FROM(VALUES(1),(1),(1),(1),(1),(1))A(N))
,B(N)AS(SELECT 1FROM A,A B)
,C(N)AS(SELECT 1FROM B A,B)
,D(N)AS(SELECT 1FROM C A,C)

--Tally table
,T(N)AS(SELECT TOP(DATALENGTH(@L)/2)ROW_NUMBER()OVER(ORDER BY(SELECT 1))FROM D)

--Start of each delimiter
,S AS(SELECT S=1UNION ALL SELECT S=N+DATALENGTH(@D)/2FROM T WHERE SUBSTRING(@L,N,DATALENGTH(@D)/2)=@D)

--Start, Length of each value in list.
--Use LEAD() to get next row start, then subtract current row start and adjust for size of delimiters
,L AS(SELECT S,L=LEAD(S,1,DATALENGTH(@L)/2+DATALENGTH(@D)/2+1)OVER(ORDER BY S)-S-DATALENGTH(@D)/2FROM S)

--do final select here based on start, length
SELECT Value=SUBSTRING(@L,S,L)FROM L




Post #1561528
Posted Monday, April 14, 2014 9:55 AM


SSCertifiable

SSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiable

Group: General Forum Members
Last Login: Today @ 9:49 AM
Points: 6,828, Visits: 14,070
Miller (4/14/2014)
How about this? I'd be curious to see results of performance testing with this.

CREATE FUNCTION [dbo].[BetterSplit]
(
@L NVARCHAR(MAX)
,@D NVARCHAR(100)
)
RETURNS TABLE WITH SCHEMABINDING AS
RETURN

WITH
--A,B,C,D used to generate large number of rows for tally
A AS(SELECT N FROM(VALUES(1),(1),(1),(1),(1),(1))A(N))
,B(N)AS(SELECT 1FROM A,A B)
,C(N)AS(SELECT 1FROM B A,B)
,D(N)AS(SELECT 1FROM C A,C)

--Tally table
,T(N)AS(SELECT TOP(DATALENGTH(@L)/2)ROW_NUMBER()OVER(ORDER BY(SELECT 1))FROM D)

--Start of each delimiter
,S AS(SELECT S=1UNION ALL SELECT S=N+DATALENGTH(@D)/2FROM T WHERE SUBSTRING(@L,N,DATALENGTH(@D)/2)=@D)

--Start, Length of each value in list.
--Use LEAD() to get next row start, then subtract current row start and adjust for size of delimiters
,L AS(SELECT S,L=LEAD(S,1,DATALENGTH(@L)/2+DATALENGTH(@D)/2+1)OVER(ORDER BY S)-S-DATALENGTH(@D)/2FROM S)

--do final select here based on start, length
SELECT Value=SUBSTRING(@L,S,L)FROM L



Then tell us. If you've taken the time to read the article, you will know that it includes a full test harness.


“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
Exploring Recursive CTEs by Example Dwain Camps
Post #1561542
Posted Monday, April 14, 2014 12:20 PM
Valued Member

Valued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued Member

Group: General Forum Members
Last Login: Thursday, October 16, 2014 12:00 PM
Points: 67, Visits: 58
OK, I ran the splitter tests.
I had to modify the BetterSplit a little: Switched to varchar(8000) instead of nvarchar(max) for list parameter. This needed to get good comparison between other algorithms using these datatypes. Nvarchar(max) is much slower but how else can you get more than 8000 characters?
Also added ItemNumber to result, even though in my usage the ItemNumber is not very useful.
Also I increased the first Cte to 10 items and eliminated the 4th Cte. This results in fewer nested loops.
I didn't bother taking out the logic to handle multi-character delimiters (it might be faster without this)
Anyway, after all this I ended up with BetterSplit being about 11% faster than the DelimitedSplit8K.

Not much, but I think it shows that the LEAD() function must be fundamentally faster than the CHARINDEX() for some reason. Everything else is about the same.


IF OBJECT_ID('dbo.[BetterSplit]') IS NOT NULL
DROP FUNCTION dbo.BetterSplit;
GO
create FUNCTION [dbo].[BetterSplit]
(
@L VARCHAR(8000)
,@D VARCHAR(10)
)
RETURNS TABLE WITH SCHEMABINDING AS
RETURN

WITH
--A,B,C used to generate large number of rows for tally
A AS(SELECT N FROM(VALUES(1),(1),(1),(1),(1),(1),(1),(1),(1),(1))A(N))
,B(N)AS(SELECT 1FROM A,A B)
,C(N)AS(SELECT 1FROM B A,B)

--Tally table
,T(N)AS(SELECT TOP(ISNULL(DATALENGTH(@L),0))ROW_NUMBER()OVER(ORDER BY(SELECT 1))FROM C)

--Start of each delimiter
,S AS(SELECT S=1UNION ALL SELECT S=N+DATALENGTH(@D)FROM T WHERE SUBSTRING(@L,N,DATALENGTH(@D))=@D)

--Start, Length of each value in list.
--Use LEAD() to get next row start, then subtract current row start and adjust for size of delimiters
,L AS(SELECT S,L=LEAD(S,1,DATALENGTH(@L)+DATALENGTH(@D)+1)OVER(ORDER BY S)-S-DATALENGTH(@D)FROM S)
--do final select here based on start, length
SELECT ItemNumber=ROW_NUMBER()OVER(ORDER BY S),Item=SUBSTRING(@L,S,L)FROM L
;



Post #1561628
Posted Monday, April 14, 2014 12:25 PM


SSCommitted

SSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommitted

Group: General Forum Members
Last Login: Today @ 3:13 PM
Points: 1,917, Visits: 19,630
Miller (4/14/2014)
How about this? I'd be curious to see results of performance testing with this.



Hi....seems interesting...but really haven't the time right now to test it for you....any reason why you cannot test yourself?


______________________________________________________________
you can lead a user to data....but you cannot make them think
and remember....every day is a school day
Post #1561630
Posted Monday, April 14, 2014 12:35 PM


SSChampion

SSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampion

Group: General Forum Members
Last Login: Today @ 7:44 AM
Points: 13,093, Visits: 12,574
Miller (4/14/2014)
OK, I ran the splitter tests.
I had to modify the BetterSplit a little: Switched to varchar(8000) instead of nvarchar(max) for list parameter. This needed to get good comparison between other algorithms using these datatypes. Nvarchar(max) is much slower but how else can you get more than 8000 characters?
Also added ItemNumber to result, even though in my usage the ItemNumber is not very useful.
Also I increased the first Cte to 10 items and eliminated the 4th Cte. This results in fewer nested loops.
I didn't bother taking out the logic to handle multi-character delimiters (it might be faster without this)
Anyway, after all this I ended up with BetterSplit being about 11% faster than the DelimitedSplit8K.


Can you post the actual testing results and the testing you did? It would be great to see the whole thing put together.


_______________________________________________________________

Need help? Help us help you.

Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

Need to split a string? Try Jeff Moden's splitter.

Cross Tabs and Pivots, Part 1 – Converting Rows to Columns
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs
Understanding and Using APPLY (Part 1)
Understanding and Using APPLY (Part 2)
Post #1561637
« Prev Topic | Next Topic »

Add to briefcase «««5657585960»»»

Permissions Expand / Collapse