RE: Tally OH! An Improved SQL 8K “CSV Splitter” Function

SSC Guru

Points: 61136

October 7, 2014 at 3:49 pm

samirabrahao1 60347 (10/7/2014)
Hello Jeff, Paul, and everyone else involved in this discussion.
I hope you are still following up on this. Great article indeed.
I took the liberty to make a minor change to your split function and I think I managed to achieve performance improvement proportional to the size of each @pString.
CREATE FUNCTION [dbo].[DelimitedSplit8K]
(@pString VARCHAR(8000), @pDelimiter CHAR(1))
RETURNS TABLE WITH SCHEMABINDING AS
RETURN
WITH E1(N) AS (
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
),
E2(N) AS (SELECT 1 FROM E1 a, E1 b),
E4(N) AS (SELECT 1 FROM E2 a, E2 b),
cteTally(N) AS (
SELECT TOP (ISNULL(DATALENGTH(@pString),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
),
cteStart(N1) AS (
SELECT 1 UNION ALL
SELECT t.N+1 FROM cteTally t WHERE SUBSTRING(@pString,t.N,1) = @pDelimiter
),
cteLen(N1,L1) AS(
SELECT s.N1,
ISNULL(lead(s.N1) over (order by s.N1)-s.N1-1,8000) -- REPLACED CHARINDEX OVER @PSTRING WITH LEAD
FROM cteStart s
)
SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1),
Item = SUBSTRING(@pString, l.N1, l.L1)
FROM cteLen l
;
As we already have a list of starting positions for each element, we can assume that the length of each element will be the starting position of the NEXT element, minus it's own starting position, minus the length of the delimiter. We can easily obtain the next element starting position by using the LEAD function and ordering by the starting position value. This also eliminates the need for the NULLIF function, because when there is no lead value, NULL is returned.
By replacing the CHARINDEX function over the @pString variable for length column calculation in the cteLen CTE, we avoid having to iterate through the whole @pString variable again (it was already done once to create the list of starting positions).
I was able to obtain a 22% performance improvement in my environment with this change 😉

Eirikur Eiriksson discusses how to use lead to make DelimitedSplit8K faster in this very excellent article:

Reaping the benefits of the Window functions in T-SQL[/url]

"I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

-- Itzik Ben-Gan 2001