RE: Tally OH! An Improved SQL 8K “CSV Splitter” Function

SSC-Insane

Points: 20039

May 3, 2011 at 9:42 am

#1319180

Jeff Moden (5/2/2011)
Well, that makes 3 of us. We have a quorum! 😛 Phil Factor wrote an article fairly recently on all the XML, JSON, and other gobilty-gook and suggested that someone should come up with a better way. I was considering writing an article about the all-too-"secret" characters 28-31 and a couple of other goodies in "control character land". Whatcha tink? Worthwhile or not?
I haven't run it but your real Tally Table splitter looks right. You picked up on everything including the use of the "TOP".

Change your test harness script to use CHAR(30) instead of ',' and start leading by example :); examples are what will help with the control characters. Peoplesoft names, valid uses of tabs in text (formatted text blurbs, for example), valid uses of pipes in VARCHAR fields (program documentation, storing REGEX expressions in a database, secure usernames with special characters allowed), etc.

Also, an easy way to get SQL Server to output SELECT results to a special character delimited VARCHAR() table, and directly to a text file would be very worthwhile for the article. The main uses I see are getting data to/from arbitrary SELECT results into a string with one or two delimiters/dimensions, and the same thing to/from a text file.

As far as my previous single CTE real Tally Table (0 based) splitter, please ignore it. It takes too long on higher numbers of elements, much like the original tally table splitter's problem. Instead, try one or both of the two double CTE versions below:

CREATE FUNCTION [dbo].[YourFunctionUnion](@pString [varchar](8000), @pDelimiter [char](1))

RETURNS TABLE

RETURN

WITH

cteTally(N) AS (--==== This provides the "zero base" and limits the number of rows right up front

-- for both a performance gain and prevention of accidental "overruns"

SELECT 0 UNION ALL

SELECT TOP (DATALENGTH(ISNULL(@pString,1))) N FROM YourDB.dbo.YourTallyTable1Based WITH (nolock) -- the 1 based + Union All is almost identical in performance to a 0 based TOP (Datalength()+1)

),

cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter)

SELECT t.N+1

FROM cteTally t

WHERE (SUBSTRING(@pString,t.N,1) = @pDelimiter OR t.N = 0)

)

--===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found.

SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY s.N1), -- not including StringNumber actually increases duration and CPU, but uses slightly fewer writes

Item = SUBSTRING(@pString,s.N1,ISNULL(NULLIF(CHARINDEX(@pDelimiter,@pString,s.N1),0)-s.N1,8000))

FROM cteStart s WITH (nolock)

;

CREATE FUNCTION [dbo].[YourFunction0Based](@pString [varchar](8000), @pDelimiter [char](1))

RETURNS TABLE

RETURN

WITH

cteTally(N) AS (--==== This limits the number of rows right up front

-- for both a performance gain and prevention of accidental "overruns"

SELECT TOP (DATALENGTH(ISNULL(@pString,1))+1) N FROM YourDB.dbo.YourTallyTable0Based WITH (nolock) -- the 1 based + Union All is almost identical in performance to a 0 based TOP (Datalength()+1)

),

cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter)

SELECT t.N+1

FROM cteTally t

WHERE (SUBSTRING(@pString,t.N,1) = @pDelimiter OR t.N = 0)

)

--===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found.

SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY s.N1), -- not including StringNumber actually increases duration and CPU, but uses slightly fewer writes

Item = SUBSTRING(@pString,s.N1,ISNULL(NULLIF(CHARINDEX(@pDelimiter,@pString,s.N1),0)-s.N1,8000))

FROM cteStart s WITH (nolock)

;

The UNION version is up to about 5% faster on small but reasonably useful numbers of elements and uses the "standard" 1 based tally table, and the 0 based version is up to about 5% faster on medium to large numbers of elements, and uses a special tally table that starts at 0. Both can be slightly slower than your CTE Tally example on very small numbers of elements/string sizes, depending on some combination of SQL version and platform parameters.