RE: Tally OH! An Improved SQL 8K “CSV Splitter” Function

Hall of Fame

Points: 3186

May 22, 2012 at 12:46 am

Hi,

I have used and referred the same splitter many times but need this to be implemented in a real time environment recently. So, I had to do some more research on it. I started with some minor improvements and ask for suggestions on the sub-site. Many thanks to Oleg, who provided another pointer with a very fruitful discussion and with an interesting 1133 elements, 10000 rows performance test (I am not going into the details and performance comparisons made). I tweaked the physical structure of the tally table a little bit according to the requirement by adding a pre-calculated column N1 which would have the value N+1 but there was only a minor gain as it reduces some IO. But the most important change was conversion of tally table’s column’s data-type i.e. from BIGINT to INT. I have come up with 2 versions so far.

Now, most importantly the performance test. I followed the article’s performance test and I saw major improvement and it beats all the versions. The margin was more significant when the number of elements increases or when the size of the string was bigger. When I used this version of the splitter, for 1133 elements & 10000 rows, the test completed in approximately 19 seconds, which seems very promising as these are still iTVFs.

USE [tempdb]

GO

SET NOCOUNT ON;

GO

IF OBJECT_ID('dbo.CsvTest') IS NOT NULL

DROP TABLE CsvTest

GO

DECLARE @NumberOfElements INT,

@NumberOfRows INT

/*======== PARAMETER VALUES ==============================*/

SELECT @NumberOfElements= 1133

, @NumberOfRows= 10000

/*========================================================*/

SELECT TOP (@NumberOfRows) --Controls the number of rows in the test table

ISNULL(ROW_NUMBER() OVER (ORDER BY(SELECT NULL)),0) AS RowNum,

(

SELECT CAST(STUFF( --=== STUFF get's rid of the leading comma

( --=== This builds CSV row with a leading comma

SELECT TOP (@NumberOfElements) --Controls the number of CSV elements in each row

','+CAST(ABS(CHECKSUM(NEWID()))%100000 AS VARCHAR(10))

FROM sys.All_Columns t3 --Classic cross join pseudo-cursor

CROSS JOIN sys.All_Columns t4 --can produce row sets up 121 million.

WHERE t1.Object_ID <> t3.Object_ID --Without this line, all rows would be the same

FOR XML PATH('')

)

,1,1,'') AS VARCHAR(8000))

) AS Csv

INTO CsvTest

FROM sys.All_Columns t1 --Classic cross join pseudo-cursor

CROSS JOIN sys.All_Columns t2 --can produce row sets up 16 million rows

GO

ALTER TABLE CsvTest

ADD PRIMARY KEY CLUSTERED (RowNum) WITH FILLFACTOR = 100;

GO

IF OBJECT_ID('dbo.Tally','U') IS NOT NULL

DROP TABLE dbo.Tally;

--===== Create and populate the Tally table on the fly.

SELECT TOP 11000

IDENTITY(INT,1,1) AS N, --PREVIOUSLY WAS BIGINT

-- This ISNULL function makes the column NOT NULL

ISNULL(ISNULL(CAST(ROW_NUMBER() OVER ( ORDER BY (SELECT NULL)) AS INT),0)+ 1, 0) N1

INTO dbo.Tally

FROM sys.all_columns ac1

CROSS JOIN sys.all_columns ac2

;

GO

--===== Add a CLUSTERED Primary Key to maximize performance

ALTER TABLE dbo.Tally

ADD CONSTRAINT PK_Tally_N

PRIMARY KEY CLUSTERED (N) WITH FILLFACTOR = 100

;

--===== Update Stats

UPDATE STATISTICS dbo.Tally WITH FULLSCAN

;

GO

--=====================================================================================================================

-- Conditionally drop and recreate the Split8KTally_With_Column_N1 splitter function.

--=====================================================================================================================

IF OBJECT_ID(N'dbo.Split8KTally_With_Column_N1', N'IF') IS NOT NULL

DROP FUNCTION dbo.Split8KTally_With_Column_N1;

GO

--===== Tally Table (Split8KTally_With_Column_N1 ITVF) ===============================================================================

/*===== THIS FUNCTION USES COLUMN "N1" OF TALLY TABLE HAVING PRE-CALCULATED VALUES FOR N+1 ===========*/

CREATE FUNCTION dbo.Split8KTally_With_Column_N1 (

@pString VARCHAR(8000), @pDelimiter VARCHAR(1)

)

RETURNS TABLE

AS

RETURN

/*======= FIND THE FIRST DELIMITED ITEM EXPLICITLY, AS WE SKIPPED THE FIRST ITEM TO AVOID "OR" CONDITION IN THE WHERE CLAUSE=====*/

SELECT ItemNumber = CAST(0 AS BIGINT) --STARTING WITH ZERO FOR SORT AND TO AVOID ROW_NUMBER + 1 LATER

, ItemValue = SUBSTRING(@pString, 1,

ISNULL(NULLIF(CHARINDEX(@pDelimiter,

@pString COLLATE Latin1_General_BIN), 0)

- 1, 8000 ))

UNION ALL -- CONCATENATE THE OUTPUTS

/*======= FIND REST OF THE DELIMITED ITEMS USING TALLY TABLE ======================================*/

SELECT ItemNumber = ROW_NUMBER() OVER ( ORDER BY [N] )

, ItemValue = SUBSTRING(@pString, [N1],

ISNULL(NULLIF(CHARINDEX(@pDelimiter ,

@pString COLLATE Latin1_General_BIN, [N1]), 0)

- [N1], 8000) )

FROM [dbo].[Tally] WITH ( NOLOCK ) --NOLOCK HINT IS NOT NECESSARY

WHERE [N] BETWEEN 1 AND LEN(@pString)

AND SUBSTRING(@pString, [N], 1) = @pDelimiter

GO

--=====================================================================================================================

-- Conditionally drop and recreate the Split8KTally_With_N_PLUS_1 splitter function.

--=====================================================================================================================

IF OBJECT_ID('dbo.Split8KTally_With_N_PLUS_1','IF') IS NOT NULL

DROP FUNCTION [dbo].Split8KTally_With_N_PLUS_1

GO

--===== Tally Table (Split8KTally_With_N_PLUS_1 ITVF) ===============================================================================

/*===== THIS FUNCTION USES RUNTIME CALCULATION "N + 1" INSTEAD OF USING THE PRE-CALCULATED COLUMN "N1" OF THE TALLY TABLE===========*/

CREATE FUNCTION dbo.Split8KTally_With_N_PLUS_1 (

@pString VARCHAR(8000), @pDelimiter VARCHAR(1)

)

RETURNS TABLE

AS

RETURN

/*======= FIND THE FIRST DELIMITED ITEM EXPLICITLY, AS WE SKIPPED THE FIRST ITEM TO AVOID "OR" CONDITION IN THE WHERE CLAUSE=====*/

SELECT ItemNumber = CAST(0 AS BIGINT) --STARTING WITH ZERO FOR SORT AND TO AVOID ROW_NUMBER + 1 LATER

, ItemValue = SUBSTRING(@pString, 1,

ISNULL(NULLIF(CHARINDEX(@pDelimiter,

@pString COLLATE Latin1_General_BIN), 0)

- 1, 8000 ))

UNION ALL -- CONCATENATE THE OUTPUTS

/*======= FIND REST OF THE DELIMITED ITEMS USING TALLY TABLE ======================================*/

SELECT ItemNumber = ROW_NUMBER() OVER ( ORDER BY [N] )

, ItemValue = SUBSTRING(@pString, (N+1),

ISNULL(NULLIF(CHARINDEX(@pDelimiter ,

@pString COLLATE Latin1_General_BIN, (N+1)), 0)

- (N+1), 8000) )

FROM [dbo].[Tally] WITH ( NOLOCK )--NOLOCK HINT IS NOT NECESSARY

WHERE [N] BETWEEN 1 AND LEN(@pString)

AND SUBSTRING(@pString, [N], 1) = @pDelimiter

GO

--Tally Test

PRINT '/*====== dbo.Split8KTally_With_N_PLUS_1 =================*/'

DBCC FREEPROCCACHE WITH NO_INFOMSGS

SET STATISTICS TIME ON

DECLARE @ItemNumber BIGINT

, @Item VARCHAR(8000) ;

SELECT @ItemNumber = V.ItemNumber

, @Item = V.ItemValue

FROM dbo.CsvTest D

CROSS APPLY dbo.Split8KTally_With_N_PLUS_1(D.Csv, ',') V

SET STATISTICS TIME OFF

PRINT '/*========================================================*/'

PRINT CHAR(10) + CHAR(13)

GO

PRINT '/*====== dbo.Split8KTally_With_Column_N1 =================*/'

DBCC FREEPROCCACHE WITH NO_INFOMSGS

SET STATISTICS TIME ON

DECLARE @ItemNumber BIGINT

, @Item VARCHAR(8000) ;

SELECT @ItemNumber = V.ItemNumber

, @Item = V.ItemValue

FROM dbo.CsvTest D

CROSS APPLY dbo.Split8KTally_With_Column_N1(D.Csv, ',') V

SET STATISTICS TIME OFF

PRINT '/*========================================================*/'

PRINT CHAR(10) + CHAR(13)

GO

Last but not least, thanks to all who are contributing voluntarily to help the community, however, cannot thanks enough to one of my favorite mentors Jeff Moden.

P.S. The naming conventions are not according to the standards as the purpose is to convey the core of the function 🙂 Moreover, we all know everyone’s mileage may differ.