Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase «««2526272829»»»

Tally OH! An Improved SQL 8K “CSV Splitter” Function Expand / Collapse
Author
Message
Posted Wednesday, May 25, 2011 12:35 PM


SSCertifiable

SSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiable

Group: General Forum Members
Last Login: Yesterday @ 7:58 AM
Points: 5,370, Visits: 9,010
mark hutchinson (5/25/2011)
@Jeff

Here's an idea to kick around for varchar(max) data...use a tally table (or cte equivalent) to split the very large string into strings less than 8000 characters and then rejoin those parsed strings.

I have no idea if it can or how it would be implemented, but it was an out-of-the-box idea that just struck me. Thought I would share the idea in case I get struck by a bus.

Mark


Good idea, but you just can't split it every 8000 characters. You have to split that varchar(max) into chunks <= 8000 chars, AND split at a delimiter. Any ideas?


Wayne
Microsoft Certified Master: SQL Server 2008
If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
Links: For better assistance in answering your questions, How to ask a question, Performance Problems, Common date/time routines,
CROSS-TABS and PIVOT tables Part 1 & Part 2, Using APPLY Part 1 & Part 2, Splitting Delimited Strings
Post #1114984
Posted Wednesday, May 25, 2011 12:56 PM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Saturday, November 29, 2014 2:36 PM
Points: 42, Visits: 382
@Wayne

it depends on the ability to split the max string at a delimiter into <8000 sized chunks reliably. Otherwise, the ends of the parsed strings sets would need to be concatenated.

If I had the delimiter positions, I might be able to split first on the greatest position less than 8000 and then the greatest delimiter position whose difference form the prior delimiter position was <8000.

I have no idea how to write the T-SQL nor even if it can be written. It might perform like a dog. I just don't know.

Mark



Post #1114990
Posted Wednesday, May 25, 2011 1:49 PM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: Yesterday @ 7:27 AM
Points: 35,769, Visits: 32,437
mark hutchinson (5/25/2011)
@Jeff

Here's an idea to kick around for varchar(max) data...use a tally table (or cte equivalent) to split the very large string into strings less than 8000 characters and then rejoin those parsed strings.

I have no idea if it can or how it would be implemented, but it was an out-of-the-box idea that just struck me. Thought I would share the idea in case I get struck by a bus.

Mark


That's the basis of the code I wrote. It runs very, very fast. Wayne and I have done some back and forth "what ifs"... I just haven't had the time to bring it up to releasable code, yet. By "Releasable Code", I mean code that's been fully documented and tested for functionality, performance, and scalability on at least a half dozen different types of machines across at least 2 revs of SQL. I've just gotta get some time to make the right kind of test harness, test data, and package to send to a very willing and helpful group of external testers. Considering the number of fires I currently have sticks in, it's not going to happen in the next week or two.

Besides... everyone knows to use a CLR for these type of things, right.


--Jeff Moden
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

Helpful Links:
How to post code problems
How to post performance problems
Post #1115016
Posted Wednesday, May 25, 2011 1:58 PM


SSCertifiable

SSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiable

Group: General Forum Members
Last Login: Yesterday @ 7:58 AM
Points: 5,370, Visits: 9,010
Jeff Moden (5/25/2011)
Besides... everyone knows to use a CLR for these type of things, right.

Yes, I know. But, when you're selling software solutions, you can't mandate their use to your customers.


Wayne
Microsoft Certified Master: SQL Server 2008
If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
Links: For better assistance in answering your questions, How to ask a question, Performance Problems, Common date/time routines,
CROSS-TABS and PIVOT tables Part 1 & Part 2, Using APPLY Part 1 & Part 2, Splitting Delimited Strings
Post #1115023
Posted Wednesday, May 25, 2011 2:02 PM


SSCertifiable

SSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiable

Group: General Forum Members
Last Login: Yesterday @ 7:58 AM
Points: 5,370, Visits: 9,010
mark hutchinson (5/25/2011)
@Wayne

it depends on the ability to split the max string at a delimiter into <8000 sized chunks reliably. Otherwise, the ends of the parsed strings sets would need to be concatenated.

If I had the delimiter positions, I might be able to split first on the greatest position less than 8000 and then the greatest delimiter position whose difference form the prior delimiter position was <8000.

I have no idea how to write the T-SQL nor even if it can be written. It might perform like a dog. I just don't know.

Mark


And what happens if the delimiter is at position 8000


Wayne
Microsoft Certified Master: SQL Server 2008
If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
Links: For better assistance in answering your questions, How to ask a question, Performance Problems, Common date/time routines,
CROSS-TABS and PIVOT tables Part 1 & Part 2, Using APPLY Part 1 & Part 2, Splitting Delimited Strings
Post #1115028
Posted Wednesday, May 25, 2011 8:07 PM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: Yesterday @ 7:27 AM
Points: 35,769, Visits: 32,437
WayneS (5/25/2011)
Jeff Moden (5/25/2011)
Besides... everyone knows to use a CLR for these type of things, right.

Yes, I know. But, when you're selling software solutions, you can't mandate their use to your customers.


Heh... absolutely true. That's why I had all the smiley faces on my last.


--Jeff Moden
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

Helpful Links:
How to post code problems
How to post performance problems
Post #1115159
Posted Tuesday, June 21, 2011 1:32 PM


Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Friday, November 7, 2014 3:46 PM
Points: 4, Visits: 84
How about this to get rid of the CHARINDEX() string operation:

 CREATE FUNCTION dbo.DelimitedSplit8K
--===== Define I/O parameters
(@pString VARCHAR(8000), @pDelimiter CHAR(1))
RETURNS TABLE WITH SCHEMABINDING AS
RETURN
--===== "Inline" CTE Driven "Tally Table" produces values from 0 up to 10,000...
-- enough to cover VARCHAR(8000)
WITH E1(N) AS (
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
), --10E+1 or 10 rows
E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
cteTally(N) AS (--==== This provides the "zero base" and limits the number of rows right up front
-- for both a performance gain and prevention of accidental "overruns"
SELECT 0 UNION ALL
SELECT TOP (DATALENGTH(ISNULL(@pString,1))) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
),
cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter)
SELECT t.N+1
FROM cteTally t
WHERE (SUBSTRING(@pString,t.N,1) = @pDelimiter OR t.N = 0)
),
cteStartEnd(Nb, Ne) AS ( -- Associate starting and ending positions
SELECT N1, ISNULL((SELECT MIN(N1) FROM cteStart ce WHERE ce.N1 > cs.N1), 8001)
FROM cteStart cs
)
--===== Do the actual split.
SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY se.Nb),
Item = SUBSTRING(@pString,se.Nb,se.Ne - se.Nb - 1)
FROM cteStartEnd se
;

...how does that affect performance?

edited to add: Actually performance goes down a bit because the subquery in making cteStartEnd is not indexed. This version does exhibit performance gains in my very modest testing:

 CREATE FUNCTION dbo.DelimitedSplit8K
--===== Define I/O parameters
(@pString VARCHAR(8000), @pDelimiter CHAR(1))
RETURNS TABLE WITH SCHEMABINDING AS
DECLARE @t_delim TABLE(offset smallint PRIMARY KEY); -- for the index

-- Get the offsets into the in-memory indexed table
--===== "Inline" CTE Driven "Tally Table" produces values from 0 up to 10,000...
-- enough to cover VARCHAR(8000)
WITH E1(N) AS (
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
), --10E+1 or 10 rows
E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
cteTally(N) AS (--==== This provides the "zero base" and limits the number of rows right up front
-- for both a performance gain and prevention of accidental "overruns"
SELECT 0 UNION ALL
SELECT TOP (DATALENGTH(ISNULL(@pString,1))) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
)
INSERT INTO @t_delim(offset) --==== This returns N+1 (starting position of each "element" just once for each delimiter)
SELECT t.N+1
FROM cteTally t
WHERE (SUBSTRING(@pString,t.N,1) = @pDelimiter OR t.N = 0)
;

RETURN
WITH
cteStartEnd(Nb, Ne) AS ( -- Associate starting and ending positions
SELECT offset, ISNULL((SELECT MIN(offset) FROM @t_delim ce WHERE ce.offset > cs.offset), 8001)
FROM @t_delim cs
)
--===== Do the actual split.
SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY se.Nb),
Item = SUBSTRING(@pString,se.Nb,se.Ne - se.Nb - 1)
FROM cteStartEnd se
;

(Note: I'm doing my version as a SP so the above exact code hasn't actually been tested by me...)
Post #1129283
Posted Monday, August 8, 2011 2:11 PM


Old Hand

Old HandOld HandOld HandOld HandOld HandOld HandOld HandOld Hand

Group: General Forum Members
Last Login: Wednesday, November 26, 2014 12:33 PM
Points: 392, Visits: 820
mark hutchinson (5/25/2011)
@Jeff

Here's an idea to kick around for varchar(max) data...use a tally table (or cte equivalent) to split the very large string into strings less than 8000 characters and then rejoin those parsed strings.


Mark, we used Jeff's function and expanded it out to varchar(max) with a fair amount of linearity, though I'm looking forward to seeing Jeff's with max handled.

String Length : Elements : CPU Time in ms
74 24 2
148 48 2
296 96 2
592 192 3
1184 384 32
2368 768 7
4736 1536 13
9472 3072 56
18944 6144 49
37888 12288 96
75776 24576 193
151552 49152 385
303104 98304 5006
606208 196608 10085
1212416 393216 19989

UDF
/*
See sometime for improvements to this: http://www.sqlservercentral.com/articles/Tally+Table/72993/
Jeff Moden 2011-05-02
*/
CREATE FUNCTION udf_StrList2Table (
@List NVARCHAR(MAX) = N'',
@Delimiter NCHAR(1) = N','
)
RETURNS TABLE WITH SCHEMABINDING
AS
RETURN
/*
Following inline delimited spliter written by Jeff Moden.
http://www.sqlservercentral.com/articles/Tally+Table/72993/
*/
WITH E1(N) AS
( --=== Create Ten 1's
SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 --10
),
E2(N) AS (SELECT 1 FROM E1 a, E1 b), --100
E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10,000

E8(N) AS
(
--==== Limits the number of rows right up front for both a performance gain and prevention of accidental "overruns"
SELECT TOP (DATALENGTH(ISNULL(@List,1))) 1 FROM E4 a, E4 b -- 100,000,000
),
cteTally(N) AS
( --==== This provides the "zero base"
SELECT 0 UNION ALL
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E8
),
cteStart(N1) AS
( --==== This returns N+1 (starting position of each "element" just once for each delimiter)
SELECT t.N+1 FROM cteTally t WHERE (SUBSTRING(@List,t.N,1) = @Delimiter OR t.N = 0)
)
--===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found.
SELECT
ROW_NUMBER() OVER(ORDER BY s.N1) as ListPos,
SUBSTRING(@List,s.N1,ISNULL(NULLIF(CHARINDEX(@Delimiter,@List,s.N1),0)-s.N1,DATALENGTH(ISNULL(@List,1)))) as Value
FROM cteStart s
GO

Testing
DECLARE
@List NVARCHAR(MAX),
@Delimiter NCHAR(1) = N',',
@ListCnt int,
@StartTime datetime2,
@EndTime datetime2,
@ReplicationSeed int = 2,
@ReplicationMultiplier int = 2

DECLARE @Results TABLE (ListLength int, ElementCount int, MSTime int)
WHILE @ReplicationSeed <= 50000
BEGIN
SELECT @List = REPLICATE(CAST(N'ab,a,aeae,3,3,a3,23,4,asa,,434,q4,345' as nvarchar(max)), @ReplicationSeed)
SELECT @StartTime = SYSDATETIME()
SELECT @ListCnt = COUNT(*) FROM udf_StrList2Table(@List, @Delimiter)
SELECT @EndTime = SYSDATETIME()
INSERT INTO @Results (ListLength, ElementCount, MSTime)
SELECT LEN(@List), LEN(@List) - LEN(REPLACE(@List, ',', '')), DATEDIFF(MS, @StartTime, @EndTime)
SELECT
@ReplicationSeed = @ReplicationSeed * @ReplicationMultiplier
END
SELECT * FROM @Results



/* Anything is possible but is it worth it? */
Post #1156384
Posted Friday, August 12, 2011 3:01 AM
Old Hand

Old HandOld HandOld HandOld HandOld HandOld HandOld HandOld Hand

Group: General Forum Members
Last Login: Monday, December 15, 2014 3:27 AM
Points: 337, Visits: 2,282
Gatekeeper,


Be careful testing sample data this way, where you have the constants in the same batch as the code that you are benchmarking. The replicate function does not "hide" the final string to be benchmark with. There is a reason Jeffs benchmarking involves a table of random stings, one of these reason is to avoid tainted results by giving the optimizer chances to skip logic it normally would have to perform at runtime.

I am not saying your test code neccecarily falls victim to this, but make sure it does not!

And thank you for your contribution. I will have to test how that E8 addition affects performance on input that never needs those high end numbers, and if it doesn't affect it, I am going to incorporate it in the thus far fastest version and use it in my own projects.

Cheers!
Post #1159040
Posted Friday, August 12, 2011 5:39 AM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Saturday, November 29, 2014 2:36 PM
Points: 42, Visits: 382
@Peter

Jeff's new article uses describes this method and states that the compiler knows that when the number of rows is less than what would be produced at/above one of the joining expressions, that it doesn't evaluate that expression.

Mark



Post #1159107
« Prev Topic | Next Topic »

Add to briefcase «««2526272829»»»

Permissions Expand / Collapse