Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase «««123

String tokenizing / splitting Expand / Collapse
Author
Message
Posted Monday, March 07, 2011 7:54 AM


Old Hand

Old HandOld HandOld HandOld HandOld HandOld HandOld HandOld Hand

Group: General Forum Members
Last Login: Wednesday, April 16, 2014 9:10 PM
Points: 393, Visits: 782
Craig Sunderland (3/7/2011)
Surely other factors like CPU, Memory and Disk Configuration could affect these timings, hence why from my tests the XML was a better fit. There probably isn't one size fits all.


Yes, it would skew the tests from one machine to the other but he's been testing on the same machine and instance. If the results widely vary, then it shows it doesn't scale well. However, the three ideas tested has their own disadvantages. One requires CLR procs to be enabled, another needs a tally table, and the other works well with smaller delimited lists. If you plan to never have a list of hundreds or thousands, XML will work without any other object. But if you need to have that flexibility, it's not a big performance hit to use the tally function for smaller sets too. Testing and re-visiting those tests from time to time will let you know if you're still using the best function for your current hardware/data distribution.


/* Anything is possible but is it worth it? */
Post #1074176
Posted Monday, March 07, 2011 7:40 PM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Saturday, March 10, 2012 8:06 PM
Points: 15, Visits: 51
My theory on the XML problem was that it was actually a CLR call-out. Tally seems to work well, it shows an up front execution cost that is not comparable to the others, but it seems to perform better.

I'm going to try to tweak each idea a bit and put them under more competition scenarios to see which ones fare better at scale with parallel use.
Post #1074571
Posted Monday, March 07, 2011 10:17 PM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Saturday, March 10, 2012 8:06 PM
Points: 15, Visits: 51
I had a little more fun... again using 17,000+ tokens.

I put the CTE version inside the original function by creating 200 character tokens using the original method (sort of), assuming a single character delimiter, then using CTE to further split them into arrays. It still has a few bugs but it did improve the overall results (hopefully, not due to the bugs):
Original + CTE Hybrid: CPU time = 468 ms, elapsed time = 473 ms.
Original: CPU time = 686 ms, elapsed time = 734 ms.
Tally: CPU time = 172 ms, elapsed time = 157 ms.
CREATE FUNCTION [dbo].[fn_Split_Hybrid] 
(
@InputString VARCHAR(max),
@Delimiter char(1)
)
RETURNS @Values TABLE (
VALUE VARCHAR(max)
-- ,IPS VARCHAR(MAX)
--,DCI BIGINT
)
AS
BEGIN
DECLARE @DelimitierLen tinyint = 200
DECLARE @DelimiterCharIndex bigint = @DelimitierLen --BIGINT = CHARINDEX(@Delimiter,@InputString)

WHILE (@DelimiterCharIndex >= @DelimitierLen )
BEGIN
declare @newval varchar(200) = left(@InputString, @DelimitierLen)
SET @InputString = SUBSTRING(@InputString,@DelimitierLen+1, LEN(@InputString)-@DelimitierLen)
SET @DelimiterCharIndex = LEN(@inputString)
--INSERT INTO @Values VALUES (@newval)--, @InputString,@DelimiterCharIndex)
BEGIN
with split(i, token, remainder) as
(select 1
, left(@newval,charindex(@Delimiter,@newval)-1)
, LTRIM(right(@newval,len(@newval)-CHARINDEX(@Delimiter,@newval)))
union all
select i + 1
,case when charindex(@Delimiter,remainder) > 0 then
left(remainder,charindex(@Delimiter,remainder)-1)
else remainder end as token
,LTRIM(right(remainder,len(remainder)-CHARINDEX(@Delimiter,remainder))) as remainder
from split
where charindex(@Delimiter,remainder) >= 0 and token != remainder
)
insert into @Values
Select token
from split
END
END
set @newval = @InputString
BEGIN

with split(i, token, remainder) as
(select 1
, left(@newval,charindex(@Delimiter,@newval)-1)
, LTRIM(right(@newval,len(@newval)-CHARINDEX(@Delimiter,@newval)))
union all
select i + 1
,case when charindex(@Delimiter,remainder) > 0 then
left(remainder,charindex(@Delimiter,remainder)-1)
else remainder end as token
,LTRIM(right(remainder,len(remainder)-CHARINDEX(@Delimiter,remainder))) as remainder
from split
where charindex(@Delimiter,remainder) >= 0 and token != remainder
)
insert into @Values
Select token
from split
END

RETURN
END

GO


Post #1074601
Posted Saturday, April 02, 2011 11:28 PM


Old Hand

Old HandOld HandOld HandOld HandOld HandOld HandOld HandOld Hand

Group: General Forum Members
Last Login: Wednesday, April 16, 2014 9:10 PM
Points: 393, Visits: 782
As promised, I did some testing (sorry it took this long). This was run from my laptop in a Win2K3 Server VM. The results showed a pretty consistent run time of ~40 ms. Running the script in a loop of 20 times, the max ms were 50 and min was 30. The average was 40.45 and a median of 41. I will note that yes, there is a penalty for the first load of the tally table, which I saw the first time I ran it in tempdb. However, if it's used often, it will probably stay in memory, which it does in our system.

But to be fair, I did a DBCC freeproccache and DBCC dropcleanbuffers between the tests below and did experience the occasional doubling of time you experienced but it wasn't consistent.

Other changes made to your test script:
1) The udf takes nvarchar so I changed chars to nchar to remove the implicit conversion
2) Used datetime2
3) DATEDIFF for time difference calculation
4) Added GO 20 to repeat the script 20 times

declare @i int = 26, @x nvarchar(max) = N'', @d nchar(1) = N' ', @j int;
declare @t table (id tinyint primary key, tokens int, which nvarchar(32), start datetime2, finish datetime2);
--Note: this is going to take a while, so if you want to run this more than once, store this data somewhere...

set @j = @i*@i;
while @j > 0 BEGIN
while @i > 0 BEGIN
set @x = @x + @d +NCHAR(91 - @i);
set @i = @i - 1;
END
set @j = @j - 1;
set @i = 26
END

declare @c int;
update @t set tokens = @c, finish = sysdatetime() where id = 1;
insert into @t (id,which,start) values (2,'udf_StrList2Table',getdate());
select @c = COUNT(*) from ..[udf_StrList2Table] (@x,@d)
update @t set tokens = @c, finish = sysdatetime() where id = 2;
select *, DATEDIFF(millisecond, start, finish) as runtime from @t;
GO 20



/* Anything is possible but is it worth it? */
Post #1087839
Posted Monday, May 02, 2011 10:30 AM


Old Hand

Old HandOld HandOld HandOld HandOld HandOld HandOld HandOld Hand

Group: General Forum Members
Last Login: Wednesday, April 16, 2014 9:10 PM
Points: 393, Visits: 782
Jeff Moden posted a very nice article about CSV parsing. His code was written for a max input string of 8K so if you need more, stay tuned for an upcoming article from him. I tested his code vs mine and it completely blows it away. Not to mention his CTE tally table did much better than relying on a physical Tally table.

Tally Oh!


/* Anything is possible but is it worth it? */
Post #1101725
Posted Saturday, February 25, 2012 3:11 PM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: Today @ 8:11 AM
Points: 36,015, Visits: 30,302
Gatekeeper (5/2/2011)
Not to mention his CTE tally table...


Thank you for the wonderful kudo but I absolutely cannot take the credit for the cteTally. The original concept was first published by Itzik Ben-Gan and he used a "base 2" generator. The "base 10" generator that I included in my article was the result of a lot of good people's (Lynn Petis started the ball rolling in this area) efforts here on SSC where we tested many different "bases". I chose to stick with the "base 10" generator because, as with many different based generators, it appeared to have a minor speed advantage over the "base 2" generator, took fewer Cross Joins to ramp up to the larger numbers and, as a result, took a bit less code.

Still, I have to give the credit to Itzik for the wonderful and very useful idea especially since it generates virtually no reads. The use of a physical Tally Table will still beat it by a bit in some applications but you simply have to love the "no reads" aspect of Itzik's fine idea.


--Jeff Moden
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

"Change is inevitable. Change for the better is not." -- 04 August 2013
(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

Helpful Links:
How to post code problems
How to post performance problems
Post #1257871
Posted Saturday, February 25, 2012 7:35 PM


Old Hand

Old HandOld HandOld HandOld HandOld HandOld HandOld HandOld Hand

Group: General Forum Members
Last Login: Wednesday, April 16, 2014 9:10 PM
Points: 393, Visits: 782
Jeff Moden (2/25/2012)
Still, I have to give the credit to Itzik for the wonderful and very useful idea especially since it generates virtually no reads. The use of a physical Tally Table will still beat it by a bit in some applications but you simply have to love the "no reads" aspect of Itzik's fine idea.


Yes, the no reads is very hard to not pass up when thinking about tally tables, thanks to Itzik.


/* Anything is possible but is it worth it? */
Post #1257886
« Prev Topic | Next Topic »

Add to briefcase «««123

Permissions Expand / Collapse