SQL Clone
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


String tokenizing / splitting


String tokenizing / splitting

Author
Message
Gatekeeper
Gatekeeper
SSChasing Mays
SSChasing Mays (613 reputation)SSChasing Mays (613 reputation)SSChasing Mays (613 reputation)SSChasing Mays (613 reputation)SSChasing Mays (613 reputation)SSChasing Mays (613 reputation)SSChasing Mays (613 reputation)SSChasing Mays (613 reputation)

Group: General Forum Members
Points: 613 Visits: 888
Craig Sunderland (3/7/2011)
Surely other factors like CPU, Memory and Disk Configuration could affect these timings, hence why from my tests the XML was a better fit. There probably isn't one size fits all.


Yes, it would skew the tests from one machine to the other but he's been testing on the same machine and instance. If the results widely vary, then it shows it doesn't scale well. However, the three ideas tested has their own disadvantages. One requires CLR procs to be enabled, another needs a tally table, and the other works well with smaller delimited lists. If you plan to never have a list of hundreds or thousands, XML will work without any other object. But if you need to have that flexibility, it's not a big performance hit to use the tally function for smaller sets too. Testing and re-visiting those tests from time to time will let you know if you're still using the best function for your current hardware/data distribution.

/* Anything is possible but is it worth it? */
davecason
davecason
SSC Journeyman
SSC Journeyman (97 reputation)SSC Journeyman (97 reputation)SSC Journeyman (97 reputation)SSC Journeyman (97 reputation)SSC Journeyman (97 reputation)SSC Journeyman (97 reputation)SSC Journeyman (97 reputation)SSC Journeyman (97 reputation)

Group: General Forum Members
Points: 97 Visits: 51
My theory on the XML problem was that it was actually a CLR call-out. Tally seems to work well, it shows an up front execution cost that is not comparable to the others, but it seems to perform better.

I'm going to try to tweak each idea a bit and put them under more competition scenarios to see which ones fare better at scale with parallel use.
davecason
davecason
SSC Journeyman
SSC Journeyman (97 reputation)SSC Journeyman (97 reputation)SSC Journeyman (97 reputation)SSC Journeyman (97 reputation)SSC Journeyman (97 reputation)SSC Journeyman (97 reputation)SSC Journeyman (97 reputation)SSC Journeyman (97 reputation)

Group: General Forum Members
Points: 97 Visits: 51
I had a little more fun... again using 17,000+ tokens.

I put the CTE version inside the original function by creating 200 character tokens using the original method (sort of), assuming a single character delimiter, then using CTE to further split them into arrays. It still has a few bugs but it did improve the overall results (hopefully, not due to the bugs):
Original + CTE Hybrid: CPU time = 468 ms, elapsed time = 473 ms.
Original: CPU time = 686 ms, elapsed time = 734 ms.
Tally: CPU time = 172 ms, elapsed time = 157 ms.

CREATE FUNCTION [dbo].[fn_Split_Hybrid]
(
@InputString VARCHAR(max),
@Delimiter char(1)
)
RETURNS @Values TABLE (
VALUE VARCHAR(max)
-- ,IPS VARCHAR(MAX)
--,DCI BIGINT
)
AS
BEGIN
DECLARE @DelimitierLen tinyint = 200
DECLARE @DelimiterCharIndex bigint = @DelimitierLen --BIGINT = CHARINDEX(@Delimiter,@InputString)

WHILE (@DelimiterCharIndex >= @DelimitierLen )
BEGIN
declare @newval varchar(200) = left(@InputString, @DelimitierLen)
SET @InputString = SUBSTRING(@InputString,@DelimitierLen+1, LEN(@InputString)-@DelimitierLen)
SET @DelimiterCharIndex = LEN(@inputString)
--INSERT INTO @Values VALUES (@newval)--, @InputString,@DelimiterCharIndex)
BEGIN
with split(i, token, remainder) as
(select 1
, left(@newval,charindex(@Delimiter,@newval)-1)
, LTRIM(right(@newval,len(@newval)-CHARINDEX(@Delimiter,@newval)))
union all
select i + 1
,case when charindex(@Delimiter,remainder) > 0 then
left(remainder,charindex(@Delimiter,remainder)-1)
else remainder end as token
,LTRIM(right(remainder,len(remainder)-CHARINDEX(@Delimiter,remainder))) as remainder
from split
where charindex(@Delimiter,remainder) >= 0 and token != remainder
)
insert into @Values
Select token
from split
END
END
set @newval = @InputString
BEGIN

with split(i, token, remainder) as
(select 1
, left(@newval,charindex(@Delimiter,@newval)-1)
, LTRIM(right(@newval,len(@newval)-CHARINDEX(@Delimiter,@newval)))
union all
select i + 1
,case when charindex(@Delimiter,remainder) > 0 then
left(remainder,charindex(@Delimiter,remainder)-1)
else remainder end as token
,LTRIM(right(remainder,len(remainder)-CHARINDEX(@Delimiter,remainder))) as remainder
from split
where charindex(@Delimiter,remainder) >= 0 and token != remainder
)
insert into @Values
Select token
from split
END

RETURN
END

GO



Gatekeeper
Gatekeeper
SSChasing Mays
SSChasing Mays (613 reputation)SSChasing Mays (613 reputation)SSChasing Mays (613 reputation)SSChasing Mays (613 reputation)SSChasing Mays (613 reputation)SSChasing Mays (613 reputation)SSChasing Mays (613 reputation)SSChasing Mays (613 reputation)

Group: General Forum Members
Points: 613 Visits: 888
As promised, I did some testing (sorry it took this long). This was run from my laptop in a Win2K3 Server VM. The results showed a pretty consistent run time of ~40 ms. Running the script in a loop of 20 times, the max ms were 50 and min was 30. The average was 40.45 and a median of 41. I will note that yes, there is a penalty for the first load of the tally table, which I saw the first time I ran it in tempdb. However, if it's used often, it will probably stay in memory, which it does in our system.

But to be fair, I did a DBCC freeproccache and DBCC dropcleanbuffers between the tests below and did experience the occasional doubling of time you experienced but it wasn't consistent.

Other changes made to your test script:
1) The udf takes nvarchar so I changed chars to nchar to remove the implicit conversion
2) Used datetime2
3) DATEDIFF for time difference calculation
4) Added GO 20 to repeat the script 20 times


declare @i int = 26, @x nvarchar(max) = N'', @d nchar(1) = N' ', @j int;
declare @t table (id tinyint primary key, tokens int, which nvarchar(32), start datetime2, finish datetime2);
--Note: this is going to take a while, so if you want to run this more than once, store this data somewhere...

set @j = @i*@i;
while @j > 0 BEGIN
while @i > 0 BEGIN
set @x = @x + @d +NCHAR(91 - @i);
set @i = @i - 1;
END
set @j = @j - 1;
set @i = 26
END

declare @c int;
update @t set tokens = @c, finish = sysdatetime() where id = 1;
insert into @t (id,which,start) values (2,'udf_StrList2Table',getdate());
select @c = COUNT(*) from ..[udf_StrList2Table] (@x,@d)
update @t set tokens = @c, finish = sysdatetime() where id = 2;
select *, DATEDIFF(millisecond, start, finish) as runtime from @t;
GO 20



/* Anything is possible but is it worth it? */
Gatekeeper
Gatekeeper
SSChasing Mays
SSChasing Mays (613 reputation)SSChasing Mays (613 reputation)SSChasing Mays (613 reputation)SSChasing Mays (613 reputation)SSChasing Mays (613 reputation)SSChasing Mays (613 reputation)SSChasing Mays (613 reputation)SSChasing Mays (613 reputation)

Group: General Forum Members
Points: 613 Visits: 888
Jeff Moden posted a very nice article about CSV parsing. His code was written for a max input string of 8K so if you need more, stay tuned for an upcoming article from him. I tested his code vs mine and it completely blows it away. Not to mention his CTE tally table did much better than relying on a physical Tally table.

Tally Oh!

/* Anything is possible but is it worth it? */
Jeff Moden
Jeff Moden
SSC Guru
SSC Guru (210K reputation)SSC Guru (210K reputation)SSC Guru (210K reputation)SSC Guru (210K reputation)SSC Guru (210K reputation)SSC Guru (210K reputation)SSC Guru (210K reputation)SSC Guru (210K reputation)

Group: General Forum Members
Points: 210325 Visits: 41973
Gatekeeper (5/2/2011)
Not to mention his CTE tally table...


Thank you for the wonderful kudo but I absolutely cannot take the credit for the cteTally. The original concept was first published by Itzik Ben-Gan and he used a "base 2" generator. The "base 10" generator that I included in my article was the result of a lot of good people's (Lynn Petis started the ball rolling in this area) efforts here on SSC where we tested many different "bases". I chose to stick with the "base 10" generator because, as with many different based generators, it appeared to have a minor speed advantage over the "base 2" generator, took fewer Cross Joins to ramp up to the larger numbers and, as a result, took a bit less code.

Still, I have to give the credit to Itzik for the wonderful and very useful idea especially since it generates virtually no reads. The use of a physical Tally Table will still beat it by a bit in some applications but you simply have to love the "no reads" aspect of Itzik's fine idea.

--Jeff Moden

RBAR is pronounced ree-bar and is a Modenism for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
If you think its expensive to hire a professional to do the job, wait until you hire an amateur. -- Red Adair

Helpful Links:
How to post code problems
How to post performance problems
Forum FAQs
Gatekeeper
Gatekeeper
SSChasing Mays
SSChasing Mays (613 reputation)SSChasing Mays (613 reputation)SSChasing Mays (613 reputation)SSChasing Mays (613 reputation)SSChasing Mays (613 reputation)SSChasing Mays (613 reputation)SSChasing Mays (613 reputation)SSChasing Mays (613 reputation)

Group: General Forum Members
Points: 613 Visits: 888
Jeff Moden (2/25/2012)
Still, I have to give the credit to Itzik for the wonderful and very useful idea especially since it generates virtually no reads. The use of a physical Tally Table will still beat it by a bit in some applications but you simply have to love the "no reads" aspect of Itzik's fine idea.


Yes, the no reads is very hard to not pass up when thinking about tally tables, thanks to Itzik.

/* Anything is possible but is it worth it? */
Go


Permissions

You can't post new topics.
You can't post topic replies.
You can't post new polls.
You can't post replies to polls.
You can't edit your own topics.
You can't delete your own topics.
You can't edit other topics.
You can't delete other topics.
You can't edit your own posts.
You can't edit other posts.
You can't delete your own posts.
You can't delete other posts.
You can't post events.
You can't edit your own events.
You can't edit other events.
You can't delete your own events.
You can't delete other events.
You can't send private messages.
You can't send emails.
You can read topics.
You can't vote in polls.
You can't upload attachments.
You can download attachments.
You can't post HTML code.
You can't edit HTML code.
You can't post IFCode.
You can't post JavaScript.
You can post emoticons.
You can't post or upload images.

Select a forum

































































































































































SQLServerCentral


Search