RE: String tokenizing / splitting

SSC-Addicted

Points: 411

March 2, 2011 at 9:26 pm

The tally table method seems to work fine, but the estimated and actual execution plans show a pretty severe penalty. My guess is that the initial file I/O and memory allocation of the table's index are at fault for those costs. I then derived a little test on about 17,000 tokens using the initial method, the XML method, and the tally table. With that said, the tally table won by a long shot when parsing a large array:

declare @i int = 26, @x varchar(max) = '', @d char(1) = ' ', @j-2 int;

declare @t table (id tinyint primary key, tokens int, which varchar(32), start datetime, finish datetime);

--Note: this is going to take a while, so if you want to run this more than once, store this data somewhere...

set @j-2 = @i*@i;

while @j-2 > 0 BEGIN

while @i > 0 BEGIN

set @x = @x + @d +CHAR(91 - @i);

set @i = @i - 1;

END

set @j-2 = @j-2 - 1;

set @i = 26

END

declare @C int;

insert into @t (id,which,start) values (1,'fnStringSplit_2005_Distinct_CHAR',getdate());

select @C = COUNT(*) from ..[fnStringSplit_2005_Distinct_CHAR](@x,@d);

update @t set tokens = @C, finish = GETDATE() where id = 1;

insert into @t (id,which,start) values (2,'udf_StrList2Table',getdate());

select @C = COUNT(*) from ..[udf_StrList2Table] (@x,@d)

update @t set tokens = @C, finish = GETDATE() where id = 2;

insert into @t (id,which,start) values (3,'fn_Split',getdate());

select @C = COUNT(*) from ..[fn_Split](@x,@d)

update @t set tokens = @C, finish = GETDATE() where id = 3;

select *, CONVERT(float, finish) - convert(float,start) as runtime from @t;

The results show us that the clear winner as the tally table. Volume seems to be a very large consideration in which method you choose. It looks like arrays of enormous scale should be parsed with a tally table method:

idtokenswhichstartfinishruntime

217577udf_StrList2Table54:52.054:52.00.00000057870056480169

317577fn_Split54:52.055:04.00.00013854166900273400

117577fnStringSplit_2005_Distinct_CHAR51:26.954:52.00.00237364969507325000

I added back my CTE function and reran it with a limit of 100 tokens (CTE max):

declare @i int = 26, @x varchar(max) = '', @d char(1) = ' ', @j-2 int;

declare @t table (id tinyint primary key, tokens int, which varchar(32), start datetime, finish datetime);

set @j-2 = @i;

while @j-2 > 0 BEGIN

while @i > 0 BEGIN

set @x = @x + @d +CHAR(91 - @i);

set @i = @i - 1;

END

set @j-2 = @j-2 - 1;

set @i = 26

END

set @x = LEFT(@x,100)

declare @C int;

insert into @t (id,which,start) values (0,'fnStringSplit_CTE',getdate());

select @C = COUNT(*) from ..[fnStringSplit_CTE](@x,@d)

update @t set tokens = @C, finish = GETDATE() where id = 0;

insert into @t (id,which,start) values (1,'fnStringSplit_2005_Distinct_CHAR',getdate());

select @C = COUNT(*) from ..[fnStringSplit_2005_Distinct_CHAR](@x,@d);

update @t set tokens = @C, finish = GETDATE() where id = 1;

insert into @t (id,which,start) values (2,'udf_StrList2Table',getdate());

select @C = COUNT(*) from ..[udf_StrList2Table] (@x,@d)

update @t set tokens = @C, finish = GETDATE() where id = 2;

insert into @t (id,which,start) values (3,'fn_Split',getdate());

select @C = COUNT(*) from ..[fn_Split](@x,@d)

update @t set tokens = @C, finish = GETDATE() where id = 3;

select *, CONVERT(float, finish) - convert(float,start) as runtime from @t;

The results show CTE is the winner, but the others are fairly close with the tally table last:

idtokenswhichstartfinishruntime

051fnStringSplit_CTE10:29.710:29.70.00000003858440322801

151fnStringSplit_2005_Distinct_CHAR10:29.710:29.70.00000027006171876565

351fn_Split10:29.710:29.80.00000027006171876565

251udf_StrList2Table10:29.710:29.70.00000038580037653446

The funny part about this one is that if I ran it a dozen times, only the tally table's result changed dramatically when splitting a small array of values (it sometimes took a lot longer... randomly).

So what I take from this is CTE is best for small arrays, and the tally table is best for large ones. The one last test I did not try is to run several thousand small splits at volume (may about 200 sets of 100 would do it). The only way I could think of to mock this up is to make a highly parallel package call via SSIS (basically, the same function called 2000 times at the same time). If I feel inspired, I'll give it a shot. Any suggestions on an easier parallelization test are welcome.