SQL Clone
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


String tokenizing / splitting


String tokenizing / splitting

Author
Message
nhaberl
nhaberl
SSC Rookie
SSC Rookie (47 reputation)SSC Rookie (47 reputation)SSC Rookie (47 reputation)SSC Rookie (47 reputation)SSC Rookie (47 reputation)SSC Rookie (47 reputation)SSC Rookie (47 reputation)SSC Rookie (47 reputation)

Group: General Forum Members
Points: 47 Visits: 244
Hi,

@davecason
Your function does only return 2 rows if there is a string like that given: 'bla;bla;bla'. And if a string like this is given 'bla;bla;bla;' 4 rows are returned.

Could you correct that?

regards
Norbert
Irfan.Baig
Irfan.Baig
Valued Member
Valued Member (54 reputation)Valued Member (54 reputation)Valued Member (54 reputation)Valued Member (54 reputation)Valued Member (54 reputation)Valued Member (54 reputation)Valued Member (54 reputation)Valued Member (54 reputation)

Group: General Forum Members
Points: 54 Visits: 55
Dear in your first scenario it being returned 3 rows and in second scenario that will be returned 4 rows according to delimiter counting e.g. ‘bla;bla;bla;’ the last row will be NULL. You can handle it in your code where you want to use this function.
davecason
davecason
SSC Journeyman
SSC Journeyman (99 reputation)SSC Journeyman (99 reputation)SSC Journeyman (99 reputation)SSC Journeyman (99 reputation)SSC Journeyman (99 reputation)SSC Journeyman (99 reputation)SSC Journeyman (99 reputation)SSC Journeyman (99 reputation)

Group: General Forum Members
Points: 99 Visits: 51
Norbert, try to fix it the way you want it, and post your code if you get stuck. I intentionally attempted to keep the last token if the delimiter was followed by nothing.
Gatekeeper
Gatekeeper
SSChasing Mays
SSChasing Mays (625 reputation)SSChasing Mays (625 reputation)SSChasing Mays (625 reputation)SSChasing Mays (625 reputation)SSChasing Mays (625 reputation)SSChasing Mays (625 reputation)SSChasing Mays (625 reputation)SSChasing Mays (625 reputation)

Group: General Forum Members
Points: 625 Visits: 888
We've found good use of the following tally table and UDF. It handles empty elements and works fast for us. We've done large and small sets against it.


CREATE TABLE dbo.Tally
(N INT)

-- Default Data
-- Taken from the following website:
-- http://www.sqlservercentral.com/articles/TSQL/62867/
--=============================================================================
-- Create and populate a Tally table
--=============================================================================
--===== Conditionally drop
IF OBJECT_ID('dbo.Tally') IS NOT NULL
DROP TABLE dbo.Tally

--===== Create and populate the Tally table on the fly
SELECT TOP 3000000 --equates to more than 30 years of dates
IDENTITY(INT,1,1) AS N
INTO dbo.Tally
FROM Master.dbo.SysColumns sc1,
Master.dbo.SysColumns sc2

--===== Add a Primary Key to maximize performance
ALTER TABLE dbo.Tally
ADD CONSTRAINT PK_Tally_N
PRIMARY KEY CLUSTERED (N) WITH FILLFACTOR = 100

--===== Let the public use it
GRANT SELECT, REFERENCES ON dbo.Tally TO PUBLIC
GO
CREATE FUNCTION udf_StrList2Table (
@List NVARCHAR(MAX) = N'',
@Delimiter NCHAR(1) = N','
)
RETURNS TABLE
AS
RETURN
SELECT SUBSTRING( @Delimiter + @List + @Delimiter, n + 1,
CHARINDEX( @Delimiter, @Delimiter + @List + @Delimiter, n + 1 ) - n - 1 ) AS Value,
ROW_NUMBER() OVER ( ORDER BY n ) AS ListPos
FROM Tally
WHERE SUBSTRING( @Delimiter + @List + @Delimiter, n, 1 ) = @Delimiter
AND n < CAST(LEN( @Delimiter + @List + @Delimiter ) as int)



/* Anything is possible but is it worth it? */
davecason
davecason
SSC Journeyman
SSC Journeyman (99 reputation)SSC Journeyman (99 reputation)SSC Journeyman (99 reputation)SSC Journeyman (99 reputation)SSC Journeyman (99 reputation)SSC Journeyman (99 reputation)SSC Journeyman (99 reputation)SSC Journeyman (99 reputation)

Group: General Forum Members
Points: 99 Visits: 51
The tally table method seems to work fine, but the estimated and actual execution plans show a pretty severe penalty. My guess is that the initial file I/O and memory allocation of the table's index are at fault for those costs. I then derived a little test on about 17,000 tokens using the initial method, the XML method, and the tally table. With that said, the tally table won by a long shot when parsing a large array:

declare @i int = 26, @x varchar(max) = '', @d char(1) = ' ', @j int;
declare @t table (id tinyint primary key, tokens int, which varchar(32), start datetime, finish datetime);
--Note: this is going to take a while, so if you want to run this more than once, store this data somewhere...
set @j = @i*@i;
while @j > 0 BEGIN
while @i > 0 BEGIN
set @x = @x + @d +CHAR(91 - @i);
set @i = @i - 1;
END
set @j = @j - 1;
set @i = 26
END

declare @c int;
insert into @t (id,which,start) values (1,'fnStringSplit_2005_Distinct_CHAR',getdate());
select @c = COUNT(*) from ..[fnStringSplit_2005_Distinct_CHAR](@x,@d);
update @t set tokens = @c, finish = GETDATE() where id = 1;
insert into @t (id,which,start) values (2,'udf_StrList2Table',getdate());
select @c = COUNT(*) from ..[udf_StrList2Table] (@x,@d)
update @t set tokens = @c, finish = GETDATE() where id = 2;
insert into @t (id,which,start) values (3,'fn_Split',getdate());
select @c = COUNT(*) from ..[fn_Split](@x,@d)
update @t set tokens = @c, finish = GETDATE() where id = 3;
select *, CONVERT(float, finish) - convert(float,start) as runtime from @t;



The results show us that the clear winner as the tally table. Volume seems to be a very large consideration in which method you choose. It looks like arrays of enormous scale should be parsed with a tally table method:
id tokens which start finish runtime
2 17577 udf_StrList2Table 54:52.0 54:52.0 0.00000057870056480169
3 17577 fn_Split 54:52.0 55:04.0 0.00013854166900273400
1 17577 fnStringSplit_2005_Distinct_CHAR 51:26.9 54:52.0 0.00237364969507325000

I added back my CTE function and reran it with a limit of 100 tokens (CTE max):

declare @i int = 26, @x varchar(max) = '', @d char(1) = ' ', @j int;
declare @t table (id tinyint primary key, tokens int, which varchar(32), start datetime, finish datetime);
set @j = @i;
while @j > 0 BEGIN
while @i > 0 BEGIN
set @x = @x + @d +CHAR(91 - @i);
set @i = @i - 1;
END
set @j = @j - 1;
set @i = 26
END
set @x = LEFT(@x,100)

declare @c int;
insert into @t (id,which,start) values (0,'fnStringSplit_CTE',getdate());
select @c = COUNT(*) from ..[fnStringSplit_CTE](@x,@d)
update @t set tokens = @c, finish = GETDATE() where id = 0;
insert into @t (id,which,start) values (1,'fnStringSplit_2005_Distinct_CHAR',getdate());
select @c = COUNT(*) from ..[fnStringSplit_2005_Distinct_CHAR](@x,@d);
update @t set tokens = @c, finish = GETDATE() where id = 1;
insert into @t (id,which,start) values (2,'udf_StrList2Table',getdate());
select @c = COUNT(*) from ..[udf_StrList2Table] (@x,@d)
update @t set tokens = @c, finish = GETDATE() where id = 2;
insert into @t (id,which,start) values (3,'fn_Split',getdate());
select @c = COUNT(*) from ..[fn_Split](@x,@d)
update @t set tokens = @c, finish = GETDATE() where id = 3;
select *, CONVERT(float, finish) - convert(float,start) as runtime from @t;



The results show CTE is the winner, but the others are fairly close with the tally table last:
id tokens which start finish runtime
0 51 fnStringSplit_CTE 10:29.7 10:29.7 0.00000003858440322801
1 51 fnStringSplit_2005_Distinct_CHAR 10:29.7 10:29.7 0.00000027006171876565
3 51 fn_Split 10:29.7 10:29.8 0.00000027006171876565
2 51 udf_StrList2Table 10:29.7 10:29.7 0.00000038580037653446

The funny part about this one is that if I ran it a dozen times, only the tally table's result changed dramatically when splitting a small array of values (it sometimes took a lot longer... randomly).

So what I take from this is CTE is best for small arrays, and the tally table is best for large ones. The one last test I did not try is to run several thousand small splits at volume (may about 200 sets of 100 would do it). The only way I could think of to mock this up is to make a highly parallel package call via SSIS (basically, the same function called 2000 times at the same time). If I feel inspired, I'll give it a shot. Any suggestions on an easier parallelization test are welcome.
Gatekeeper
Gatekeeper
SSChasing Mays
SSChasing Mays (625 reputation)SSChasing Mays (625 reputation)SSChasing Mays (625 reputation)SSChasing Mays (625 reputation)SSChasing Mays (625 reputation)SSChasing Mays (625 reputation)SSChasing Mays (625 reputation)SSChasing Mays (625 reputation)

Group: General Forum Members
Points: 625 Visits: 888
Interesting about the varying times with smaller sets of data. I'll do some testing and see what I can find.

/* Anything is possible but is it worth it? */
nhaberl
nhaberl
SSC Rookie
SSC Rookie (47 reputation)SSC Rookie (47 reputation)SSC Rookie (47 reputation)SSC Rookie (47 reputation)SSC Rookie (47 reputation)SSC Rookie (47 reputation)SSC Rookie (47 reputation)SSC Rookie (47 reputation)

Group: General Forum Members
Points: 47 Visits: 244
Hey guys,

look at here --> udf_ListToTable

maybe you could run it within your performance tests,
Norbert


PS: For the way upsidedown take this --> udf_TableToList
davecason
davecason
SSC Journeyman
SSC Journeyman (99 reputation)SSC Journeyman (99 reputation)SSC Journeyman (99 reputation)SSC Journeyman (99 reputation)SSC Journeyman (99 reputation)SSC Journeyman (99 reputation)SSC Journeyman (99 reputation)SSC Journeyman (99 reputation)

Group: General Forum Members
Points: 99 Visits: 51
That function looks pretty similar to the one from the original article, which I called "fn_Split". It does fine, too, and it will work on older versions of SQL Server.

For joining, there are a couple of basic ways to do it. The tough part can be when you want to pass the table name involved as a variable, as you either you have to do some SQL-Server-2008-only stuff using a newly declared type, or you have to use string concatenation with EXEC, so be careful with that one regarding SQL Injection... AND some of the XML functionality might let you find a back door, too.

The basic premise for joining can use either CTE (as I wrote in my very first blog post) or a COALESCE() or perhaps a CURSOR (which will be a little slow). Here is a quick and dirty example using a table variable and a 2008-only insert statement to fill that test table variable, and then a COALESCE() to join all the rows in that table:

DECLARE @table TABLE (token VARCHAR(max));
INSERT INTO @table VALUES ('xyz'),('pdq'),('abc'),('123')
DECLARE @joined VARCHAR(max)
SELECT @joined=COALESCE(@joined+',', '')+token FROM @table
SELECT @joined


davecason
davecason
SSC Journeyman
SSC Journeyman (99 reputation)SSC Journeyman (99 reputation)SSC Journeyman (99 reputation)SSC Journeyman (99 reputation)SSC Journeyman (99 reputation)SSC Journeyman (99 reputation)SSC Journeyman (99 reputation)SSC Journeyman (99 reputation)

Group: General Forum Members
Points: 99 Visits: 51
I ran the three types again, with "SET EXECUTION TIME ON" and...

...8788 tokens:
XML: CPU time = 203722 ms
Tally: CPU time = 47 ms
Original: CPU time = 686 ms
Craig Sunderland
Craig Sunderland
SSC Rookie
SSC Rookie (37 reputation)SSC Rookie (37 reputation)SSC Rookie (37 reputation)SSC Rookie (37 reputation)SSC Rookie (37 reputation)SSC Rookie (37 reputation)SSC Rookie (37 reputation)SSC Rookie (37 reputation)

Group: General Forum Members
Points: 37 Visits: 61
Surely other factors like CPU, Memory and Disk Configuration could affect these timings, hence why from my tests the XML was a better fit. There probably isn't one size fits all.

Cheap toilet paper is a false economy, beware!
Go


Permissions

You can't post new topics.
You can't post topic replies.
You can't post new polls.
You can't post replies to polls.
You can't edit your own topics.
You can't delete your own topics.
You can't edit other topics.
You can't delete other topics.
You can't edit your own posts.
You can't edit other posts.
You can't delete your own posts.
You can't delete other posts.
You can't post events.
You can't edit your own events.
You can't edit other events.
You can't delete your own events.
You can't delete other events.
You can't send private messages.
You can't send emails.
You can read topics.
You can't vote in polls.
You can't upload attachments.
You can download attachments.
You can't post HTML code.
You can't edit HTML code.
You can't post IFCode.
You can't post JavaScript.
You can post emoticons.
You can't post or upload images.

Select a forum

































































































































































SQLServerCentral


Search