Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase ««123»»

String tokenizing / splitting Expand / Collapse
Author
Message
Posted Monday, February 28, 2011 3:32 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Tuesday, November 19, 2013 9:29 AM
Points: 9, Visits: 183
Hi,

@davecason
Your function does only return 2 rows if there is a string like that given: 'bla;bla;bla'. And if a string like this is given 'bla;bla;bla;' 4 rows are returned.

Could you correct that?

regards
Norbert
Post #1070396
Posted Monday, February 28, 2011 4:19 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Sunday, February 26, 2012 3:26 AM
Points: 1, Visits: 30
Dear in your first scenario it being returned 3 rows and in second scenario that will be returned 4 rows according to delimiter counting e.g. ‘bla;bla;bla;’ the last row will be NULL. You can handle it in your code where you want to use this function.
Post #1070415
Posted Monday, February 28, 2011 10:38 PM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Saturday, March 10, 2012 8:06 PM
Points: 15, Visits: 51
Norbert, try to fix it the way you want it, and post your code if you get stuck. I intentionally attempted to keep the last token if the delimiter was followed by nothing.
Post #1070989
Posted Wednesday, March 02, 2011 12:20 PM


Old Hand

Old HandOld HandOld HandOld HandOld HandOld HandOld HandOld Hand

Group: General Forum Members
Last Login: 2 days ago @ 9:10 PM
Points: 393, Visits: 782
We've found good use of the following tally table and UDF. It handles empty elements and works fast for us. We've done large and small sets against it.

 CREATE TABLE dbo.Tally 
(N INT)

-- Default Data
-- Taken from the following website:
-- http://www.sqlservercentral.com/articles/TSQL/62867/
--=============================================================================
-- Create and populate a Tally table
--=============================================================================
--===== Conditionally drop
IF OBJECT_ID('dbo.Tally') IS NOT NULL
DROP TABLE dbo.Tally

--===== Create and populate the Tally table on the fly
SELECT TOP 3000000 --equates to more than 30 years of dates
IDENTITY(INT,1,1) AS N
INTO dbo.Tally
FROM Master.dbo.SysColumns sc1,
Master.dbo.SysColumns sc2

--===== Add a Primary Key to maximize performance
ALTER TABLE dbo.Tally
ADD CONSTRAINT PK_Tally_N
PRIMARY KEY CLUSTERED (N) WITH FILLFACTOR = 100

--===== Let the public use it
GRANT SELECT, REFERENCES ON dbo.Tally TO PUBLIC
GO
CREATE FUNCTION udf_StrList2Table (
@List NVARCHAR(MAX) = N'',
@Delimiter NCHAR(1) = N','
)
RETURNS TABLE
AS
RETURN
SELECT SUBSTRING( @Delimiter + @List + @Delimiter, n + 1,
CHARINDEX( @Delimiter, @Delimiter + @List + @Delimiter, n + 1 ) - n - 1 ) AS Value,
ROW_NUMBER() OVER ( ORDER BY n ) AS ListPos
FROM Tally
WHERE SUBSTRING( @Delimiter + @List + @Delimiter, n, 1 ) = @Delimiter
AND n < CAST(LEN( @Delimiter + @List + @Delimiter ) as int)



/* Anything is possible but is it worth it? */
Post #1072130
Posted Wednesday, March 02, 2011 9:26 PM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Saturday, March 10, 2012 8:06 PM
Points: 15, Visits: 51
The tally table method seems to work fine, but the estimated and actual execution plans show a pretty severe penalty. My guess is that the initial file I/O and memory allocation of the table's index are at fault for those costs. I then derived a little test on about 17,000 tokens using the initial method, the XML method, and the tally table. With that said, the tally table won by a long shot when parsing a large array:
declare @i int = 26, @x varchar(max) = '', @d char(1) = ' ', @j int;
declare @t table (id tinyint primary key, tokens int, which varchar(32), start datetime, finish datetime);
--Note: this is going to take a while, so if you want to run this more than once, store this data somewhere...
set @j = @i*@i;
while @j > 0 BEGIN
while @i > 0 BEGIN
set @x = @x + @d +CHAR(91 - @i);
set @i = @i - 1;
END
set @j = @j - 1;
set @i = 26
END

declare @c int;
insert into @t (id,which,start) values (1,'fnStringSplit_2005_Distinct_CHAR',getdate());
select @c = COUNT(*) from ..[fnStringSplit_2005_Distinct_CHAR](@x,@d);
update @t set tokens = @c, finish = GETDATE() where id = 1;
insert into @t (id,which,start) values (2,'udf_StrList2Table',getdate());
select @c = COUNT(*) from ..[udf_StrList2Table] (@x,@d)
update @t set tokens = @c, finish = GETDATE() where id = 2;
insert into @t (id,which,start) values (3,'fn_Split',getdate());
select @c = COUNT(*) from ..[fn_Split](@x,@d)
update @t set tokens = @c, finish = GETDATE() where id = 3;
select *, CONVERT(float, finish) - convert(float,start) as runtime from @t;

The results show us that the clear winner as the tally table. Volume seems to be a very large consideration in which method you choose. It looks like arrays of enormous scale should be parsed with a tally table method:
id tokens which start finish runtime
2 17577 udf_StrList2Table 54:52.0 54:52.0 0.00000057870056480169
3 17577 fn_Split 54:52.0 55:04.0 0.00013854166900273400
1 17577 fnStringSplit_2005_Distinct_CHAR 51:26.9 54:52.0 0.00237364969507325000

I added back my CTE function and reran it with a limit of 100 tokens (CTE max):
declare @i int = 26, @x varchar(max) = '', @d char(1) = ' ', @j int;
declare @t table (id tinyint primary key, tokens int, which varchar(32), start datetime, finish datetime);
set @j = @i;
while @j > 0 BEGIN
while @i > 0 BEGIN
set @x = @x + @d +CHAR(91 - @i);
set @i = @i - 1;
END
set @j = @j - 1;
set @i = 26
END
set @x = LEFT(@x,100)

declare @c int;
insert into @t (id,which,start) values (0,'fnStringSplit_CTE',getdate());
select @c = COUNT(*) from ..[fnStringSplit_CTE](@x,@d)
update @t set tokens = @c, finish = GETDATE() where id = 0;
insert into @t (id,which,start) values (1,'fnStringSplit_2005_Distinct_CHAR',getdate());
select @c = COUNT(*) from ..[fnStringSplit_2005_Distinct_CHAR](@x,@d);
update @t set tokens = @c, finish = GETDATE() where id = 1;
insert into @t (id,which,start) values (2,'udf_StrList2Table',getdate());
select @c = COUNT(*) from ..[udf_StrList2Table] (@x,@d)
update @t set tokens = @c, finish = GETDATE() where id = 2;
insert into @t (id,which,start) values (3,'fn_Split',getdate());
select @c = COUNT(*) from ..[fn_Split](@x,@d)
update @t set tokens = @c, finish = GETDATE() where id = 3;
select *, CONVERT(float, finish) - convert(float,start) as runtime from @t;

The results show CTE is the winner, but the others are fairly close with the tally table last:
id tokens which start finish runtime
0 51 fnStringSplit_CTE 10:29.7 10:29.7 0.00000003858440322801
1 51 fnStringSplit_2005_Distinct_CHAR 10:29.7 10:29.7 0.00000027006171876565
3 51 fn_Split 10:29.7 10:29.8 0.00000027006171876565
2 51 udf_StrList2Table 10:29.7 10:29.7 0.00000038580037653446

The funny part about this one is that if I ran it a dozen times, only the tally table's result changed dramatically when splitting a small array of values (it sometimes took a lot longer... randomly).

So what I take from this is CTE is best for small arrays, and the tally table is best for large ones. The one last test I did not try is to run several thousand small splits at volume (may about 200 sets of 100 would do it). The only way I could think of to mock this up is to make a highly parallel package call via SSIS (basically, the same function called 2000 times at the same time). If I feel inspired, I'll give it a shot. Any suggestions on an easier parallelization test are welcome.
Post #1072321
Posted Wednesday, March 02, 2011 10:21 PM


Old Hand

Old HandOld HandOld HandOld HandOld HandOld HandOld HandOld Hand

Group: General Forum Members
Last Login: 2 days ago @ 9:10 PM
Points: 393, Visits: 782
Interesting about the varying times with smaller sets of data. I'll do some testing and see what I can find.

/* Anything is possible but is it worth it? */
Post #1072336
Posted Thursday, March 03, 2011 4:44 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Tuesday, November 19, 2013 9:29 AM
Points: 9, Visits: 183
Hey guys,

look at here --> udf_ListToTable

maybe you could run it within your performance tests,
Norbert


PS: For the way upsidedown take this --> udf_TableToList
Post #1072436
Posted Friday, March 04, 2011 8:52 PM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Saturday, March 10, 2012 8:06 PM
Points: 15, Visits: 51
That function looks pretty similar to the one from the original article, which I called "fn_Split". It does fine, too, and it will work on older versions of SQL Server.

For joining, there are a couple of basic ways to do it. The tough part can be when you want to pass the table name involved as a variable, as you either you have to do some SQL-Server-2008-only stuff using a newly declared type, or you have to use string concatenation with EXEC, so be careful with that one regarding SQL Injection... AND some of the XML functionality might let you find a back door, too.

The basic premise for joining can use either CTE (as I wrote in my very first blog post) or a COALESCE() or perhaps a CURSOR (which will be a little slow). Here is a quick and dirty example using a table variable and a 2008-only insert statement to fill that test table variable, and then a COALESCE() to join all the rows in that table:
DECLARE @table TABLE (token VARCHAR(max));
INSERT INTO @table VALUES ('xyz'),('pdq'),('abc'),('123')
DECLARE @joined VARCHAR(max)
SELECT @joined=COALESCE(@joined+',', '')+token FROM @table
SELECT @joined

Post #1073653
Posted Saturday, March 05, 2011 2:47 PM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Saturday, March 10, 2012 8:06 PM
Points: 15, Visits: 51
I ran the three types again, with "SET EXECUTION TIME ON" and...

...8788 tokens:
XML: CPU time = 203722 ms
Tally: CPU time = 47 ms
Original: CPU time = 686 ms
Post #1073774
Posted Monday, March 07, 2011 2:34 AM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Tuesday, March 04, 2014 8:50 AM
Points: 13, Visits: 61
Surely other factors like CPU, Memory and Disk Configuration could affect these timings, hence why from my tests the XML was a better fit. There probably isn't one size fits all.

Cheap toilet paper is a false economy, beware!
Post #1074000
« Prev Topic | Next Topic »

Add to briefcase ««123»»

Permissions Expand / Collapse