|
|
|
Forum Newbie
      
Group: General Forum Members
Last Login: Wednesday, February 20, 2013 3:54 AM
Points: 9,
Visits: 159
|
|
Hi,
@davecason Your function does only return 2 rows if there is a string like that given: 'bla;bla;bla'. And if a string like this is given 'bla;bla;bla;' 4 rows are returned.
Could you correct that?
regards Norbert
|
|
|
|
|
Forum Newbie
      
Group: General Forum Members
Last Login: Sunday, February 26, 2012 3:26 AM
Points: 1,
Visits: 30
|
|
| Dear in your first scenario it being returned 3 rows and in second scenario that will be returned 4 rows according to delimiter counting e.g. ‘bla;bla;bla;’ the last row will be NULL. You can handle it in your code where you want to use this function.
|
|
|
|
|
Grasshopper
      
Group: General Forum Members
Last Login: Saturday, March 10, 2012 8:06 PM
Points: 15,
Visits: 51
|
|
| Norbert, try to fix it the way you want it, and post your code if you get stuck. I intentionally attempted to keep the last token if the delimiter was followed by nothing.
|
|
|
|
|
Old Hand
      
Group: General Forum Members
Last Login: Wednesday, May 15, 2013 6:29 PM
Points: 387,
Visits: 681
|
|
We've found good use of the following tally table and UDF. It handles empty elements and works fast for us. We've done large and small sets against it.
CREATE TABLE dbo.Tally (N INT) -- Default Data -- Taken from the following website: -- http://www.sqlservercentral.com/articles/TSQL/62867/ --============================================================================= -- Create and populate a Tally table --============================================================================= --===== Conditionally drop IF OBJECT_ID('dbo.Tally') IS NOT NULL DROP TABLE dbo.Tally
--===== Create and populate the Tally table on the fly SELECT TOP 3000000 --equates to more than 30 years of dates IDENTITY(INT,1,1) AS N INTO dbo.Tally FROM Master.dbo.SysColumns sc1, Master.dbo.SysColumns sc2
--===== Add a Primary Key to maximize performance ALTER TABLE dbo.Tally ADD CONSTRAINT PK_Tally_N PRIMARY KEY CLUSTERED (N) WITH FILLFACTOR = 100
--===== Let the public use it GRANT SELECT, REFERENCES ON dbo.Tally TO PUBLIC GO CREATE FUNCTION udf_StrList2Table ( @List NVARCHAR(MAX) = N'', @Delimiter NCHAR(1) = N',' ) RETURNS TABLE AS RETURN SELECT SUBSTRING( @Delimiter + @List + @Delimiter, n + 1, CHARINDEX( @Delimiter, @Delimiter + @List + @Delimiter, n + 1 ) - n - 1 ) AS Value, ROW_NUMBER() OVER ( ORDER BY n ) AS ListPos FROM Tally WHERE SUBSTRING( @Delimiter + @List + @Delimiter, n, 1 ) = @Delimiter AND n < CAST(LEN( @Delimiter + @List + @Delimiter ) as int)
/* Anything is possible but is it worth it? */
|
|
|
|
|
Grasshopper
      
Group: General Forum Members
Last Login: Saturday, March 10, 2012 8:06 PM
Points: 15,
Visits: 51
|
|
The tally table method seems to work fine, but the estimated and actual execution plans show a pretty severe penalty. My guess is that the initial file I/O and memory allocation of the table's index are at fault for those costs. I then derived a little test on about 17,000 tokens using the initial method, the XML method, and the tally table. With that said, the tally table won by a long shot when parsing a large array:
declare @i int = 26, @x varchar(max) = '', @d char(1) = ' ', @j int; declare @t table (id tinyint primary key, tokens int, which varchar(32), start datetime, finish datetime); --Note: this is going to take a while, so if you want to run this more than once, store this data somewhere... set @j = @i*@i; while @j > 0 BEGIN while @i > 0 BEGIN set @x = @x + @d +CHAR(91 - @i); set @i = @i - 1; END set @j = @j - 1; set @i = 26 END
declare @c int; insert into @t (id,which,start) values (1,'fnStringSplit_2005_Distinct_CHAR',getdate()); select @c = COUNT(*) from ..[fnStringSplit_2005_Distinct_CHAR](@x,@d); update @t set tokens = @c, finish = GETDATE() where id = 1; insert into @t (id,which,start) values (2,'udf_StrList2Table',getdate()); select @c = COUNT(*) from ..[udf_StrList2Table] (@x,@d) update @t set tokens = @c, finish = GETDATE() where id = 2; insert into @t (id,which,start) values (3,'fn_Split',getdate()); select @c = COUNT(*) from ..[fn_Split](@x,@d) update @t set tokens = @c, finish = GETDATE() where id = 3; select *, CONVERT(float, finish) - convert(float,start) as runtime from @t;
The results show us that the clear winner as the tally table. Volume seems to be a very large consideration in which method you choose. It looks like arrays of enormous scale should be parsed with a tally table method: id tokens which start finish runtime 2 17577 udf_StrList2Table 54:52.0 54:52.0 0.00000057870056480169 3 17577 fn_Split 54:52.0 55:04.0 0.00013854166900273400 1 17577 fnStringSplit_2005_Distinct_CHAR 51:26.9 54:52.0 0.00237364969507325000
I added back my CTE function and reran it with a limit of 100 tokens (CTE max):
declare @i int = 26, @x varchar(max) = '', @d char(1) = ' ', @j int; declare @t table (id tinyint primary key, tokens int, which varchar(32), start datetime, finish datetime); set @j = @i; while @j > 0 BEGIN while @i > 0 BEGIN set @x = @x + @d +CHAR(91 - @i); set @i = @i - 1; END set @j = @j - 1; set @i = 26 END set @x = LEFT(@x,100)
declare @c int; insert into @t (id,which,start) values (0,'fnStringSplit_CTE',getdate()); select @c = COUNT(*) from ..[fnStringSplit_CTE](@x,@d) update @t set tokens = @c, finish = GETDATE() where id = 0; insert into @t (id,which,start) values (1,'fnStringSplit_2005_Distinct_CHAR',getdate()); select @c = COUNT(*) from ..[fnStringSplit_2005_Distinct_CHAR](@x,@d); update @t set tokens = @c, finish = GETDATE() where id = 1; insert into @t (id,which,start) values (2,'udf_StrList2Table',getdate()); select @c = COUNT(*) from ..[udf_StrList2Table] (@x,@d) update @t set tokens = @c, finish = GETDATE() where id = 2; insert into @t (id,which,start) values (3,'fn_Split',getdate()); select @c = COUNT(*) from ..[fn_Split](@x,@d) update @t set tokens = @c, finish = GETDATE() where id = 3; select *, CONVERT(float, finish) - convert(float,start) as runtime from @t;
The results show CTE is the winner, but the others are fairly close with the tally table last: id tokens which start finish runtime 0 51 fnStringSplit_CTE 10:29.7 10:29.7 0.00000003858440322801 1 51 fnStringSplit_2005_Distinct_CHAR 10:29.7 10:29.7 0.00000027006171876565 3 51 fn_Split 10:29.7 10:29.8 0.00000027006171876565 2 51 udf_StrList2Table 10:29.7 10:29.7 0.00000038580037653446
The funny part about this one is that if I ran it a dozen times, only the tally table's result changed dramatically when splitting a small array of values (it sometimes took a lot longer... randomly).
So what I take from this is CTE is best for small arrays, and the tally table is best for large ones. The one last test I did not try is to run several thousand small splits at volume (may about 200 sets of 100 would do it). The only way I could think of to mock this up is to make a highly parallel package call via SSIS (basically, the same function called 2000 times at the same time). If I feel inspired, I'll give it a shot. Any suggestions on an easier parallelization test are welcome.
|
|
|
|
|
Old Hand
      
Group: General Forum Members
Last Login: Wednesday, May 15, 2013 6:29 PM
Points: 387,
Visits: 681
|
|
Interesting about the varying times with smaller sets of data. I'll do some testing and see what I can find.
/* Anything is possible but is it worth it? */
|
|
|
|
|
Forum Newbie
      
Group: General Forum Members
Last Login: Wednesday, February 20, 2013 3:54 AM
Points: 9,
Visits: 159
|
|
Hey guys,
look at here --> udf_ListToTable
maybe you could run it within your performance tests, Norbert
PS: For the way upsidedown take this --> udf_TableToList
|
|
|
|
|
Grasshopper
      
Group: General Forum Members
Last Login: Saturday, March 10, 2012 8:06 PM
Points: 15,
Visits: 51
|
|
That function looks pretty similar to the one from the original article, which I called "fn_Split". It does fine, too, and it will work on older versions of SQL Server.
For joining, there are a couple of basic ways to do it. The tough part can be when you want to pass the table name involved as a variable, as you either you have to do some SQL-Server-2008-only stuff using a newly declared type, or you have to use string concatenation with EXEC, so be careful with that one regarding SQL Injection... AND some of the XML functionality might let you find a back door, too.
The basic premise for joining can use either CTE (as I wrote in my very first blog post) or a COALESCE() or perhaps a CURSOR (which will be a little slow). Here is a quick and dirty example using a table variable and a 2008-only insert statement to fill that test table variable, and then a COALESCE() to join all the rows in that table:
DECLARE @table TABLE (token VARCHAR(max)); INSERT INTO @table VALUES ('xyz'),('pdq'),('abc'),('123') DECLARE @joined VARCHAR(max) SELECT @joined=COALESCE(@joined+',', '')+token FROM @table SELECT @joined
|
|
|
|
|
Grasshopper
      
Group: General Forum Members
Last Login: Saturday, March 10, 2012 8:06 PM
Points: 15,
Visits: 51
|
|
I ran the three types again, with "SET EXECUTION TIME ON" and...
...8788 tokens: XML: CPU time = 203722 ms Tally: CPU time = 47 ms Original: CPU time = 686 ms
|
|
|
|
|
Grasshopper
      
Group: General Forum Members
Last Login: Thursday, February 21, 2013 4:51 AM
Points: 13,
Visits: 51
|
|
Surely other factors like CPU, Memory and Disk Configuration could affect these timings, hence why from my tests the XML was a better fit. There probably isn't one size fits all.
Cheap toilet paper is a false economy, beware!
|
|
|
|