SQL Clone
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


Convert CSV values in three columns to rows


Convert CSV values in three columns to rows

Author
Message
Jeff Moden
Jeff Moden
SSC Guru
SSC Guru (934K reputation)SSC Guru (934K reputation)SSC Guru (934K reputation)SSC Guru (934K reputation)SSC Guru (934K reputation)SSC Guru (934K reputation)SSC Guru (934K reputation)SSC Guru (934K reputation)

Group: General Forum Members
Points: 934733 Visits: 49053
Ok... here we go...

First, here are 4 functions. I believe Oleg will recognize the one called "XML-1 (Split8KXML1 mlTVF)"....

--=====================================================================================================================
-- Create the various functions to test
--=====================================================================================================================
--===== Do this in a nice safe place that everyone has
USE TempDB;

--===== Tally Table (Split8KTally iTVF) ===============================================================================
DROP FUNCTION dbo.Split8KTally
GO
CREATE FUNCTION dbo.Split8KTally
(@Parameter VARCHAR(8000), @Delimiter VARCHAR(1))
RETURNS TABLE AS
RETURN
SELECT CAST(ROW_NUMBER() OVER (ORDER BY N) AS INT) AS ItemNumber,
SUBSTRING(@Parameter,N,CHARINDEX(@Delimiter,@Parameter+@Delimiter,N)-N) AS ItemValue
FROM dbo.Tally
WHERE N BETWEEN 1 AND LEN(@Parameter)
AND SUBSTRING(@Delimiter+@Parameter,N,1) = @Delimiter; --Notice how we find the comma
GO
--===== XML-1 (Split8KXML1 mlTVF) =====================================================================================
DROP FUNCTION dbo.Split8KXML1
GO
CREATE FUNCTION dbo.Split8KXML1
(@Parameter VARCHAR(MAX), @Delimiter VARCHAR(1))
RETURNS @Result TABLE
(ItemNumber INT, ItemValue VARCHAR(8000))AS
BEGIN
DECLARE @XML XML;
SELECT @XML = '<r>'+REPLACE(@Parameter, ',', '</r><r>')+'</r>';

INSERT INTO @Result (ItemNumber, ItemValue)
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS ItemNumber,
Item.value('text()[1]', 'varchar(max)') AS ItemValue
FROM @XML.nodes('//r') R(Item);
RETURN;
END;
GO
--===== XML-3 (Split8KXML3 iTVF) ======================================================================================
DROP FUNCTION dbo.Split8KXML3
GO
CREATE FUNCTION dbo.Split8KXML3
(@Parameter VARCHAR(8000), @Delimiter VARCHAR(1))
RETURNS TABLE WITH SCHEMABINDING AS
RETURN
SELECT CAST(ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS INT) AS ItemNumber,
R.Item.value('text()[1]', 'varchar(max)') AS ItemValue
FROM (SELECT CAST('<r>'+REPLACE(@Parameter, @Delimiter, '</r><r>')+'</r>' AS XML)) X(N)
CROSS APPLY N.nodes('//r') R(Item)
;
GO
--===== Loop Method 1 (Split8KL1 mlTVF) ===============================================================================
DROP FUNCTION dbo.Split8KL1
GO
CREATE FUNCTION dbo.Split8KL1
(@Parameter VARCHAR(8000), @Delimiter VARCHAR(1))
RETURNS @Result TABLE (ItemNumber INT IDENTITY(1,1), ItemValue VARCHAR(8000))
AS
BEGIN
--===== Declare a variable to remember the position of the current comma
DECLARE @N INT;
--===== Add start and end commas to the Parameter so we can handle
-- single elements
SELECT @Parameter = @Delimiter + @Parameter + @Delimiter,
--===== Preassign the current comma as the first character
@N = 1;
--===== Loop through and find each comma, then insert the string value
-- found between the current comma and the next comma. @N is
-- the position of the current comma.
WHILE @N < LEN(@Parameter) --Don't include the last comma
BEGIN
--==== Do the insert using the value between the commas
INSERT INTO @Result (ItemValue)
SELECT SUBSTRING(@Parameter, @N+1, CHARINDEX(@Delimiter, @Parameter, @N+1)-@N-1);
--==== Find the next comma
SELECT @N = CHARINDEX(@Delimiter, @Parameter, @N+1);
END; --END While
RETURN;
END; --END Function
GO




Next, some test data. Read the comments for where to make changes to get the data to vary, please...

--=====================================================================================================================
-- Build the test data
--=====================================================================================================================
--===== Do this in a nice safe place that everyone has
USE TempDB;

--===== Conditionally drop the test table to make reruns easier
IF OBJECT_ID('TempDB..CsvTest','U') IS NOT NULL
DROP TABLE CsvTest;

--===== This creates and populates a test table on the fly containing a
-- sequential column and a randomly generated CSV Parameter column.
SELECT TOP (10000) --Controls the number of rows in the test table
ISNULL(ROW_NUMBER() OVER (ORDER BY(SELECT NULL)),0) AS RowNum,
(
SELECT CAST(STUFF( --=== STUFF get's rid of the leading comma
( --=== This builds CSV row with a leading comma
SELECT TOP (16) --Controls the number of CSV elements in each row
','+CAST(ABS(CHECKSUM(NEWID()))%100000 AS VARCHAR(10))
FROM dbo.Tally t3 --Classic cross join pseudo-cursor
CROSS JOIN dbo.Tally t4 --can produce row sets up 121 million.
WHERE t1.N <> t3.N --Without this line, all rows would be the same
FOR XML PATH('')
)
,1,1,'') AS NVARCHAR(MAX))
) AS CsvParameter
INTO CsvTest
FROM dbo.Tally t1 --Classic cross join pseudo-cursor
CROSS JOIN dbo.Tally t2; --can produce row sets up 121 million.

--===== Let's add a PK just for grins. Since it's a temp table, we won't name it.
ALTER TABLE CsvTest
ADD PRIMARY KEY CLUSTERED (RowNum) WITH FILLFACTOR = 100;
GO




Last but not least, let's test the functions. I ran the following for 4 different sets of test data with SQL Profiler running. You can tell what's what just by reading the following SQL Profiler output.

--=====================================================================================================================
-- Run the functions (Profiler turned on for this given SPID)
--=====================================================================================================================
--===== Do this in a nice safe place that everyone has
USE TempDB;
GO
--===== Tally Table (Split8KTally iTVF) ===============================================================================
DECLARE @RowNum INT, @ItemNumber INT, @ItemValue INT;

SELECT @RowNum = csv.RowNum, @ItemNumber = split.ItemNumber, @ItemValue = CAST(split.ItemValue AS INT)
FROM CsvTest csv
CROSS APPLY dbo.Split8KTally(csv.CsvParameter,',') AS split;
GO
--===== XML-1 (Split8KXML1 mlTVF) =====================================================================================
DECLARE @RowNum INT, @ItemNumber INT, @ItemValue INT;

SELECT @RowNum = csv.RowNum, @ItemNumber = split.ItemNumber, @ItemValue = CAST(split.ItemValue AS INT)
FROM CsvTest csv
CROSS APPLY dbo.Split8KXML1(csv.CsvParameter,',') AS split;
GO
--===== XML-3 (Split8KXML3 iTVF) ======================================================================================
DECLARE @RowNum INT, @ItemNumber INT, @ItemValue INT;

SELECT @RowNum = csv.RowNum, @ItemNumber = split.ItemNumber, @ItemValue = CAST(split.ItemValue AS INT)
FROM CsvTest csv
CROSS APPLY dbo.Split8KXML3(csv.CsvParameter,',') AS split;
GO
--===== Loop Method 1 (Split8KL1 mlTVF) ===============================================================================
DECLARE @RowNum INT, @ItemNumber INT, @ItemValue INT;

SELECT @RowNum = csv.RowNum, @ItemNumber = split.ItemNumber, @ItemValue = CAST(split.ItemValue AS INT)
FROM CsvTest csv
CROSS APPLY dbo.Split8KL1(csv.CsvParameter,',') AS split;
GO




Here's the profiler run. Even the WHILE Loop beat both versions of the XML. You don't really want to see the 25 minutes it takes the XML to do 100 elements for 10K Rows.



The reason why I initially said the XML was looking good is because I'd made a mistake in the creation of my Tally table... I used BIGINT instead of INT and the implicit conversions made a mess of the Tally table solution. Obviously, I fixed that for this experiment.

My machine is an 8 year old single P4 1.8Ghz with 1GB Ram running Windows XP Pro SP3 and SQL Server Developer's Edition 2005 SP3 with no CU's.

--Jeff Moden

RBAR is pronounced ree-bar and is a Modenism for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
If you think its expensive to hire a professional to do the job, wait until you hire an amateur. -- Red Adair

When you put the right degree of spin on it, the number 318 is also a glyph that describes the nature of a DBAs job. Wink

Helpful Links:
How to post code problems
How to post performance problems
Forum FAQs
Attachments
XML Split Race.gif (533 views, 15.00 KB)
Jeff Moden
Jeff Moden
SSC Guru
SSC Guru (934K reputation)SSC Guru (934K reputation)SSC Guru (934K reputation)SSC Guru (934K reputation)SSC Guru (934K reputation)SSC Guru (934K reputation)SSC Guru (934K reputation)SSC Guru (934K reputation)

Group: General Forum Members
Points: 934733 Visits: 49053
Long story short... I wouldn't use XML for splitting if I were you. :-)

--Jeff Moden

RBAR is pronounced ree-bar and is a Modenism for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
If you think its expensive to hire a professional to do the job, wait until you hire an amateur. -- Red Adair

When you put the right degree of spin on it, the number 318 is also a glyph that describes the nature of a DBAs job. Wink

Helpful Links:
How to post code problems
How to post performance problems
Forum FAQs
Oleg Netchaev
Oleg Netchaev
SSCarpal Tunnel
SSCarpal Tunnel (4.9K reputation)SSCarpal Tunnel (4.9K reputation)SSCarpal Tunnel (4.9K reputation)SSCarpal Tunnel (4.9K reputation)SSCarpal Tunnel (4.9K reputation)SSCarpal Tunnel (4.9K reputation)SSCarpal Tunnel (4.9K reputation)SSCarpal Tunnel (4.9K reputation)

Group: General Forum Members
Points: 4924 Visits: 2016
Jeff,

I am terribly sorry, but there is no way that your test reflects the picture. I ran Split8K and xml only tests and results I get are totally different. First, I would like to clarify that you did mean 25 seconds, not 25 minutes to process 10,000 records with 100 items in one record, right? The biggest problem with Split8KTally was pointed out a while ago by Lynn Pettis who correctly said that it becomes pretty slow when the datalength of the items in the record becomes close to 8K. This assessment is absolutely correct, yet you tested a rather small number of items to split in each record (4, 16 etc). When I ran your tests I saw that the smaller number of items to split, the better Split8KTally becomes. So, for 16 items per record I saw Split8kTally takes 1 second while xml takes 4. However, when I increased the records to hold 100 items each, I saw that Split8KTally takes 15 seconds while xml takes 20, still faster but not as much. When I increased the number to 1000 per record, which basically pushed the Split8K to about as much as it can handle then I saw that

Split8KTally took 4 minutes 39 seconds
xml still took 20 seconds.

Then I could no longer test Split8KTally (does not handle more than 8K worth of csv data per record), but continued with xml. I increased the number of items to split in each record to 10,000, which made the longest record to have 117,840 characters, and the number of records as a result of cross apply - 100 million (10,000 records with 10,000 csv values in each). This took 45 seconds.

This clearly shows that xml handily beats other methods (outside of CLR procedure, which can be created to complete the whole thing in a heartbeat) once the data to split becomes big enough in number of csv values to split.

Honestly, I would not even go for all this re-testing, but after I saw the 25 minutes instead of expected 25 seconds coupled with the fact that the WHILE loop (the which should not even exist in T-SQL) can beat my xml solution, I had to do it.

Please let me know if my assumptions are incorrect is some way. I used your harness without making any changes.

Thanks,

Oleg
Jeff Moden
Jeff Moden
SSC Guru
SSC Guru (934K reputation)SSC Guru (934K reputation)SSC Guru (934K reputation)SSC Guru (934K reputation)SSC Guru (934K reputation)SSC Guru (934K reputation)SSC Guru (934K reputation)SSC Guru (934K reputation)

Group: General Forum Members
Points: 934733 Visits: 49053
My tests DO reflect the total picture... just not on your machine. ;-) They show the test results on my machine. It's obvious that my machine is different than yours (mine cannot do parallelism... it only has 1 CPU, for example) so I'm setting up to do the tests on a more modern machine than my 8 year old desktop. We'll see soon.

Please post the CREATE TABLE statement for your Tally Table. Be sure to include any indexes it may have. It would also be helpful to know what your default COLLATION is, please. We need to make sure we're comparing apples to apples. Thanks.

And to be sure, the only thing I care about right now is splitting things that are 8k or less. We already know that JOINing to a MAX column is terrible for performance and that a While Loop is one of the ways to go there.

--Jeff Moden

RBAR is pronounced ree-bar and is a Modenism for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
If you think its expensive to hire a professional to do the job, wait until you hire an amateur. -- Red Adair

When you put the right degree of spin on it, the number 318 is also a glyph that describes the nature of a DBAs job. Wink

Helpful Links:
How to post code problems
How to post performance problems
Forum FAQs
Jeff Moden
Jeff Moden
SSC Guru
SSC Guru (934K reputation)SSC Guru (934K reputation)SSC Guru (934K reputation)SSC Guru (934K reputation)SSC Guru (934K reputation)SSC Guru (934K reputation)SSC Guru (934K reputation)SSC Guru (934K reputation)

Group: General Forum Members
Points: 934733 Visits: 49053
Ok, Oleg... not sure what the differences are between your machine and the two that I'm running but the Tally table beats the two XML methods pretty badly on both.

First, here's my Tally table setup just so we're talking apples and apples. Also, make sure you're using my test generator so you're testing for "real life" very random data with commas in different positions in each row.

--===== Do this in a nice safe place that everyone has
USE TempDB;
--===================================================================
-- Create a Tally table from 1 to 11000
--===================================================================
--===== Create and populate the Tally table on the fly.
-- This ISNULL function makes the column NOT NULL
-- so we can put a Primary Key on it
SELECT TOP 11000
IDENTITY(INT,1,1) AS N
INTO dbo.Tally
FROM Master.sys.ALL_Columns ac1
CROSS JOIN Master.sys.ALL_Columns ac2
;
--===== Add a CLUSTERED Primary Key to maximize performance
ALTER TABLE dbo.Tally
ADD CONSTRAINT PK_Tally_N
PRIMARY KEY CLUSTERED (N) WITH FILLFACTOR = 100
;
--===== Allow the general public to use it
GRANT SELECT ON dbo.Tally TO PUBLIC
;
GO




Here are the run results from my desktop machine (8 year old single P4 1.8GHz 1GB Ram). Do notice the 100 element test, please. And, hell no... with speeds like that, there's no way I'm running a 1,000 element test.



Here are the exact same runs from a real server running 4 32-bit Xeons at 3GHz on Windows 2003 and SQL Server 2005 sp3 with 2GB memory allocated to SQL Server. Again... notice the 100 element runs.




I just might try the 1000 element runs on the server (well... probably NOT the XML3 version).

--Jeff Moden

RBAR is pronounced ree-bar and is a Modenism for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
If you think its expensive to hire a professional to do the job, wait until you hire an amateur. -- Red Adair

When you put the right degree of spin on it, the number 318 is also a glyph that describes the nature of a DBAs job. Wink

Helpful Links:
How to post code problems
How to post performance problems
Forum FAQs
Attachments
DeskTop Runs.gif (539 views, 25.00 KB)
Real Server Runs.gif (559 views, 33.00 KB)
Jeff Moden
Jeff Moden
SSC Guru
SSC Guru (934K reputation)SSC Guru (934K reputation)SSC Guru (934K reputation)SSC Guru (934K reputation)SSC Guru (934K reputation)SSC Guru (934K reputation)SSC Guru (934K reputation)SSC Guru (934K reputation)

Group: General Forum Members
Points: 934733 Visits: 49053
Oh yeah... I almost forgot. Here's the default collation on my desktop box...

SQL_Latin1_General_CP1_CI_AS

... and on my server box...

SQL_Latin1_General_CP1_CI_AS

What we need to find out now is why XML seems to run so nice and fast on your box. Please let me know what the configuration is so I can try to find one similar to test on. Thanks.

--Jeff Moden

RBAR is pronounced ree-bar and is a Modenism for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
If you think its expensive to hire a professional to do the job, wait until you hire an amateur. -- Red Adair

When you put the right degree of spin on it, the number 318 is also a glyph that describes the nature of a DBAs job. Wink

Helpful Links:
How to post code problems
How to post performance problems
Forum FAQs
Oleg Netchaev
Oleg Netchaev
SSCarpal Tunnel
SSCarpal Tunnel (4.9K reputation)SSCarpal Tunnel (4.9K reputation)SSCarpal Tunnel (4.9K reputation)SSCarpal Tunnel (4.9K reputation)SSCarpal Tunnel (4.9K reputation)SSCarpal Tunnel (4.9K reputation)SSCarpal Tunnel (4.9K reputation)SSCarpal Tunnel (4.9K reputation)

Group: General Forum Members
Points: 4924 Visits: 2016
Jeff,

So far, I can see that I use the same setup as far as the tally creation is concerned. The collation is of course the same as well. You are correct in not using xml3, it is designed to fail due to the cross apply inside of it. This is what prompted me to modify my answer on ask after I saw Kevan's. My original answer included the same dreaded cross apply, but once I removed it, it became faster than Kevan's tally for sizeable strings. I still deleted the answer from ask altogether cause I respect your opinion (though cannot agree with it yet). I will try to play with settings and also check whether it is possible to tweak the xml a bit. One thing I saw for sure, I and hope that you will agree that with huge strings, xml execution times does not increase much, but tally split sure does.

I think that the most relevant test we can have will be with the string size somewhere in the middle, say 4k per record meaning about 400 items to split. More than 800 makes the string greater than 8k in size.

I would like to point out that while you do mention apples to apples comparison, the xml function you wrote does use the varchar(max). I will leave it as is though, because xml-ifying the input bloats it somewhat, so let varchar(max) stay in xml function.

Oleg
Jeff Moden
Jeff Moden
SSC Guru
SSC Guru (934K reputation)SSC Guru (934K reputation)SSC Guru (934K reputation)SSC Guru (934K reputation)SSC Guru (934K reputation)SSC Guru (934K reputation)SSC Guru (934K reputation)SSC Guru (934K reputation)

Group: General Forum Members
Points: 934733 Visits: 49053
Oleg Netchaev (9/18/2010)
Jeff,

So far, I can see that I use the same setup as far as the tally creation is concerned. The collation is of course the same as well. You are correct in not using xml3, it is designed to fail due to the cross apply inside of it. This is what prompted me to modify my answer on ask after I saw Kevan's. My original answer included the same dreaded cross apply, but once I removed it, it became faster than Kevan's tally for sizeable strings. I still deleted the answer from ask altogether cause I respect your opinion (though cannot agree with it yet). I will try to play with settings and also check whether it is possible to tweak the xml a bit. One thing I saw for sure, I and hope that you will agree that with huge strings, xml execution times does not increase much, but tally split sure does.

I think that the most relevant test we can have will be with the string size somewhere in the middle, say 4k per record meaning about 400 items to split. More than 800 makes the string greater than 8k in size.

I would like to point out that while you do mention apples to apples comparison, the xml function you wrote does use the varchar(max). I will leave it as is though, because xml-ifying the input bloats it somewhat, so let varchar(max) stay in xml function.

Oleg


Hi Oleg,

I appreciate the respect and I do wish you'd put your good post back up on Ask.

I definitely agree that the Tally table stumbles pretty badly on things above 8k and I'm setting up to test that.

On the apples-to-apples thing, I left VARCHAR(MAX) in the XML code because in order to get anything close to 8k of data, you have to use VARCHAR(MAX) to account for the addition of the 7 character tag information at each delimiter. Just to be complete, though, I'll try it with just a VARCHAR(8000) at the same 100 elements I've been testing with, so far.

I've got some folks participating in the 100 element test right now. If it works out ok, I'll try to figure out a way to do a bit more automated and extensive testing across a wider range. I didn't think the Statistics reporting was going to work out as well as it is currently working out and didn't want to bog peoples machines down so much.

You WILL be please to know that it's YOUR method in XML-1 in that testing and even though it's a mlTVF, it blows the other XML methods away for performance. Under 8k, the Tally table still beats up on it pretty good. Like I said, I'm setting up for >8k testing... it just takes some time to do it all so people don't have to mess with things when they test it.

--Jeff Moden

RBAR is pronounced ree-bar and is a Modenism for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
If you think its expensive to hire a professional to do the job, wait until you hire an amateur. -- Red Adair

When you put the right degree of spin on it, the number 318 is also a glyph that describes the nature of a DBAs job. Wink

Helpful Links:
How to post code problems
How to post performance problems
Forum FAQs
Pavel Pawlowski
Pavel Pawlowski
Ten Centuries
Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)

Group: General Forum Members
Points: 1220 Visits: 460
Jeff,

I've done some testing about this also on my machine, but instead of the XML version I've tested the CLR RegEx solution.

Here is the source for the RegEx CLR Function

public class SQLRegEx
{
private class RegExRow
{
/// <summary>
/// Private class for passing matches of the RegExMatches to the FillRow method
/// </summary>
/// <param name="rowId">ID of the Row</param>
/// <param name="matchId">ID of the Match</param>
/// <param name="groupID">ID of the Group within the Match</param>
/// <param name="value">Value of the particular group</param>
public RegExRow(int rowId, int matchId, int groupID, string value)
{
RowId = rowId;
MatchId = matchId;
GroupID = groupID;
Value = value;
}

public int RowId;
public int MatchId;
public int GroupID;
public string Value;
}

/// <summary>
/// Applies Regular Expression to the Source strings and return all matches and groups
/// </summary>
/// <param name="sourceString">Source string on which the regular expression should be applied</param>
/// <param name="pattern">Regular Expression pattern</param>
/// <returns>Returns list of RegExRows representing the group value</returns>
[SqlFunction(FillRowMethodName = "FillRegExRow")]
public static IEnumerable RegExMatches(string sourceString, string pattern)
{
Regex r = new Regex(pattern, RegexOptions.Compiled);
int rowId = 0;
int matchId = 0;
foreach (Match m in r.Matches(sourceString))
{
matchId++;
for (int i = 0; i < m.Groups.Count; i++)
{
yield return new RegExRow(++rowId, matchId, i, m.Groups[i].Value);
}
}
}

/// <summary>
/// FillRow method to populate the output table
/// </summary>
/// <param name="obj">RegExRow passed as object</param>
/// <param name="rowId">ID or the returned row</param>
/// <param name="matchId">ID of returned Match</param>
/// <param name="groupID">ID of group in the Match</param>
/// <param name="value">Value of the Group</param>
public static void FillRegExRow(Object obj, out int rowId, out int matchId, out int groupID, out SqlChars value)
{
RegExRow r = (RegExRow)obj;
rowId = r.RowId;
matchId = r.MatchId;
groupID = r.GroupID;
value = new SqlChars(r.Value);
}
}



CREATE ASSEMBLY [SQLRegEx]
AUTHORIZATION [dbo]
FROM 'C:\CLR\SQLRegEx.dll'
WITH PERMISSION_SET = SAFE
GO
CREATE FUNCTION [dbo].[fn_RegExMatches](
@sourceString nvarchar(max), --Source string to be processed by regular expression
@pattern nvarchar(4000)) --Regular expression (pattern) to be applied on the source string
RETURNS TABLE (
[rowId] int, --RowId each row as it's ID
[matchId] int, --ID of particular match (starts from 1)
[groupId] int, --ID of particular group in RegEx match (GroupID = 0) represents a complete match
[value] [nvarchar](4000) --value of the group)
WITH EXECUTE AS CALLER
AS EXTERNAL NAME [SQLRegEx].[SQLRegEx].[RegExMatches]
GO



I've tested it on the test tables you have provided and it's true, that for small number of elements the Tally table solution is unbeatable. But as Oleg said for larger element count the Tally is not usable.

For the 16 element on my machine the Tally runs about 1 se. and the CLR RegEx about 20 sec.

For 100 elements the Tally runs 20550 ms and CLR RegEx 26773 ms.

But the situation differs with increased number of element. Tested it on 1333 elements (nearly 8k). The Tally table version I stopped after 19 minutes of execution time.

Insetad the CLR RegEx solution tooks 46 seconds to finish.

Test query for the CRL RegEx:
DBCC DROPCLEANBUFFERS
GO
DBCC FREEPROCCACHE
GO

SET STATISTICS IO ON
SET STATISTICS TIME ON
GO

DECLARE @RowNum INT, @ItemNumber INT, @ItemValue INT;

SELECT
@RowNum = D.RowNum,
@ItemNumber = V.matchId,
@ItemValue = Cast(V.value AS int)
FROM dbo.CsvTest2 D
CROSS APPLY dbo.fn_RegExMatches(D.CsvParameter, '(\d+?)(?:,|$)') V
WHERE V.groupId = 1
GO

SET STATISTICS TIME OFF
SET STATISTICS IO OFF
GO



Here are results for the CLR RegEx:
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 33 ms.
Table 'CsvTest2'. Scan count 5, logical reads 217, physical reads 5, read-ahead reads 68, lob logical reads 320872, lob physical reads 2500, lob read-ahead reads 170000.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 140775 ms, elapsed time = 46082 ms.
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 0 ms.

Profiler output:


Run on my working machine Core 2 Quand 2.5 GHz * GB RAM

So from my tests the Tally solution is unusable for large arrays splitting. Instead the CLR works perfectly even with very lage arrays.. When tested it on single array of 100 000 elements, the execution time is 745 ms and for array with 1 000 000 elements it's 7.8 sec.

Also using the RegEx, there are no AppDomains reloads/restarts as Phill Factor somewhere mentioned when using the String.Split method for splitting arrays. The RegEx doesn't have this kind of problems.
Jeff Moden
Jeff Moden
SSC Guru
SSC Guru (934K reputation)SSC Guru (934K reputation)SSC Guru (934K reputation)SSC Guru (934K reputation)SSC Guru (934K reputation)SSC Guru (934K reputation)SSC Guru (934K reputation)SSC Guru (934K reputation)

Group: General Forum Members
Points: 934733 Visits: 49053
But as Oleg said for larger element count the Tally is not usable.

For the 16 element on my machine the Tally runs about 1 se. and the CLR RegEx about 20 sec.

For 100 elements the Tally runs 20550 ms and CLR RegEx 26773 ms.

But the situation differs with increased number of element. Tested it on 1333 elements (nearly 8k). The Tally table version I stopped after 19 minutes of execution time.


Thanks for the feedback... I'll run your code and see if I come up with the same results.

--Jeff Moden

RBAR is pronounced ree-bar and is a Modenism for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
If you think its expensive to hire a professional to do the job, wait until you hire an amateur. -- Red Adair

When you put the right degree of spin on it, the number 318 is also a glyph that describes the nature of a DBAs job. Wink

Helpful Links:
How to post code problems
How to post performance problems
Forum FAQs
Go


Permissions

You can't post new topics.
You can't post topic replies.
You can't post new polls.
You can't post replies to polls.
You can't edit your own topics.
You can't delete your own topics.
You can't edit other topics.
You can't delete other topics.
You can't edit your own posts.
You can't edit other posts.
You can't delete your own posts.
You can't delete other posts.
You can't post events.
You can't edit your own events.
You can't edit other events.
You can't delete your own events.
You can't delete other events.
You can't send private messages.
You can't send emails.
You can read topics.
You can't vote in polls.
You can't upload attachments.
You can download attachments.
You can't post HTML code.
You can't edit HTML code.
You can't post IFCode.
You can't post JavaScript.
You can post emoticons.
You can't post or upload images.

Select a forum









































































































































































SQLServerCentral


Search