SQLServerCentral Article

Splitting Strings Based on Patterns

,

Background

In a recent forum post: A Quick Query Puzzle, the OP (original poster) posed an interesting problem that I will paraphrase:

“Given an input string such as: 1234ABC123456XYZ1234567890ADS, I would like to replace any string of digits that is longer than 5 characters with some other character (e.g., ‘x’), while leaving the remaining characters in the string unchanged.”

My first thought to solve this was that, if the string could be split based on the digit pattern, i.e., substrings containing digits only into one row set and substrings not containing digits into another row set, it should be relatively simple to make the replacement on substrings greater than five in length and then put it all back together using the familiar STUFF/FOR XML PATH/GROUP BY technique.

As it turned out, splitting the string on a pattern was a bit more challenging than I initially expected, and I ended up posting a hideously ugly solution using a recursive CTE.  Thinking more about this ides of “pattern splitting,” I realized that a general technique would also have been useful to various other forum questions I’ve tackled in the past.  A quick trip to Google identified this blog post by Adam Machanic: Pattern-based Split String

Unfortunately, while playing with the FUNCTION that Adam proposed, I realized that it wasn’t exactly suited for what I had in mind for the following reasons:

  1. It returns only substrings that match the supplied pattern.  I also want the substrings that do not match the pattern, so that when necessary these can be concatenated to reproduce the original string.
  2. It doesn’t return an item number for ordering the occurrence of the substrings (a very useful feature of DelimitedSplit8K popularized by Jeff Moden here).
  3. It uses a permanent Numbers TABLE, which is not too much of an issue, merely a slight inconvenience.

Various other code modifications would be needed for compatibility with what I had in mind, so I felt it might be unfair to pit it against the solutions I will offer here.  Firstly, I will offer a loop-based solution of my own, followed by three other set-based solutions, one of which was suggested by my good friend, TSQL developer Chris Morris.

The Loop-based Solution

Those of you that know me are probably wondering why I’d propose a loop-based solution to this problem at the get go.  Let’s just say that if it was good enough for Mr. Machanic perhaps it will be good enough for me.  But more importantly, the loop-based solution makes the problem quite simple to understand from an algorithmic point of view.

Let’s first mention the specific requirements that will drive our efforts:

  1. The Pattern Splitter should handle VARCHAR strings up to 8000 characters in length.
  2. One pattern must be supplied to the Pattern Splitter and should be of a form that is compatible with LIKE and the first argument to PATINDEX.  Ideally, the Pattern Splitter should handle a wide variety of at least the simplest patterns that conform to this syntax.
  3. The number of substrings matching the pattern (and what I’ll call the “not pattern”) appearing within the string is unbounded except by the overall length of the string.
  4. The Pattern Splitter should return both matched and unmatched strings, using the pattern to delimit the split boundaries, so that if necessary the original string can be reassembled from the parts.
  5. The Pattern Splitter should be encapsulated in a Table Valued FUNCTION.

I have not mentioned performance but of course it should be as fast as possible.  More on that also will appear later.

Finally, when I reviewed Chris’s contribution I realized that the additional column he provided in the results set [Matched] can be quite useful as it allows the user to distinguish which of the strings are matched and which are not, thus avoiding the overhead of doing an extra LIKE in the calling query.

So let’s now look at a loop-based Table Valued FUNCTION (TVF) for solving this.


-- PatternSplitLoop will split a string based on a pattern of the form 
-- supported by LIKE and PATINDEX 
-- 
-- Created by: Dwain Camps 11-Oct-2012 
CREATE FUNCTION [dbo].[PatternSplitLoop]
 (  @String    VARCHAR(400)
   ,@Pattern   VARCHAR(500)
  ) RETURNS 
  @Results TABLE ( ItemNumber  INT
                  ,Item        VARCHAR(400)
                  ,[Matched]   INT     )
   WITH SCHEMABINDING 
AS 
BEGIN;
-- DECLARE a couple of variables we'll need in our loop     
DECLARE 
  @ItemNumber  INT = 0 
, @Remaining   VARCHAR(400) = ISNULL(@String, '')         
-- Create the "not pattern" 
, @NotPattern  VARCHAR(500) = REPLACE(REPLACE(@Pattern, '[', '[^'), '^^', '')
, @Matched     INT
IF @String IS NULL OR @Pattern IS NULL 
  INSERT INTO @Results SELECT NULL, NULL, NULL
WHILE DATALENGTH(@Remaining) > 0     
 BEGIN
   SELECT @ItemNumber = @ItemNumber + 1
   -- The item returned from the cascaded CROSS APPLY b below               ,@String = CASE
            -- When a+b = 1, then either a=1 and b=0 (the pattern was found but not pattern                     
            -- was not found) or a=0 and b=1 (the not pattern was found but pattern was           
            -- not found).
            -- This means that no meaninful patterns are found in what remains so we’re done.   
               WHEN a+b = 1 THEN @Remaining   
            -- This case returns the chunk up to the start of the next pattern/not pattern  
               WHEN (a=1 AND b>0) OR (b=1 AND a>0)                                  THEN SUBSTRING(@Remaining, 1, CASE a 
                                  WHEN 1 THEN b 
                                  ELSE a 
                                 END - 1)
               ELSE @Remaining                            
                END 
           ,@Matched=CASE a WHEN 1 THEN 1 ELSE 0 END                    FROM (
       -- Find the next occurrence of the Pattern and the NotPattern
        SELECT PATINDEX(@Pattern, @Remaining)
             , PATINDEX(@NotPattern, @Remaining)
           ) a(a, b)
       -- Now that we have our ItemNumber and Item (in @String) INSERT them into our results
       INSERT INTO @Results SELECT @ItemNumber, @String, @Matched
        -- Find the remaining characters in the string         
        SELECT @Remaining = CASE
                         WHEN DATALENGTH(@Remaining) = DATALENGTH(@String) THEN ''
                         ELSE SUBSTRING(@Remaining, DATALENGTH(@String)+1, DATALENGTH(@Remaining)) 
         END
 END
RETURN
END

The approach is reasonably straightforward and hopefully explained adequately by the comments in the code.  Some further notes:

  • The construction of the “not pattern” using REPLACE, as a highly simplified approach, imposes a limitation on patterns that will be recognized.
  • Using DATALENGTH instead of LEN is important when determining remaining characters because if the remaining characters only contain spaces  it can cause an infinite loop.

Initially we’ll try this on some degenerate cases and also the string proposed in the first paragraph of this article.  We’ll use OUTER APPLY to ensure that our expected result for a NULL string is returned.

CREATE TABLE #t1 (MyString VARCHAR(8000), Pattern VARCHAR(50))
INSERT INTO #t1 -- Check that degenerate cases are handled properly  SELECT NULL, '%[0-9]%'
 -- Returns NULL for Item (any pattern)
 UNION ALL
 SELECT 'ABC', '%[0-9]%'
 -- Returns the single unmatched pattern
 UNION ALL
 SELECT '123', '%[0-9]%'
 -- Returns the single matched pattern
 -- Split out the numeric components of a string 
 UNION ALL
 SELECT '1234ABC123456XYZ1234567890ADS', '%[0-9]%' 
SELECT MyString
     , ItemNumber
     , Item
     , Matched 
 FROM #t1 
  OUTER APPLY PatternSplitLoop(MyString, Pattern) 
  --WHERE Matched = 1
DROP TABLE #t1

The results we get from this are:

MyString                         ItemNumber   Item        Matched
NULL                             NULL         NULL        NULL
ABC                              1            ABC         0
123                              1            123         1
1234ABC123456XYZ1234567890ADS    1            1234        1
1234ABC123456XYZ1234567890ADS    2            ABC         0
1234ABC123456XYZ1234567890ADS    3            123456      1
1234ABC123456XYZ1234567890ADS    4            XYZ         0
1234ABC123456XYZ1234567890ADS    5            1234567890  1
1234ABC123456XYZ1234567890ADS    6            ADS         0

We see that our NULL string is handled as expected and for our two other degenerate cases, the entire string is returned.  If we are only interested in returning strings that match our supplied pattern, we merely need to uncomment the WHERE clause shown in the code snippet above.

So now we can make short work of the originally proposed problem:


CREATE TABLE #t1 (MyString VARCHAR(8000))
DECLARE @Pattern VARCHAR(50) = '%[0-9]%'
INSERT INTO #t1 SELECT '1234ABC123456XYZ1234567890ADS'
;WITH ReplaceDigits AS 
( 
  SELECT MyString, ItemNumber
        ,Item=CASE WHEN LEN(Item) >= 6 AND [Matched] = 1            
                   THEN REPLICATE('x', LEN(Item)) 
                   ELSE Item 
              END
   FROM #t1 
    CROSS APPLY PatternSplitLoop(MyString, @Pattern)
)
SELECT ( SELECT '' + Item
 FROM ReplaceDigits b
 WHERE a.MyString = b.Mystring  
 ORDER BY ItemNumber  
 FOR XML PATH('')
)
 FROM ReplaceDigits a
 GROUP BY MyString
DROP TABLE #t1
-- Returns: 1234ABCxxxxxxXYZxxxxxxxxxxADS

Note that we are not saying that this approach will outperform some of the more specific solutions posted to the linked forum thread.  We are only saying that as a generalized solution, ours is reasonably elegant and probably simpler to comprehend.

Some Usage Examples

So now let’s look at a few other examples of how this pattern split concept can be employed.  To shorten the results we’ll only list matched Items.


CREATE TABLE #t1 (MyString VARCHAR(8000), Pattern VARCHAR(50))
INSERT INTO #t1 
 -- Split out the non-numeric components of a string
 SELECT 'cbv736456XYZ543534534545XLS', '%[^0-9]%' 
    -- Retrieve [Tuples] regardless of bracketing characters 
 UNION ALL 
 SELECT '[001][0021][00043]', '%[0-9]%' 
 -- Retrieve only the alphabetic [Tuples] 
 UNION ALL 
 SELECT '<001><AAA><00043>', '%[A-Z]%' 
 -- Retrieve binary strings 
 UNION ALL 
 SELECT '01010101A011001B01900C', '%[0-1]%' 
 -- Retrieve hexadecimal strings 
 UNION ALL 
 SELECT '01ABXX02AYY0005ABCZZ', '%[0-9A-F]%' 
 -- Retrieve words and remove spaces & punctuation between them (from A. Machanic blog) 
 UNION ALL 
 SELECT 'The quick-brown  fox jumped! over the lazy dog.', '%[a-zA-Z]%' 
 -- Retrieve masked credit card numbers 
 UNION ALL 
 SELECT 'Dwain532500010044****Charley432800245431****', '%[0-9*]%'
SELECT MyString, ItemNumber, Item 
 FROM #t1 
  CROSS APPLY PatternSplitLoop(MyString, Pattern) 
 WHERE [Matched] = 1
DROP TABLE #t1

The results from this script are all pretty straightforward:

MyString                                         ItemNumber  Item
cbv736456XYZ543534534545XLS                      1           cbv
cbv736456XYZ543534534545XLS                      3           XYZ
cbv736456XYZ543534534545XLS                      5           XLS
[001][0021][00043]                               2           001
[001][0021][00043]                               4           0021
[001][0021][00043]                               6           00043
<001><AAA><00043>                                2           AAA
01010101A011001B01900C                           1           01010101
01010101A011001B01900C                           3           011001
01010101A011001B01900C                           5           01
01010101A011001B01900C                           7           00
01ABXX02AYY0005ABCZZ                             1           01AB
01ABXX02AYY0005ABCZZ                             3           02A
01ABXX02AYY0005ABCZZ                             5           0005ABC
The quick-brown  fox jumped! over the lazy dog.  1           The
The quick-brown  fox jumped! over the lazy dog.  3           quick
The quick-brown  fox jumped! over the lazy dog.  5           brown
The quick-brown  fox jumped! over the lazy dog.  7           fox
The quick-brown  fox jumped! over the lazy dog.  9           jumped
The quick-brown  fox jumped! over the lazy dog.  11          over
The quick-brown  fox jumped! over the lazy dog.  13          the
The quick-brown  fox jumped! over the lazy dog.  15          lazy
The quick-brown  fox jumped! over the lazy dog.  17          dog
Dwain532500010044****Charley432800245431****     2           532500010044****
Dwain532500010044****Charley432800245431****     4           432800245431****

You’ll notice in our examples that you can pass a different pattern to each row to be split or use the same pattern across all rows.  Let’s look at another example that is perhaps not quite as intuitive:


CREATE TABLE #t1 (MyString VARCHAR(8000), Pattern VARCHAR(50))
INSERT INTO #t1 
 -- Retrieve file names with full path 
 SELECT 'c:\abc.txt###d:\test\abc1s2.xls**e:\def.exe**', '%[a-zA-Z0-9:\.]%' 
 -- Retrieve drive letter (e.g., c:) and file name without path 
 UNION ALL 
 SELECT '***c:\mypath\test\abc.txt***', '%[a-zA-Z0-9:\.]%' 
 -- Retrieve drive letter (e.g., c:) and file name without path 
 UNION ALL 
 SELECT 'c:\abc.txt###d:\test\abc1s2.xls**e:\def.exe**', '%[a-zA-Z0-9.^:]%'
SELECT MyString, ItemNumber, Item
 FROM #t1
   CROSS APPLY PatternSplitLoop(MyString, Pattern) 
 WHERE Item LIKE Pattern
DROP TABLE #t1

These results are:

MyString                                         ItemNumber  Item
c:\abc.txt###d:\test\abc1s2.xls**e:\def.exe**    1           c:\abc.txt
c:\abc.txt###d:\test\abc1s2.xls**e:\def.exe**    3           d:\test\abc1s2.xls
c:\abc.txt###d:\test\abc1s2.xls**e:\def.exe**    5           e:\def.exe
***c:\mypath\test\abc.txt***                     2           c:\mypath\test\abc.txt
c:\abc.txt###d:\test\abc1s2.xls**e:\def.exe**    1           c:
c:\abc.txt###d:\test\abc1s2.xls**e:\def.exe**    3           abc.txt
c:\abc.txt###d:\test\abc1s2.xls**e:\def.exe**    5           d:
c:\abc.txt###d:\test\abc1s2.xls**e:\def.exe**    7           test
c:\abc.txt###d:\test\abc1s2.xls**e:\def.exe**    9           abc1s2.xls
c:\abc.txt###d:\test\abc1s2.xls**e:\def.exe**    11          e:
c:\abc.txt###d:\test\abc1s2.xls**e:\def.exe**    13          def.exe

Ignoring for a moment that you may want to include other characters in your file names (e.g., blanks) or perhaps that your file name does not include the file’s extension, the above shows that you can:

  1. [Example 1] Split multiple file names with the full path out of a string.
  2. [Example 2] Clean up a string that contains a file name and other (garbage) characters.
  3. [Example 3] Split out only file names, path names and/or drive letters.  If all you want returned is the drive letters, simply add AND LIKE ‘%:%’ to the WHERE clause.  For just the file names (without the path), use AND NOT LIKE ‘%:%’ AND NOT LIKE ‘%.%’. The final AND NOT LIKE ‘%.%’ removes item 7 (the folder – test) in this case.

As mentioned earlier, our PatternSplit FUNCTION does not handle every conceivable pattern that you can construct using the rules described in Microsoft Books on Line.  However our PatternSplit FUNCTION does at least gracefully handle cases where the pattern is not supported by simply returning the original string.  Try this for an example:


CREATE TABLE #t1 (MyString VARCHAR(8000), Pattern VARCHAR(50))
INSERT INTO #t1 
 -- Unsupported patterns return the whole string 
 SELECT 'abc0031001abacass532466aa43', '%[0-9][0-9][0-9][0-9]%'
SELECT MyString, ItemNumber, Item, [Matched]
 FROM #t1 
  CROSS APPLY PatternSplitLoop(MyString, Pattern)
DROP TABLE #t1

The limitation applies to patterns that attempt to locate a specific character at a specific position in the substring, as for a telephone number, SSN (with hyphens) or postal code.

It seems that other limitations of this approach are in our collective imaginations.

Alternative Solutions

Before Jeff Moden revokes my anti-RBAR card for the above atrocity (besides we’d be remiss if we didn’t), let’s explore some set-based solutions to this problem.  Three alternative solutions will be described and then tested against PatternSplitLoop to see which approach is the fastest.

  1. A recursive CTE approach similar to what we did in our forum post.
  2. A quirky update.
  3. A tally table split and re-grouping approach suggested by Chris Morris.

Alternate Solution #1 – A Recursive CTE (rCTE) Pattern Splitter

In our loop-based solution, we used a TVF; however we could not craft that as an inline TVF (iTVF) because of the requirement for multiple statements within the loop.  Using a rCTE, it is possible to handle the process in a single statement, so the ability to put this into an iTVF brings with it the sincere hope that it will perform better than the loop, even with the recursion overhead.

-- PatternSplitrCTE will split a string based on a pattern of the form
-- supported by LIKE and PATINDEX
--
-- Created by: Dwain Camps 11-Oct-2012
ALTER FUNCTION [dbo].[PatternSplitrCTE]
        (@String                    VARCHAR(8000)      -- The string to be split
        ,@Pattern                   VARCHAR(500))     -- The pattern to split
RETURNS TABLE WITH SCHEMABINDING AS
RETURN
-- Note that this is an in-line Table Valued Function (iTVF)
WITH PatternSplitter AS (
    -- PatternSplitter is a recursive CTE - here is the anchor leg:
    SELECT ItemNumber=CASE WHEN @String IS NULL THEN NULL ELSE 1 END
        -- The first item from cascaded CROSS APPLY c below
        ,Item
        -- The remaining elements of the string is created (for this recursion) here
        ,Remaining=CASE
            WHEN DATALENGTH(@String) = DATALENGTH(Item) THEN ''
            ELSE SUBSTRING(@String, DATALENGTH(Item)+1, DATALENGTH(@String)) END
        -- NotPattern is the converse of Pattern (calculated in CROSS APPLY a below)
        ,NotPattern
        ,[Matched]=CASE WHEN @String IS NULL THEN NULL WHEN a=1 THEN 1 ELSE 0 END
    FROM (
        -- Only create the "NotPattern" once in the anchor leg
        SELECT REPLACE(REPLACE(@Pattern, '[', '[^'), '^^', '')) a(NotPattern)
    -- Note that the original version of this FUNCTION included this CROSS APPLY to calculate
    -- a and b in the recursive leg also.  Removing it there and embedding PATINDEX wherever
    -- a and b appear in the final CROSS APPLY improved performance measurably at the expense
    -- of some readability.
    CROSS APPLY (
        -- Find the first occurrence of the Pattern and the NotPattern
        SELECT PATINDEX(@Pattern, @String)
            ,PATINDEX(NotPattern, @String)
        ) b(a, b)
    CROSS APPLY (
        -- Identify the first item (chunk)
        SELECT CASE
            -- When a+b = 1, then either a=1 and b=0 (pattern found but not pattern is not found)
            -- or a=0 and b=1 (not pattern found but pattern is not found). 
            -- In either case we're done.
            WHEN a+b = 1 THEN @String
            WHEN (a=1 AND b>0) OR (b=1 AND a>0)
                THEN SUBSTRING(@String, 1, CASE a WHEN 1 THEN b ELSE a END-1)
            ELSE @String        -- Return value for unsupported patterns
            END) c(Item)
    UNION ALL
    -- Recursive leg of PatternSplitter rCTE
    SELECT ItemNumber+1
        -- The next item from CROSS APPLY b below
        ,b.Item
        -- The remaining elements of the string is created (for the remainding recursions) here
        ,Remaining=CASE
            WHEN DATALENGTH(Remaining) = DATALENGTH(b.Item) THEN ''
            ELSE SUBSTRING(Remaining, DATALENGTH(b.Item)+1, DATALENGTH(Remaining)) END
        -- The NotPattern is only calculated once (to avoid added cost of the REPLACE)
        ,NotPattern
        ,CASE PATINDEX(@Pattern, Remaining) WHEN 1 THEN 1 ELSE 0 END
    FROM PatternSplitter
    CROSS APPLY (
        -- Identify the next item (chunk)
        SELECT CASE
            -- For below: a=PATINDEX(@Pattern, Remaining) and b=PATINDEX(NotPattern, Remaining)
            -- When a+b = 1, then either a=1 and b=0 (pattern found but not pattern is not found)
            -- or a=0 and b=1 (not pattern found but pattern is not found). 
            -- In either case we're done.
            WHEN PATINDEX(@Pattern, Remaining) + PATINDEX(NotPattern, Remaining) = 1
                THEN Remaining
            WHEN (PATINDEX(@Pattern, Remaining)=1 AND PATINDEX(NotPattern, Remaining)>0) OR
                (PATINDEX(NotPattern, Remaining)=1 AND PATINDEX(@Pattern, Remaining)>0)
                THEN SUBSTRING(Remaining, 1,
                    CASE PATINDEX(@Pattern, Remaining)
                        WHEN 1 THEN PATINDEX(NotPattern, Remaining)
                        ELSE PATINDEX(@Pattern, Remaining) END-1)
            ELSE Remaining          -- Should never occur
            END) b(Item)
    -- When Remaining is an empty string, there's nothing left to process so we're done.
    WHERE DATALENGTH(Remaining) > 0
    )
SELECT ItemNumber, Item, [Matched]
FROM PatternSplitter

We have tested this solution on all of the proposed samples above and confirmed that identical results are obtained for each case.

Alternate Solution #2 – A Pattern Splitter Using the Infamous Quirky Update

It may not be immediately apparent how the Quirky Update (QU) can be used to solve this problem, but consider the string split by a tally table into one row for each character.  If we add a column that is based on PATINDEX(‘%[0-9]%’, char), we come up with something that might look like the following:

ItemNumber  Char  PATINDEX(‘%[0-9]%’, Char)
1           A     0
2           B     0
3           1     1
4           2     1
5           C     0
6           D     0
7           3     1
8           4     1

In other words, the results appear as a series of alternating 0’s and 1’s that, when combined with a clustered index on ItemNumber can be used to enable/drive the Quirky Update.  So in code, our FUNCTION looks as follows.


-- PatternSplitQU will split a string based on a pattern of the form 
-- supported by LIKE and PATINDEX 
-- 
-- Created by: Dwain Camps 11-Oct-2012 
ALTER FUNCTION [dbo].[PatternSplitQU] 
(  @String                 VARCHAR(8000)
  ,@Pattern               VARCHAR(500)
) RETURNS @Results
            TABLE
(     ItemNumber              INT
     ,Item                   VARCHAR(8000)
     ,[Matched]              INT     
) WITH SCHEMABINDING AS BEGIN;
-- Holding table for tally split by character     
DECLARE @Strings        TABLE     
(
 -- With a clustered index to facilitate the quirky update     
  ID         INT PRIMARY KEY CLUSTERED
 ,MyString   CHAR(1)
 ,[Matched]  INT
 ,Pattern    INT
);
-- Use a Tally table to split out the single characters     
WITH Nbrs_3(n) 
AS 
(
  SELECT 1 
  UNION ALL 
  SELECT 1 
  UNION ALL 
  SELECT 1 
  UNION ALL 
  SELECT 1
)
,Nbrs_2 (n)
 AS
 (SELECT 1 
   FROM Nbrs_3 n1 
    CROSS JOIN Nbrs_3 n2
)
,Nbrs_1 (n)
 AS
 (SELECT 1 
   FROM Nbrs_2 n1 
    CROSS JOIN Nbrs_2 n2)      
,Nbrs_0 (n) 
 AS
 (SELECT 1 
   FROM Nbrs_1 n1 
    CROSS JOIN Nbrs_1 n2)
,Tally  (n)
 AS
 (SELECT ROW_NUMBER() OVER (ORDER BY n) As n 
  FROM Nbrs_0)     
INSERT INTO @Strings
 SELECT n, SUBSTRING(@String, n, 1)
       ,PATINDEX(@Pattern, SUBSTRING(@String, n, 1))
       ,PATINDEX(@Pattern, SUBSTRING(@String, n, 1))
   FROM (SELECT TOP (ISNULL(DATALENGTH(@String), 0)) n
          FROM Tally) a
-- Local variables to control the quirky update     
DECLARE @CharID   INT = -1
       ,@Matched  INT = 0
-- Perform the Quirky Update     
-- At each change in Pattern (0-->1 or 1-->0) increment to a grouping value     
UPDATE @Strings     
  SET @Matched = CASE 
                   WHEN Pattern <> @CharID THEN @Matched + 1 
                   ELSE @Matched 
                 END
      ,@CharID = CASE 
                   WHEN Pattern <> @CharID THEN Pattern 
                   ELSE @CharID 
                 END     
     ,Pattern =  @Matched
-- Contenate strings from each group into the final Items returned INSERT INTO @Results 
 SELECT ItemNumber=ROW_NUMBER() OVER (ORDER BY (SELECT NULL))             ,Item=(
             SELECT '' + MyString
             FROM @Strings b
             WHERE a.Pattern = b.Pattern
             ORDER BY ID
             FOR XML PATH(''), TYPE).value('.', 'VARCHAR(8000)')          ,MIN([Matched])
  FROM @Strings a
  GROUP BY Pattern
RETURN 
END

Unfortunately, once again due to the multiple statements required, we were unable to formulate this solution as an iTVF.  But since the Quirky Update is known to be extremely fast when computing accumulated sums, this method also warrants attention as a reasonable set-based alternative to the WHILE loop.  We can only hope that we have abided by all the rules expounded upon in the seminal article on the Quirky Update:

Solving the Running Totals and Ordinal Rank Problems

Even though the QU approach does not need to construct the “not pattern” in the previous two proposed solutions, it is still subject to the same limitation that it will only work with relatively simple patterns.  Once again, we tested this solution on all of the previously proposed examples and found the results to be identical with a bonus – the percent (%) signs do not need to be included in the pattern.

We also tried a few variations on the tally table used to perform the initial single character split, all using a TOP clause based on the data length of the string, including:

  1. A Numbers table with a clustered index on the number – in this case TOP was not required because we used BETWEEN.
  2. A ROW_NUMBER() based on rows of a CROSS JOIN of sys.all_columns against itself.
  3. An Itzik Ben-Gan style in-line CTE tally table.
  4. A variant that uses CROSS JOINs of VALUES (Table Value Constructors), which Chris chose to use in the next alternate solution.

Option 3 was selected because it seemed to perform the best.  Note that options 1 and 2 cannot be used WITH SCHEMABINDING.

Alternate Solution #3 – A Tally Table Split with Regrouping

While examining my solutions I felt that a significant epiphany was eluding me.  While I could not put my finger on it, somehow the approaches I was exploring just didn’t seem quite right.  I brought this problem to the attention of Chris Morris, whom I know to be relentless in his quest for high performing solutions.  In a number of prior performance competitions, I have come in second to his submissions, thus I have a healthy respect for what he’s capable of.  Since he was also a contributor on the original forum thread upon which this article is based, I thought he might have some good ideas and I didn’t want to miss the opportunity of including them should he be available to provide them.

What Chris came up with was a solution that starts with a character by character (tally table) split of the original string, followed by a numbering and regrouping that is vaguely reminiscent of what Jeff Moden did in his article Group Islands of Contiguous Dates.


-- PatternSplitCM will split a string based on a pattern of the form 
-- supported by LIKE and PATINDEX 
-- 
-- Created by: Chris Morris 12-Oct-2012 
ALTER FUNCTION [dbo].[PatternSplitCM]
(
       @List                VARCHAR(8000) = NULL
       ,@Pattern            VARCHAR(50)
) RETURNS TABLE WITH SCHEMABINDING 
AS 
RETURN
    WITH numbers AS (
      SELECT TOP(ISNULL(DATALENGTH(@List), 0))
       n = ROW_NUMBER() OVER(ORDER BY (SELECT NULL))
      FROM
      (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) d (n),
      (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) e (n),
      (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) f (n),
      (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) g (n))
    SELECT
      ItemNumber = ROW_NUMBER() OVER(ORDER BY MIN(n)),
      Item = SUBSTRING(@List,MIN(n),1+MAX(n)-MIN(n)),
      [Matched]
     FROM (
      SELECT n, y.[Matched], Grouper = n - ROW_NUMBER() OVER(ORDER BY y.[Matched],n)
      FROM numbers
      CROSS APPLY (
          SELECT [Matched] = CASE WHEN SUBSTRING(@List,n,1) LIKE @Pattern THEN 1 ELSE 0 END
      ) y
     ) d
     GROUP BY [Matched], Grouper

Just look at that bad boy!  Simple and elegant, packaged as an iTVF and waiting to be run it looks to me like a Ferrari sitting at a stoplight, its engines growling in anticipation of making its run.  Chris has this to say to help us understand his thinking behind this approach:

“Like Jeff Moden’s Islands ‘n’ Gaps solution, this function employs what I call the staggered ROW_NUMBER() trick, and to understand how it works I’ll use a few diagrams and a tweaked version of the inner query. Here’s the tweaked query:


DECLARE @List VARCHAR(8000), @Pattern VARCHAR(50)
SELECT @List = '0123a', @Pattern = '[^0-9]'
;WITH numbers AS (
  SELECT TOP(DATALENGTH(@List))
   n = ROW_NUMBER() OVER(ORDER BY (SELECT NULL))
  FROM
  (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) d (n),
  (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) e (n),
  (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) f (n),
  (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) g (n))
SELECT  n,  rn = ROW_NUMBER() OVER(ORDER BY y.Match DESC, n),
  letter = SUBSTRING(@List,n,1),
  y.Match,
  [n-rn] = n - ROW_NUMBER() OVER(ORDER BY y.Match DESC,n)
 FROM numbers CROSS APPLY (
     SELECT Match = CASE
         WHEN SUBSTRING(@List,n,1) LIKE @Pattern THEN 1
         ELSE 0 END) y
 ORDER BY Match, n

All I’ve changed is the sort order for ‘rn’. It’s much easier to follow if y.Match is sorted in descending order. With a suitable ORDER BY thrown in for good measure, it all makes sense.

Here’s the output:

Photobucket

Column ‘n’ is the position of the character in the string - ‘0’ is at position 1 and ‘a’ is at position 5. The column ‘rn’ renumbers the rows in the same sort order as ‘n’ but by matches first, then non-matches. Adjacent matching (or non-matching) characters in the string increment both n and rn by 1. Subtracting rn from n yields the same number for them – so long as they are adjacent. Okay so far?

Here’s what happens if we add another alpha (‘b’) to the end of the string:

Photobucket

And finally a trailing ‘c’:

Photobucket

Conveniently, the grouping remains intact. All of the contiguous matches have the same value for [n-rn], as do the non-matches – albeit decremented by 1 for every matching letter added.

Let’s see what happens when we take the original input string ‘0123a' and double it up to give four contiguous groups (or ranges) of alphas and digits, ‘0123’ + ‘a’ + ‘0123’ + ‘b’:

Photobucket

Then treble it up to 6 ranges, ‘0123’ + ‘a’ + ‘0123’ + ‘b’ + ‘0123’ + ‘c’:

Photobucket

No matter how many characters are added to the end of the string, the match groups are resolved as increasing positive values for [n-rn], and the non-matching groups as negative values. There’s absolutely no chance of a collision, where two valid groups are calculated to have the same [n-rn] value.

Now that we’ve grouped the data into contiguous sequences of alphas and numerics using [n-rn], it’s trivial to determine the MIN(n), which is the position of the first character of the group, and MAX(n), which is the position of the last. Here’s the expression:

Item = SUBSTRING(@List,MIN(n),1+MAX(n)-MIN(n))

And here’s the GROUP BY: 

GROUP BY [Matched], Grouper

This rolls up the rows from one row per character in the input string, to one row per partition of ‘match’ + [n-rn], and uses SUBSTRING to pluck the word from the string. Note that [Matched] isn’t required for the grouping in this example; this could change with sort orders. Always test, then test again.

I know I know, it sounds complicated until you see it, then it looks like it just has to be expensive. It’s going to be expensive with all of those sorts. But the sort for the inline tally table is free, and the sort for ItemID in the output isn’t strictly necessary if you can get away with MIN(n) (or MAX) instead of renumbering. Dwain will now show you how this code can really fly and what you can do with it. We’ve only just begun to play with it and it seems every day a relevant post comes up on SSC.”

-- Chris Morris – Bogstandard TSQL developer

I did test Chris’s solution, like the others, against all of the test case examples described and got identical results.  Once again, like the QU solution, in Chris’s solution the percent (%) signs are optional in the pattern.

SQL 2012 Solutions

While we were unable to explore them due to lack of a SQL 2012 test environment, we believe that the new windowing capabilities may be exploited to address this particular problem using a setup that is similar to what we did for the Quirky Update.

We are anxious to explore ROWS BETWEEN 1 PRECEEDING AND CURRENT ROW since we believe that because it provides relief for the running totals problem, it may also be helpful here.  Performance will remain an open question for now.

Performance Comparison Between Alternatives

In order to ascertain which of our proposed solutions to this problem performs the best, we constructed a relatively small test harness that can be tweaked to produce strings that will contain varying numbers of chunks that will be split into returned items.


CREATE TABLE #t1 (MyString VARCHAR(8000)) 
DECLARE @Pattern VARCHAR(500) = '%[0-9]%'
;WITH Tally (n) AS (
     SELECT TOP 500 ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
     FROM sys.all_columns a CROSS JOIN sys.all_columns b)
INSERT INTO #t1
 SELECT REPLICATE('abc0123hij456nopq789uvw1y', 
        1 + ABS(CHECKSUM(NEWID())) % 30)
 FROM Tally
DECLARE @MyString VARCHAR(8000), @ItemNumber INT, @Item VARCHAR(8000)
PRINT '--- PatternSplitLoop'
SET STATISTICS TIME ON 
SELECT @MyString=MyString, @ItemNumber=ItemNumber, @Item=Item
 FROM #t1 CROSS APPLY dbo.PatternSplitLoop(MyString, @Pattern) 
--WHERE [Matched] = 1 
SET STATISTICS TIME OFF
PRINT '--- PatternSplitQU' 
SET STATISTICS TIME ON 
SELECT @MyString=MyString, @ItemNumber=ItemNumber, @Item=Item
 FROM #t1 CROSS APPLY dbo.PatternSplitQU(MyString, @Pattern) 
--WHERE [Matched] = 1 
SET STATISTICS TIME OFF
PRINT '--- PatternSplitrCTE' 
SET STATISTICS TIME ON 
SELECT @MyString=MyString, @ItemNumber=ItemNumber, @Item=Item
 FROM #t1 CROSS APPLY dbo.PatternSplitrCTE(MyString, @Pattern) 
--WHERE [Matched] = 1 
 OPTION (MAXRECURSION 0)
SET STATISTICS TIME OFF
PRINT '--- PatternSplitCM' 
SET STATISTICS TIME ON 
SELECT @MyString=MyString, @ItemNumber=ItemNumber, @Item=Item
 FROM #t1 CROSS APPLY dbo.PatternSplitCM(MyString, @Pattern) 
--WHERE [Matched] = 1 
SET STATISTICS TIME OFF
DROP TABLE #t1

OPTION (MAXRECURSION 0) is required for PatternSplitrCTE because at 30 replications of the 25 character string containing 9 items, 270 recursions are required to resolve the list of Items.

The timing results from this test are shown below.

--- PatternSplitLoop

 SQL Server Execution Times:    CPU time = 1077 ms,  elapsed time = 1278 ms.

--- PatternSplitQU

 SQL Server Execution Times:    CPU time = 5710 ms,  elapsed time = 6274 ms.

--- PatternSplitrCTE

 SQL Server Execution Times:    CPU time = 1778 ms,  elapsed time = 1845 ms.

--- PatternSplitCM

 SQL Server Execution Times:    CPU time = 63 ms,  elapsed time = 64 ms.

The results are surprising only in the huge margin by which PatternSplitCM came out the winner.

The less interesting results show that the loop-based TVF, despite the RBAR-nightmare that it is, performs better than the rCTE iTVF and the QU TVF alternatives.  All RBAR jokes aside, we do appreciate the fact the Mr. Moden has often commented that a well-crafted WHILE loop will beat even set-based solutions under certain circumstances.  In this case, not all of the available set-based solutions though.

Performance Scalability for the Tally/Regroup Solution

These performance tests of the PatternSplitCM FUNCTION are designed to give some idea of the scalability of the solution.  They were done on a Dell Inspiron Core i5-2430M CPU at 2.4GHz with 4GB of RAM, running Windows 7 and SQL Server 2008 R2. The Loop, QU and rCTE versions were not tested as they clearly did not perform as well. 

Essentially the same test harness as above is used, and the WHERE clause that reduces the row set to only matched patterns remains commented out to preserve the raw timing of the PatternSplit TVF.

There are two dimensions to the test, which we will call breadth and depth.  

  • Breadth is the number of embedded patterns that end up getting matched.  This variable is controlled by the random number assigned to the modulo function in the test harness.  In each case, a 25 character string including 4 patterns that will be matched (total of 9 Item rows) is replicated a random number of times up to the maximum 8000 character length of the supported string.  Breadth therefore can be measured by the replication factor (the value of the second argument to the REPLICATE built-in function) to a maximum of 320.  So if the replication factor is 160 it produces between 9 and 1,440 Items for each row in the test table.
  • Depth is the number of rows run through the test harness.  For a 100 row test harness and a replication factor of 160, the test harness should generate output of approximately 720,000 rows (80*9*1000).  So with 10,000 rows in the test harness and 320 replications, the output row set is approximately 14,400,000 rows.

Each timing was run 5 times and average results reported for CPU and Elapsed milliseconds. 

Photobucket

In all cases, results were dumped into local holding variables.  Elapsed times appear to scale approximately linearly.  Note how in the cells shaded with a lighter shade of orange, SQL has obviously parallelized the query because CPU is much higher than elapsed time.  I find it odd that this pattern is not particularly predictable.

We also tried reducing the allowable size of the @String parameter to the iTVF to 400 instead of 8000.  This produced no measurably conclusive improvement in the speed of the utility.  In fact, increasing it to a VARCHAR(MAX) anecdotally produced no reduction in performance!

Flexibility of Pattern Splitting

It should be pretty obvious that splitting strings based on a pattern is pretty flexible.  So flexible in fact, that even delimited strings can be split by an auspicious choice of pattern.  So let’s compare how PatternSplitCM works against a well-known, high-performance delimited string splitter, such as the famous DelimitedSplit8K community string splitter popularized by Jeff Moden.

Hardly a fair test to be sure, since Jeff’s splitter has been performance optimized by a wide audience, but I figure since Chris’s solution beat mine so handily it’s time for payback!

I’ve uploaded the test harness consisting of these cases comparing a split of a comma delimited string using ‘%[^,]%’ as the pattern for PatternSplitCM vs. DelimitedSplit8K on the same rows.

The test results for this case are:

--- Split on Pattern , delimiter

 SQL Server Execution Times:    CPU time = 1201 ms,  elapsed time = 3082 ms.

--- DelimitedSplit8K on , (single char) delimiter

 SQL Server Execution Times:    CPU time = 249 ms,  elapsed time = 247 ms.

No surprises – DelimitedSplit8K is the clear winner because it is so focused on its task.

Final Words, Cautions and Conclusions

Prior to seeing Chris’s solution run with blinding speed and contrary to my mantra (“No Loops.  No CURSORs.  No RBAR.  Hoo-uh!”), I would have been delighted to propose a looping solution as the first choice solution to this problem because it clearly performs better than the other set-based alternatives that I could come up with.  There I said it, but it still feels weird.  However because Chris’s set-based solution is so blazingly fast, I am much happier to say that set-based solutions rule!

Congratulations Chris!  You have really shown your stuff and produced a winner that should be in everybody’s tool chest.  I know it will be in mine, although I may keep my other solutions around just for the sake of nostalgia.

Soon we also may hear from the CLR-enthusiasts that this is best handled in that manner.  But I suspect that a general CLR solution that handles even a small subset of the plethora of patterns acceptable to LIKE and PATINDEX may be a daunting challenge.  But please prove me wrong as I’d love to see some examples, even if they may not be usable in environments that don’t allow CLRs.

We’ve managed to address this problem in four different ways and shown what works and what does not.  Hopefully we’ve provided enough performance information to serve as a guideline for suitability for any specific application.

In the discussion thread that follows, hopefully others will come along and propose an even faster, killer approach.  Thankfully this community is well stocked with such performance experts that won’t hesitate to rise to the task of contributing.  If only this article gets people thinking about better approaches to this sort of problem, I feel that I have performed my task adequately.

We have learned that pattern-based string splitting can be challenging, but we have provided a fairly generalized tool that can work fast under many circumstances.  The constraints on this tool were identified (that it only works for certain patterns), so you have been forewarned.  Nonetheless, given the huge number of possible patterns available, possibly there is one out there that can be applied to fit your needs.

As always, we would like to thank our readers for their attention and especially those that will follow with better approaches to this problem.

Most importantly though, we must thank Chris Morris.  Without his contribution this article would have been just a footnote in the archives of the SQLverse.  I have it on good authority that he’s working on a version that will handle more patterns and hopefully he’ll be kind enough to share that with us all in the discussion thread.

Resources

Rate

4.94 (54)

You rated this post out of 5. Change rating

Share

Share

Rate

4.94 (54)

You rated this post out of 5. Change rating