Set-Based String Splitting table function

,

Background

Splitting a string into segments is something I do on a daily basis for one reason or another. From splitting out a folder name from a path to extracting delimited values from a long string. I’ve had a few go-to functions for years that loop character by character, I parse a variable or cross apply against the function and get the required outcome. An example might be:

SELECT * 
 FROM dbo.udf_split (‘something,to,split’,’,’)

OR

SELECT * 
 FROM dbo.tExample e CROSS APPLY dbo.udf_split(e.StringToSplit,',')

The routines that I’ve used have only ever used a single split character and looped character by character - surely worse than R.B.A.R.!

With another long flight and a dead work laptop battery, I started writing a set-based method to split strings while accepting multiple character delimiters. Given I have a number of personal experiments running and don't have an effective method to split large strings it seemed a useful way to spend time. I’m aware there is more than SQL in the world but am keen to keep this task in database for now.

This note describes the method as a contribution to the community and I welcome any comments or thoughts.

Approach

The approach is to identify all the delimiter locations across the string in one pass. Then using the lead() window function to define the character range for each interval between the delimiter pairs as well as dealing with the first and last position. These intervals then drive return a record per delimited value with an ordinal.

The code is written making use of features for SQL Server 2012+.

The first prerequisite is a numberS table with as many numbers as your maximum string length is long/ Depending on the task, this may not be trivial. If you don’t have a number table the most efficient way to construct one I’ve found is describe by Itzik Ben-Gan. In this case the table is physical rather than a dynamic table function as described by Itzik. For the sake of containment here is a simple short 1000 number table. It is critical to size this as required as this is not tested for in the code!

CREATE TABLE dbo.tNUM
(
    NUM INT PRIMARY KEY
);
WITH L0 AS (SELECT 1 AS c UNION ALL SELECT 1),
     L1 AS (SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),
     L2 AS (SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),
     L3 AS (SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B),
     L4 AS (SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B),
     Nums AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS n FROM L4)
INSERT INTO dbo.tNUM ( NUM )
SELECT TOP (1000)
       n
FROM Nums
ORDER BY n;

Code Walk through

As a table function the results can directly returned or can be used via cross apply to drive output for each input record.

SELECT *
 FROM dbo.ufnt_split('a,b,c',',')

OR

SELECT a.name, b.Ordinal, b.String
 FROM sys.objects a 
  CROSS APPLY dbo.ufnt_split(name,'_') b

The first step in the code is to determine the length of the string to be split and the length of the delimiter string.

SET @Len = DATALENGTH(@String)/2
SET @Dlen = DATALENGTH(@Delimiter)/2

@Len stores the length of the string for splitting and @Dlen stores the length of the delimiter

One noteworthy feature I learnt as part of this exercise was that LEN() does not count trailing spaces. The following example shows the need to use the DATALENGTH() function. The value 'Space ' is 6 characters long in total however len() returns only 5. 

SELECT LEN('Space ') [Len example],
       DATALENGTH('Space ') [DATALENGTH example]
+------------------+----------------------------+
| Len example  | DATALENGTH example  |
+------------------+----------------------------+
|    5                  |   6                             |
+------------------+----------------------------+

The result then has to be divided by two as the type is nvarchar.

For the next part consider the simple case a,b,c and splitting at the position of the commas. The delimiter length is 1 (@Dlen) and the string length is 5 (@Len).

To be able to identify the positions of each comma, all of the 5 potential locations need to be tested. Each possible location is 1 character long. So using SELECT from the tNUM records where NUM <= 5, a substring expression can be used to limit the number table to only the positions where the substring is equal to the delimiter being searched for. In other words the result of the query is the positions of the delimiters.

SELECT NUM, @String
 from dbo.tNUM
 WHERE NUM <= @len and SUBSTRING(@String,NUM,@Dlen) = @Delimiter

Notice that the length of the delimiter here is used to set the substring length to test for.

There are a couple other parts of locating the positions to split worth explaining. Firstly, a 0 is unioned to the results for the start of the first interval. To be able to list the location of the next delimiter position, the LEAD(NUM) is used. This will "pull" the next delimiter position onto the record, forming the start and end position for a single split term. The next point to make, is that the last split needs to end at the length of the string. To do this coalescing LEAD(NUM) with the @Len ensures the final split position is the overall string length. The count of the number of segments is also captured into the @Cut table variable. This table now has one record for each part to be split out of the original string.

INSERT into @Cut (CutStart, CutEnd, MaxOrdinal)
SELECT NUM, coalesce(lead(NUM) over (order by NUM),@Len), count(*) over (partition by null) 
 FROM (SELECT 0 NUM UNION
       SELECT NUM
        FROM dbo.tNUM
        WHERE NUM <= @Len and SUBSTRING(@String,NUM,@Dlen) = @Delimiter) x

The @Cut table variable can be used to split the string into parts. So by selecting from the @Cut table and using the substring function starting at the CutStart for a length of the CutEnd-CutStart the parts are returned. The conditions to be catered for as part of this is that after the first Ordinal the length of the delimiter must be added to the CutStart. Then the length should be reduced by the length of the delimiter, except for the first and last ordinal where no reduction is required.

INSERT INTO @tblOUT (
    Ordinal,
    String
)
SELECT Ordinal,
       StringValue
FROM
(
    SELECT ROW_NUMBER() OVER (ORDER BY Ordinal) Ordinal,
           SUBSTRING(   @String,
                        CutStart + CASE
                                       WHEN Ordinal = 1 THEN
                                           0
                                       ELSE
                                           @Dlen
                                   END,
                        CutEnd - CutStart - CASE
                                                WHEN Ordinal = 1
                                                     OR Ordinal = MaxOrdinal THEN
                                                    0
                                                ELSE
                                                    @Dlen
                                            END
                    ) StringValue
    FROM @Cut
) x;

The select of @tblOUT is then returned by the function.

Conclusion

I have been using this on and off over the past few months in personal projects. I have been able to get through logs of text and have not encountered any issues.

The handling of multiple character delimiters has been useful in a number of places. Certainly not a best-case method but interesting is to split a table definition into rows and columns as below.

SELECT
MIN(CASE WHEN C = 1 THEN V END) [1],
MIN(CASE WHEN C = 2 THEN V END) [2],
MIN(CASE WHEN C = 3 THEN V END) [3]
FROM ( SELECT a.Ordinal R, b.Ordinal C, ltrim(rtrim(replace(replace(b.String,'<tr>',''),'<td>',''))) V
        FROM (SELECT *
               FROM dbo.ufnt_split('<tr><td>Apple</td><td>Orange</td><td>Pear</td></tr><tr><td>Green</td><td>Orange</td><td>Green</td></tr><tr><td>Medium</td><td>Medium</td><td>Medium</td></tr>','</tr>'))a
 CROSS APPLY dbo.ufnt_split (a.String,'</td>') b) c
WHERE LEN(V)>0
GROUP BY R
ORDER BY R

I hope that you find the function useful.

The attached function creates the tNUM table and ufnt_split table function in tempdb so stay out of the way of user environments. Please test carefully and use and feed back any useful comments. Thanks for reading hope you find some enjoyable ways to split.

Resources

Rate

3.43 (7)

Share

Share

Rate

3.43 (7)