Technical Article

New T-SQL Fixed Width "Splitter"

,

Intro

One day while playing with Solomon Rutzky's SQL# String_FixedWidthSplit I thought, why not crerate a purely set-based T-SQL fixed width splitter? Enter NSplit8K and NSplit2B. Both splitters are identical except that the 8K version handles varchar(8000) and the 2B version (2B for two billion*) handles varchar(max). The 8K version is faster but the 2B version handles strings longer than 8,000 characters. For situations where you need to split a CSV and you know that each token (AKA item) is the same size, you don't need a traditional "splitter" for the job, instead you can use a fixed width splitter instead. As you will see from the examples below, you can use these functions for more than just your traditional "string splitting" tasks. 

Details

Each splitter takes three parameters:

@String           varchar(8000) for NSplit8K, varchar(max) for NSplit2B; represents the input string to "split" into equal sized tokens
@TokenLen     bigint; the size of the output token
@DelimLen      bigint; the size of the delimiter; can be set to 0 for when there's no delimiter or set to negetive if you desire overlap (note the N-Grams example below)

Usages Examples 

Example 1: Split the string, "ab,cd,ef,gg" into tokens

Here's how you would split a string when you know that each token is two characters long and the delimiter is one character long.

 SELECT ItemNumber, Token 
 FROM dbo.NSplit8K('ab,cd,ef,gg',2,1);
Results
ItemNumber           Token
-------------------- --------------------
1                    ab
2                    cd
3                    ef
4                    gg

Example 2: Extract the values "client01", "client02", etc... from string below

This is an example of how you could parse values from an XML fragment where the elements are the same size. 

 DECLARE @string varchar(100) = '<client01>,<client02>,<client03>,<client04>';
 SELECT ItemNumber, Token FROM dbo.NSplit8K(SUBSTRING(@string,2,LEN(@string)-2),8,3);

Results

ItemNumber           Token
-------------------- --------------------
1                    client01
2                    client02
3                    client03
4                    client04

Example 3: Nsplit into an N-Gram function

Here we're using NSplit to create an N-Gram function, a very powerful text mining tool.

 DECLARE @string varchar(8000) = 'abcdefghi',
         @N int = 1; -- change this to 2 for bigrams, 3 for trigrams, 4 for 4-grams, etc...
 SELECT TOP (LEN(@string)-(@N-1)) ItemNumber, Token 
 FROM dbo.NSplit8K('abcdefghi',@N,-(@N-1));

Results when running the above code with @N = 1

ItemNumber           Token
-------------------- --------------------
1                    a
2                    b
3                    c
4                    d
5                    e
6                    f
7                    g
8                    h
9                    i

When @N = 2

ItemNumber           Token
-------------------- --------------------
1                    ab
2                    bc
3                    cd
4                    de
5                    ef
6                    fg
7                    gh
8                    hi

When @N = 3

ItemNumber           Token
-------------------- --------------------
1                    abc
2                    bcd
3                    cde
4                    def
5                    efg
6                    fgh
7                    ghi

Notice that, in the above N-Gram examples, we're not just returning the N-sized token, we're also returning it's position in the stringBelow are the two functions. Happy text manipulating! 

NSplit8K - The varchar(8000) version

CREATE FUNCTION dbo.NSplit8K
(
  @String   varchar(8000),
  @TokenLen bigint,
  @DelimLen bigint
)
/****************************************************************************************
Purpose:
 Splits an input string (@String) into @TokenLen sized tokens with @DelimLen sized gaps 
 between each token. Intended to be a better performing alternative to the traditional 
 "splitter" for cases where the size of the token (AKA "item") is always the same.
Compatibility:
 SQL Server 2008+, Azure SQL Database, Azure SQL Data Warehouse & Parallel Data Warehouse
Syntax:
 SELECT ItemNumber, Token
 FROM dbo.NSplit8K(@string, @length, @delimiterLength);
Parameters:
 @String          = varchar(8000); the input string to "split" into tokens
 @TokenLen        = bigint; the size of the output token
 @DelimLen        = bigint; the size of the delimiter. Can be set to 0 for when there's 
                    no delimiter or set to negetive if you desire overlap
Returns:
 ItemNumber = bigint; represents the order that the token appears in the input string
 Token      = varchar(8000); the @lengh-sized token returned by the function.
Developer notes:
 1. Requires GetNumsAB: http://www.sqlservercentral.com/scripts/Set+Based/139370/
    Alternatively you can incorporate your own tally table logic at your own risk.
 2. If the last token is shorter then @TokenLen then it will be truncated.
 3. Tends to perform better with a serial execution plan; it may be adventageous to test
    with OPTION (MAXDOP 1) and compare that performance with a parallel plan. To force a 
    parallel execution plan see make_parallel by Adam Machanic:
  sqlblog.com/blogs/adam_machanic/archive/2013/07/11/next-level-parallel-plan-porcing.aspx
 4. There are no safeguards built into the function to make sure the tokens are all the
    same size. The only safeguard is to test and know your data.
 5. If any of the parameters are NULL the function will return a a single NULL value.
 6. NSplit8K is deterministic. For more about deterministic functions see:
    https://msdn.microsoft.com/en-us/library/ms178091.aspx
Examples:
--===== (1) Extract values "ab", "cd", "ef", and "gg"; token-size=2; delimiter = ","
 SELECT ItemNumber, Token FROM dbo.NSplit8K('ab,cd,ef,gg',2,1);
--===== (2) Extract values "xxx", "yyy", "bbb", etc... token-size=3; delimiter = ","
 DECLARE @string1 varchar(100) = 'xxx,yyy,bbb,123,zzz';
 SELECT ItemNumber, Token FROM dbo.NSplit8K(@string1,3,1);
--===== (3) extract the values "client01", "client02", etc...
 DECLARE @string varchar(100) = '<client01>,<client02>,<client03>,<client04>';
 SELECT ItemNumber, Token FROM dbo.NSplit8K(SUBSTRING(@string,2,LEN(@string)-2),8,3);
--===== (4) If you dont need a delimiter to split the string
 SELECT ItemNumber, Token FROM dbo.NSplit8K('abcdefghi',3,0);
--===== (5) Turn it into an N-Gram function
 DECLARE @string varchar(8000) = 'abcdefghi',
         @N int = 1; -- change this to 2 for bigrams, 3 for trigrams, 4 for 4-grams, etc...
 SELECT TOP (LEN(@string)-(@N-1)) ItemNumber, Token
 FROM dbo.NSplit8K('abcdefghi',@N,-(@N-1));
 --===== (6) Split a string into unigrams then count the vowels
 SELECT VowelCount = COUNT(*)
 FROM dbo.NSplit8K('abcdefghi',1,0)
 WHERE Token LIKE '[aeiou]';
 --===== (7) Behavior for NULL inputs
 SELECT ItemNumber, Token FROM dbo.NSplit8K(NULL,1,0);
 SELECT ItemNumber, Token FROM dbo.NSplit8K('ABC',NULL,0);
 SELECT ItemNumber, Token FROM dbo.NSplit8K('ABC',1,NULL);
----------------------------------------------------------------------------------------
Rev 00 - 20151030 - Alan Burstein - Initial Development 
Rev 01 - 20160407 - Alan Burstein 
       - Rewrote using getnumsAB
       - Added logic for handling NULL inputs
****************************************************************************************/RETURNS TABLE WITH SCHEMABINDING AS RETURN
SELECT 
  ItemNumber = rn, 
  Token      = SUBSTRING(@string,n1,@TokenLen)
FROM dbo.GetNumsAB(1,CONVERT(bigint,DATALENGTH(@string),0),@TokenLen+@DelimLen,1)
WHERE @string IS NOT NULL AND @TokenLen+@DelimLen IS NOT NULL
UNION ALL
SELECT 1, NULL   -- if any of the parameters are NULL then return a single NULL value:
WHERE REPLICATE(@string,@TokenLen+@DelimLen) IS NULL;

NSplit2B - The varchar(max) version

CREATE FUNCTION dbo.NSplit2B
(
  @String   varchar(max), 
  @TokenLen bigint, 
  @DelimLen bigint
)
/****************************************************************************************
Purpose:
 Identical to NSplit8K but handles an input string of data type varchar(max). NSplit2B
 splits an input string (@String) into @TokenLen sized tokens with @DelimLen sized gaps 
 between each token. Intended to be a better performing alternative to the traditional 
 "splitter" for cases where the size of the token (AKA "item") is always the same.
Compatibility:
 SQL Server 2008+, Azure SQL Database, Azure SQL Data Warehouse & Parallel Data Warehouse
Syntax:
 SELECT ItemNumber, Token
 FROM dbo.NSplit2B(@string, @length, @delimiterLength)
Parameters:
 @String          = varchar(max); the input string to "split" into tokens
 @TokenLen        = bigint; the size of the output token
 @DelimLen        = bigint; the size of the delimiter. Can be set to 0 for when there's 
                    no delimiter or set to negetive if you desire overlap
Returns:
 ItemNumber = bigint; represents the order that the token appears in the input string
 Token      = varchar(max); the @lengh-sized token returned by the function.
Developer notes:
 1. Requires GetNumsAB: http://www.sqlservercentral.com/scripts/Set+Based/139370/
    Alternatively you can incorporate your own tally table logic at your own risk.
 2. If the last token is shorter then @TokenLen then it will be truncated.
 3. Tends to perform better with a serial execution plan; it may be adventageous to test
    with OPTION (MAXDOP 1) and compare that performance with a parallel plan. To force a
    parallel execution plan see make_parallel by Adam Machanic:
  sqlblog.com/blogs/adam_machanic/archive/2013/07/11/next-level-parallel-plan-porcing.aspx
 4. There are no safeguards built into the function to make sure the tokens are all the
    same size. The only safeguard is to test and know your data.
 5. If any of the parameters are NULL the function will return a a single NULL value.
 6. Will perform slower than NSplit8K due to the fact that it's handling varchar(max) 
    strings. If you're sure all strings contain 8,000 characters or less use NSPlit8K.
 7. NSplit2B is deterministic. For more about deterministic functions see:
    https://msdn.microsoft.com/en-us/library/ms178091.aspx
Examples:
--===== (1) extract the values "ab", "cd", "ef", and "gg"; token-size=2; delimiter=","
 SELECT ItemNumber, Token FROM dbo.NSplit2B('ab,cd,ef,gg',2,1);
--===== (2) extract the values "xxx", "yyy", "bbb", etc... token-size=3; delimiter=","
 DECLARE @string1 varchar(100) = 'xxx,yyy,bbb,123,zzz';
 SELECT ItemNumber, Token FROM dbo.NSplit2B(@string1,3,1);
--===== (3) extract the values "client01", "client02", etc...
 DECLARE @string varchar(100) = '<client01>,<client02>,<client03>,<client04>';
 SELECT ItemNumber, Token FROM dbo.NSplit2B(SUBSTRING(@string,2,LEN(@string)-2),8,3);
--===== (4) If you dont need a delimiter to split the string
 SELECT ItemNumber, Token FROM dbo.NSplit2B('abcdefghi',3,0);
--===== (5) Turn it into an N-Gram function
 DECLARE @string varchar(max) = 'abcdefghi',
         @N int = 1; -- change this to 2 for bigrams, 3 for trigrams, 4 for 4-grams, etc...
 SELECT TOP (LEN(@string)-(@N-1)) ItemNumber, Token 
 FROM dbo.NSplit2B('abcdefghi',@N,-(@N-1));

--===== (6) Split a string into unigrams then count the vowels
 SELECT VowelCount = COUNT(*)
 FROM dbo.NSplit2B('abcdefghi',1,0)
 WHERE Token LIKE '[aeiou]';
--===== (7) Behavior for NULL inputs
 SELECT ItemNumber, Token FROM dbo.NSplit2B(NULL,1,0);
 SELECT ItemNumber, Token FROM dbo.NSplit2B('ABC',NULL,0);
 SELECT ItemNumber, Token FROM dbo.NSplit2B('ABC',1,NULL);
----------------------------------------------------------------------------------------
Rev 00 - 20151030 - Alan Burstein - Initial Development 
Rev 01 - 20160407 - Alan Burstein 
       - Rewrote using getnumsAB
       - Added logic for handling NULL inputs
****************************************************************************************/RETURNS TABLE WITH SCHEMABINDING AS RETURN
SELECT 
  ItemNumber = rn, 
  Token      = SUBSTRING(@string,n1,@TokenLen)
FROM dbo.GetNumsAB(1,CONVERT(bigint,DATALENGTH(@string),0),@TokenLen+@DelimLen,1)
WHERE @string IS NOT NULL AND @TokenLen+@DelimLen IS NOT NULL
UNION ALL
SELECT 1, NULL   -- if any of the parameters are NULL then return a single NULL value
WHERE REPLICATE(@string,@TokenLen+@DelimLen) IS NULL;
* varchar(max) handles 2,147,483,647 characters. 2B just represents that number rounded down. 
  

Rate

5 (3)

You rated this post out of 5. Change rating

Share

Share

Rate

5 (3)

You rated this post out of 5. Change rating