Splitting Strings in SQL Server 2016

  • Ed Wagner (4/19/2016)


    Great article, Wayne. Thanks for putting it together. I especially liked your comparison of the different approaches and what each one does and does not do. I really hope that Microsoft finishes the function by adding an ordinal to the return result set.

    I must say that I fully agree - a function should return something even if it's NULL.

    Thanks Ed

    Now all I have to do is wait until we get SQL 2016 here at work. I figure I have another 4 or 5 years. :w00t:

    Did you hear that SQL Server Developer Edition is now free[/url]?

    Wayne
    Microsoft Certified Master: SQL Server 2008
    Author - SQL Server T-SQL Recipes


    If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
    Links:
    For better assistance in answering your questions
    Performance Problems
    Common date/time routines
    Understanding and Using APPLY Part 1 & Part 2

  • WayneS (4/19/2016)


    Now all I have to do is wait until we get SQL 2016 here at work. I figure I have another 4 or 5 years. :w00t:

    Did you hear that SQL Server Developer Edition is now free[/url]?

    Yes I did. My new laptop is supposed to arrive this Thursday. I'm hoping it has enough horsepower to run SQL Server. That was the whole reason for getting it.

  • Good to know there would be built in function to split string now. And i have various ways to use them.

    But right now i use function which can work with multiply-character delimiter (like "||" , "&&") and output table has 2 columns. First one is the ID its the member place from left to right. Some time i had 2 or 3 inputs in procedure with string and after convertin them to tables had to join them by their ID. So it is usefull.

    So right now i don't see any bonuses in built in function, except performance. But since i use those function not in OLTP envirement and not so frequent and not with big string i don't see any improvement. Results are anyway in macro seconds...

  • etheral1980 (5/14/2016)


    Good to know there would be built in function to split string now. And i have various ways to use them.

    But right now i use function which can work with multiply-character delimiter (like "||" , "&&") and output table has 2 columns. First one is the ID its the member place from left to right. Some time i had 2 or 3 inputs in procedure with string and after convertin them to tables had to join them by their ID. So it is usefull.

    So right now i don't see any bonuses in built in function, except performance. But since i use those function not in OLTP envirement and not so frequent and not with big string i don't see any improvement. Results are anyway in macro seconds...

    It's a funny thing about "fast" code... when you call them 100,000 times a day. 😀 I continue to find and repair such things where I work. Individually, they don't amount to much. Collectively, they take literally hours of CPU time and add to the IO bottleneck by a substantial amount.

    In the last couple of years, I've eliminated nearly 3 days per day of CPU time and a couple of hundred trillion bytes of IO on our "money maker" system and it's really helped with timeouts and deadlocks on the user end a whole lot.

    That's the problem with such things. No one thinks that "milliseconds matter" and then when performance tanks (it was seriously tanked when I first joined the company), it's a real bugger to fix because it's a system cancer that has occurred instead of just a handful of places that need help. It's not just front-end code, either. I've started working on some of the batch jobs that are getting out of hand because of bad practices that people paid little attention to because they didn't look ahead. For example, there are dozens of places where someone populates a Temp Table as a very necessary loop control. Ironically, the loops work fine but the Temp Tables are a killer because they select the TOP 1 ID and when the loop completes, they DELETE that row from the Temp Table. These Temp Tables get quite large and the "find" for the DELETEs cause a full table scan of the Temp Table. I'm sure none of that was a problem per se` when the tables were much smaller but they're in the top 10 of things that take too much CPU/IO now. It's 1) stupid to delete from a Temp Table ((N2+N)/2+N rows read, in this case... one pass worse than a Triangular Join) and 2) it amazing that they didn't thing of using a Clustered Index on an IDENTITY column for the control. The loop, of course, is also a problem but it pales in comparison to just doing the DELETEs from the control table.

    My recommendation is to never justify poor performance by saying "good enough" based on a small number of rows, which have the habit of growing or someone else using the code as a example for larger problems (which you have NO control over, ever).

    Do like Granny used to say... "Mind the Pennies and the Dollars will take care of themselves". It just doesn't take any extra time and, in the long run, will save a huge amount of time because rework is about 8 times more expensive than doing it right the first time.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (5/15/2016)


    etheral1980 (5/14/2016)


    Good to know there would be built in function to split string now. And i have various ways to use them.

    But right now i use function which can work with multiply-character delimiter (like "||" , "&&") and output table has 2 columns. First one is the ID its the member place from left to right. Some time i had 2 or 3 inputs in procedure with string and after convertin them to tables had to join them by their ID. So it is usefull.

    So right now i don't see any bonuses in built in function, except performance. But since i use those function not in OLTP envirement and not so frequent and not with big string i don't see any improvement. Results are anyway in macro seconds...

    It's a funny thing about "fast" code... when you call them 100,000 times a day. 😀 I continue to find and repair such things where I work. Individually, they don't amount to much. Collectively, they take literally hours of CPU time and add to the IO bottleneck by a substantial amount.

    In the last couple of years, I've eliminated nearly 3 days per day of CPU time and a couple of hundred trillion bytes of IO on our "money maker" system and it's really helped with timeouts and deadlocks on the user end a whole lot.

    That's the problem with such things. No one thinks that "milliseconds matter" and then when performance tanks (it was seriously tanked when I first joined the company), it's a real bugger to fix because it's a system cancer that has occurred instead of just a handful of places that need help. It's not just front-end code, either. I've started working on some of the batch jobs that are getting out of hand because of bad practices that people paid little attention to because they didn't look ahead. For example, there are dozens of places where someone populates a Temp Table as a very necessary loop control. Ironically, the loops work fine but the Temp Tables are a killer because they select the TOP 1 ID and when the loop completes, they DELETE that row from the Temp Table. These Temp Tables get quite large and the "find" for the DELETEs cause a full table scan of the Temp Table. I'm sure none of that was a problem per se` when the tables were much smaller but they're in the top 10 of things that take too much CPU/IO now. It's 1) stupid to delete from a Temp Table ((N2+N)/2+N rows read, in this case... one pass worse than a Triangular Join) and 2) it amazing that they didn't thing of using a Clustered Index on an IDENTITY column for the control. The loop, of course, is also a problem but it pales in comparison to just doing the DELETEs from the control table.

    My recommendation is to never justify poor performance by saying "good enough" based on a small number of rows, which have the habit of growing or someone else using the code as a example for larger problems (which you have NO control over, ever).

    Do like Granny used to say... "Mind the Pennies and the Dollars will take care of themselves". It just doesn't take any extra time and, in the long run, will save a huge amount of time because rework is about 8 times more expensive than doing it right the first time.

    Jeff,

    i totally agree with you on all points you mentioned. But may be i was not clear enough 🙂

    I talked specificly about string to table function and i told i don't use them in OLTP envirement. There are no million calls per day, not even hundreds.

    If it is OLTP and input for procedure would be multiply rows for insert/update i tell programmers to send me table and not string and then i convert it to table to make set based operation. It is qiuete obvious.

    Right now i use this function in DWH. To send reports in HTML i need this function in dynamic procedure which build html. So about 10 times a day? Build in function of 2016 will not give me better performance, but have less options my function has.

    In one case i had a table with 2 columns i needed to process. One for example is brand and other products. So the products column is a string. All products for this brand. Delimited string in one columns. Ok, design is bad from start. But needed to do something.

    So there is another use for string to table function if you need result set with 1 row for each product and in the same row corresponding brand.

    And if you started about how much macroseconds are importand ... there is another thing i fruequently see programmers and even other DBA don't do. Write table/procedure and etc names with it's schema name. In heavy OLTP envirement it will also save huge amount of cpu time.

  • @etheral1980,

    Understood. But that's how it all starts and your example is exactly what I'm talking about. It seems totally harmless right now.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I wonder if any change is needed to STRING_SPLIT for getting the order of values that are split. The following is an example of how I use the ROW_NUMBER function to get the order of values, either within a group or across all entries (second SELECT).

    CREATE TABLE

    #test_table

    (

    group_name

    VARCHAR (20)

    ,value_array

    VARCHAR (MAX)

    )

    ;

    INSERT

    #test_table

    VALUES

    ('first', '1,2,3,4,7,6,5')

    ,('second', '2,1,6,5,3,4')

    ;

    SELECT

    group_name

    ,ROW_NUMBER () OVER (PARTITION BY group_name ORDER BY GETDATE())

    AS record_id

    ,[value]

    AS file_record

    FROM

    #test_table

    CROSS APPLY

    STRING_SPLIT

    (value_array, N',')

    ;

    SELECT

    group_name

    ,ROW_NUMBER () OVER (PARTITION BY 1 ORDER BY GETDATE())

    AS record_id

    ,[value]

    AS file_record

    FROM

    #test_table

    CROSS APPLY

    STRING_SPLIT

    (value_array, N',')

    ;

    DROP TABLE

    #test_table

    ;


    Have Fun!
    Ronzo

  • Hmm... I kind of like the XML-based splitter method, if only because part of the process is it can throw out the data in the output tableset as a specific data type (varchar(x), int, date, etc), and it looks like this data type could be specified dynamically too...

    http://www.sqlservercentral.com/articles/XML/66932/ et al.

    Again, yes, the XML-based splitter is not as performant as SQLCLR function, Jeff Moden's splitter, etc., but if it's just used at the beginning of a stored proc, to split a parameter value into a table set, probably not a big deal in the grand scheme of things.

  • corey lawson (9/8/2016)


    Hmm... I kind of like the XML-based splitter method, if only because part of the process is it can throw out the data in the output tableset as a specific data type (varchar(x), int, date, etc), and it looks like this data type could be specified dynamically too...

    http://www.sqlservercentral.com/articles/XML/66932/ et al.

    You can't specify the data type dynamically using the XML method; the datatype must be specified as a string literal. Using a tally table based splitter you can simply use CAST to specify the datatype you'd like returned either inside the function or similar splitter logic nested in subquery.

    Again, yes, the XML-based splitter is not as performant as SQLCLR function, Jeff Moden's splitter, etc., but if it's just used at the beginning of a stored proc, to split a parameter value into a table set, probably not a big deal in the grand scheme of things.

    Probably not but why not just have the fastest splitter available and call it a day?

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Alan.B (9/8/2016)


    corey lawson (9/8/2016)


    Hmm... I kind of like the XML-based splitter method, if only because part of the process is it can throw out the data in the output tableset as a specific data type (varchar(x), int, date, etc), and it looks like this data type could be specified dynamically too...

    http://www.sqlservercentral.com/articles/XML/66932/ et al.

    You can't specify the data type dynamically using the XML method; the datatype must be specified as a string literal. Using a tally table based splitter you can simply use CAST to specify the datatype you'd like returned either inside the function or similar splitter logic nested in subquery.

    Again, yes, the XML-based splitter is not as performant as SQLCLR function, Jeff Moden's splitter, etc., but if it's just used at the beginning of a stored proc, to split a parameter value into a table set, probably not a big deal in the grand scheme of things.

    Probably not but why not just have the fastest splitter available and call it a day?

    Better yet, if the splitter is only being used to dice up parameters passed as a string, why not pass those parameters as XML and do normal shredding, which will be quite fast for many reasons as well as no longer being limited to 4000 or 8000 characters and, maybe, actually having a chance of passing special characters without the problem that a lot of homegrown XML splitters have and that's choking on a special character like "&".

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Hmm... good point, but...

    but thinking more of user-supplied list of values (like, say, an SSRS report, that passes the parameter to a stored proc to get the data for the report), and not wanting/being able to convert that delimited string into XML somewhere else to then pass in to a proc, expect the end user to properly wrap values in some other delimiter, etc...

    Like, send in a list of dates (like, I haven't wrapped the individual values in double-quotes, etc... )

    8/17/2016,8/24/2016, 9/12/2016, 4/1/2015, ...

    ...then the XML splitter, using the explicit type cast to a date or datetime in the XQuery statement, just works, reasonably enough...

    Also, the XML splitter code will work for multiple-character delimiters, too...

  • Maybe someone already brought it up, but I am a little curious as to why the OPENJSON function was even mentioned in this article. I can't see that it really added any value. It can't handle a straight delimited string without manipulating the string, even if that string is comma delimited (must add brackets, enclose each string in double-quotes, and escape existing double-quotes, and it is slower than STRING_SPLIT. The overhead involved in making a string compatible with using OPENJSON seems like making mention of it only fills the article with useless fluff. When you initially mentioned it, I was expecting to read some benefit of using it over STRING_SPLIT. Great for breaking down a JSON, but not for splitting strings.

    Otherwise, a good article.

  • dbishop - Thursday, August 10, 2017 4:49 PM

    Maybe someone already brought it up, but I am a little curious as to why the OPENJSON function was even mentioned in this article. I can't see that it really added any value. It can't handle a straight delimited string without manipulating the string, even if that string is comma delimited (must add brackets, enclose each string in double-quotes, and escape existing double-quotes, and it is slower than STRING_SPLIT. The overhead involved in making a string compatible with using OPENJSON seems like making mention of it only fills the article with useless fluff. When you initially mentioned it, I was expecting to read some benefit of using it over STRING_SPLIT. Great for breaking down a JSON, but not for splitting strings.

    Otherwise, a good article.

    Wayne addressed that in his post on page 1 at https://www.sqlservercentral.com/Forums/FindPost1777077.aspx.  It's a built-in function that can be used to split string, so he included it in the performance comparison.

  • dbishop - Thursday, August 10, 2017 4:49 PM

    Maybe someone already brought it up, but I am a little curious as to why the OPENJSON function was even mentioned in this article. I can't see that it really added any value. It can't handle a straight delimited string without manipulating the string, even if that string is comma delimited (must add brackets, enclose each string in double-quotes, and escape existing double-quotes, and it is slower than STRING_SPLIT. The overhead involved in making a string compatible with using OPENJSON seems like making mention of it only fills the article with useless fluff. When you initially mentioned it, I was expecting to read some benefit of using it over STRING_SPLIT. Great for breaking down a JSON, but not for splitting strings.

    Otherwise, a good article.

    I can't actually speak for Wayne but I suspect that it's for the same reasons why I would include it and several other methods in such an article. A lot of people have come up with many different ways of splitting strings and I'd want to be sure to cover them all if, for nothing else, to demonstrate the fallacy of using some of those "clever" methods because other people writing about them don't.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Great article and read!

Viewing 15 posts - 16 through 30 (of 33 total)

You must be logged in to reply to this topic. Login to reply