Set-Based String Splitting table function

  • bevanward

    SSC Eights!

    Points: 855

    Comments posted to this topic are about the item Set-Based String Splitting table function

  • ChrisM@Work

    SSC Guru

    Points: 186094

    Thanks for taking the time to write this up. Interesting - you've arrived at a few of the shortcuts of the house splitter

    http://www.sqlservercentral.com/articles/Tally+Table/72993/ written by Jeff Moden (and others).
    I would imaging that storing the results into a table variable then querying that for the output will cost some time, is it really necessary?

    [font="Arial"]“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw[/font]


    For fast, accurate and documented assistance in answering your questions, please read this article[/url].
    Understanding and using APPLY, (I)[/url] and (II)[/url] Paul White[/url]
    Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]
    [url

  • Carlo Romagnano

    SSC-Insane

    Points: 21948

    bevanward - Saturday, March 17, 2018 11:21 AM

    Comments posted to this topic are about the item Set-Based String Splitting table function

    You  can't use the function with  very large string because of the tally table with only 1000 numbers.
    insert into dbo.tNUM (NUM)
    SELECT TOP (1000) n FROM Nums ORDER BY n;

  • bevanward

    SSC Eights!

    Points: 855

    Carlo Romagnano - Monday, March 19, 2018 3:28 AM

    bevanward - Saturday, March 17, 2018 11:21 AM

    Comments posted to this topic are about the item Set-Based String Splitting table function

    You  can't use the function with  very large string because of the tally table with only 1000 numbers.
    insert into dbo.tNUM (NUM)
    SELECT TOP (1000) n FROM Nums ORDER BY n;

    Hi Carlo

    I have a few million in my table however did not want to add so much for the example. Just increase the number of records or update the function to use an existing if you have one?

    Thanks
    Bevan

  • bevanward

    SSC Eights!

    Points: 855

    ChrisM@Work - Monday, March 19, 2018 3:05 AM

    Thanks for taking the time to write this up. Interesting - you've arrived at a few of the shortcuts of the house splitter

    http://www.sqlservercentral.com/articles/Tally+Table/72993/ written by Jeff Moden (and others).
    I would imaging that storing the results into a table variable then querying that for the output will cost some time, is it really necessary?

    Hi Chris

    I had not seen this before so will have to have a look - from a brief look it seems to be very comprehensive.
    I had not tried without the table variable so will have to experiment. 
    Thanks for the link!
    Cheers
    Bevan

  • gbritton1

    SSCertifiable

    Points: 6520

    ChrisM@Work - Monday, March 19, 2018 3:05 AM

    Thanks for taking the time to write this up. Interesting - you've arrived at a few of the shortcuts of the house splitter

    http://www.sqlservercentral.com/articles/Tally+Table/72993/ written by Jeff Moden (and others).
    I would imaging that storing the results into a table variable then querying that for the output will cost some time, is it really necessary?

    Yes, this is the gold standard for string splitting in pure SQL.  Surprised the author was not aware of it

  • bevanward

    SSC Eights!

    Points: 855

    gbritton1 - Monday, March 19, 2018 6:35 AM

    ChrisM@Work - Monday, March 19, 2018 3:05 AM

    Thanks for taking the time to write this up. Interesting - you've arrived at a few of the shortcuts of the house splitter

    http://www.sqlservercentral.com/articles/Tally+Table/72993/ written by Jeff Moden (and others).
    I would imaging that storing the results into a table variable then querying that for the output will cost some time, is it really necessary?

    Yes, this is the gold standard for string splitting in pure SQL.  Surprised the author was not aware of it

    Hi gbritton1

    Have just read through Jeff's work and it does seem to be VERY complete and I'm sure anything I've done is a subset of this!
    I have been using my function for splitting million character files based on various patterns however need to compare to what Jeff has done as I'm sure it will improve performance. In the meantime was a useful learning exercise but certainly doesn't extend what is out there!

    Thanks for the comment
    Bevan

  • Jeff Moden

    SSC Guru

    Points: 996502

    bevanward - Monday, March 19, 2018 10:09 AM

    gbritton1 - Monday, March 19, 2018 6:35 AM

    ChrisM@Work - Monday, March 19, 2018 3:05 AM

    Thanks for taking the time to write this up. Interesting - you've arrived at a few of the shortcuts of the house splitter

    http://www.sqlservercentral.com/articles/Tally+Table/72993/ written by Jeff Moden (and others).
    I would imaging that storing the results into a table variable then querying that for the output will cost some time, is it really necessary?

    Yes, this is the gold standard for string splitting in pure SQL.  Surprised the author was not aware of it

    Hi gbritton1

    Have just read through Jeff's work and it does seem to be VERY complete and I'm sure anything I've done is a subset of this!
    I have been using my function for splitting million character files based on various patterns however need to compare to what Jeff has done as I'm sure it will improve performance. In the meantime was a useful learning exercise but certainly doesn't extend what is out there!

    Thanks for the comment
    Bevan

    Just for the sake of curiosity, how many "fields" per "line",how many "lines", and how many average characters per line?  I'm always up for different ideas, especially on this subject, and would like to knock off a test or two.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Forum FAQ

  • bevanward

    SSC Eights!

    Points: 855

    Jeff Moden - Monday, March 19, 2018 2:21 PM

    bevanward - Monday, March 19, 2018 10:09 AM

    gbritton1 - Monday, March 19, 2018 6:35 AM

    ChrisM@Work - Monday, March 19, 2018 3:05 AM

    Thanks for taking the time to write this up. Interesting - you've arrived at a few of the shortcuts of the house splitter

    http://www.sqlservercentral.com/articles/Tally+Table/72993/ written by Jeff Moden (and others).
    I would imaging that storing the results into a table variable then querying that for the output will cost some time, is it really necessary?

    Yes, this is the gold standard for string splitting in pure SQL.  Surprised the author was not aware of it

    Hi gbritton1

    Have just read through Jeff's work and it does seem to be VERY complete and I'm sure anything I've done is a subset of this!
    I have been using my function for splitting million character files based on various patterns however need to compare to what Jeff has done as I'm sure it will improve performance. In the meantime was a useful learning exercise but certainly doesn't extend what is out there!

    Thanks for the comment
    Bevan

    Just for the sake of curiosity, how many "fields" per "line",how many "lines", and how many average characters per line?  I'm always up for different ideas, especially on this subject, and would like to knock off a test or two.

    Hi Jeff

    To be open I'm sure I should be doing a lot of these tasks in Python but first tend to try and use SQL as a challenge. 
    I would be interested in any of your comments however can see your method is well tested!
    I don't have a normal as such - I've started using it in lots of places.

    For example finding the last sentence after a .(space), !(space), or ?(space) of 10 thousand records, maybe 10 sentences, less than 2000 characters each line.

    One other example at the moment is to remove Google stop words for example

    use tempdb
    go

    drop table if exists dbo.tStop
    drop function if exists dbo.fnt_remove_stopwords

    create table dbo.tStop (Word nvarchar(50) unique)
    insert into dbo.tStop values ('a'),('about'),('above'),('after'),('again'),('against'),('all'),('am'),('an'),('and'),('any'),('are'),('as'),('at'),('be'),('because'),('been'),('before'),('being'),('below'),('between'),('both'),('but'),('by'),('could'),('did'),('do'),('does'),('doing'),('down'),('during'),('each'),('few'),('for'),('from'),('further'),('had'),('has'),('have'),('having'),('he'),('he''d'),('he''ll'),('he''s'),('her'),('here'),('here''s'),('hers'),('herself'),('him'),('himself'),('his'),('how'),('how''s'),('i'),('i''d'),('i''ll'),('i''m'),('i''ve'),('if'),('in'),('into'),('is'),('it'),('it''s'),('its'),('itself'),('let''s'),('me'),('more'),('most'),('my'),('myself'),('nor'),('of'),('on'),('once'),('only'),('or'),('other'),('ought'),('our'),('ours'),('ourselves'),('out'),('over'),('own'),('same'),('she'),('she''d'),('she''ll'),('she''s'),('should'),('so'),('some'),('such'),('than'),('that'),('that''s'),('the'),('their'),('theirs'),('them'),('themselves'),('then'),('there'),('there''s'),('these'),('they'),('they''d'),('they''ll'),('they''re'),('they''ve'),('this'),('those'),('through'),('to'),('too'),('under'),('until'),('up'),('very'),('was'),('we'),('we''d'),('we''ll'),('we''re'),('we''ve'),('were'),('what'),('what''s'),('when'),('when''s'),('where'),('where''s'),('which'),('while'),('who'),('who''s'),('whom'),('why'),('why''s'),('with'),('would'),('you'),('you''d'),('you''ll'),('you''re'),('you''ve'),('your'),('yours'),('yourself'),('yourselves')
    go

    create function dbo.fnt_remove_stopwords (@Value nvarchar(max))
    returns nvarchar(max)
    as
    begin

    declare @Out nvarchar(max)

    set @Value = replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(@Value, '&amp',''),',',''),';',''),'.',''),'(',''),')',''),'?',''),':',''),'[',''),']',''),'!','')

    select @Out = (select ' '+String from
        (select top 100 percent Ordinal, String
        from dbo.ufnt_split(@Value, ' ')
        where String not in
            (select Word
            from tStop)
            order by Ordinal) subqry
    for xml path,TYPE).value('.[1]','nvarchar(max)')

    return @out

    end
    go

    select dbo.fnt_remove_stopwords('The shortbeard codling can grow to a maximum standard length of about 33 cm (13 in) and takes the form of a somewhat spindle-shaped cylinder. The snout is fairly blunt and the upper jaw is slightly longer than the lower jaw. There is a short barbel on the chin and a flattened spine on the operculum. There are two dorsal fins with no spines and a total of from 69 to 76 soft rays. The first dorsal fin has 6 rays, the second of which is the longest, and the second dorsal fin originates immediately behind the first fin. Each pectoral fin has 17 to 23 soft rays. The pelvic fins are widely separated and each has two elongated soft rays; these fins originate to the front of the pectoral fins. The anal fin has no spines and has 57 to 63 soft rays. The caudal fin is small and rounded. The general colour of this fish is tan, the dorsal and anal fins having dark edges.')

    Thanks for taking the time to comment
    Bevan

  • Jeff Moden

    SSC Guru

    Points: 996502

    Cool feedback.  Thanks, Bevan.

    I think the concern of some of the other poster's is having to do with performance.  Generally speaking and with few exceptions, an mTVF (Multi-statement Table Valued Function) is going to be slower than an iTVF (Inline Table Valued Function).  Your use of LEAD may help overcome that but, if you can use lead, take a look at a modernized version of DelimitedSplit8K that a good online friend, Eirikur Eiriksson , wrote.  I've test it and it's twice as fast as DelimitedSplit8k and approaches CLR speeds very closely.  Of course, it requires 2012+.

    http://www.sqlservercentral.com/articles/SQL+Server+2012/106903/

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Forum FAQ

  • Scott In Sydney

    Hall of Fame

    Points: 3331

    gbritton1 - Monday, March 19, 2018 6:35 AM

    ChrisM@Work - Monday, March 19, 2018 3:05 AM

    Thanks for taking the time to write this up. Interesting - you've arrived at a few of the shortcuts of the house splitter

    http://www.sqlservercentral.com/articles/Tally+Table/72993/ written by Jeff Moden (and others).
    I would imaging that storing the results into a table variable then querying that for the output will cost some time, is it really necessary?

    Yes, this is the gold standard for string splitting in pure SQL.  Surprised the author was not aware of it

    While I've seen (and am using) Jeff's approach noted above (thanks Jeff!), I'm not going to ding @bevanward for not being aware of this previous article.

    I say kudos to bevan for posting an approach that, while perhaps not as comprehensive as Jeff's, was clearly arrived at independently of Jeff's work.  And just taking the time to share some ideas is a cool thing.

  • bevanward

    SSC Eights!

    Points: 855

    Jeff Moden - Monday, March 19, 2018 7:06 PM

    Cool feedback.  Thanks, Bevan.

    I think the concern of some of the other poster's is having to do with performance.  Generally speaking and with few exceptions, an mTVF (Multi-statement Table Valued Function) is going to be slower than an iTVF (Inline Table Valued Function).  Your use of LEAD may help overcome that but, if you can use lead, take a look at a modernized version of DelimitedSplit8K that a good online friend, Eirikur Eiriksson , wrote.  I've test it and it's twice as fast as DelimitedSplit8k and approaches CLR speeds very closely.  Of course, it requires 2012+.

    http://www.sqlservercentral.com/articles/SQL+Server+2012/106903/

    Hi Jeff
    Thanks for this - I had never thought through, but makes complete sense the multi-statement versus the single statement return. Since I have been using this more performance is going to start becoming an issue so will have a look at Eirikur's version once I've got my head around all of your work! Thanks for this is highly valued feedback!!
    Take care
    Bevan

  • bevanward

    SSC Eights!

    Points: 855

    Scott In Sydney - Tuesday, March 20, 2018 12:56 AM

    gbritton1 - Monday, March 19, 2018 6:35 AM

    ChrisM@Work - Monday, March 19, 2018 3:05 AM

    Thanks for taking the time to write this up. Interesting - you've arrived at a few of the shortcuts of the house splitter

    http://www.sqlservercentral.com/articles/Tally+Table/72993/ written by Jeff Moden (and others).
    I would imaging that storing the results into a table variable then querying that for the output will cost some time, is it really necessary?

    Yes, this is the gold standard for string splitting in pure SQL.  Surprised the author was not aware of it

    While I've seen (and am using) Jeff's approach noted above (thanks Jeff!), I'm not going to ding @bevanward for not being aware of this previous article.

    I say kudos to bevan for posting an approach that, while perhaps not as comprehensive as Jeff's, was clearly arrived at independently of Jeff's work.  And just taking the time to share some ideas is a cool thing.

    Hi Scott
    Thanks for your message! It is hard to know how long to look for something before starting to write. What Jeff wrote looks perfect.
    Had you seen the modernised version that Jeff mentions above? Seems to be worth considering from the looks of the performance.

    Thanks again keep well
    Bevan

  • bevanward

    SSC Eights!

    Points: 855

    Jeff Moden - Monday, March 19, 2018 7:06 PM

    Cool feedback.  Thanks, Bevan.

    I think the concern of some of the other poster's is having to do with performance.  Generally speaking and with few exceptions, an mTVF (Multi-statement Table Valued Function) is going to be slower than an iTVF (Inline Table Valued Function).  Your use of LEAD may help overcome that but, if you can use lead, take a look at a modernized version of DelimitedSplit8K that a good online friend, Eirikur Eiriksson , wrote.  I've test it and it's twice as fast as DelimitedSplit8k and approaches CLR speeds very closely.  Of course, it requires 2012+.

    http://www.sqlservercentral.com/articles/SQL+Server+2012/106903/

    Hi Jeff

    Have modified Eirikur's version for multiple character delimiters and longer fields. Certainly runs fast from what I've seen so far have to test more.
    The only issue I have stumbled into so far is where delimiters overlap and it splits multiple times - for example when splitting with three dots and try and pass through a area with lots of dots. Any thoughts appreciated. Thanks for pointing this all out.
    Thanks
    Bevan

  • Jeff Moden

    SSC Guru

    Points: 996502

    bevanward - Tuesday, March 20, 2018 3:04 PM

    Jeff Moden - Monday, March 19, 2018 7:06 PM

    Cool feedback.  Thanks, Bevan.

    I think the concern of some of the other poster's is having to do with performance.  Generally speaking and with few exceptions, an mTVF (Multi-statement Table Valued Function) is going to be slower than an iTVF (Inline Table Valued Function).  Your use of LEAD may help overcome that but, if you can use lead, take a look at a modernized version of DelimitedSplit8K that a good online friend, Eirikur Eiriksson , wrote.  I've test it and it's twice as fast as DelimitedSplit8k and approaches CLR speeds very closely.  Of course, it requires 2012+.

    http://www.sqlservercentral.com/articles/SQL+Server+2012/106903/

    Hi Jeff

    Have modified Eirikur's version for multiple character delimiters and longer fields. Certainly runs fast from what I've seen so far have to test more.
    The only issue I have stumbled into so far is where delimiters overlap and it splits multiple times - for example when splitting with three dots and try and pass through a area with lots of dots. Any thoughts appreciated. Thanks for pointing this all out.
    Thanks
    Bevan

    Post the modifications you made.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Forum FAQ

Viewing 15 posts - 1 through 15 (of 26 total)

You must be logged in to reply to this topic. Login to reply