Extracting a specific number of words from a string in sql

  • I'm currently trying to write a query that will extract the first two words from a string. I'm using charindex and substrings and it's getting quite messy! Has anyone ever written a function may be to extract a specific number of words i.e. text around spaces please? It would need to be able to cope with data where there are different lengths and number of words such as:

    This is the first line

    Second line

    Word

    This is the fourth line

    and if I was asking it to return a string containing the first two words the results would be:

    This is

    Second line

    Word

    This is

    Thanks.

  • I know that by using regular expressions, the \b is a word boundary; it might be a space/tab/Crlf, or it might be punctuation...period comma exclamation semicolon etc.

    using regualr expressions, you'd be able to grab x number of words, based on word boundaries.

    that would be the best way to go, but I've only installed the extended stored procedures from the SSC Toolkit on SQL2000.

    let me install on 2005 and test a few expressions.

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • Lowell - if you're on 2005 - use CLR instead to create the REGEX expressions instead. The XP is nice in 2000, but you can do better in 2005.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • This may look a little funky, but it does the job 🙂

    declare @x table ( words varchar(50) not null )

    insert @x

    select 'This is the first line' union all

    select 'Second line' union all

    select 'Word' union all

    select 'This is the fourth line'

    selectcase when charindex(' ', words, charindex(' ', words) + 1) = 0

    then words

    else left(words, charindex(' ', words, charindex(' ', words) + 1))

    end

    from @x

    go

    --------------------------------------------------

    This is

    Second line

    Word

    This is

    (4 row(s) affected)

    /Kenneth

  • martin.griffiths (2/27/2008)


    I'm currently trying to write a query that will extract the first two words from a string. I'm using charindex and substrings and it's getting quite messy! Has anyone ever written a function may be to extract a specific number of words i.e. text around spaces please? It would need to be able to cope with data where there are different lengths and number of words such as:

    This is the first line

    Second line

    Word

    This is the fourth line

    and if I was asking it to return a string containing the first two words the results would be:

    This is

    Second line

    Word

    This is

    Thanks.

    Just curious... what will this be used for? In other words, why do you need to do this?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I need to do this to parse out data required by an agency we send address details too to place the contents of our address line 1 to their separate Flat No, house number, and street name fields. So I'm basically going to try to select the first two words if address line 1 starts with Flat in to Flat No, if it begins with a number then place the number in to house number field, and then put the remainder (or the else) in to the street name field. Not entirely 100% reliable but will do. Kenneth's response above will work nicely if I'm only looking for two words but will not work with instances like my fourth line where there's multiple spaces. Would prefer a function though that I can re-use taking in a string and a number representing the number of words required. Like the way it works though thanks Kenneth.

  • I found a function that may be more along the lines of what you are looking for. This function requires two inputs and the number of words you would like returned. I did not write the code and will give credit to the programmer. His link is http://www.tom-muck.com/blog/index.cfm?newsid=72

    CREATE FUNCTION udf_GetNumberOfWords (

    @stringToSplit varchar(8000),

    @numberOfWords int

    )

    RETURNS varchar(8000) AS

    BEGIN

    DECLARE @currentword varchar(8000)

    DECLARE @returnstring varchar(8000)

    DECLARE @wordcount int

    SET @wordcount = 0

    SET @returnstring = ''

    SET @currentword = ''

    SET @stringToSplit = ltrim(rtrim(@stringToSplit))

    Declare @index int

    WHILE @wordcount 0

    BEGIN

    Select @index = CHARINDEX(' ', @stringToSplit)

    if @index = 0

    BEGIN

    SELECT @currentword = ltrim(rtrim(@stringToSplit))

    SELECT @wordcount = @numberOfWords

    END

    else

    BEGIN

    IF (len(@stringToSplit) - @index > 0) BEGIN

    SELECT @currentword = ltrim(rtrim(LEFT(@stringToSplit, @index-1)))--the new shortened string

    SELECT @stringToSplit = RIGHT(@stringToSplit,LEN(@stringToSplit) - @index) -- the rest

    END

    END

    SELECT @returnstring = @returnstring + ' ' + @currentword

    SELECT @wordcount = @wordcount + 1

    END

    SET @returnstring = LTRIM(@returnstring)

    RETURN @returnstring

    END

    The function can be called like this: (2 is the number of words to return)

    select dbo.udf_GetNumberOfWords(mycolumn,2)

    from mytable

  • Matt Miller (2/27/2008)


    Lowell - if you're on 2005 - use CLR instead to create the REGEX expressions instead. The XP is nice in 2000, but you can do better in 2005.

    agreed Matt; I think I'm waiting for someone to put together a package of "CLR must have functions", they way someone did for the Toolkit.

    Regular expressions, with specific calls for common string manipulations (strip html for example) is something I'm playing with now, but i get distracted easily.

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • I have the start to one. I need to spiffy them up and I will see what I can do about posting them.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • That function's worked a dream thanks. I added a little while loop in to take care of the multiple spaces and it looks really good now. Many thanks.

  • Just for fun, another way to skin the cat.

    This specific example relies on a numbers table named 'nums' with a column named 'n' that has numbers at least up to the number of charachters in the longest string.

    It's also possible to do a numtable 'on the fly' as a derived table.

    No considerations to performance, though 😉

    create table #x ( words varchar(50) not null )

    insert #x

    select 'This is the first line' union all

    select 'Second line'union all

    select 'Word' union all

    select 'This is the fourth line'

    go

    declare @numWords int

    set @numWords = 3 -- set the 'number of words here'

    selectleft(z.words, z.n) as shortString

    from (

    selecty.words, y.n, row_number() over (partition by y.words order by y.words, y.n) as b

    from (

    selectx.words, n.n, substring(x.words, n.n, 1) as letter

    from#x x

    joinnums n

    onn.n <= len(x.words)

    ) y

    where y.letter = ' '

    ) z

    wherez.b = @numWords

    union all

    selectwords

    from#x

    wherelen(words) - len(replace(words, ' ', '')) < @numWords

    go

    shortString

    ------------------------

    This is the

    This is the

    Second line

    Word

    (4 row(s) affected)

    /Kenneth

  • Ya beat me to it with the numbers table... well done...

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Nice. I like this one because you are not using while loops :D.

  • Thanks everyone. You've been a great help here!

  • I agree that this is a nice function, but it only works with this correction:

    Instead of

    While @wordcount 0

    there should be

    WHILE @wordcount < @numberOfWords AND len(@stringToSplit) > 0

    like in the original code posted on the external page.

    CREATE FUNCTION fnGetNumberOfWords (

    @stringToSplit varchar(8000),

    @numberOfWords int

    )

    RETURNS varchar(8000) AS

    BEGIN

    DECLARE @currentword varchar(8000)

    DECLARE @returnstring varchar(8000)

    DECLARE @wordcount int

    SET @wordcount = 0

    SET @returnstring = ''

    SET @currentword = ''

    SET @stringToSplit = ltrim(rtrim(@stringToSplit))

    Declare @index int

    WHILE @wordcount < @numberOfWords AND len(@stringToSplit) > 0

    BEGIN

    Select @index = CHARINDEX(' ', @stringToSplit)

    if @index = 0

    BEGIN

    SELECT @currentword = ltrim(rtrim(@stringToSplit))

    SELECT @wordcount = @numberOfWords

    END

    else

    BEGIN

    IF (len(@stringToSplit) - @index > 0) BEGIN

    SELECT @currentword = ltrim(rtrim(LEFT(@stringToSplit, @index-1)))--the new shortened string

    SELECT @stringToSplit = RIGHT(@stringToSplit,LEN(@stringToSplit) - @index) -- the rest

    END

    END

    SELECT @returnstring = @returnstring + ' ' + @currentword

    SELECT @wordcount = @wordcount + 1

    END

    SET @returnstring = LTRIM(@returnstring)

    RETURN @returnstring

Viewing 15 posts - 1 through 15 (of 16 total)

You must be logged in to reply to this topic. Login to reply