Tame Those Strings - Finding Carriage Returns

  • Comments posted here are about the content posted at http://www.sqlservercentral.com/columnists/sjones/2843.asp

  • Nice article.

     

    So what's the next tame those strings article in this serie?

  • Being an old UNIX hack, I would have preferred to see the CHR(13) and CHR(10) in seperate REPLACEs. That would allow it to work with data from other sources than Windows.

    Also, it would have been possible to use the CR themselves as your "markers", though this would rely on the format being consistent (allways a bad idea IMHO).

    All in all - string manipulation is a pain. But what it comes down to is seeing the pattern, and writing your code to be as flexible as possible in dealing with it.

    My application UI has a small function in JavaScript which goed through any text field that is being entered and removes a whole bunch of "illegal" character strings (Copy and paste is the devil's work and can introduce so many fun strings, e.g. Tab and other unprintable characters). About once every 6 months or so I discover a new character that we have to add to the forbidden list. I suspect Smilies will be next!

    Good to see that someone is still paying attention to the basics!

  • Toby,

    When you are ready to post the self learning "stripper" code, I am ready to take a look.

    Pun intended.

    Andy

  • I love these puzzles. If the objective is to get at those pesky numbers, how about this? The code is slightly more complex so as to cope with those records that have a blank string or no number! (this sort of routine can do all sorts of string manipulation). I added the test code so one can try out different approaches.

    DECLARE @rawData TABLE (TheID INT IDENTITY(1,1),String VARCHAR(255))
    INSERT INTO @rawData (string)
    SELECT 'We need to be sure that this purchase order is processed:
               34030
            Please check on the status'
    INSERT INTO @rawData (string)
    SELECT 'We need to be sure that this purchase order is processed:
                34031
            Please check on the status'
    INSERT INTO @rawData (string)
    SELECT 'We need to be sure that this purchase order is processed:
             34032
            Please check on the status'
    INSERT INTO @rawData (string)
    SELECT 'We need to be sure that this purchase order is processed:
               35932
            Please check on the status'
    INSERT INTO @rawData (string)
    SELECT 'We need to be sure that this purchase order is processed:
               35939
            Please check on the status'
    INSERT INTO @rawData (string)
    SELECT 'We need to be sure that this purchase order is processed:
               
            Please check on the status'
    /*
    to get the first number out of a string (0 if it cannot find one). Normally
    one would put it in a function for clarity. 
    */
    SELECT LEFT(start+' ',PATINDEX('%[^0-9]%',start+' ')) FROM 
           (SELECT [Start]=SUBSTRING(string+'0', PATINDEX('%[0-9]%',
                                           string+'0'),
                                           LEN(string+'0'))
           FROM @RawData
           )f

    Best wishes,
    Phil Factor

  • Here's a function which will extract the 1st, 2nd, 3rd or whatever number string, optionally with a length greater than whatever you specify (say you only want strings which are 5 or longer and want to ignore street addresses):

     

    SET

    QUOTED_IDENTIFIER ON

    GO

    SET

    ANSI_NULLS ON

    GO

    /****** Object: User Defined Function dbo.fnExtractNumberStrings Script Date: 03/05/2006 13:52:49 ******/

    if

    exists (select * from dbo.sysobjects where id = object_id(N'[dbo].[fnExtractNumberStrings]') and xtype in (N'FN', N'IF', N'TF'))

    drop

    function [dbo].[fnExtractNumberStrings]

    GO

     

    CREATE

    function fnExtractNumberStrings

    /**********************************************************************

    Purpose : Extracts the required numeric string from a given input string

    Author : Caroline Barnard (Airborne Consulting)

    Created : May 2006

    Version : 1.00

    Changed :

    ***********************************************************************/

    (

    @inputstring

    varchar(255),

    @strno

    int,

    @strsize

    int

    )

    returns

    varchar(30)

    as

    /* Used to fix non-normalised data. @inputstring is the string to be passed to the function

    ** @strno is the instance of numeric data to be returned, so the caller can specify the 1st, 2nd

    ** 3rd numeric string etc, and would generally call it repeatedly.

    ** @strsize is the minimum size of the string to be returned and defaults to 3. If, for example, the

    ** caller requests the 3rd string with this parameter set to 4, and passes the following data in:

    ** AB12345CD 12DE7891%$55678 ,

    ** the function will return 55678 as the 3rd numeric string because the 2nd string (12) has fewer than 4 digits and will

    ** be ignored

    */

    begin

    declare

    @outnumber varchar(30),

    @char1

    char(1),

    @dlen

    int,

    @dlenremains

    int,

    @ctr

    int,

    @substart

    int,

    @sublen

    int,

    @subend

    int,

    @wipstring

    varchar(255)

    if isnull(@inputstring,' ') = ' ' --input parameter not supplied with valid value

    begin

    set @outnumber = 'ERROR'

    return @outnumber

    end

    select

    @wipstring = ltrim(rtrim(@inputstring))

    select

    @dlen = len(@wipstring)

    select

    @ctr = 1

    --

    --search for numerical string

    --

    while

    @ctr <= @strno

    begin

    select @substart = PATINDEX ( '%[0-9]%' , @wipstring ) --look for next numeric string

    if @substart > 0 --numeric data found

    begin

    --

    --Calculate length of string from 1st numerical position to end

    --remove all leading non-numeric characters

    --

    select @dlenremains = @dlen + 1 - @substart

    select @wipstring = right(@wipstring,@dlenremains )

    select @dlen = @dlenremains

    --

    --Now find end of number string

    --

    select @subend = PATINDEX ( '%[^0-9]%' , @wipstring ) --next non-numeric character

    if @subend = 0 --no non-numeric data left

    select @subend = @dlen + 1 --fake the "endpoint"

    if @subend >= @strsize + 1 --the minimum string size is met so we output the string

    begin

    select @outnumber = substring(@wipstring, 1, @subend - 1) --populate the output number

    end

    else

    begin

    select @outnumber = 'NOT FOUND' --no numeric string found on this iteration

    select @ctr = @ctr - 1 --subtract one from loop counter to compensate for increment

    end

    --

    --strip off the number just found

    --

    select @dlenremains = @dlenremains - @subend + 1

    select @wipstring = right(@wipstring,@dlenremains )

    select @dlen = @dlenremains

    end

    else

    begin

    select @outnumber = 'NOT FOUND' --no numeric string found

    select @ctr = @strno + 1 --exit the loop

    end

    select @ctr = @ctr + 1

    end

    return

    @outnumber

    end

     

    GO

    SET

    QUOTED_IDENTIFIER OFF

    GO

    SET

    ANSI_NULLS ON

    GO

     

     

  • Good!

    This is exactly what I was going to suggest. To use the function. Maybe I would input a string into the function, and check for each character using IsNumeric() function.

    Regards,Yelena Varsha

  • I'll admit that I don't have all of the ASCII codes memorized. However, we've reached a point where rote memorization of such things is silly. If you need it look it up:

    http://www.asciitable.com

    Though, there are a few which get memorized because you use them too often, 13 and 9 decimal are good examples.

    That said, good article, wrangling in strings can be an annoying task.

  • I agree, whenever I use CHR(13) and CHR(10) I always treat them as separate characters.  Even in the windows environment you can never be sure that a return will be the combination of these two, it just depends on the application which created the file.

  • I have come up with a more general function with some more features such as case sensitive and optional replacement character.

    CREATE FUNCTION dbo.fnFilterString
    (
    @Source VARCHAR(8000),
    @Filter VARCHAR(8000),
    @Replacement VARCHAR(8000)
    )
    RETURNS VARCHAR(8000)
    AS
    BEGIN
    DECLARE@Index SMALLINT
    SET@Index = DATALENGTH(@Source)
    WHILE @Index > 0
    IF SUBSTRING(@Source, @Index, 1) COLLATE Latin1_General_BIN LIKE @Filter
    SET@Index = @Index - 1
    ELSE
    SELECT@Source = STUFF(@Source, @Index, 1, @Replacement),
    @Index =  @Index - 1
    RETURN @Source
    END

    Use this test code!

    declare @s varchar(500)
    select @s = 'Peso' + char(10) + 'Helsingborg' + CHAR(13) + 'Sweden' + CHAR(13) + CHAR(10) + 'SQL'
    print @s
    print ''
    print dbo.fnFilterString(@s, '[^' + CHAR(13) + CHAR(10) + ']', '_')
    print dbo.fnFilterString(@s, '[e]', '*')
    print dbo.fnFilterString(@s, '[^n]', '')
     

    N 56°04'39.16"
    E 12°55'05.25"

  • What about simply using a Regex function?  As discussed here on MSDN,

    http://msdn.microsoft.com/msdnmag/issues/07/02/SQLRegex/default.aspx

    you can create a CLR function that just passes along the parameters to the .NET regex methods.  Save yourself the trouble of reimplementing complex string matching and replacement algorithms, when the functionality can be added so easily to any SQL Server 2005 instance.

  • I think the best would be to change the table design and normalize this thing a little to put order numbers in one table and standard text in another message table?

    Regards,Yelena Varsha

  • Steve, I'm confused on how you loaded the data into SQL Server in the first place using DTS.  How did it know that the three lines were supposed to be part of the same record?

     

    Aunt Kathi Data Platform MVP
    Author of Expert T-SQL Window Functions
    Simple-Talk Editor

  • In looking at the problem, I thought to extend it to extracting all of the numbers in a string, not just the first. Then you could choose to deal with the first or the nth.

    This function only deals with int data, thus calling it for a string that held "Invoice total is $123.45' will return 123 and 45. There are other number patterns such as IP that would also need an alternate pattern handling and it will only pull the first number in a SSN type pattern as it tries to allow negative numbers but doesn't know what 333-22-4444 means... ( 333, -22, -4444 or 333, 22, 4444?)

    String parsing is fun, but a lack of domain makes for unpredictable results...

    Cheers!


    CREATE FUNCTION dbo.listInt( @value varchar(8000) )

    RETURNS @out table (

    ordinal int identity(1,1) primary key,

    number int

    )

    AS

    BEGIN

    INSERT INTO @out ( number )

    SELECT convert( int, substring(

    @value,

    n,

    patindex('%[0-9][^0-9]%', substring( @value+' ', n, 8000 ) ) -- find this number's end from its start (and thus length)

    ))

    FROM

    (

    SELECT (1 +n1.n +N10.n +n100.n +n1000.n) AS n

    FROM

    (select 0 as n union select 1 union select 2 union select 3 union select 4 union select 5 union select 6 union select 7 union select 8 union select 9) as n1

    cross join (select 0 as n union select 10 union select 20 union select 30 union select 40 union select 50 union select 60 union select 70 union select 80 union select 90) as n10

    cross join (select 0 as n union select 100 union select 200 union select 300 union select 400 union select 500 union select 600 union select 700 union select 800 union select 900) as n100

    cross join (select 0 as n union select 1000 union select 2000 union select 3000 union select 4000 union select 5000 union select 6000 union select 7000 ) as n1000

    WHERE

    (1 +n1.n +N10.n +n100.n +n1000.n) BETWEEN 1 AND len(@value) -- only indexes for characters that exist

    ) AS numbers

    WHERE

    substring(' '+@value, n, 2) LIKE '[^-0-9][-0-9]' -- each number's start

    ORDER BY

    n ASC

    RETURN

    END

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply