Stripping out all non-numerical characters

  • SELECT accountid,New_Column = REPLACE(telephone1,SUBSTRING(telephone1,PATINDEX('[^0-9]',telephone1),1),'')

    INTO #YourNewResults

    FROM #TelephoneTable

    I have the script above which I thought would remove all non-numerical characters from the field but this not the case?

    I have some fields with numbers+characters and some with characters only regardless of the combination I want all characters removed.

    I have seen there are a lot of functions is this the best way? If so where would the function be applied in the script?

  • Hi,

    The best way in my opinion would be to use Regex and some CLR functions to replace the characters. Read this article:

    http://www.simple-talk.com/sql/t-sql-programming/clr-assembly-regex-functions-for-sql-server-by-example/[/url]

    Once you've got the functions installed then it's as simple as:

    SELECT dbo.RegExReplace('ABC123ABC','[A-Z]','')



    MCSE: Data Platform
    MCSE: Business Intelligence
    Follow me on Twitter: @WazzTheBadger
    LinkedIn Profile: Simon Osborne

  • @Sachin,

    I agree that the CLR Regex method should be the fastest although I've not seen any performance testing on it.

    The questions I have for you is...

    1. Can you use CLR or do you need a 100% T-SQL solution?

    2. Is the maximum length of the data <= 8K bytes?

    I no longer believe this. In the last 2 weeks, I've seen CLR Regex get its doors blown off by more than one example of some good, ol' fashioned T-SQL. CLR is still the fastest for splitting a string and it might be the fastest for cleaning a string, but CLR Regex probably won't be. Regex itself seems to have been over promised and under delivered. It appears that you pay in performance what you thought you gained in flexibility.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • 1. I would prefer t-SQL as CLR is way beyond my understanding

    2. At the moment I'm not concerned about performance

  • Here's one way to do it in T-SQL:

    DECLARE @String VARCHAR(8000) = 'ABC12D34E56';

    WITH Seeds(Seed)

    AS (SELECT *

    FROM ( VALUES ( 1), ( 1), ( 1), ( 1), ( 1), ( 1), ( 1), ( 1), ( 1), ( 1) ) AS V (C)),

    Numbers(Number)

    AS (SELECT TOP (8000)

    ROW_NUMBER() OVER (ORDER BY S1.Seed)

    FROM Seeds AS S1

    CROSS JOIN Seeds AS S2

    CROSS JOIN Seeds AS S3

    CROSS JOIN Seeds AS S4)

    SELECT (

    SELECT SUBSTRING(@String, Number, 1)

    FROM Numbers

    WHERE Number <= LEN(@String)

    AND SUBSTRING(@String, Number, 1) LIKE '[0-9]'

    ORDER BY Number

    FOR XML PATH(''),

    TYPE).value('.[1]', 'VARCHAR(8000)');

    Alternatively, if you need to apply this to a table instead of a variable:

    IF OBJECT_ID(N'tempdb..#T') IS NOT NULL

    DROP TABLE #T;

    CREATE TABLE #T

    (ID INT IDENTITY

    PRIMARY KEY,

    Col1 VARCHAR(8000));

    INSERT INTO #T

    (Col1)

    VALUES ('123A'),

    ('B1C2D3');

    WITH Seeds(Seed)

    AS (SELECT *

    FROM ( VALUES ( 1), ( 1), ( 1), ( 1), ( 1), ( 1), ( 1), ( 1), ( 1), ( 1) ) AS V (C)),

    Numbers(Number)

    AS (SELECT TOP (8000)

    ROW_NUMBER() OVER (ORDER BY S1.Seed)

    FROM Seeds AS S1

    CROSS JOIN Seeds AS S2

    CROSS JOIN Seeds AS S3

    CROSS JOIN Seeds AS S4)

    SELECT *

    FROM #T AS T

    CROSS APPLY (SELECT (

    SELECT SUBSTRING(Col1, Number, 1)

    FROM Numbers

    WHERE Number <= LEN(Col1)

    AND SUBSTRING(Col1, Number, 1) LIKE '[0-9]'

    ORDER BY Number

    FOR XML PATH(''),

    TYPE).value('.[1]', 'VARCHAR(8000)') AS Stripped) AS Parser;

    In either case, it uses a "Runtime Numbers Table" to parse the string into individual characters, then strips out anything other than digits 0-9, then uses For XML to put it all back together. (FOR XML PATH with a zero-length root indicated by (''), and no column name for the query, will concatenate strings together nicely. It's a documented trick that comes in very handy.)

    Does that help?

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • This can also be done using the PatternSplitCM function described in the fourth article in my signature line (Splitting Strings Based on Patterns).

    IF OBJECT_ID(N'tempdb..#T') IS NOT NULL

    DROP TABLE #T;

    CREATE TABLE #T

    (ID INT IDENTITY

    PRIMARY KEY,

    Col1 VARCHAR(8000));

    INSERT INTO #T

    (Col1)

    VALUES ('123A'),

    ('B1C2D3');

    ;WITH CTE AS (

    SELECT *

    FROM #T a

    CROSS APPLY PatternSplitCM(a.Col1, '[0-9]')

    WHERE [Matched] = 1)

    SELECT ID, Col1=MAX(Col1), Col2=(

    SELECT '' + Item

    FROM CTE b

    WHERE a.ID = b.ID

    ORDER BY ItemNumber

    FOR XML PATH(''))

    FROM CTE a

    GROUP BY ID

    IF OBJECT_ID(N'tempdb..#T') IS NOT NULL

    DROP TABLE #T;

    In fact the inspiration for that article (described therein) was a forum-posted question that was quite similar to this one.

    One caveat though. Since you're working in SQL 2005, you'll need to replace the numbers table with a Ben-Gan style Tally table something like this one (from the PatternSplitQU FUNCTION also in that article):

    WITH Nbrs_3(n) AS (SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1)

    ,Nbrs_2 (n) AS (SELECT 1 FROM Nbrs_3 n1 CROSS JOIN Nbrs_3 n2)

    ,Nbrs_1 (n) AS (SELECT 1 FROM Nbrs_2 n1 CROSS JOIN Nbrs_2 n2)

    ,Nbrs_0 (n) AS (SELECT 1 FROM Nbrs_1 n1 CROSS JOIN Nbrs_1 n2)

    ,Tally (n) AS (SELECT ROW_NUMBER() OVER (ORDER BY n) As n FROM Nbrs_0)


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • GSquared (12/18/2012)


    Here's one way to do it in T-SQL:

    DECLARE @String VARCHAR(8000) = 'ABC12D34E56';

    WITH Seeds(Seed)

    AS (SELECT *

    FROM ( VALUES ( 1), ( 1), ( 1), ( 1), ( 1), ( 1), ( 1), ( 1), ( 1), ( 1) ) AS V (C)),

    Numbers(Number)

    AS (SELECT TOP (8000)

    ROW_NUMBER() OVER (ORDER BY S1.Seed)

    FROM Seeds AS S1

    CROSS JOIN Seeds AS S2

    CROSS JOIN Seeds AS S3

    CROSS JOIN Seeds AS S4)

    SELECT (

    SELECT SUBSTRING(@String, Number, 1)

    FROM Numbers

    WHERE Number <= LEN(@String)

    AND SUBSTRING(@String, Number, 1) LIKE '[0-9]'

    ORDER BY Number

    FOR XML PATH(''),

    TYPE).value('.[1]', 'VARCHAR(8000)');

    Alternatively, if you need to apply this to a table instead of a variable:

    IF OBJECT_ID(N'tempdb..#T') IS NOT NULL

    DROP TABLE #T;

    CREATE TABLE #T

    (ID INT IDENTITY

    PRIMARY KEY,

    Col1 VARCHAR(8000));

    INSERT INTO #T

    (Col1)

    VALUES ('123A'),

    ('B1C2D3');

    WITH Seeds(Seed)

    AS (SELECT *

    FROM ( VALUES ( 1), ( 1), ( 1), ( 1), ( 1), ( 1), ( 1), ( 1), ( 1), ( 1) ) AS V (C)),

    Numbers(Number)

    AS (SELECT TOP (8000)

    ROW_NUMBER() OVER (ORDER BY S1.Seed)

    FROM Seeds AS S1

    CROSS JOIN Seeds AS S2

    CROSS JOIN Seeds AS S3

    CROSS JOIN Seeds AS S4)

    SELECT *

    FROM #T AS T

    CROSS APPLY (SELECT (

    SELECT SUBSTRING(Col1, Number, 1)

    FROM Numbers

    WHERE Number <= LEN(Col1)

    AND SUBSTRING(Col1, Number, 1) LIKE '[0-9]'

    ORDER BY Number

    FOR XML PATH(''),

    TYPE).value('.[1]', 'VARCHAR(8000)') AS Stripped) AS Parser;

    In either case, it uses a "Runtime Numbers Table" to parse the string into individual characters, then strips out anything other than digits 0-9, then uses For XML to put it all back together. (FOR XML PATH with a zero-length root indicated by (''), and no column name for the query, will concatenate strings together nicely. It's a documented trick that comes in very handy.)

    Does that help?

    Gus - I think the OP may have a problem with your Tally table (Seeds CTE) for the same reason I noted above.


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • Jeff Moden (12/18/2012)


    @Sachin,

    I agree that the CLR Regex method should be the fastest although I've not seen any performance testing on it.

    The questions I have for you is...

    1. Can you use CLR or do you need a 100% T-SQL solution?

    2. Is the maximum length of the data <= 8K bytes?

    Actually, I have to take that back. CLR would be the fastest method but only if you DON"T use Regex. I've recently seen many performance tests and either some dedicated CLR or some dedicated T-SQL will usually smoke RegEx according to the tests I've seen.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Sachin 80451 (12/18/2012)


    1. I would prefer t-SQL as CLR is way beyond my understanding

    2. At the moment I'm not concerned about performance

    I whole heartedly agree with #1 above. I never agree with #2 above because they always come back about the performance problem that a given solution is having. The only thing you should be more concerned about than performance is accuracy and the two should be virtually tied as being the most important. If you think not, spend an hour or two looking at the thousands of performance issues people are asking about on this forum alone.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (12/25/2012)


    Jeff Moden (12/18/2012)


    @Sachin,

    I agree that the CLR Regex method should be the fastest although I've not seen any performance testing on it.

    The questions I have for you is...

    1. Can you use CLR or do you need a 100% T-SQL solution?

    2. Is the maximum length of the data <= 8K bytes?

    Actually, I have to take that back. CLR would be the fastest method but only if you DON"T use Regex. I've recently seen many performance tests and either some dedicated CLR or some dedicated T-SQL will usually smoke RegEx according to the tests I've seen.

    I've got your back Jeff.

    Here's one example: http://www.sqlservercentral.com/Forums/Topic1390297-3122-5.aspx


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • I have no experience in CRL, so would like T-SQL for now please?

    the answer to the second question is yes.

  • @GSquared:

    Sorry to be thick-headed (read: slow-witted) but I have to ask: why did you SELECT TOP (8000) ROW_NUMBER... when it [seems] to work with TOP (10). Since we're using only the numbers 0-9, I would have figured using a generic "number generator" like SELECTING ROW_NUMBER() ... from a "known" source that would have at least 10 records (like sys.objects), you could skinny down the code and maybe processing as well:

    DECLARE @String VARCHAR(8000) = 'ABC12D34E56';

    SELECT (

    SELECT SUBSTRING(@String, n.[Number], 1)

    FROM (

    SELECT TOP (10) ROW_NUMBER()

    OVER (ORDER BY [object_id]) AS [Number]

    FROM sys.objects) n

    WHERE n.[Number] <= LEN(@String)

    AND SUBSTRING(@String, n.[Number], 1) LIKE '[0-9]'

    ORDER BY n.[Number]

    FOR XML PATH(''), TYPE).[value]('.[1]', 'VARCHAR(8000)');

    Factoring in inflation, health care, taxes and cost of living, my two cents isn't worth much so I apologize if I missed something. I'll blame the lack of coffee for any oversight on my part.

    Cheers,

    Ken

  • Your Name Here (1/4/2013)


    @GSquared:

    Sorry to be thick-headed (read: slow-witted) but I have to ask: why did you SELECT TOP (8000) ROW_NUMBER... when it [seems] to work with TOP (10). Since we're using only the numbers 0-9, I would have figured using a generic "number generator" like SELECTING ROW_NUMBER() ... from a "known" source that would have at least 10 records (like sys.objects), you could skinny down the code and maybe processing as well:

    DECLARE @String VARCHAR(8000) = 'ABC12D34E56';

    SELECT (

    SELECT SUBSTRING(@String, n.[Number], 1)

    FROM (

    SELECT TOP (10) ROW_NUMBER()

    OVER (ORDER BY [object_id]) AS [Number]

    FROM sys.objects) n

    WHERE n.[Number] <= LEN(@String)

    AND SUBSTRING(@String, n.[Number], 1) LIKE '[0-9]'

    ORDER BY n.[Number]

    FOR XML PATH(''), TYPE).[value]('.[1]', 'VARCHAR(8000)');

    Factoring in inflation, health care, taxes and cost of living, my two cents isn't worth much so I apologize if I missed something. I'll blame the lack of coffee for any oversight on my part.

    Cheers,

    Ken

    Try it on a much larger string and see why the 8000 becomes important.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • :exclamationmark:

    I get it now. Thanks!

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply