TSQL RegEx find and replace without CLR

  • Hi,

    I am having a table with a column which is having data like # followed by some number for Eg. #123.

    I want to strip out the # and the number from the column using TSQL without using clr regex replace.

    Any help here would be highly appreciated.

    Script for generating test data is given below.

    DECLARE @BadData TABLE(DataID INT, Data VARCHAR(50), ExpectedData VARCHAR(50))

    INSERT INTO @BadData(DataID,Data,ExpectedData)

    SELECT 1,'#123 data','data' UNION

    SELECT 2,'#456data','data' UNION

    SELECT 3,'data #789','data' UNION

    SELECT 4,'data#012','data' UNION

    SELECT 5,'data#345 some more data','data some more data' UNION

    SELECT 6,'data #678some more data','data some more data' UNION

    SELECT 7,'data #901 some more data','data some more data' UNION

    SELECT 8,'data#234some more data','datasome more data'

    SELECT * FROM @BadData

    The column 'Data' has bad values and the column 'ExpectedData' has the expected values.

    I will soon post the queries i have tried so far but they didn't work, hence posting it to the forum.

  • there was an article posted this week in relation to RegEx for DBA's, it can be found here[/url]

  • Thanks for the link, but i was looking for a non-clr solution.

  • I have tried the below query but it does not cover all the cases, especially when the pattern is like #[number][string]

    The output of query is attached to this post.

    SELECT

    DataID

    ,Data

    ,LEN(Data) AS [Length of Data]

    ,CHARINDEX('#',Data) AS [Position of #]

    ,CHARINDEX(' ',Data) AS [Position of space]

    ,CHARINDEX(' ',Data,CHARINDEX('#',Data)) AS [Position of space after #]

    ,CASE

    WHEN (CHARINDEX(' ',Data,CHARINDEX('#',Data)) > 0)

    THEN SUBSTRING(Data,CHARINDEX('#',Data),CHARINDEX(' ',Data,CHARINDEX('#',Data))-CHARINDEX('#',Data))

    ELSE SUBSTRING(Data,CHARINDEX('#',Data),LEN(Data)-CHARINDEX('#',Data))

    END AS [The word containing #]

    ,CASE

    WHEN (CHARINDEX(' ',Data,CHARINDEX('#',Data)) > 0)

    THEN LTRIM(RTRIM(REPLACE(Data,SUBSTRING(Data,CHARINDEX('#',Data),CHARINDEX(' ',Data,CHARINDEX('#',Data))-CHARINDEX('#',Data)),'')))

    ELSE LTRIM(RTRIM(REPLACE(Data,SUBSTRING(Data,CHARINDEX('#',Data),LEN(Data)-CHARINDEX('#',Data)),'')))

    END AS [Output]

    ,ExpectedData

    FROM

    @BadData

  • How about nested replace. You said all you want is to remove "#" and any numbers.

    SELECT rtrim(ltrim(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(REPLACE(Data, '#', ''), '1', ''), '2', ''), '3', ''), '4', ''), '5', ''), '6', ''), '7', ''), '8', ''), '9', ''), '0', '')))

    , ExpectedData

    FROM @BadData

    The nested replace approach is easy to read and understand, it also has the advantage that is super fast. Then just trim left and right. This at least works on your sample data.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • Have to ask, is it always a # and three digits or is that just the way you configured your sample data?

  • Here is a shot at it:

    DECLARE @BadData TABLE(DataID INT, Data VARCHAR(50), ExpectedData VARCHAR(50));

    INSERT INTO @BadData(DataID,Data,ExpectedData)

    SELECT 1,'#123 data','data' UNION

    SELECT 2,'#456data','data' UNION

    SELECT 3,'data #789','data' UNION

    SELECT 4,'data#012','data' UNION

    SELECT 5,'data#345 some more data','data some more data' UNION

    SELECT 6,'data #678some more data','data some more data' UNION

    SELECT 7,'data #901 some more data','data some more data' UNION

    SELECT 8,'data#234some more data','datasome more data' UNION

    SELECT 9,'data#2some more data','datasome more data';

    --SELECT * FROM @BadData;

    SELECT

    SUBSTRING(Data,1,PATINDEX('%#[0-9]%', Data) - 1) +

    REVERSE(SUBSTRING(REVERSE(SUBSTRING(Data,PATINDEX('%#[0-9]%', Data) + 1,DATALENGTH(Data) - PATINDEX('%#[0-9]%', Data))),1,PATINDEX('%[0-9]%',REVERSE(SUBSTRING(Data,PATINDEX('%#[0-9]%', Data) + 1,DATALENGTH(Data) - PATINDEX('%#[0-9]%', Data)))) - 1)),

    Data,ExpectedData

    FROM

    @BadData;

  • Lynn, why not the replace I posted above? It is super fast because it doesn't have to do all sorts of string manipulation.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • Shaun-884394 (5/11/2012)


    Thanks for the link, but i was looking for a non-clr solution.

    Any particular reason why not? Let me guess...political reasons in your organization?

    RegEx is one of the primary use-cases for using SQLCLR in the database and there are tons of good examples of UDFs and Procedures already written. RegEx can be used in assemblies marked as SAFE as well meaning security is far less of a concern when using SQLCLR this way.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • Sean Lange (5/11/2012)


    Lynn, why not the replace I posted above? It is super fast because it doesn't have to do all sorts of string manipulation.

    Didn't see it. Looks good actually as long as there aren't other numbers in the string that need to be kept. Spec did say # followed by numbers.

  • Lynn Pettis (5/11/2012)


    Sean Lange (5/11/2012)


    Lynn, why not the replace I posted above? It is super fast because it doesn't have to do all sorts of string manipulation.

    Didn't see it. Looks good actually as long as there aren't other numbers in the string that need to be kept. Spec did say # followed by numbers.

    Good point about the spec, guess it worked on the sample data so I figured it was good enough. 😉

    If there are other numbers you need to keep and CLR is off the table then go with Lynn's approach. Of course CLR is still the easiest way to handle this.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • Sean Lange (5/11/2012)


    Lynn Pettis (5/11/2012)


    Sean Lange (5/11/2012)


    Lynn, why not the replace I posted above? It is super fast because it doesn't have to do all sorts of string manipulation.

    Didn't see it. Looks good actually as long as there aren't other numbers in the string that need to be kept. Spec did say # followed by numbers.

    Good point about the spec, guess it worked on the sample data so I figured it was good enough. 😉

    If there are other numbers you need to keep and CLR is off the table then go with Lynn's approach. Of course CLR is still the easiest way to handle this.

    Actually, mine will fail if there are extra numbers. I just thought of that.

  • This should work:

    DECLARE @BadData TABLE(DataID INT, Data VARCHAR(50), ExpectedData VARCHAR(50));

    INSERT INTO @BadData(DataID,Data,ExpectedData)

    SELECT 1,'#123 data','data' UNION

    SELECT 2,'#456data','data' UNION

    SELECT 3,'data #789','data' UNION

    SELECT 4,'data#012','data' UNION

    SELECT 5,'data#345 some more data','data some more data' UNION

    SELECT 6,'data #678some more data','data some more data' UNION

    SELECT 7,'data #901 some more data','data some more data' UNION

    SELECT 8,'data#234some more data','datasome more data' UNION

    SELECT 9,'data#2some more data','datasome more data' UNION

    SELECT 10, 'data#2 4 some more data','data 4 some more data';

    --SELECT * FROM @BadData;

    SELECT

    LEFT(Data, PATINDEX('%#[0-9]%', Data) - 1)

    + RIGHT(Data,CASE WHEN PATINDEX('%[^0-9]%',SUBSTRING(Data, PATINDEX('%#[0-9]%', Data) + 1, DATALENGTH(Data))) = 0 THEN 0 ELSE DATALENGTH(Data) - (PATINDEX('%#[0-9]%', Data) + PATINDEX('%[^0-9]%',SUBSTRING(Data, PATINDEX('%#[0-9]%', Data) + 1, DATALENGTH(Data)))) + 1 END)

    ,Data,ExpectedData

    FROM

    @BadData;

  • Hi Lynn,

    What if the source data contained the same pattern that is repeated many times? Example:

    SELECT 1,'001N data 002N003N004N','data' UNION

    SELECT 2,'005N005Ndata007N','data' UNION

    SELECT 3,'data 008N009N010N','data ' UNION

    SELECT 4,'data011Nsome012Nmore013Ndata014N','datasomemoredata' UNION

    SELECT 5,'data015N some 016N017Nmore data','data some more data' UNION

    SELECT 6,'data 050Nsome more 051N052Ndata','data some more data' UNION

    SELECT 7,'data 090N some more data091N092N','data some more data' UNION

    SELECT 8,'data093Nsome more 094N095Ndata','datasome more data' UNION

    SELECT 9,'data096N some more097N098N data','data some more data' UNION

    SELECT 10, 'data099N 4 some100N 5more 101Ndata','data 4 some 5more data';

    Note that in my example, the string I want to remove is always 3 characters long, datatype Text, in the format of [0-9][0-9][0-9]N, ie. 001N, 002N, ... 999N, which repeats up to 999 times in each string, and no numbers repeat, they just increment upwards by 1. In addition, eliminating all but 1 space in between all words would also be helpful too. Thanks!

    EDIT: I've solved the problem myself by creating a function as follows:

    CREATE FUNCTION [dbo].[fnCleanN]

    (

    @BadString nvarchar(MAX)

    )

    RETURNS nvarchar(MAX)

    AS

    BEGIN

    DECLARE @npos INTEGER

    SELECT @npos = PATINDEX('%[0-9][0-9][0-9]N%', @BadString)

    WHILE @npos > 0

    BEGIN

    SELECT @BadString = STUFF(@BadString, @npos, 4, '')

    SELECT @npos = PATINDEX('%[0-9][0-9][0-9]N%', @BadString)

    END

    RETURN @BadString

    END

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply