Remove non alpha chars from a column

  • Hi there,

    How to remove non alpha chars from a column ?

    I googled the following code

    Create Function [dbo].[RemoveNonAlphaCharacters](@Temp VarChar(1000))

    Returns VarChar(1000)

    AS

    Begin

    Declare @KeepValues as varchar(50)

    Set @KeepValues = '%[^a-z]%'

    While PatIndex(@KeepValues, @Temp) > 0

    Set @Temp = Stuff(@Temp, PatIndex(@KeepValues, @Temp), 1, '')

    Return @Temp

    End

    is there any other possible way ?

    Thanks in advance

  • Yes, there are many other possible ways that use things like Tally Tables, cascading CTEs, etc, but... none of them are faster than what you have. This is one of those rare places where procedural code beats set-based code. I've been trying to beat it since 2007 using set based code and the best I've been able to do is tie it. Oh sure, there's an optimization using nested REPLACEs if you only have a small handful of characters to replace but I don't believe that anyone can beat this particular method using set-based code.

    The only function that I've found to be a little faster is an optimization of what you have. The optimization makes it so that PATINDEX is only used once in the loop. Like this...

    CREATE FUNCTION dbo.CleanString

    /*******************************************************************************

    Purpose:

    Given a string and a pattern of characters to remove, remove the patterned

    characters from the string.

    Usage:

    --===== Basic Syntax Example

    SELECT CleanedString = dbo.CleanString(@pSomeString,@pPattern)

    ;

    --===== Remove all but Alpha characters

    SELECT CleanedString = dbo.CleanString(st.SomeString,'%[^A-Za-z]%');

    FROM dbo.SomeTable st

    ;

    --===== Remove all but Numeric digits

    SELECT CleanedString = dbo.CleanString(st.SomeString,'%[^0-9]%');

    FROM dbo.SomeTable st

    ;

    Programmer Notes:

    1. @pPattern is case sensitive.

    2. The pattern set of characters must be for just one character.

    Revision History:

    Rev 00 - Circa 2007 - Author Unknown

    - Initial find on the web

    Rev 01 - 29 Mar 2007 - Jeff Moden

    - Optimize to remove one instance of PATINDEX from the loop.

    - Add code to use the pattern as a parameter.

    Rev 02 - 26 May 2013 - Jeff Moden

    - Add case sensitivity

    *******************************************************************************/

    (@pString VARCHAR(8000),@pPattern VARCHAR(100))

    RETURNS VARCHAR(8000) AS

    BEGIN

    DECLARE @Pos SMALLINT;

    SELECT @Pos = PATINDEX(@pPattern,@pString COLLATE Latin1_General_BIN);

    WHILE @Pos > 0

    SELECT @pString = STUFF(@pString,@Pos,1,''),

    @Pos = PATINDEX(@pPattern,@pString COLLATE Latin1_General_BIN);

    RETURN @pString;

    END

    ;

    GO

    Shifting gears a bit, if anyone would like to "race" their function against this one, I'll be happy to setup a test harness. Like I said, I've only been able to tie this one with set-based technology and I'd love to find a faster function because, as you know, I hate it when I have to resort to a WHILE loop and a Scalar Function. 😀

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Strange. I would have thought something like this would be faster:

    CREATE FUNCTION [dbo].[RemoveUnmatchedPatterns]

    (

    @STR VARCHAR(8000)

    ,@Pattern VARCHAR(100)

    )

    RETURNS TABLE WITH SCHEMABINDING AS

    RETURN

    WITH Tally (n) AS

    (

    SELECT TOP (LEN(@Str)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

    FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) a(n)

    CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) b(n)

    CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) c(n)

    CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0)) d(n)

    ),

    SplitString AS

    (

    SELECT code=@Str, n, s=SUBSTRING(@Str, n, 1)

    FROM Tally

    WHERE PATINDEX(@Pattern, SUBSTRING(@Str COLLATE Latin1_General_BIN, n, 1)) <> 0

    )

    SELECT ReducedString=

    (

    SELECT s + ''

    FROM SplitString b

    ORDER BY n

    FOR XML PATH('')

    );

    GO

    WITH Tally (n) AS

    (

    SELECT TOP 10000 ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

    FROM sys.all_columns a CROSS JOIN sys.all_columns b

    )

    SELECT MyStr=REPLICATE(SUBSTRING('abcdefghijklmnopqrstuvwxyz', 1+ABS(CHECKSUM(NEWID()))%26, 1), 1+ABS(CHECKSUM(NEWID()))%26)

    +REPLICATE(SUBSTRING('0123456789', 1+ABS(CHECKSUM(NEWID()))%10, 1), 1+ABS(CHECKSUM(NEWID()))%26)

    +REPLICATE(SUBSTRING('abcdefghijklmnopqrstuvwxyz', 1+ABS(CHECKSUM(NEWID()))%26, 1), 1+ABS(CHECKSUM(NEWID()))%26)

    +REPLICATE(SUBSTRING('0123456789', 1+ABS(CHECKSUM(NEWID()))%10, 1), 1+ABS(CHECKSUM(NEWID()))%26)

    +REPLICATE(SUBSTRING('abcdefghijklmnopqrstuvwxyz', 1+ABS(CHECKSUM(NEWID()))%26, 1), 1+ABS(CHECKSUM(NEWID()))%26)

    +REPLICATE(SUBSTRING('0123456789', 1+ABS(CHECKSUM(NEWID()))%10, 1), 1+ABS(CHECKSUM(NEWID()))%26)

    +REPLICATE(SUBSTRING('abcdefghijklmnopqrstuvwxyz', 1+ABS(CHECKSUM(NEWID()))%26, 1), 1+ABS(CHECKSUM(NEWID()))%26)

    +REPLICATE(SUBSTRING('0123456789', 1+ABS(CHECKSUM(NEWID()))%10, 1), 1+ABS(CHECKSUM(NEWID()))%26)

    +REPLICATE(SUBSTRING('abcdefghijklmnopqrstuvwxyz', 1+ABS(CHECKSUM(NEWID()))%26, 1), 1+ABS(CHECKSUM(NEWID()))%26)

    +REPLICATE(SUBSTRING('0123456789', 1+ABS(CHECKSUM(NEWID()))%10, 1), 1+ABS(CHECKSUM(NEWID()))%26)

    +REPLICATE(SUBSTRING('abcdefghijklmnopqrstuvwxyz', 1+ABS(CHECKSUM(NEWID()))%26, 1), 1+ABS(CHECKSUM(NEWID()))%26)

    +REPLICATE(SUBSTRING('0123456789', 1+ABS(CHECKSUM(NEWID()))%10, 1), 1+ABS(CHECKSUM(NEWID()))%26)

    +REPLICATE(SUBSTRING('abcdefghijklmnopqrstuvwxyz', 1+ABS(CHECKSUM(NEWID()))%26, 1), 1+ABS(CHECKSUM(NEWID()))%26)

    +REPLICATE(SUBSTRING('0123456789', 1+ABS(CHECKSUM(NEWID()))%10, 1), 1+ABS(CHECKSUM(NEWID()))%26)

    +REPLICATE(SUBSTRING('abcdefghijklmnopqrstuvwxyz', 1+ABS(CHECKSUM(NEWID()))%26, 1), 1+ABS(CHECKSUM(NEWID()))%26)

    +REPLICATE(SUBSTRING('0123456789', 1+ABS(CHECKSUM(NEWID()))%10, 1), 1+ABS(CHECKSUM(NEWID()))%26)

    +REPLICATE(SUBSTRING('abcdefghijklmnopqrstuvwxyz', 1+ABS(CHECKSUM(NEWID()))%26, 1), 1+ABS(CHECKSUM(NEWID()))%26)

    +REPLICATE(SUBSTRING('0123456789', 1+ABS(CHECKSUM(NEWID()))%10, 1), 1+ABS(CHECKSUM(NEWID()))%26)

    +REPLICATE(SUBSTRING('abcdefghijklmnopqrstuvwxyz', 1+ABS(CHECKSUM(NEWID()))%26, 1), 1+ABS(CHECKSUM(NEWID()))%26)

    +REPLICATE(SUBSTRING('0123456789', 1+ABS(CHECKSUM(NEWID()))%10, 1), 1+ABS(CHECKSUM(NEWID()))%26)

    +REPLICATE(SUBSTRING('abcdefghijklmnopqrstuvwxyz', 1+ABS(CHECKSUM(NEWID()))%26, 1), 1+ABS(CHECKSUM(NEWID()))%26)

    +REPLICATE(SUBSTRING('0123456789', 1+ABS(CHECKSUM(NEWID()))%10, 1), 1+ABS(CHECKSUM(NEWID()))%26)

    +REPLICATE(SUBSTRING('abcdefghijklmnopqrstuvwxyz', 1+ABS(CHECKSUM(NEWID()))%26, 1), 1+ABS(CHECKSUM(NEWID()))%26)

    +REPLICATE(SUBSTRING('0123456789', 1+ABS(CHECKSUM(NEWID()))%10, 1), 1+ABS(CHECKSUM(NEWID()))%26)

    +REPLICATE(SUBSTRING('abcdefghijklmnopqrstuvwxyz', 1+ABS(CHECKSUM(NEWID()))%26, 1), 1+ABS(CHECKSUM(NEWID()))%26)

    INTO #Temp

    FROM Tally;

    DECLARE @StartDT DATETIME = GETDATE()

    ,@S1 VARCHAR(8000), @S2 VARCHAR(8000);

    SELECT @S1=MyStr, @S2=ReducedString

    FROM #Temp

    CROSS APPLY dbo.RemoveUnmatchedPatterns(MyStr, '%[^a-zA-Z]%');

    SELECT [ElapsedMS]=DATEDIFF(ms, @StartDT, GETDATE());

    SELECT @StartDT = GETDATE();

    SELECT @S1=MyStr, @S2=dbo.CleanString(MyStr, '%[a-zA-Z]%')

    FROM #Temp;

    SELECT [ElapsedMS]=DATEDIFF(ms, @StartDT, GETDATE());

    GO

    DROP TABLE #Temp;

    Perhaps my test harness is not appropriate because my iTVF appears faster.

    Note that I had to reverse the pattern passed to CleanString to return the same results. Or perhaps I misread the requirements and had to reverse the pattern I passed to my iTVF.


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • I'm with Dwain on this one.

    CREATE FUNCTION dbo.CleanString_bh

    (@pString VARCHAR(8000),@pPattern VARCHAR(100))

    RETURNS TABLE

    AS

    RETURN

    (

    with tally as (select top(len(@pstring)) N from dbo.tally order by N)

    ,cte as (select N, substring(@pstring,N,1) as chr

    from tally

    where substring(@pstring,N,1) NOT LIKE @pPattern)

    select convert(varchar(8000), (select ''+chr

    from cte

    ORDER BY N

    FOR XML PATH(''))) AS CleanedString

    )

    GO

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • I'll test these but similar code has been quite a bit slower in previous testing.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Don't know what to tell you, Jeff. On my development box it's neck and neck for a short string but the margin of victory for the inline table valued function gets wider and wider as the input string gets longer. Please share your test results when you're done.

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • The Dixie Flatline (3/4/2014)


    I'm with Dwain on this one.

    CREATE FUNCTION dbo.CleanString_bh

    (@pString VARCHAR(8000),@pPattern VARCHAR(100))

    RETURNS TABLE

    AS

    RETURN

    (

    with tally as (select top(len(@pstring)) N from dbo.tally order by N)

    ,cte as (select N, substring(@pstring,N,1) as chr

    from tally

    where substring(@pstring,N,1) NOT LIKE @pPattern)

    select convert(varchar(8000), (select ''+chr

    from cte

    ORDER BY N

    FOR XML PATH(''))) AS CleanedString

    )

    GO

    I can't wait to hear the results either. Thought you'd abandoned the thread there Jeff!

    BTW. I think the TOP(LEN(@pString)) is going to have a problem if the string is NULL, a deficiency shared by my version. It needs to be wrapped in an ISNULL. I kind of just threw mine together because I was curious about the postulate (that the loop was faster) and so didn't fully complete the example.


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • You're right about the issue with NULL strings, Dwain. The following change should take care of it.

    with tally as (select top(ISNULL(len(@pstring),0)) N from dbo.tally order by N)

    If @pstring is null then len(@pstring) is null. If we select top 0 rows from the tally table, we will get a null value for cleaned string.

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • The Dixie Flatline (3/4/2014)


    You're right about the issue with NULL strings, Dwain. The following change should take care of it.

    with tally as (select top(ISNULL(len(@pstring),0)) N from dbo.tally order by N)

    If @pstring is null then len(@pstring) is null. If we select top 0 rows from the tally table, we will get a null value for cleaned string.

    Yep. I actually ran across that issue the first time in the PatternSplitCM function I included in the fourth article in my signature links. It was the only thing I had to change in Chris Morris's (CM's) proposed solution.


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • I just read that fourth article on splitting strings based on patterns: Good stuff.

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • The Dixie Flatline (3/4/2014)


    I just read that fourth article on splitting strings based on patterns: Good stuff.

    Thanks! I like to think of PatternSplitCM as sort of a cousin of DelimitedSplit8K - not as fast but a little more flexible.


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • dwain.c (3/4/2014)


    The Dixie Flatline (3/4/2014)


    I'm with Dwain on this one.

    CREATE FUNCTION dbo.CleanString_bh

    (@pString VARCHAR(8000),@pPattern VARCHAR(100))

    RETURNS TABLE

    AS

    RETURN

    (

    with tally as (select top(len(@pstring)) N from dbo.tally order by N)

    ,cte as (select N, substring(@pstring,N,1) as chr

    from tally

    where substring(@pstring,N,1) NOT LIKE @pPattern)

    select convert(varchar(8000), (select ''+chr

    from cte

    ORDER BY N

    FOR XML PATH(''))) AS CleanedString

    )

    GO

    I can't wait to hear the results either. Thought you'd abandoned the thread there Jeff!

    BTW. I think the TOP(LEN(@pString)) is going to have a problem if the string is NULL, a deficiency shared by my version. It needs to be wrapped in an ISNULL. I kind of just threw mine together because I was curious about the postulate (that the loop was faster) and so didn't fully complete the example.

    Ahhhhh... THERE you are! I couldn't find this thread for the longest time. I'm getting ready to give my Tally Table presentation at an SQL Saturday and I came to the final page where I talk about the Taly Table not being a panacea of all problems and the "Alpha-Numeric Only" problem is what I cited. That made me think back to this thread.

    You know me. I love the Tally Table but I stood adamant that no one could beat the Scalar While Loop function that I first posted on this thread and then, seemingly, you guys did it. I could NOT figure it out. Every time I wrote a Tally Table solution for this problem, the best I could do was almost tie the Scalar While Loop function.

    Heh... It's the little things that count and I kept missing it. Let's see if you pick up on the difference. Build the function quoted above again and then run the following code. Then look at the output and see the difference. That's why your Tally Table version is faster... it doesn't take that little problem into consideration whereas it just doesn't matter to the Scalar While Loop function.

    DECLARE @pString VARCHAR(8000);

    SELECT @pString = 'This-divided-by,that|&/multiplied\by(100)give*you,%;<percent>.';

    SELECT SomeString, ca.CleanedString

    FROM #TestTable

    CROSS APPLY dbo.CleanString_bh (SomeString,'[^0-9A-Za-z&%<>]') ca

    ;

    Fixing that little nuance with TYPE causes the Tally Table code to run almost twice as slow even with a binary collation.

    I'll post some test code tomorrow for you good folks to play with but, after all this time, I'm going to sleep a little better tonight (I know, I know... I really need to get a life :-P).

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Here's the testing I did on this...

    First, create the two original functions with no modifications, Bob's CleanString_bh and my CleanString.

    CREATE FUNCTION dbo.CleanString_bh

    (@pString VARCHAR(8000),@pPattern VARCHAR(100))

    RETURNS TABLE

    AS

    RETURN

    (

    with tally as (select top(len(@pstring)) N from dbo.tally order by N)

    ,cte as (select N, substring(@pstring,N,1) as chr

    from tally

    where substring(@pstring,N,1) NOT LIKE @pPattern)

    select convert(varchar(8000), (select ''+chr

    from cte

    ORDER BY N

    FOR XML PATH(''))) AS CleanedString

    )

    GO

    CREATE FUNCTION dbo.CleanString

    /*******************************************************************************

    Purpose:

    Given a string and a pattern of characters to remove, remove the patterned

    characters from the string.

    Usage:

    --===== Basic Syntax Example

    SELECT CleanedString = dbo.CleanString(@pSomeString,@pPattern)

    ;

    --===== Remove all but Alpha characters

    SELECT CleanedString = dbo.CleanString(st.SomeString,'%[^A-Za-z]%');

    FROM dbo.SomeTable st

    ;

    --===== Remove all but Numeric digits

    SELECT CleanedString = dbo.CleanString(st.SomeString,'%[^0-9]%');

    FROM dbo.SomeTable st

    ;

    Programmer Notes:

    1. @pPattern is case sensitive.

    2. The pattern set of characters must be for just one character.

    Revision History:

    Rev 00 - Circa 2007 - Author Unknown

    - Initial find on the web

    Rev 01 - 29 Mar 2007 - Jeff Moden

    - Optimize to remove one instance of PATINDEX from the loop.

    - Add code to use the pattern as a parameter.

    Rev 02 - 26 May 2013 - Jeff Moden

    - Add case sensitivity

    *******************************************************************************/

    (@pString VARCHAR(8000),@pPattern VARCHAR(100))

    RETURNS VARCHAR(8000) AS

    BEGIN

    DECLARE @Pos SMALLINT;

    SELECT @Pos = PATINDEX(@pPattern,@pString COLLATE Latin1_General_BIN);

    WHILE @Pos > 0

    SELECT @pString = STUFF(@pString,@Pos,1,''),

    @Pos = PATINDEX(@pPattern,@pString COLLATE Latin1_General_BIN);

    RETURN @pString;

    END

    ;

    GO

    Here's one use-case... including the 3 special XML characters (<&>) in the data and the pattern.

    SET NOCOUNT ON;

    DECLARE @pString VARCHAR(8000);

    SELECT @pString = 'This-divided-by,that|&/multiplied\by(100)give*you,;<percent>''".';

    PRINT '========== CleanString_bh w/Special XML characters ============================'

    SELECT @pString --Show the original

    UNION ALL

    SELECT ca.CleanedString --Show the cleaned string below it

    FROM dbo.CleanString_bh (@pString,'[^0-9A-Za-z&<>''"]') ca

    ;

    PRINT '========== CleanString w/Special XML characters ==============================='

    SELECT @pString --Show the original

    UNION ALL

    SELECT dbo.CleanString (@pString,'%[^0-9A-Za-z&<>''"]%') --Show the cleaned string below it

    ;

    Here are the results. The Tally Table based function entitized the special characters and the While Loop version did not. (Note that I had to manually insert a space after the "&" in the results so that the forum software wouldn't de-entitize the entities that were produced)

    ========== CleanString_bh w/Special XML characters ============================

    ----------------------------------------------------------------------------------------------------

    This-divided-by,that|&/multiplied\by(100)give*you,;<percent>'".

    Thisdividedbythat& amp;multipliedby100giveyou& lt;percent& gt;'"

    ========== CleanString w/Special XML characters ===============================

    ----------------------------------------------------------------------------------------------------

    This-divided-by,that|&/multiplied\by(100)give*you,;<percent>'".

    Thisdividedbythat&multipliedby100giveyou<percent>'"

    There's also another problem with the original code. It's not actually case sensitive when played against servers that aren't case sensitive as a default collation. Let's run the code from above again but without the lower case letters in the pattern.

    SET NOCOUNT ON;

    DECLARE @pString VARCHAR(8000);

    SELECT @pString = 'This-divided-by,that|&/multiplied\by(100)give*you,;<percent>''".';

    PRINT '========== CleanString_bh w/Special XML characters ============================'

    SELECT @pString --Show the original

    UNION ALL

    SELECT ca.CleanedString --Show the cleaned string below it

    FROM dbo.CleanString_bh (@pString,'[^0-9A-Z&<>''"]') ca

    ;

    PRINT '========== CleanString w/Speccial XML characters ==============================='

    SELECT @pString --Show the original

    UNION ALL

    SELECT dbo.CleanString (@pString,'%[^0-9A-Z&<>''"]%') --Show the cleaned string below it

    ;

    Here are the results showing that the original function isn't case sensitive. Again, I had to manually add spaces after the "&" in the results so this site's software wouldn't convert the entitized characters.

    ========== CleanString_bh w/Speccial XML characters ============================

    ----------------------------------------------------------------------------------------------------

    This-divided-by,that|&/multiplied\by(100)give*you,;<percent>'".

    Thisdividedbythat& amp;multipliedby100giveyou& lt;percent& gt;'"

    ========== CleanString w/Speccial XML characters ===============================

    ----------------------------------------------------------------------------------------------------

    This-divided-by,that|&/multiplied\by(100)give*you,;<percent>'".

    T&100<>'"

    Let's make the same repairs that I was automatically including in all of my previous attempts to do this with a Tally Table...

    CREATE FUNCTION [dbo].[CleanString_bhDE]

    (@pString VARCHAR(8000),@pPattern VARCHAR(100))

    RETURNS TABLE

    AS

    RETURN

    (

    with tally as (select top(len(@pstring)) N from dbo.tally order by N)

    ,cte as (select N, substring(@pstring,N,1) as chr

    from tally

    where substring(@pstring ,N,1) NOT LIKE @pPattern COLLATE Latin1_General_BIN)

    select (select ''+chr

    from cte

    ORDER BY N

    FOR XML PATH(''),TYPE).value('(./text())[1]','varchar(8000)') AS CleanedString

    )

    ... and let's run the same test that we just did but using the modified CleanString_bhDE function (the "DE" stands for "DeEntitized")...

    SET NOCOUNT ON;

    DECLARE @pString VARCHAR(8000);

    SELECT @pString = 'This-divided-by,that|&/multiplied\by(100)give*you,;<percent>''".';

    PRINT '========== CleanString_bhDE w/Speccial XML characters ============================'

    SELECT @pString --Show the original

    UNION ALL

    SELECT ca.CleanedString --Show the cleaned string below it

    FROM dbo.CleanString_bhDE (@pString,'[^0-9A-Z&<>''"]') ca

    ;

    PRINT '========== CleanString w/Speccial XML characters ==============================='

    SELECT @pString --Show the original

    UNION ALL

    SELECT dbo.CleanString (@pString,'%[^0-9A-Z&<>''"]%') --Show the cleaned string below it

    ;

    Here are the results. Both functions work correctly for handling the 3 special XML characters and both are case sensitive.

    ========== CleanString_bhDE w/Speccial XML characters ============================

    ----------------------------------------------------------------------------------------------------

    This-divided-by,that|&/multiplied\by(100)give*you,;<percent>'".

    T&100<>'"

    ========== CleanString w/Speccial XML characters ===============================

    ----------------------------------------------------------------------------------------------------

    This-divided-by,that|&/multiplied\by(100)give*you,;<percent>'".

    T&100<>'"

    Now, let's do a performance check to see how the Tally Table function was affected.

    First, some "non-grooved" test data. "Non-grooved" means that the characters to be dropped aren't all in the same position in the string, which can greatly affect such results. Having "grooved" data (simply repeating the same data on every row in the test harness) is a common testing mistake that a lot of people make.

    --DROP TABLE #TestTable;

    GO

    SELECT TOP 1000000

    SomeString = REPLACE(REPLACE(REPLACE(

    SUBSTRING(CAST(NEWID() AS CHAR(36)),ABS(CHECKSUM(NEWID()))%36,36)

    + SUBSTRING(CAST(NEWID() AS CHAR(36)),ABS(CHECKSUM(NEWID()))%36,36)

    ,'5','<')

    ,'6','&')

    ,'7','>')

    INTO #TestTable

    FROM sys.all_columns ac1

    CROSS JOIN sys.all_columns ac2

    ;

    Here's the test harness. Note that we cannot use "SET STATISTICS" to measure performance here because there's a scalar function involved. "SET STATISTICS" greatly skews the performance results when scalar functions are present making the scalar function look a whole lot worse than it actually is. Please see the following article for more on that little testing nuance...

    http://www.sqlservercentral.com/articles/T-SQL/91724/

    DBCC FREEPROCCACHE;

    DBCC DROPCLEANBUFFERS;

    --===== Declare the variable that we'll use to take the display processing time out of the picture

    -- and a timer variable.

    DECLARE @Bitbucket VARCHAR(8000)

    ,@StartDT DATETIME

    ;

    PRINT '========== CleanString_bh w/Special XML characters ============================';

    SELECT @StartDT = GETDATE()

    ;

    SELECT @Bitbucket = ca.CleanedString

    FROM #TestTable

    CROSS APPLY dbo.CleanString_bh (SomeString,'[^0-9A-Za-z&%<>]') ca

    ;

    PRINT 'Duration(ms) = ' + CONVERT(VARCHAR(10),DATEDIFF(ms,@StartDT,GETDATE()))

    ;

    GO

    PRINT '--------------------------------------------------------------------------------';

    DBCC FREEPROCCACHE;

    DBCC DROPCLEANBUFFERS;

    --===== Declare the variable that we'll use to take the display processing time out of the picture

    -- and a timer variable.

    DECLARE @Bitbucket VARCHAR(8000)

    ,@StartDT DATETIME

    ;

    PRINT '========== CleanString_bhDE w/Special XML characters ============================';

    SELECT @StartDT = GETDATE()

    ;

    SELECT @Bitbucket = ca.CleanedString

    FROM #TestTable

    CROSS APPLY dbo.CleanString_bhDE (SomeString,'[^0-9A-Za-z&%<>]') ca

    ;

    PRINT 'Duration(ms) = ' + CONVERT(VARCHAR(10),DATEDIFF(ms,@StartDT,GETDATE()))

    ;

    GO

    PRINT '--------------------------------------------------------------------------------';

    DBCC FREEPROCCACHE;

    DBCC DROPCLEANBUFFERS;

    --===== Declare the variable that we'll use to take the display processing time out of the picture

    -- and a timer variable.

    DECLARE @Bitbucket VARCHAR(8000)

    ,@StartDT DATETIME

    ;

    PRINT '========== CleanString w/Special XML characters ===============================';

    SELECT @StartDT = GETDATE()

    ;

    SELECT @Bitbucket = dbo.CleanString (SomeString,'[^0-9A-Za-z&%<>]')

    FROM #TestTable

    ;

    PRINT 'Duration(ms) = ' + CONVERT(VARCHAR(10),DATEDIFF(ms,@StartDT,GETDATE()))

    ;

    Here are the results from that run...

    DBCC execution completed. If DBCC printed error messages, contact your system administrator.

    DBCC execution completed. If DBCC printed error messages, contact your system administrator.

    ========== CleanString_bh w/Special XML characters ============================

    Duration(ms) = 34370

    --------------------------------------------------------------------------------

    DBCC execution completed. If DBCC printed error messages, contact your system administrator.

    DBCC execution completed. If DBCC printed error messages, contact your system administrator.

    ========== CleanString_bhDE w/Special XML characters ============================

    Duration(ms) = 56693

    --------------------------------------------------------------------------------

    DBCC execution completed. If DBCC printed error messages, contact your system administrator.

    DBCC execution completed. If DBCC printed error messages, contact your system administrator.

    ========== CleanString w/Special XML characters ===============================

    Duration(ms) = 3970

    Interestingly enough, even the original Tally Table function doesn't fair so well. That could be because someone use "grooved" data in their testing. Lets make such "grooved" data and run the same test again.

    Here's the data...

    /*******************************************************************************

    Create the "grooved" test data

    *******************************************************************************/

    DROP TABLE #TestTable;

    GO

    DECLARE @SomeString VARCHAR(72);

    SELECT @SomeString = CONVERT(VARCHAR(36),NEWID())+CONVERT(VARCHAR(36),NEWID())

    SELECT TOP 1000000

    SomeString = @SomeString

    INTO #TestTable

    FROM sys.all_columns ac1

    CROSS JOIN sys.all_columns ac2

    ;

    ... and here are the results from running the same test...

    DBCC execution completed. If DBCC printed error messages, contact your system administrator.

    DBCC execution completed. If DBCC printed error messages, contact your system administrator.

    ========== CleanString_bh w/Special XML characters ============================

    Duration(ms) = 9643

    --------------------------------------------------------------------------------

    DBCC execution completed. If DBCC printed error messages, contact your system administrator.

    DBCC execution completed. If DBCC printed error messages, contact your system administrator.

    ========== CleanString_bhDE w/Special XML characters ============================

    Duration(ms) = 3180

    --------------------------------------------------------------------------------

    DBCC execution completed. If DBCC printed error messages, contact your system administrator.

    DBCC execution completed. If DBCC printed error messages, contact your system administrator.

    ========== CleanString w/Special XML characters ===============================

    Duration(ms) = 3996

    Now it looks like the modified Tally Table function beat the While Loop but, as we proved in the test previous to this one, it doesn't in real life.

    As an interesting side bar, this proves the value of statistics. SQL server didn't have to work so hard on this type of data because it recognized that it really only had to solve the problem once and then duplicate the results a million times instead of solving the a different problem a million times.

    There are other possible errors (such as the NULL thing mentioned in the other posts) that may creep in but that's not the current point that I'm trying to make. The point is that I believe that "grooved" data may have been used in the original testing on this thread and the original Tally Table function didn't work correctly (didn't de-entitize) to begin with. Correcting those two problems shows that this isn't actually a good place to use the Tally Table.

    Heh... you can bet there will be an article forthcoming on this. 😀

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff,

    You are correct. Adding that TYPE/value to the FOR XML is going to kill the performance of the string cleaner I posted. Which of course, I mistakenly tried to generalize to all of the pattern cases (which you've shown doesn't always work).

    Still, by hard-coding the alpha-only result that the OP wanted, the iTVF version would probably be faster, just not general to any pattern.

    Yes indeed, you do need to get a life! 🙂 Can't believe you fished this one out of obscurity.


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • dwain.c (9/29/2014)


    Jeff,

    You are correct. Adding that TYPE/value to the FOR XML is going to kill the performance of the string cleaner I posted. Which of course, I mistakenly tried to generalize to all of the pattern cases (which you've shown doesn't always work).

    Still, by hard-coding the alpha-only result that the OP wanted, the iTVF version would probably be faster, just not general to any pattern.

    That was the first thing I tried and it didn't/doesn't matter. Using a single copy of a string across multiple rows makes just about everything faster. Change it to real world and the iTVF version is left gasping for air.

    Yes indeed, you do need to get a life! 🙂 Can't believe you fished this one out of obscurity.

    It's amazing what a need will cause the lengths for one to suffer. 😀

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 1 through 15 (of 27 total)

You must be logged in to reply to this topic. Login to reply