• Raymond van Laake (11/16/2012)


    Hi there,

    Sometimes customers add data with unwanted characters in the ascii range 0 to 31. How do I remove them? I'd prefer a regular expression type of solution because this would be fastest.

    Thanks,

    Raymond

    Regular expressions may be the fastest in another language but they're not necessarily the fastest in T-SQL because of the bit of overhead that a CLR to call RegEx would take. Please see the following article and the comprehensive discussion (click on "Join the Discussion") attached to that for proof.

    http://www.sqlservercentral.com/articles/RegEx/88586/

    Scott is correct, though. Nested REPLACEs will be faster than most anything else I can come up with especially when you create and use a high performance Inline Table Valued Function instead of using a Scalar UDF.

    Most people also forget about the control character at the other end of the basic ASCII table, the DELETE character.

    Here's the function that uses nested REPLACEs...

    CREATE FUNCTION dbo.DropControlCharacters

    (@pString VARCHAR(8000))

    RETURNS TABLE WITH SCHEMABINDING AS

    RETURN

    SELECT CleanedString =

    REPLACE(

    REPLACE(

    REPLACE(

    REPLACE(

    REPLACE(

    REPLACE(

    REPLACE(

    REPLACE(

    REPLACE(

    REPLACE(

    REPLACE(

    REPLACE(

    REPLACE(

    REPLACE(

    REPLACE(

    REPLACE(

    REPLACE(

    REPLACE(

    REPLACE(

    REPLACE(

    REPLACE(

    REPLACE(

    REPLACE(

    REPLACE(

    REPLACE(

    REPLACE(

    REPLACE(

    REPLACE(

    REPLACE(

    REPLACE(

    REPLACE(

    REPLACE(

    REPLACE(

    @pString

    ,CHAR(0),'') COLLATE Latin1_General_BIN

    ,CHAR(1),'') COLLATE Latin1_General_BIN

    ,CHAR(2),'') COLLATE Latin1_General_BIN

    ,CHAR(3),'') COLLATE Latin1_General_BIN

    ,CHAR(4),'') COLLATE Latin1_General_BIN

    ,CHAR(5),'') COLLATE Latin1_General_BIN

    ,CHAR(6),'') COLLATE Latin1_General_BIN

    ,CHAR(7),'') COLLATE Latin1_General_BIN

    ,CHAR(8),'') COLLATE Latin1_General_BIN

    ,CHAR(9),'') COLLATE Latin1_General_BIN

    ,CHAR(10),'') COLLATE Latin1_General_BIN

    ,CHAR(11),'') COLLATE Latin1_General_BIN

    ,CHAR(12),'') COLLATE Latin1_General_BIN

    ,CHAR(13),'') COLLATE Latin1_General_BIN

    ,CHAR(14),'') COLLATE Latin1_General_BIN

    ,CHAR(15),'') COLLATE Latin1_General_BIN

    ,CHAR(16),'') COLLATE Latin1_General_BIN

    ,CHAR(17),'') COLLATE Latin1_General_BIN

    ,CHAR(18),'') COLLATE Latin1_General_BIN

    ,CHAR(19),'') COLLATE Latin1_General_BIN

    ,CHAR(20),'') COLLATE Latin1_General_BIN

    ,CHAR(21),'') COLLATE Latin1_General_BIN

    ,CHAR(22),'') COLLATE Latin1_General_BIN

    ,CHAR(23),'') COLLATE Latin1_General_BIN

    ,CHAR(24),'') COLLATE Latin1_General_BIN

    ,CHAR(25),'') COLLATE Latin1_General_BIN

    ,CHAR(26),'') COLLATE Latin1_General_BIN

    ,CHAR(27),'') COLLATE Latin1_General_BIN

    ,CHAR(28),'') COLLATE Latin1_General_BIN

    ,CHAR(29),'') COLLATE Latin1_General_BIN

    ,CHAR(30),'') COLLATE Latin1_General_BIN

    ,CHAR(31),'') COLLATE Latin1_General_BIN

    ,CHAR(127),'') COLLATE Latin1_General_BIN

    ;

    GO

    If you want to test performance, here's some code to build a wad o' test data. Details, as usual, are in the comments in the code.

    --===== Conditionally drop the test table to make reruns in SSMS easier.

    IF OBJECT_ID('tempdb..#TestTable','U') IS NOT NULL

    DROP TABLE #TestTable

    ;

    --===== Create and populate the test table with test data.

    -- Most rows will have 2 embedded control characters although some may have just 1

    -- just due to random selection.

    SELECT TOP 100000

    RowNum = IDENTITY(INT,1,1),

    SomeString =

    STUFF(

    STUFF(

    '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz`~!@#$%^&*()_-+={[}]|\:;"''<,>.?/',

    ABS(CHECKSUM(NEWID()))%94+1, 1, CHAR(ABS(CHECKSUM(NEWID()))%30+1)),

    ABS(CHECKSUM(NEWID()))%94+1, 1, CHAR(ABS(CHECKSUM(NEWID()))%30+1))

    INTO #TestTable

    FROM master.sys.all_columns ac1

    CROSS JOIN master.sys.all_columns ac2

    ;

    GO

    Here's how to use the iTVF (iSF because it returns a scalar value) function against the test data.

    SELECT ca.CleanedString

    FROM #TestTable tt

    CROSS APPLY dbo.DropControlCharacters(tt.SomeString) ca

    ;

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)