Non-Printable Character Detection

  • The following was found as a solution to detecting non-printable characters in a string. I can find no explanation of how the COLLATE function is used.

    Can someone explain how the COLLATE function affects the search string to find the non-printable characters?

    patindex('%[^ !-~]%' COLLATE Latin1_General_BIN, titlel) as [Position]

    Thanks,

  • robertharrison (3/25/2015)


    The following was found as a solution to detecting non-printable characters in a string. I can find no explanation of how the COLLATE function is used.

    Can someone explain how the COLLATE function affects the search string to find the non-printable characters?

    patindex('%[^ !-~]%' COLLATE Latin1_General_BIN, titlel) as [Position]

    Thanks,

    The pattern you have above has nothing to do with non-printable characters. The code above finds the first character that is NOT a space, exclamation point, dash, or tilde.

    {Edit} Please see Eirikur's note below.

    The purpose of the COLLATE, in this particular case, is an attempt to improve the performance of the code by forcing it to do a straight forward binary comparison of characters due to the Latin1_General_BIN collation.

    Still, the code above has nothing to do with finding non-printable characters.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (4/11/2015)


    robertharrison (3/25/2015)


    The following was found as a solution to detecting non-printable characters in a string. I can find no explanation of how the COLLATE function is used.

    Can someone explain how the COLLATE function affects the search string to find the non-printable characters?

    patindex('%[^ !-~]%' COLLATE Latin1_General_BIN, titlel) as [Position]

    Thanks,

    The pattern you have above has nothing to do with non-printable characters. The code above finds the first character that is NOT a space, exclamation point, dash, or tilde.

    The purpose of the COLLATE, in this particular case, is an attempt to improve the performance of the code by forcing it to do a straight forward binary comparison of characters due to the Latin1_General_BIN collation.

    Still, the code above has nothing to do with finding non-printable characters.

    Quick note, Jeff is almost right, the code will detect non lower ascii characters, that is characters not between ASCII 32 and 126, more than half of those are printable! Jeff, the dash in this case means "between";-)

    😎

    I normally use something like this code

    USE tempdb;

    GO

    SET NOCOUNT ON;

    /* Sample string */

    DECLARE @CRAPSTRING VARCHAR(100) =

    CHAR(10) -- Line Feed

    + CHAR(12) -- Form Feed

    + CHAR(31) -- Unit Separator

    + CHAR(67) -- Uppercase C

    + CHAR(108) -- Lowercase l

    + CHAR(101) -- Lowercase e

    + CHAR(97) -- Lowercase a

    + CHAR(110) -- Lowercase n

    + CHAR(127) -- Delete

    + CHAR(17) -- Device Control 1 (oft. XON)

    + CHAR(101) -- Lowercase e

    + CHAR(15) -- Shift In / X-Off

    + CHAR(100) -- Lowercase d

    ;

    ;WITH T(N) AS (SELECT N FROM (VALUES (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL)) AS X(N))

    , NUMS(N) AS (SELECT TOP(LEN(@CRAPSTRING)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS N FROM T T1,T T2,T T3,T T4,T T5,T T6,T T7,T T8,T T9)

    SELECT

    (SELECT

    SUBSTRING(@CRAPSTRING,NM.N,1)

    FROM NUMS NM

    WHERE ASCII(SUBSTRING(@CRAPSTRING COLLATE Latin1_General_BIN,NM.N,1)) NOT BETWEEN 0 AND 31

    AND ASCII(SUBSTRING(@CRAPSTRING COLLATE Latin1_General_BIN,NM.N,1)) <> 127

    FOR XML PATH(''), TYPE).value('.[1]','VARCHAR(MAX)') AS CLEANED_OUTPUT;

  • Ah crud. More coffee please. Thanks for the correction, Eirikur. You're absolutely correct about the dash.

    That also adds more functionality to the COLLATE because it will find only the exact characters according to a binary match instead of possibly finding accented characters, as well.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply