Finding special characters within character strings

  • Try this for the Tab and pipe

    Select * from myTable where MyField Like '%' + char(9) + '%' or MyField Like '%' + char(124) + '%'

    The Number is the ASCII char number

  • Katharine Mahon (8/29/2014)


    How to find an embedded Tab or Pipe in an address field? I got into some problems when exported the text files with those special characters.

    As begulley demonstrated above, it's pretty easy to find such rows of data. My question would be, what do you want to do about them? Delete them, replace them with spaces, use them as delimiters, or ???

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • We replace the characters with a blank or space depending on what they are. We remove them because they cause issues when exporting to Excel from reporting services.

  • Unicode defines potentially 1,114,112 characters. So two bytes won't be enough for all of them!

    What actually goes on is the data is encoded as UTF-16. For code points up to 65535 the data is stored in 2 bytes. Above this you get 4 bytes per character using what are called surrogate pairs (two 2 byte sequences).

    Just to make life more entertaining to correctly store characters above the base page (code points 65536 and above) you need to choose a collation which supports supplementary character sets (the collation name contains SC).

    This means an nchar(10) column might only store 5 characters (in the event they are all in the supplementary character range)

  • Nice article. One thing to be careful of when using LEN and DATALENGTH is that it matters whether the column was declared as CHAR or VARCHAR. CHAR will return different numbers for LEN and DATALENGTH.

    For instance -

    Declare @v-2 VARCHAR(10), @C CHAR(10)

    SET @v-2 = 'Donna'

    SET @C = 'Donna'

    SELECT '@v', LEN(@v), DATALENGTH(@v) -- returns 5 and 5

    SELECT '@c', LEN(@c), DATALENGTH(@c) -- returns 5 and 10

  • david woakes (4/8/2016)


    Unicode defines potentially 1,114,112 characters. So two bytes won't be enough for all of them!

    What actually goes on is the data is encoded as UTF-16. For code points up to 65535 the data is stored in 2 bytes. Above this you get 4 bytes per character using what are called surrogate pairs (two 2 byte sequences).

    Just to make life more entertaining to correctly store characters above the base page (code points 65536 and above) you need to choose a collation which supports supplementary character sets (the collation name contains SC).

    This means an nchar(10) column might only store 5 characters (in the event they are all in the supplementary character range)

    David, thanks for mentioning Supplementary Characters / Surrogate Pairs. There is only one correction to make related to the statements quoted above: a Supplementary Character-aware Collation is not necessary to store and retrieve Code Points above U+FFFF (i.e. values above 65535). All UTF-16 Surrogate Pairs are made up of completely valid UCS-2 (i.e. Code Points U+0 through U+FFFF == values 0 - 65535 == Base Multilingual Plane a.k.a. BMP) Code Points. Those Code Points just happen to be unmapped individually and were reserved prior to UTF-16 existing in order to serve this purpose. Storing and retrieving Unicode data across the full UTF-16 spectrum (i.e. all 1.1 million Code Points; only 200k or so of which are mapped at this point) is not a matter of Collation, and hence works as far back as at least SQL Server 2005, if not even SQL Server 2000 or prior, though I have no way of testing 2000 / 7.0 / 6.5.

    What the Supplementary Character-aware Collations (those ending in "_SC" and were introduced in SQL Server 2012) get you is the ability for SQL Server (in terms of the built-in functions as well as sorting) to properly interpret them as a singular Code Point instead of two double-byte sequences. For example:

    /*

    Supplementary Character = ??

    Code Point = U+1F142

    Decimal Value = 127298

    Surrogate Pair = D83C + DD42

    Little Endian representation = 3CD8 + 42DD

    */

    SELECT NCHAR(0x01F142), -- ?? or NULL if DB default Collation does not end in _SC

    NCHAR(127298), -- ?? or NULL if DB default Collation does not end in _SC

    NCHAR(0xD83C) + NCHAR(0xDD42), -- ??

    CONVERT(NVARCHAR(50), 0x3CD842DD) -- ??

    If you run the code above in a database that has a default Collation ending in "_SC" (again, these were introduced in SQL Server 2012), then all four fields will properly show the "??" character / glyph. But in databases having a default Collation not ending in "_SC", the first two fields will return NULL because they were only coded to handle the initial 0 - 65535 range, which is the UCS-2 specification. However, you can see that the second two fields will always display the correct character / glyph due to injecting the correct UTF-16 Little Endian sequence.

    The following example shows how the LEN() built-in function returns a different value for the same four bytes between the "_SC" and non-"_SC" collated columns (and I included a "_BIN2" column just in case there was a question of whether or not they properly interpret Surrogate Pairs / Supplementary Characters since they are more recent and were designed to properly sort Unicode data whereas the deprecated "_BIN" Collations do not).

    CREATE TABLE #SupplementaryCharacterTest

    (

    NonScCollation NVARCHAR(10) COLLATE Latin1_General_100_CI_AS NOT NULL,

    ScCollation NVARCHAR(10) COLLATE Latin1_General_100_CI_AS_SC NOT NULL,

    BIN2Collation NVARCHAR(10) COLLATE Latin1_General_100_BIN2 NOT NULL

    );

    DECLARE @Glyph NVARCHAR(10) = NCHAR(0xD83C) + NCHAR(0xDD42);

    INSERT INTO #SupplementaryCharacterTest (NonScCollation, ScCollation, BIN2Collation)

    VALUES (@Glyph, @Glyph, @Glyph);

    SELECT tmp.NonScCollation,

    LEN(tmp.NonScCollation) AS [LenNonSC],

    DATALENGTH(tmp.NonScCollation) AS [DataLengthNonSC],

    '---' AS [---],

    tmp.ScCollation,

    LEN(tmp.ScCollation) AS [LenSC],

    DATALENGTH(tmp.ScCollation) AS [DataLengthSC],

    '---' AS [---],

    tmp.BIN2Collation,

    LEN(tmp.BIN2Collation) AS [LenBIN2],

    DATALENGTH(tmp.BIN2Collation) AS [DataLengthBIN2]

    FROM #SupplementaryCharacterTest tmp;

    NonScGlyph LenNonSC DataLengthNonSC -- ScGlyph LenSC DataLengthSC -- BIN2Glyph LenBIN2 DataLengthBIN2

    ?? 2 4 -- ?? 1 4 -- ?? 2 4

    Other notes on the article:

    • VARCHAR data is not always 1 to 1 between LEN and DATALENGTH. VARCHAR is always 8-bit, but there are four Code Pages that are Double-Byte Character Sets (DBCS). I provided details and examples on this topic in the following DBA.StackExchange answer: Storing Japanese characters in a table
    • The pattern matching capabilities of LIKE and PATINDEX to use [char], [rangeBegin-rangeEnd], [^char], [^rangeBegin-rangeEnd], etc are very much not Regular Expressions, and they should not be referred to as being Regular Expressions as doing so just increases the confusion that is already so prevalent on this topic, especially for those who are just starting to learn this stuff. What is offered by the LIKE operator and PATINDEX function is simple pattern matching that just happens to share a very similar syntax to RegEx.

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

  • We have lots of different input sources and bad data just gets in. We don't have the luxury to reject the data so have to deal with it afterward. Thanks for this.

  • Great ideas in the post!

    What I would really like to see and still don't understand why it is not part of the SQL Server: full regular expression support. Regular expressions seem to be a natural fit for relational data, yet SQL Server still only supports a hobbled implementation of it. Yes it is possible to expose this through .NET and CLR, but sadly where I work CLR is rarely if ever allowed.

    I can hope!

Viewing 8 posts - 16 through 22 (of 22 total)

You must be logged in to reply to this topic. Login to reply