Finding special characters within character strings

  • michael.f.morin

    Old Hand

    Points: 386

    Comments posted to this topic are about the item Finding special characters within character strings

  • wes_springob

    Grasshopper

    Points: 19

    You say you run this every morning? Why not just set up a constraint on the column to prevent the problem characters from getting in, in the first place?

  • halifaxdal

    SSCoach

    Points: 19584

    interesting summary article, worth to save for later use

  • Jeff Moden

    SSC Guru

    Points: 994266

    wes_springob (8/19/2014)


    You say you run this every morning? Why not just set up a constraint on the column to prevent the problem characters from getting in, in the first place?

    Heh... because if someone fails to mail me a check because of a typo in a bulk load, there will be hell to pay! 😀 For example, if I live at 5678 Main Street and someone accidently had the SHIFT key depressed when they typed the "5" so that the address came out as %678 Main Street (just as an example), I'd want someone working on the correction rather than just rejecting the row of bulk data. Why not do those types of checks in the front end before it makes it to a bulk file? Heh... yeah... people always do those kinds of checks in the front end, don't they? 😛

    It just doesn't hurt to do these types of checks in the database, especially where any form of keyboard input was ever involved. It would be nice if they did it in the front end, but there's no guarantees there. These types of checks might also be good for capturing examples of what people have been typing to design new "validation" code for the front end or whatever source the data came from.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
    "If you think its expensive to hire a professional to do the job, wait until you hire an amateur."--Red Adair
    "Change is inevitable... change for the better is not."
    When you put the right degree of spin on it, the number 3|8 is also a glyph that describes the nature of a DBAs job. 😉

    Helpful Links:
    How to post code problems
    Create a Tally Function (fnTally)

  • Chris Hurlbut

    SSCarpal Tunnel

    Points: 4052

    What gets me most of the time are Carriage Returns, Line Feeds and tabs.

    We receive many different file formats from different clients, while we try to standardize feeds as much as possible, these sometimes end up in the data I have to process.

    Usually Excel is involved in some way...Analysts love excel but for ETL it makes me cringe.

  • alexander.oss

    SSC Veteran

    Points: 253

    Beware the Unicode character FULLWIDTH APOSTROPHE ' (U+FF07), because we had a problem with it once since it looks like SQL Server will treat it like any other apostrophe, i.e. as a string constant delimiter or apostrophe escape. So if you're building dynamic SQL (which of course you aren't, and of course neither were we) and you're manually looking for characters to escape or otherwise handle before executing your SQL string, you may need to handle it too.

  • michael.f.morin

    Old Hand

    Points: 386

    If you own, or are a member of, the project that developed the system that would be great. Sadly, many times you are administering a system that you cannot change, at least in the short term, or have very little control over at that level.

  • NBSteve

    Hall of Fame

    Points: 3203

    Thanks for the article! One clarification:

    Also notice that “Ça va” is missing. If the column is case sensitive, should that not have been captured? Again this is not a course on regular expressions but note: the regular expression does not see the accented characters as special. Meaning (é and e) or (Ä and A) are evaluated as the same, as far as the expression is concerned.

    This is incorrect, they are not evaluated the same. When checking ranges in regular expressions, the collation setting is used. When you specify [A-Z], this includes every character alphabetically sorted between 'A' and 'Z' according to the collation. 'Ç' is sorted between 'C' and 'D' alphabetically in this collation, and so is included in [A-Z].

    SELECT 'IT WORKS'

    WHERE 'C' COLLATE SQL_Latin1_General_CP1_CI_AS LIKE ('%[C-C]%');

    SELECT 'IT WORKS'

    WHERE 'C' COLLATE SQL_Latin1_General_CP1_CI_AS LIKE ('%[B-C]%');

    SELECT 'IT WORKS'

    WHERE 'C' COLLATE SQL_Latin1_General_CP1_CI_AS LIKE ('%[C-D]%');

    SELECT 'IT WORKS'

    WHERE 'Ç' COLLATE SQL_Latin1_General_CP1_CI_AS LIKE ('%[C-C]%');

    SELECT 'IT WORKS'

    WHERE 'Ç' COLLATE SQL_Latin1_General_CP1_CI_AS LIKE ('%[B-C]%');

    SELECT 'IT WORKS'

    WHERE 'Ç' COLLATE SQL_Latin1_General_CP1_CI_AS LIKE ('%[C-D]%');

    In 1-3, you can see that a plain 'C' matches all 3 ranges. However, in 4-6, the accented 'Ç' character only matches the range [C-D].

  • tom.w.brannon

    Hall of Fame

    Points: 3946

    Thanks for the post! I set up a search on all the data in my modest sized database and found that LEN also ignores CRLF characters. I found I had a field that had been loaded with trailing blanks up to the last two characters 0D 0A. The value from LEN varied but DATALEN was always 65. Fixed with LEFT on the LEN.

  • Robert.Sterbal

    SSCrazy

    Points: 2855

    I'm not sure why we don't check our data as many ways as possible. I have an environment that assumes everything is installed as expected, and a help desk busy with inquiries for when it isn't.

    The key is to develop testing that is efficient to run.

  • nicholas.mace

    Newbie

    Points: 9

    NBSteve (8/19/2014)


    This is incorrect, they are not evaluated the same. When checking ranges in regular expressions, the collation setting is used. When you specify [A-Z], this includes every character alphabetically sorted between 'A' and 'Z' according to the collation. 'Ç' is sorted between 'C' and 'D' alphabetically in this collation, and so is included in [A-Z].

    This is also the reason why [0-9] is not a good choice for finding "numbers". Using a UK installation of SQL Server and the same collation (Latin1_General_CS_AS) as the article, I get 16 matches in the first 255 characters - it includes characters for one-quarter, one-half, three-quarters, superscript 1, superscript 2 and superscript 3! There are 249 matches in 16-bit character space.

    I think using [0123456789] is safe, but do please tell me if I'm wrong.

  • LearnModeOn

    SSC Enthusiast

    Points: 154

    I really like the idea and article. I ran everything to follow along, but I kept getting different results than the example. What gives???

    using sql 2008 r2

    select this!!

  • begulley

    SSC Rookie

    Points: 41

    When looking for a range, is this according to the ascii table? For example:

    where FixSubject like '%[' + Char(1) + '-' + Char(11) + ']%'

    Does this find everything on the Ascii table between (start of heading) to (Vertical Tab)? or is it indeterminate what it will return?

    Also, would the performance be better using individual like's or a range?

  • NBSteve

    Hall of Fame

    Points: 3203

    begulley (8/25/2014)


    When looking for a range, is this according to the ascii table? For example:

    where FixSubject like '%[' + Char(1) + '-' + Char(11) + ']%'

    Does this find everything on the Ascii table between (start of heading) to (Vertical Tab)? or is it indeterminate what it will return?

    Also, would the performance be better using individual like's or a range?

    It's based on collation sorting, which is likely to be related to the ascii table, but NOT the same. Best bet is to do a simple test:

    DECLARE @startchar int = 1, @endchar int = 11;

    WITH

    TALLY10(N) AS (

    SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL

    SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL

    SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9 UNION ALL SELECT 0

    ),

    TALLY100(N) AS (SELECT 1 FROM TALLY10 a, TALLY10 b),

    TALLY10K(N) AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM TALLY100 a, TALLY100 b)

    SELECT N ASCIICODE, CHAR(N) CHARVALUE

    FROM TALLY10K

    WHERE CHAR(N) LIKE '[' + CHAR(@startchar) + '-' + CHAR(@endchar) + ']'

    ORDER BY N

    This just uses an inline tally table to return a list of all ascii code's and char values that match the range between @startchar and @endchar. On my system, it does appear to work as you expect for char(1)-char(11), but I don't know if this is true for all collations.

    Note that 1-47 seems like it might be a 1-1 match, but 1-48 will suddenly return 132 "matching" rows as many punctuation marks with higher ascii values are sorted prior to zero (char(48)) in my collation.

  • Katharine Mahon

    Grasshopper

    Points: 14

    How to find an embedded Tab or Pipe in an address field? I got into some problems when exported the text files with those special characters.

Viewing 15 posts - 1 through 15 (of 23 total)

You must be logged in to reply to this topic. Login to reply