Finding special characters within character strings

Question

Finding special characters within character strings

michael.f.morin

SSC-Addicted

Points: 406
More actions
August 18, 2014 at 11:23 pm

#298482

Comments posted to this topic are about the item Finding special characters within character strings

Viewing 15 posts - 1 through 15 (of 23 total)

You must be logged in to reply to this topic. Login to reply

wes_springob Grasshopper Points: 19 More actions · Answer 1

You say you run this every morning? Why not just set up a constraint on the column to prevent the problem characters from getting in, in the first place?

halifaxdal SSCoach Points: 19741 More actions · Answer 2

interesting summary article, worth to save for later use

Jeff Moden SSC Guru Points: 1004749 More actions · Answer 3

wes_springob (8/19/2014)
You say you run this every morning? Why not just set up a constraint on the column to prevent the problem characters from getting in, in the first place?

Heh... because if someone fails to mail me a check because of a typo in a bulk load, there will be hell to pay! 😀 For example, if I live at 5678 Main Street and someone accidently had the SHIFT key depressed when they typed the "5" so that the address came out as %678 Main Street (just as an example), I'd want someone working on the correction rather than just rejecting the row of bulk data. Why not do those types of checks in the front end before it makes it to a bulk file? Heh... yeah... people always do those kinds of checks in the front end, don't they? 😛

It just doesn't hurt to do these types of checks in the database, especially where any form of keyboard input was ever involved. It would be nice if they did it in the front end, but there's no guarantees there. These types of checks might also be good for capturing examples of what people have been typing to design new "validation" code for the front end or whatever source the data came from.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Chris Hurlbut SSCarpal Tunnel Points: 4162 More actions · Answer 4

What gets me most of the time are Carriage Returns, Line Feeds and tabs.

We receive many different file formats from different clients, while we try to standardize feeds as much as possible, these sometimes end up in the data I have to process.

Usually Excel is involved in some way...Analysts love excel but for ETL it makes me cringe.

alexander.oss Old Hand Points: 304 More actions · Answer 5

Beware the Unicode character FULLWIDTH APOSTROPHE ' (U+FF07), because we had a problem with it once since it looks like SQL Server will treat it like any other apostrophe, i.e. as a string constant delimiter or apostrophe escape. So if you're building dynamic SQL (which of course you aren't, and of course neither were we) and you're manually looking for characters to escape or otherwise handle before executing your SQL string, you may need to handle it too.

michael.f.morin SSC-Addicted Points: 406 More actions · Answer 6

If you own, or are a member of, the project that developed the system that would be great. Sadly, many times you are administering a system that you cannot change, at least in the short term, or have very little control over at that level.

NBSteve SSCarpal Tunnel Points: 4333 More actions · Answer 7

Thanks for the article! One clarification:

Also notice that “Ça va” is missing. If the column is case sensitive, should that not have been captured? Again this is not a course on regular expressions but note: the regular expression does not see the accented characters as special. Meaning (é and e) or (Ä and A) are evaluated as the same, as far as the expression is concerned.

This is incorrect, they are not evaluated the same. When checking ranges in regular expressions, the collation setting is used. When you specify [A-Z], this includes every character alphabetically sorted between 'A' and 'Z' according to the collation. 'Ç' is sorted between 'C' and 'D' alphabetically in this collation, and so is included in [A-Z].

SELECT 'IT WORKS'
WHERE 'C' COLLATE SQL_Latin1_General_CP1_CI_AS LIKE ('%[C-C]%');
SELECT 'IT WORKS'
WHERE 'C' COLLATE SQL_Latin1_General_CP1_CI_AS LIKE ('%[B-C]%');
SELECT 'IT WORKS'
WHERE 'C' COLLATE SQL_Latin1_General_CP1_CI_AS LIKE ('%[C-D]%');
SELECT 'IT WORKS'
WHERE 'Ç' COLLATE SQL_Latin1_General_CP1_CI_AS LIKE ('%[C-C]%');
SELECT 'IT WORKS'
WHERE 'Ç' COLLATE SQL_Latin1_General_CP1_CI_AS LIKE ('%[B-C]%');
SELECT 'IT WORKS'
WHERE 'Ç' COLLATE SQL_Latin1_General_CP1_CI_AS LIKE ('%[C-D]%');

In 1-3, you can see that a plain 'C' matches all 3 ranges. However, in 4-6, the accented 'Ç' character only matches the range [C-D].

tom.w.brannon Hall of Fame Points: 3976 More actions · Answer 8

Thanks for the post! I set up a search on all the data in my modest sized database and found that LEN also ignores CRLF characters. I found I had a field that had been loaded with trailing blanks up to the last two characters 0D 0A. The value from LEN varied but DATALEN was always 65. Fixed with LEFT on the LEN.

Robert.Sterbal SSCrazy Points: 2855 More actions · Answer 9

I'm not sure why we don't check our data as many ways as possible. I have an environment that assumes everything is installed as expected, and a help desk busy with inquiries for when it isn't.

The key is to develop testing that is efficient to run.

nicholas.mace Newbie Points: 9 More actions · Answer 10

NBSteve (8/19/2014)
This is incorrect, they are not evaluated the same. When checking ranges in regular expressions, the collation setting is used. When you specify [A-Z], this includes every character alphabetically sorted between 'A' and 'Z' according to the collation. 'Ç' is sorted between 'C' and 'D' alphabetically in this collation, and so is included in [A-Z].

This is also the reason why [0-9] is not a good choice for finding "numbers". Using a UK installation of SQL Server and the same collation (Latin1_General_CS_AS) as the article, I get 16 matches in the first 255 characters - it includes characters for one-quarter, one-half, three-quarters, superscript 1, superscript 2 and superscript 3! There are 249 matches in 16-bit character space.

I think using [0123456789] is safe, but do please tell me if I'm wrong.

LearnModeOn SSC Enthusiast Points: 154 More actions · Answer 11

I really like the idea and article. I ran everything to follow along, but I kept getting different results than the example. What gives???

using sql 2008 r2

select this!!

begulley SSC Rookie Points: 41 More actions · Answer 12

When looking for a range, is this according to the ascii table? For example:

where FixSubject like '%[' + Char(1) + '-' + Char(11) + ']%'

Does this find everything on the Ascii table between (start of heading) to (Vertical Tab)? or is it indeterminate what it will return?

Also, would the performance be better using individual like's or a range?

NBSteve SSCarpal Tunnel Points: 4333 More actions · Answer 13

begulley (8/25/2014)
When looking for a range, is this according to the ascii table? For example:
where FixSubject like '%[' + Char(1) + '-' + Char(11) + ']%'
Does this find everything on the Ascii table between (start of heading) to (Vertical Tab)? or is it indeterminate what it will return?
Also, would the performance be better using individual like's or a range?

It's based on collation sorting, which is likely to be related to the ascii table, but NOT the same. Best bet is to do a simple test:

DECLARE @startchar int = 1, @endchar int = 11;

WITH

TALLY10(N) AS (

SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL

SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL

SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9 UNION ALL SELECT 0

),

TALLY100(N) AS (SELECT 1 FROM TALLY10 a, TALLY10 b),

TALLY10K(N) AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM TALLY100 a, TALLY100 b)

SELECT N ASCIICODE, CHAR(N) CHARVALUE

FROM TALLY10K

WHERE CHAR(N) LIKE '[' + CHAR(@startchar) + '-' + CHAR(@endchar) + ']'

ORDER BY N

This just uses an inline tally table to return a list of all ascii code's and char values that match the range between @startchar and @endchar. On my system, it does appear to work as you expect for char(1)-char(11), but I don't know if this is true for all collations.

Note that 1-47 seems like it might be a 1-1 match, but 1-48 will suddenly return 132 "matching" rows as many punctuation marks with higher ascii values are sorted prior to zero (char(48)) in my collation.

Katharine Mahon Grasshopper Points: 19 More actions · Answer 14

How to find an embedded Tab or Pipe in an address field? I got into some problems when exported the text files with those special characters.