March 24, 2016 at 1:03 pm
I am doing a data cleansing activity.
The original Source was in Access Database
Current Source in SQL Server 2008 R2
nvarchar(50)
The reference table is in SQL server 2012.
varchar(100)
I used a DT package to convert the reference data to nvarchar(50)
I placed them in some table
There is an exact match for the value of lastname which equals: Kirkland‐Wallace
when I run the query
select LastName from
DT5
where LastName like '%-%'
no results
when I run the query
SELECT LastName FROM DT5
WHERE CHARINDEX('-', LastName) > 0
I get everything but Kirkland‐Wallace that has a hyphen
when I run the query
select LastName from --
DT5
where LastName like '%Wallace%' -- switch these or '%Kirkland%
I get the correct result
For some reason it does not see the hyphen as a hyphen but it display's itself as a result set.
Has anyone come across this situation before?
I would be appreciate any feedback
March 24, 2016 at 1:12 pm
there's at least two other longer than normal hyphens that has a different ascii code than the "normal hyphen ASCII('-')=45,without even going into nvarchar data sets.
i'm betting it's one of these others?
/*Results
NormalHyphenChar150Char151
-–—
*/
SELECT CHAR(45) As NormalHyphen,
CHAR(150) Char150,
CHAR(151) As Char151
Lowell
March 24, 2016 at 1:23 pm
Lowell (3/24/2016)
there's at least two other longer than normal hyphens that has a different ascii code than the "normal hyphen ASCII('-')=45,without even going into nvarchar data sets.i'm betting it's one of these others?
/*Results
NormalHyphenChar150Char151
-–—
*/
SELECT CHAR(45) As NormalHyphen,
CHAR(150) Char150,
CHAR(151) As Char151
I'll bet one of them is the annoying one that's pasted in from MS Word.
March 24, 2016 at 1:37 pm
That was a good thought!
SELECT LastName FROM DT5
WHERE CHARINDEX(CHAR(151), LastName) > 0
no results
SELECT LastName FROM DT5
WHERE CHARINDEX(CHAR(150), LastName) > 0
no results
SELECT LastName FROM DT5
WHERE CHARINDEX(CHAR(45), LastName) > 0
same results as hyphen above
I appreciate the rapid turrnaround
March 24, 2016 at 1:46 pm
bdkdavid (3/24/2016)
That was a good thought!SELECT LastName FROM DT5
WHERE CHARINDEX(CHAR(151), LastName) > 0
no results
SELECT LastName FROM DT5
WHERE CHARINDEX(CHAR(150), LastName) > 0
no results
SELECT LastName FROM DT5
WHERE CHARINDEX(CHAR(45), LastName) > 0
same results as hyphen above
I appreciate the rapid turrnaround
test for high ascii, like the word hyphen:
Select LastName FROM DT5 Where LastName Like '%[^0-9a-zA-Z -]%' COLLATE Latin1_General_BIN
Lowell
March 24, 2016 at 2:07 pm
What do you get if you run this query?
WITH
E(n) AS(
SELECT n FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0))E(n)
),
E2(n) AS(
SELECT a.n FROM E a, E b
),
E4(n) AS(
SELECT a.n FROM E2 a, E2 b
),
cteTally(n) AS(
SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) n
FROM E4
)
SELECT n, SUBSTRING(LastName, n, 1), ASCII( SUBSTRING(LastName, n, 1))
FROM cteTally t
JOIN DT5 ON t.n <= LEN(DT5.LastName)
WHERE LastName like '%Kirkland%Wallace%'
March 24, 2016 at 2:18 pm
The hyphen might be a non-breaking hyphen. Here's an example of how it behaves:
declare @test-2 table (a nvarchar(10));
insert @test-2 values ('abc-def'), ('abc' + nchar(0x02010) + 'def')
select a, ascii(SUBSTRING(a, 4, 1)), UNICODE(SUBSTRING(a, 4, 1))
from @test-2
--where a like '%-%'
The result without the where clause is:
abc-def4545
abc-def458208
And with the where clause:
abc-def4545
March 24, 2016 at 2:40 pm
Hi Luis
Here are the Results:
1K75
2i105
3r114
4k107
5l108
6a97
7n110
8d100
9-45
10W87
11a97
12l108
13l108
14a97
15c99
16e101
March 24, 2016 at 2:46 pm
I just realized that you're querying the source table. Change the ASCII function for UNICODE. That should show the difference as shown by Stephanie.
March 24, 2016 at 2:50 pm
Hi Stephanie,
SELECT LastName FROM DT5
WHERE CHARINDEX(nchar(0x02010), LastName) > 0
or where Charindex(nchar(8208),LastName) > 0
this get the desired out put
Kirkland-Wallace
Is there a cause for this behavior?
I do not understand the source of the problem.
Can anyone shed some light on the subject?
March 24, 2016 at 2:54 pm
Hi Luis
Here is the correction output:
1K75
2i105
3r114
4k107
5l108
6a97
7n110
8d100
9-8208
10W87
11a97
12l108
13l108
14a97
15c99
16e101
March 24, 2016 at 3:17 pm
bdkdavid (3/24/2016)
Is there a cause for this behavior?
I do not understand the source of the problem.
Can anyone shed some light on the subject?
The unicode character that looks like a hyphen is not, in fact, a hyphen. It's a "non-breaking hyphen" which can be created fairly easily in word processing. In a program like Word, that means that "Kirkland-Wallace" (which I typed with a hyphen, but you get the point) will not break across lines. The entire phrase will move to the next line if there isn't room at the end of the previous line for all of it.
For your data cleansing, you need to add the non-breaking hyphen to whatever logic you are currently applying to hyphens.
March 29, 2016 at 2:09 pm
Luis, that is an awesome query! I suspected that was the problem when I first saw this post, but it was already answered and quite well at that.
I would really like to smack Word upside the head and disable things like smart quotes and their various hyphens. I understand and appreciate that they have a purpose, but it don't do much for a poor li'l code slinger like me.
-----
[font="Arial"]Knowledge is of two kinds. We know a subject ourselves or we know where we can find information upon it. --Samuel Johnson[/font]
Viewing 13 posts - 1 through 12 (of 12 total)
You must be logged in to reply to this topic. Login to reply