|
|
|
SSC Rookie
      
Group: General Forum Members
Last Login: Monday, February 18, 2013 2:25 PM
Points: 32,
Visits: 87
|
|
Our content has been contaminated by UNICODE NULL characters in our SQL Server 2008 database that don't play nicely with our java libraries that depend on UTF8 characters.
I am trying to compose SQL statements that will identify the characters within the nVarchars and then replace with an empty value ''. The main problem I'm having is creating a check to see whether a particular character is a UNICODE NULL or not.
When I try to view the values in the database, they look like 'u' instead of 'username'. I suspect there is an invisible UNICODE NULL just after the first letter. My goal is to simply delete that second, invisible character (UNICODE NULL).
SELECT name FROM users WHERE name LIKE '%[^ -~]%' COLLATE Latin1_General_BIN
Returns the following:
B M M N S S S S ÿþA
select name, CAST(RIGHT(name,1) AS varbinary(128)) AS RIGHTER_1, from users where id=1
returns:
B 0x4200
Any ideas are greatly appreciated.
mtf
|
|
|
|
|
SSChampion
        
Group: General Forum Members
Last Login: Today @ 9:24 AM
Points: 11,791,
Visits: 28,073
|
|
something like this can help you identify the actual char; i don't know what unicode null is (is that ASCII code(0)?
but this will help figure it out;
select name, N as Position, SUBSTRING(name,N,1) As TheChar, ASCII(SUBSTRING(name,N,1)) TheAsciiCode from users CROSS APPLY (SELECT TOP 255 ROW_NUMBER() OVER (ORDER BY (SELECT NULL) -1) AS n FROM sys.columns) MiniTally WHERE MiniTally.n BETWEEN 0 AND 255 AND MiniTally.n < LEN(name) ORDER BY name,N
Lowell
--There is no spoon, and there's no default ORDER BY in sql server either. Actually, Common Sense is so rare, it should be considered a Superpower. --my son
|
|
|
|
|
SSC Rookie
      
Group: General Forum Members
Last Login: Monday, February 18, 2013 2:25 PM
Points: 32,
Visits: 87
|
|
Hey, that's a pretty cool piece of SQL! Thanks for offering it.
I think the +1 helps it traverse the string, but honestly, this is using some structures I've never even seen before (CROSS APPLY -- I need to study up on that!), so I just took a guess on the +1....
select name, N as Position, SUBSTRING(name,N,1) As TheChar, ASCII(SUBSTRING(name,N,1)) TheAsciiCode from my_user_table CROSS APPLY (SELECT TOP 255 ROW_NUMBER() OVER (ORDER BY (SELECT NULL) -1) AS n FROM sys.columns) MiniTally WHERE MiniTally.n BETWEEN 0 AND 255 AND MiniTally.n < LEN(name) +1 ORDER BY name,N
The Results:
A 1 A 65 A 2 0 A 3 H 72 A 4 0 A 5 S 83 A 6 0 A 7 G 71 A 8 0 A 9 u 117 A 10 0 A 11 e 101 A 12 0 A 13 s 115 A 14 0 A 15 t 116
I think the ascii 0 value is where the UNICODE NULLS are. The above example should be 'AHSGuest', but in the table, it just looks like 'A' with nothing else after the letter A.
Here's an ugly recursion using the SQL you provided to explicitly display the names contaminated with the UNICODE NULLS:
select id, name, N as Position, SUBSTRING(name,N,1) As TheChar, ASCII(SUBSTRING(name,N,1)) TheAsciiCode from my_user_table CROSS APPLY (SELECT TOP 255 ROW_NUMBER() OVER (ORDER BY (SELECT NULL) -1) AS n FROM sys.columns) MiniTally WHERE MiniTally.n BETWEEN 0 AND 255 AND MiniTally.n < LEN(name)+1 AND ID in (select DISTINCT B.id FROM (select id, name, N as Position, SUBSTRING(name,N,1) As TheChar, ASCII(SUBSTRING(name,N,1)) TheAsciiCode from my_user_table CROSS APPLY (SELECT TOP 255 ROW_NUMBER() OVER (ORDER BY (SELECT NULL) -1) AS n FROM sys.columns) MiniTally WHERE MiniTally.n BETWEEN 0 AND 255 AND MiniTally.n < LEN(name)+1) B WHERE B.TheAsciiCode=0) ORDER BY name,N
Any idea on how I can get rid of these ASCII(0) characters within my strings?
Thanks for your help on this!!!
mtf
|
|
|
|
|
SSChampion
        
Group: General Forum Members
Last Login: Today @ 9:24 AM
Points: 11,791,
Visits: 28,073
|
|
well, so far it looks like a single offending character, so i think you can do a simple replace();
if this produces the desired name , i would update it with the sample below:
select id, name, REPLACE(name,CHAR(0),'') As Fixed from my_user_table --only bad data for clarity WHERE CHARINDEX(name,CHAR(0),1) > 0
If that's cleaning up the results correctly, a simple replace should fix it:
UPDATE my_user_table SET name = REPLACE(name,CHAR(0),'') --only bad data for to match the #rows in our original select WHERE CHARINDEX(name,CHAR(0),1) > 0
Lowell
--There is no spoon, and there's no default ORDER BY in sql server either. Actually, Common Sense is so rare, it should be considered a Superpower. --my son
|
|
|
|
|
Ten Centuries
      
Group: General Forum Members
Last Login: Today @ 9:33 AM
Points: 1,333,
Visits: 4,017
|
|
Just wondering what is the datalength() of one of those names, such as AHSGuest ?
MM
|
|
|
|
|
SSC Rookie
      
Group: General Forum Members
Last Login: Monday, February 18, 2013 2:25 PM
Points: 32,
Visits: 87
|
|
Lowell,
Sadly, this isn't matching any rows:
select id, name FROM my_user_table WHERE CHARINDEX(name,CHAR(0),1) > 0 No rows returned. Likewise, the replace operation doesn't touch any of the rows. Ideas?
Thank you very much for the help you've provided thus far. I feel like I'm on the verge of clobbering this beast thanks to your assistance!
mtf
|
|
|
|
|
SSC Rookie
      
Group: General Forum Members
Last Login: Monday, February 18, 2013 2:25 PM
Points: 32,
Visits: 87
|
|
Magoo,
For the 'AHSGUEST' value, here's what I see:
select id, name, DATALENGTH(name) as 'thedatalengthName' FROM my_user_table WHERE id='123'
123 A 30 mtf
|
|
|
|
|
Right there with Babe
      
Group: General Forum Members
Last Login: Today @ 11:14 AM
Points: 772,
Visits: 1,501
|
|
mrTexasFreedom (1/17/2013)
Lowell, Sadly, this isn't matching any rows: select id, name FROM my_user_table WHERE CHARINDEX(name,CHAR(0),1) > 0 No rows returned. Likewise, the replace operation doesn't touch any of the rows. Ideas? Thank you very much for the help you've provided thus far. I feel like I'm on the verge of clobbering this beast thanks to your assistance! mtf
What do you get if you replace the ASCII function with the UNICODE function in Lowell's original diagnostic code (the one that splits the character string into single characters and returns the character code)? If I understood correctly that your troublesome column is NVARCHAR, the UNICODE function should give you the character code to weed out, and if it's a non-printing control character, it may be different than ASCII 0.
|
|
|
|
|
SSChampion
        
Group: General Forum Members
Last Login: Today @ 9:24 AM
Points: 11,791,
Visits: 28,073
|
|
or another version: a scalar function that would remove anything that was not A-Z/numeric:
same basic logic, i think:
select id, name, dbo.StripNonAlphaNumeric(name) As Fixed from my_user_table
If that's cleaning up the results correctly, a simple replace should fix it:
UPDATE my_user_table SET name = StripNonAlphaNumeric(name)
and the function definition:
CREATE FUNCTION StripNonAlphaNumeric(@OriginalText VARCHAR(8000)) RETURNS VARCHAR(8000) BEGIN DECLARE @CleanedText VARCHAR(8000) ;WITH tally (N) as (SELECT TOP 10000 row_number() OVER (ORDER BY sc1.id) FROM Master.dbo.SysColumns sc1 CROSS JOIN Master.dbo.SysColumns sc2) SELECT @CleanedText = ISNULL(@CleanedText,'') + CASE --ascii numbers are 48(for '0') thru 57 (for '9') WHEN ASCII(SUBSTRING(@OriginalText,Tally.N,1)) BETWEEN 48 AND 57 THEN SUBSTRING(@OriginalText,Tally.N,1) --ascii upper case letters A-Z is 65 thru 90 WHEN ASCII(SUBSTRING(@OriginalText,Tally.N,1)) BETWEEN 65 AND 90 THEN SUBSTRING(@OriginalText,Tally.N,1) --ascii lower case letters a-z is 97 thru 122 WHEN ASCII(SUBSTRING(@OriginalText,Tally.N,1)) BETWEEN 97 AND 122 THEN SUBSTRING(@OriginalText,Tally.N,1) ELSE '' END FROM tally WHERE Tally.N <= LEN(@OriginalText) RETURN @CleanedText END
Lowell
--There is no spoon, and there's no default ORDER BY in sql server either. Actually, Common Sense is so rare, it should be considered a Superpower. --my son
|
|
|
|
|
Ten Centuries
      
Group: General Forum Members
Last Login: Today @ 9:33 AM
Points: 1,333,
Visits: 4,017
|
|
mrTexasFreedom (1/17/2013)
Magoo, For the 'AHSGUEST' value, here's what I see: select id, name, DATALENGTH(name) as 'thedatalengthName' FROM my_user_table WHERE id='123'
123 A 30 mtf
I suspect that somewhere along the line, a unicode string has been badly converted, first to non-unicode with every unicode character being converted to two non-unicode characters, like you are seeing - for example, an A followed by an ascii 0, then converted back to unicode leaving you a unicode string that contains the null (zero) character in every other position.
So, it is easy to remove every other character from the strings, but the big question will be Have you lost any information in that process? if you remove the zeros and the results are OK, then fine...otherwise you may need to find the source of the data and import it correctly.
edit: Try this conversion CONVERT(NVARCHAR(64),CONVERT(VARBINARY(128),CONVERT(CHAR(128),name))) to see if you get something that looks vaguely right...
MM
|
|
|
|