Lowell,
Sadly, this isn't matching any rows:
select id, name
FROM my_user_table
WHERE CHARINDEX(name,CHAR(0),1) > 0
No rows returned. Likewise, the replace operation doesn't touch any of the rows. Ideas?
Thank you very much for the help you've provided thus far. I feel like I'm on the verge of clobbering this beast thanks to your assistance!
mtf
Magoo,
For the 'AHSGUEST' value, here's what I see:
select id, name, DATALENGTH(name) as 'thedatalengthName'
FROM my_user_table
WHERE id='123'
123A30
mtf
mrTexasFreedom (1/17/2013)
Lowell,Sadly, this isn't matching any rows:
select id, name
FROM my_user_table
WHERE CHARINDEX(name,CHAR(0),1) > 0
No rows returned. Likewise, the replace operation doesn't touch any of the rows. Ideas?
Thank you very much for the help you've provided thus far. I feel like I'm on the verge of clobbering this beast thanks to your assistance!
mtf
What do you get if you replace the ASCII function with the UNICODE function in Lowell's original diagnostic code (the one that splits the character string into single characters and returns the character code)? If I understood correctly that your troublesome column is NVARCHAR, the UNICODE function should give you the character code to weed out, and if it's a non-printing control character, it may be different than ASCII 0.
Jason Wolfkill
or another version: a scalar function that would remove anything that was not A-Z/numeric:
same basic logic, i think:
select
id,
name,
dbo.StripNonAlphaNumeric(name) As Fixed
from my_user_table
If that's cleaning up the results correctly, a simple replace should fix it:
UPDATE my_user_table
SET name = StripNonAlphaNumeric(name)
and the function definition:
CREATE FUNCTION StripNonAlphaNumeric(@OriginalText VARCHAR(8000))
RETURNS VARCHAR(8000)
BEGIN
DECLARE @CleanedText VARCHAR(8000)
;WITH tally (N) as
(SELECT TOP 10000 row_number() OVER (ORDER BY sc1.id)
FROM Master.dbo.SysColumns sc1
CROSS JOIN Master.dbo.SysColumns sc2)
SELECT @CleanedText = ISNULL(@CleanedText,'') +
CASE
--ascii numbers are 48(for '0') thru 57 (for '9')
WHEN ASCII(SUBSTRING(@OriginalText,Tally.N,1)) BETWEEN 48 AND 57
THEN SUBSTRING(@OriginalText,Tally.N,1)
--ascii upper case letters A-Z is 65 thru 90
WHEN ASCII(SUBSTRING(@OriginalText,Tally.N,1)) BETWEEN 65 AND 90
THEN SUBSTRING(@OriginalText,Tally.N,1)
--ascii lower case letters a-z is 97 thru 122
WHEN ASCII(SUBSTRING(@OriginalText,Tally.N,1)) BETWEEN 97 AND 122
THEN SUBSTRING(@OriginalText,Tally.N,1)
ELSE '' END
FROM tally WHERE Tally.N <= LEN(@OriginalText)
RETURN @CleanedText
END
Lowell
mrTexasFreedom (1/17/2013)
Magoo,For the 'AHSGUEST' value, here's what I see:
select id, name, DATALENGTH(name) as 'thedatalengthName'
FROM my_user_table
WHERE id='123'
123A30
mtf
I suspect that somewhere along the line, a unicode string has been badly converted, first to non-unicode with every unicode character being converted to two non-unicode characters, like you are seeing - for example, an A followed by an ascii 0, then converted back to unicode leaving you a unicode string that contains the null (zero) character in every other position.
So, it is easy to remove every other character from the strings, but the big question will be Have you lost any information in that process? if you remove the zeros and the results are OK, then fine...otherwise you may need to find the source of the data and import it correctly.
edit:
Try this conversion CONVERT(NVARCHAR(64),CONVERT(VARBINARY(128),CONVERT(CHAR(128),name)))
to see if you get something that looks vaguely right...
MM
select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);
Magoo,
I think you're right about how there are embedded NULLS between each character. I ran the series of convert functions you provided and it showed AHSGuest††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††.
I really appreciate you providing this. I might use a modified version of this in a stored procedure as I create a programmatic method of cleaning all the columns of all our tables.
I'll update this thread with that stored procedure.
mtf
Lowell,
That was just the ticket!
The select StripNonAlphaNumeric(name) produces exactly the ASCII without the UNICODE NULLs embedded. The update also seems to work in the testing I've done so far!!!
For applying it to the entire table, I wanted to limit it to just the affected rows, so I use the earlier nested selects to narrow it down:
update my_user_table
set name=dbo.StripNonAlphaNumeric(name)
WHERE ID in
(select DISTINCT B.id FROM
(select id, name, N as Position,
SUBSTRING(name,N,1) As TheChar,
ASCII(SUBSTRING(name,N,1)) TheAsciiCode
from my_user_table
CROSS APPLY (SELECT TOP 255 ROW_NUMBER() OVER (ORDER BY (SELECT NULL) -1) AS n
FROM sys.columns) MiniTally
WHERE MiniTally.n BETWEEN 0 AND 255
AND MiniTally.n < LEN(name)+1) B
WHERE B.TheAsciiCode=0)
We have these UNICODE NULLS infecting many of the columns in many of our tables. I'm going to write a stored procedure that utilizes this function to programmatically clean every column in every table. I'll update this thread with my solution soon.
For now, though, I wanted to let you know how much I appreciate the people who have helped me in this thread. I am HUGELY appreciative. Thank you so much.
mtf
mrTexasFreedom I'm glad my code helped a bit, but mister magoo really identified the culprit, i think;
some process imported(and maybe still imports) data that should be nvarchar instead of varchar;
you should try to track down whatever that process is and fix it at the source; otherwise this cleanup-after-the-mess thing is going to be needed every time that other process runs.
Lowell
Thank you for the endorsement on Magoo's convert trick. I am going to adjust it so the final character isn't repeated and if I can get that to work, I'll use it in my stored proc.
The source of those documents has been identified and we're putting processes in place to prevent more corrupt data from being imported. Our developers are also working on a front-end filter to dump anything that isn't standard UTF8. I'm the person tasked with cleaning up the mess that's already been created.
Have a great weekend!
mtf
You should be able to just wrap the existing method in LEFT(....,LEN(name)) to get the correct answer...
but only if the original data was no more than half as long as your column can hold, otherwise you have lost some data...
MM
select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);
Viewing 15 posts - 1 through 15 (of 17 total)
You must be logged in to reply to this topic. Login to reply