August 22, 2012 at 11:11 am
I'm storing information in an XML file that is UTF-8.
I have a stored procedure I use to query some tables and structure the data in XML format, which I then save into an XML file with UTF-8.
I noticed that there are some characters that make it into this XML file, that violate the UTF-8 encoding. How do I avoid that?
August 23, 2012 at 2:23 pm
Looks like this might be in the wrong section, how do I move it to the XML secion?
August 23, 2012 at 2:42 pm
ok, i just googled "Non UTF-8 Characters", and the only thing i can find is when something got malformed; otherwise, It looks like UTF covers teh whole range of possible characters.
http://stackoverflow.com/questions/1379416/insert-utf8-data-into-a-ms-sql-server-2008
http://magp.ie/2011/01/06/remove-non-utf8-characters-from-string-with-php/
the two links above are describing fixes for malformed strings;
can you be more specific about what it is you want to remove?
do you really mean high ascii characters, ie >127 like some of these?
('ÀAlbèert ËEîinstêeiìn ÌInstìitúutëe - MPG')
do you mean escaping special characters that xml requires to be htmlized? like > to &l t ;?
Lowell
August 24, 2012 at 9:57 am
I didn't write this code and can't remember where I got it so apologies to the original author. In my case I only cared about non-ASCII characters outside the range 32-127 because the SQL XML parser handles things like '<' and '>' etc. If you need to convert other characters or tags then the case statement will need a little tweaking. Undesirable characters are replaced by '?' and then optionally all question marks are replaced with blanks.
CREATE FUNCTION dbo.svfCleanXML
(
@OriginalText VARCHAR(8000)
)
RETURNS VARCHAR(8000)
BEGIN
/* This function requires a Tally or Numbers table */
/* http://www.sqlservercentral.com/articles/T-SQL/62867/ */
DECLARE @CleanedText VARCHAR(8000)
SELECT
@CleanedText = ISNULL(@CleanedText,'')
+ CASE
WHEN ASCII(SUBSTRING(@OriginalText,Tally.N,1)) BETWEEN 32 AND 127
THEN SUBSTRING(@OriginalText,Tally.N,1)
WHEN ASCII(SUBSTRING(@OriginalText,Tally.N,1)) <= 32
THEN CONVERT(VARCHAR,ASCII(SUBSTRING(@OriginalText,Tally.N,1)))
WHEN ASCII(SUBSTRING(@OriginalText,Tally.N,1)) >= 128
THEN CONVERT(VARCHAR,ASCII(SUBSTRING(@OriginalText,Tally.N,1)))
END
FROM
dbo.Tally
WHERE
Tally.N <= LEN(@OriginalText)
SET @CleanedText = REPLACE(@CleanedText,'?','') --optional
RETURN @CleanedText
/*
SELECT dbo.svfCleanXML('[xml_string_here]') AS cleanString
SELECT CONVERT(XML,dbo.svfCleanXML('[xml_string_here]')) AS cleanXML
*/
END
August 24, 2012 at 11:01 am
and here's a similar one i wrote long ago which converts things like © to the html entity ©
I think it was specifically due to XML:
CREATE FUNCTION HTMLEncode(@OriginalText VARCHAR(4000))
RETURNS VARCHAR(8000)
BEGIN
DECLARE @CleanedText VARCHAR(8000);
WITH E01(N) AS (SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1), -- 10 or 10E01 rows
E02(N) AS (SELECT 1 FROM E01 a, E01 b), -- 100 or 10E02 rows
E04(N) AS (SELECT 1 FROM E02 a, E02 b), -- 10,000 or 10E04 rows
-- E08(N) AS (SELECT 1 FROM E04 a, E04 b), --100,000,000 or 10E08 rows
-- E16(N) AS (SELECT 1 FROM E08 a, E08 b), --10E16 or more rows than you'll EVER need
Tally(N) AS (SELECT ROW_NUMBER() OVER (ORDER BY N) FROM E04)
SELECT @CleanedText = ISNULL(@CleanedText,'') +
CASE
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =34) THEN '"' --" " " quotation mark
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =39) THEN ''' --' ' ' apostrophe
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =38) THEN '&' --& & & ampersand
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =60) THEN '<' --< < < less-than
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =62) THEN '>' --> > > greater-than
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =160) THEN ' ' -- non-breaking space
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =161) THEN '¡' --¡ ¡ ¡ inverted exclamation mark
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =162) THEN '¢' --¢ ¢ ¢ cent
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =163) THEN '£' --£ £ £ pound
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =164) THEN '¤' --¤ ¤ ¤ currency
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =165) THEN '¥' --¥ ¥ ¥ yen
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =166) THEN '¦' --¦ ¦ ¦ broken vertical bar
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =167) THEN '§' --§ § § section
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =168) THEN '¨' --¨ ¨ ¨ spacing diaeresis
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =169) THEN '©' --© © © copyright
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =170) THEN 'ª' --ª ª ª feminine ordinal indicator
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =171) THEN '«' --« « « angle quotation mark (left)
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =172) THEN '¬' --¬ ¬ ¬ negation
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =173) THEN '' -- soft hyphen
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =174) THEN '®' --® ® ® registered trademark
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =175) THEN '¯' --¯ ¯ ¯ spacing macron
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =176) THEN '°' --° ° ° degree
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =177) THEN '±' --± ± ± plus-or-minus
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =178) THEN '²' --² ² ² superscript 2
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =179) THEN '³' --³ ³ ³ superscript 3
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =180) THEN '´' --´ ´ ´ spacing acute
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =181) THEN 'µ' --µ µ µ micro
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =182) THEN '¶' --¶ ¶ ¶ paragraph
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =183) THEN '·' --· · · middle dot
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =184) THEN '¸' --¸ ¸ ¸ spacing cedilla
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =185) THEN '¹' --¹ ¹ ¹ superscript 1
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =186) THEN 'º' --º º º masculine ordinal indicator
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =187) THEN '»' --» » » angle quotation mark (right)
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =188) THEN '¼' --¼ ¼ ¼ fraction 1/4
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =189) THEN '½' --½ ½ ½ fraction 1/2
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =190) THEN '¾' --¾ ¾ ¾ fraction 3/4
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =191) THEN '¿' --¿ ¿ ¿ inverted question mark
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =215) THEN '×' --× × × multiplication
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =247) THEN '÷' --÷ ÷ ÷ division
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =192) THEN 'À' --À À À capital a, grave accent
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =193) THEN 'Á' --Á Á Á capital a, acute accent
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =194) THEN 'Â' --Â Â Â capital a, circumflex accent
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =195) THEN 'Ã' --Ã Ã Ã capital a, tilde
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =196) THEN 'Ä' --Ä Ä Ä capital a, umlaut mark
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =197) THEN 'Å' --Å Å Å capital a, ring
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =198) THEN 'Æ' --Æ Æ Æ capital ae
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =199) THEN 'Ç' --Ç Ç Ç capital c, cedilla
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =200) THEN 'È' --È È È capital e, grave accent
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =201) THEN 'É' --É É É capital e, acute accent
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =202) THEN 'Ê' --Ê Ê Ê capital e, circumflex accent
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =203) THEN 'Ë' --Ë Ë Ë capital e, umlaut mark
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =204) THEN 'Ì' --Ì Ì Ì capital i, grave accent
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =205) THEN 'Í' --Í Í Í capital i, acute accent
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =206) THEN 'Î' --Î Î Î capital i, circumflex accent
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =207) THEN 'Ï' --Ï Ï Ï capital i, umlaut mark
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =208) THEN 'Ð' --Ð Ð Ð capital eth, Icelandic
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =209) THEN 'Ñ' --Ñ Ñ Ñ capital n, tilde
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =210) THEN 'Ò' --Ò Ò Ò capital o, grave accent
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =211) THEN 'Ó' --Ó Ó Ó capital o, acute accent
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =212) THEN 'Ô' --Ô Ô Ô capital o, circumflex accent
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =213) THEN 'Õ' --Õ Õ Õ capital o, tilde
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =214) THEN 'Ö' --Ö Ö Ö capital o, umlaut mark
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =216) THEN 'Ø' --Ø Ø Ø capital o, slash
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =217) THEN 'Ù' --Ù Ù Ù capital u, grave accent
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =218) THEN 'Ú' --Ú Ú Ú capital u, acute accent
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =219) THEN 'Û' --Û Û Û capital u, circumflex accent
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =220) THEN 'Ü' --Ü Ü Ü capital u, umlaut mark
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =221) THEN 'Ý' --Ý Ý Ý capital y, acute accent
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =222) THEN 'Þ' --Þ Þ Þ capital THORN, Icelandic
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =223) THEN 'ß' --ß ß ß small sharp s, German
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =224) THEN 'à' --à à à small a, grave accent
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =225) THEN 'á' --á á á small a, acute accent
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =226) THEN 'â' --â â â small a, circumflex accent
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =227) THEN 'ã' --ã ã ã small a, tilde
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =228) THEN 'ä' --ä ä ä small a, umlaut mark
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =229) THEN 'å' --å å å small a, ring
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =230) THEN 'æ' --æ æ æ small ae
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =231) THEN 'ç' --ç ç ç small c, cedilla
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =232) THEN 'è' --è è è small e, grave accent
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =233) THEN 'é' --é é é small e, acute accent
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =234) THEN 'ê' --ê ê ê small e, circumflex accent
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =235) THEN 'ë' --ë ë ë small e, umlaut mark
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =236) THEN 'ì' --ì ì ì small i, grave accent
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =237) THEN 'í' --í í í small i, acute accent
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =238) THEN 'î' --î î î small i, circumflex accent
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =239) THEN 'ï' --ï ï ï small i, umlaut mark
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =240) THEN 'ð' --ð ð ð small eth, Icelandic
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =241) THEN 'ñ' --ñ ñ ñ small n, tilde
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =242) THEN 'ò' --ò ò ò small o, grave accent
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =243) THEN 'ó' --ó ó ó small o, acute accent
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =244) THEN 'ô' --ô ô ô small o, circumflex accent
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =245) THEN 'õ' --õ õ õ small o, tilde
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =246) THEN 'ö' --ö ö ö small o, umlaut mark
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =248) THEN 'ø' --ø ø ø small o, slash
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =249) THEN 'ù' --ù ù ù small u, grave accent
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =250) THEN 'ú' --ú ú ú small u, acute accent
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =251) THEN 'û' --û û û small u, circumflex accent
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =252) THEN 'ü' --ü ü ü small u, umlaut mark
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =253) THEN 'ý' --ý ý ý small y, acute accent
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =254) THEN 'þ' --þ þ þ small thorn, Icelandic
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =255) THEN 'ÿ' --ÿ ÿ ÿ small y, umlaut mark
ELSE SUBSTRING(@OriginalText,Tally.N,1)
END
FROM Tally
WHERE Tally.N <= LEN(@OriginalText)
RETURN @CleanedText
END
Lowell
August 24, 2012 at 11:04 am
well, the forum inteprets the escaped values as something it should render...whoops.
link to the function as raw text here:
Lowell
August 31, 2012 at 3:44 pm
The character that was giving me problems is â (0xE2)
August 31, 2012 at 6:39 pm
Khades (8/31/2012)
The character that was giving me problems is â (0xE2)
Here's a revised version of the function I posted above which in addition to checking for normal ASCII characters will remove any other extended characters either in their native form (â) or encoded form (&#x). There are some examples to run at the bottom of the code.
CREATE FUNCTION [dbo].[svfHTMLEncodeXML]
(
@OriginalText VARCHAR(8000)
)
RETURNS VARCHAR(8000)
BEGIN
DECLARE
@CleanedText VARCHAR(8000)
,@pos INT
,@chunk VARCHAR(10)
SELECT @CleanedText = COALESCE(@CleanedText + '','')
+ CAST(CleanText.CharVal AS VARCHAR(8000))
FROM
(
SELECT
Result.CharVal
FROM
(
SELECT
N AS RowNum
,(CASE
WHEN ASCII(SUBSTRING(@OriginalText,t.N,1)) BETWEEN 32 AND 127
THEN CAST(SUBSTRING(@OriginalText,t.N,1) AS VARCHAR(50))
WHEN ASCII(SUBSTRING(@OriginalText,t.N,1)) < 32
THEN NULL
WHEN ASCII(SUBSTRING(@OriginalText,t.N,1)) >= 127
THEN NULL
END)
AS CharVal
FROM
dbo.Tally AS t
WHERE
t.N <= LEN(@OriginalText)
) AS Result
WHERE
Result.CharVal IS NOT NULL
) CleanText
SET @pos = PATINDEX('%[&#x]%',@CleanedText)
WHILE @pos <> 0
BEGIN
SET @chunk = SUBSTRING(@CleanedText,@pos,6)
SET @CleanedText = REPLACE(@CleanedText,@chunk,'')
SET @pos = PATINDEX('%[&#x]%',@CleanedText)
SET @chunk = NULL
END
SET @pos = PATINDEX('%[0x]%',@CleanedText)
WHILE @pos <> 0
BEGIN
SET @chunk = SUBSTRING(@CleanedText,@pos,6)
SET @CleanedText = REPLACE(@CleanedText,@chunk,'')
SET @pos = PATINDEX('%[0x]%',@CleanedText)
SET @chunk = NULL
END
/* This optional section removes tabs and excess spaces */
SET @CleanedText = REPLACE(@CleanedText,CHAR(9),' ')-- convert tabs to spaces
SET @CleanedText =
REPLACE(
REPLACE(
REPLACE(
@CleanedText
,' ',' '+CHAR(7))
,CHAR(7)+' ','')
,CHAR(7),'')
SET @CleanedText = REPLACE(@CleanedText,'> <','><')
RETURN @CleanedText
/*
SELECT dbo.svfHTMLEncodeXML('O&xxxx;Brien') -- sub '#x0D' for 'xxxx' to test
SELECT dbo.svfHTMLEncodeXML('OÇBrien')
SELECT dbo.svfHTMLEncodeXML('O¢Brien')
SELECT dbo.svfHTMLEncodeXML('O&xxxx;Brien') -- sub '#x0F' for 'xxxx' to test
SELECT dbo.svfHTMLEncodeXML('OâBrien')
*/
/* --this is text with an invisible line break character in it
SELECT dbo.svfHTMLEncodeXML('O
Brien')
*/
/* -- notice the â (0xE2) in the 2nd position which the function will remove
SELECT CONVERT(XML,dbo.svfHTMLEncodeXML(
'<âRoot type="demographics">
<PersonalData>
<PersonName type="primary">
<GivenName>John</GivenName>
<MiddleName>A</MiddleName>
<FamilyName>Doe</FamilyName>
</PersonName>
</PersonalData>
</Root>'))
*/
END
September 4, 2012 at 3:10 pm
Thank you once again. I appreciate your help.
What does the dbo.Tally table hold? I notice we keep querying the N column.
September 4, 2012 at 3:16 pm
Khades (9/4/2012)
Thank you once again. I appreciate your help.What does the dbo.Tally table hold? I notice we keep querying the N column.
A "Tally" table (sometimes called a "Numbers" table) is just a table with a large column of sequentially ordered numbers which can be joined to other tables to avoid using cursors or while loops.
This article will giet you started on the subject: http://www.sqlservercentral.com/articles/T-SQL/62867/
BTW, in my last post above I did include some WHILE loops which could almost certainly be replaced by joining to a tally table. But it was late when I posted that and I wanted to go home.
Besides, in my particular case where I'm using a function very similar to this, the number of loops are likely to be minimal as these stray characters only occur infrequently and I'm using the function against a single block of code at a time and not in a SELECT statement. If using this function in a SELECT statement it would be worth the effort to try and replace them with some sort of non-looping construct.
Viewing 10 posts - 1 through 10 (of 10 total)
You must be logged in to reply to this topic. Login to reply
This website stores cookies on your computer.
These cookies are used to improve your website experience and provide more personalized services to you, both on this website and through other media.
To find out more about the cookies we use, see our Privacy Policy