Weird issue with unicode and pattern matching

  • Hiya,

    I've been trying to find and filter out data that can cause FOR XML to fail with the error:

    FOR XML could not serialize the data for node ?? because it contains a character (0x0006) which is not allowed in XML.

    I thought it would be simple, just identify the rows that don't match the XML spec http://www.w3.org/TR/REC-xml/#charsets and filter them out.

    However, the following doesn't work.

    select fieldname

    from tablename

    where fieldname

    like N'%[^'+nchar(0x9)+nchar(0xA)+nchar(0xD)+nchar(0x20)+N'-'+nchar(0xD7FF)+N']%'

    Not only does this fail to find the offending rows, when I checked by inverted it by taking out the "^" it still returns no rows.

    I'm at a loss as to why it's doing this.

    When I reduced 0xD7FF down to 0xFF it returns data, but I cannot add the other range back in without all rows not matching.

    I've experimented with values and found 0x02E9 was the highest I could go without all values vanishing. Even then, numbers lower than that caused a large variation in the number of rows returned.

    My Field is NVarchar(100) in SQL_Latin1_General_CP1_CI_AS.

    Thanks

    Ian.

  • Maybe you could do this to find offending rows.

    select fieldname

    from tablename

    EXCEPT

    select fieldname

    from tablename

    where fieldname

    NOT LIKE N'%[^'+nchar(0x9)+nchar(0xA)+nchar(0xD)+nchar(0x20)+N'-'+nchar(0xD7FF)+N']%'

    The NOT LIKE N'%[^Pattern]%' will return non-offending rows. The EXCEPT will return the other rows.

    You have a wide range in there, that might be causing the lack of results.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Right, I think I've found the issue.

    As you can see from this SQLFiddle http://sqlfiddle.com/#!6/7801f/6 (watch out, sqlfiddle doesn't display the char that caused the issue for me in the first place!), in the first recordset there's two columns, the first is a patindex without changing the collation, the second is the same patindex with an enforced "bin" collation.

    You can see that the normal collation doesn't match any characters, but the second does.

    What's more interesting about this is when you look at the second query.

    For for some reason it would seem that the patindex is matching the range, but excluding the hyphen from the characters matched despite the hyphen being within the range selected.

    Unless I'm very much mistaken, this would look like a bug in SQL server.

    As soon as you enforce a binary collation this behaviour goes away.

    Ian.

Viewing 3 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic. Login to reply