Identifying ASCII characters in NVARCHAR columns

  • Hello,

    I have an issue where I am storing various international characters in nvarchar columns, but need to branch the data at one point of processing so that ASCII characters are run through an additional cleansing process and all non-ASCII characters are set aside.

    Is there a way to identify which nvarchar values are within the ASCII range and can be converted to varchar without corruption? Also, the strings may contain a mix of english and international character sets, so the entire string must be checked and not just the first character.

    i.e.

    Pass:

    'Hello', 'abc123'

    Fail:

    '太平市', 'abc太123'

    Any help would be appreciated and let me know if examples would make this more clear.

    Regards,

    Chris

  • I know CAST will allow you to change it to a VARCHAR, and you'll end up with a variety of ? in the items for non-ASCII characters.

    Perhaps you can somehow search for those placeholders and if they aren't there (CHARINDEX=0), you can do your processing?

  • Thanks Steve, this may be an option if we cannot figure out a better way around this. I just need to confirm that we are not allowing ? characters to exist in any of the strings I am cleansing and, if so, create some type of internal process for escaping these out so they do not trigger the 'international' flag.

    Appreciate the help.

  • proof of concept for Steve's idea;i thought that was kind of neat:

    --results:

    Status cnv id val id val

    ------ ------------------------------ ----------- ------- ----------- -------

    Pass Hello 1 Hello 1 Hello

    Pass abc123 2 abc123 2 abc123

    Fail ??? 3 ??? 3 ???

    Fail abc?123 4 abc?123 4 abc?123

    the test code:

    SELECT 1 as id,N'Hello' As val

    INTo #TMP UNION ALL

    SELECT 2 as id,N'abc123' UNION ALL

    SELECT 3 as id,N'???' UNION ALL

    SELECT 4 as id,N'abc?123'

    SELECT

    CASE

    WHEN t1.val = CONVERT(varchar,t2.val)

    THEN 'Pass'

    ELSE 'Fail'

    END As Status,

    CONVERT(varchar,t2.val) As cnv,*

    FROM #TMP t1

    LEFT OUTER JOIN #TMP t2

    ON t1.Id = t2.ID

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • Thanks for the code Lowell, I have been playing around with this myself and it looks like it is going to work out well for my problem.

    Thanks again both of you for the help!

  • Also, once failures are found, you could distinguish actual '?' values from Unicode values and get the percentage of Unicode characters with ...

    SELECT t.val

    , UnicodeStringLen = len(t.val)

    , UnicodeNonQmarkChars = len(replace(t.val,'?',''))

    , NonQmarkChars = len(replace(convert(varchar, t.val),'?',''))

    , UnicodeChars = len(replace(t.val,'?',''))

    - len(replace(convert(varchar, t.val),'?',''))

    , PercentUnicodeChars =

    (LEN(replace(t.val,'?','')) - len(replace(convert(varchar, t.val),'?','')))

    * 100.0 / len(t.val)

    FROM #TMP t

    WHERE t.val <> CONVERT(varchar,t.val) --FAILURES ONLY

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply