Japanese characters as question marks

  • Hi,

    I'm having a bit of a strange issue with Japanese characters.

    1) They're in the Excel file I'm importing.

    2) They show up in the raw table I first load the file to, when I select results to text or file, but not to grid. In grid they show up as black rectangles that are hollow in the center.

    3) They don't show up at all in the secondary staging table, only question marks.

    I will post anything I can to help, but I'm not sure how to generate dummy data. I can't post what I have, because it's from a client file.

    I can post my table creation code, but they're kind of ridiculous stored procedures full of cursors (which will probably give some of you a twitch) that have been in place since before I ever had an inkling I'd be touching SQL at work, and a year and a half into my SQL life I have no idea how to improve upon.

    What I'm really looking for are some things to check on to see why this might be happening. Maybe a setting somewhere, but I'm not counting on being that lucky.

    I've tried various collate selects, but they don't change how the data displays in either table. I've put in a request for the MS OLEDB driver to be updated (it's currently 4.0), but I'm really not even sure if that could be an issue. Windows *seems* to have the language packs I would need, since I can see the characters in some places.

    Anyway, I'd appreciate any suggestions.

    Thanks

  • First thing that comes to mind is Unicode conversion problems. That data must be stored as nvarchar() or nchar(). If any table application, column, temp table column, or variable touched that data along its path to the table was char() and not nchar(), then it would get mangled.

    Eddie Wuerch
    MCM: SQL

  • The ? marks are a dead giveaway that non-printable ANSI characters are being pushed into an ANSI column or variable (i.e. VARCHAR or CHAR). The hollow squares might be a problem if some of your characters are multi-byte Unicode characters, of which I think Japanese has a few. SQL Server (at least 2005) does not handle multi-byte character sets. SQL 2012 introduced some support for them through Collations. PS The squares could also be a simple display issue depending on what you're viewing the data with. You can always check by using the SUBSTRING() and UNICODE() or ASCII() functions to see what is actually stored.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • opc.three (4/11/2013)


    The ? marks are a dead giveaway that non-printable ANSI characters are being pushed into an ANSI column or variable (i.e. VARCHAR or CHAR). The hollow squares might be a problem if some of your characters are multi-byte Unicode characters, of which I think Japanese has a few. SQL Server (at least 2005) does not handle multi-byte character sets. SQL 2012 introduced some support for them through Collations. PS The squares could also be a simple display issue depending on what you're viewing the data with. You can always check by using the SUBSTRING() and UNICODE() or ASCII() functions to see what is actually stored.

    Hm. All the columns I have are nvarchar. I can SEE them as Japanese characters correctly using SSMS 2008 in the raw table. 2005 is hopeless, I'm convinced. However, they're both still the same in the second table.

    Using ascii() I get a value of 63. Using unicode() I get a 5 digit string.

    So, I guess I have to see if the columns that the characters are going into are ANSI, and change them to Unicode, if that's possible?

    Thanks

  • For terminology's sake so we're speaking the same language:

    ANSI = VARCHAR/CHAR

    Unicode = NVARCHAR/NCHAR

    In SQL Server, Unicode = UCS-2

    Still not clear if all your columns are Unicode or whether you have some ANSI columns. If all Unicode then you can stop using the ASCII() function and stick with the UNICODE() function.

    SQL Server will not munge data inserted into Unicode columns. If you're seeing ? marks in a Unicode column then those characters were munged prior to entering the table.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • opc.three (4/11/2013)


    For terminology's sake so we're speaking the same language:

    ANSI = VARCHAR/CHAR

    Unicode = NVARCHAR/NCHAR

    In SQL Server, Unicode = UCS-2

    Still not clear if all your columns are Unicode or whether you have some ANSI columns. If all Unicode then you can stop using the ASCII() function and stick with the UNICODE() function.

    SQL Server will not munge data inserted into Unicode columns. If you're seeing ? marks in a Unicode column then those characters were munged prior to entering the table.

    All I know is that when I bypass the stored procedure (kind of) and just do a straight update from the raw table matching on email address, the Japanese characters display properly in SSMS 2008, and as squares in 2005. Which makes about as much sense as I can ask for.

    If you're feeling charitable, I can post the stored procedure that copies data from raw to process table. Maybe you can make heads or tails of where it "munges" things. If not, I absolutely understand.

    Thanks again

  • Sure. Post it. If you want to attach some sample UPDATE statements and the table defs too as a text file so the browser doesn't get in the way that might be helpful too.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • opc.three (4/11/2013)


    Sure. Post it. If you want to attach some sample UPDATE statements and the table defs too as a text file so the browser doesn't get in the way that might be helpful too.

    I've zipped up the five procedures I run to import a file and set up tables. I've also included the fairly innocuous Prefix column from the sample file I have so you can see some data that I'm working with. Let me know if you need anything else.

    https://www.dropbox.com/s/xcbn1gx0blbj55p/erikd.zip

  • Cool. I can't get to dropbox from here but I'll have a look when I am home.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • I need the contents of Sample.dbo.sample_fields in order to know the second table definition. Can you simply script out the resulting table definitions for the project_id in question?

    This query in dbo.CreateTargetTable looks like it cannot allow for a Unicode column to be created:

    DECLARE Column_Cursor CURSOR

    FOR

    SELECT source_name,

    coltype = CASE WHEN source_type = 1 THEN 'varchar(' + CAST(field_varSize AS VARCHAR(4)) + ')'

    WHEN source_type = 2 THEN 'int'

    WHEN source_type = 3 THEN 'datetime'

    /* Code 4 converts Float to Varchar to get around Excel opendatasource */

    WHEN source_type = 4 THEN 'varchar(' + CAST(field_varSize AS VARCHAR(4)) + ')'

    END

    FROM sample_fields

    WHERE ID_Project = @ProjectID

    AND source_name IS NOT NULL

    ORDER BY source_sequence

    I am assuming that either source_type 1 or 4 should generate an NVARCHAR column? If the second table is ANSI then you will lose the Unicode characters on INSERT and that is likely where the ? marks are coming from.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • Absolutely. Here you go.

    https://www.dropbox.com/s/sp6b6kf6z997g55/erikd.csv

  • Where I have your workflow:

    1. dbo.LoadPreLoadFromExcel brings the contents of the Excel into a new "load" table

    2. dbo.CreateTargetTable creates a new "target" table with typed columns from data in sample_fields (this proc will not create tables with Unicode columns)

    3. dbo.LoadTargetFromFile copies data from "load" table created into "target" table

    4. dbo.CreateProcessTable creates a new "process" table

    5. dbo.LoadProcessfromTarget copies data from "target" to "process" table

    I am assuming you sent me the definition for the "process" table because it has NVARCHAR columns and dbo.CreateTargetTable does not know how to create NVARCHAR columns in the "target" table. I think your problem is in the creation of the "target" table, i.e. in step 2 above. Can you please send the table definitions for the "load", "target" and "process" tables for this project, scripted out of SSMS by right-clicking the each and selecting "Script Table As" > "Create" as opposed to a copy/paste from the grid?

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • Yeah, I sort of forgot to mention that I numbered them in the order I use them. Heh.

    Here's the scripted output you asked for.

    https://www.dropbox.com/s/inhtv19uyo04xyk/erikd_tables.zip

    I've also included the full table create statement, along with _raw and _load, just in case.

  • As suspected

    [Prefix] [varchar](64) NULL,

    in the "load" table is the best candidate for causing the issue. Should be NVARCHAR.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • The other fields it's occurring in are first name, last name, and company name. Crud.

Viewing 15 posts - 1 through 15 (of 18 total)

You must be logged in to reply to this topic. Login to reply