Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase ««123»»

Bad data import Expand / Collapse
Author
Message
Posted Thursday, April 29, 2010 9:19 AM


Ten Centuries

Ten CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen Centuries

Group: General Forum Members
Last Login: Wednesday, September 24, 2014 4:48 PM
Points: 1,142, Visits: 2,687
skjoldtc (4/29/2010)
65533 is a valid unicode number but represents a special replacement character. It is the highest value in the character set. dbowlin makes a good point about what values to check for. You would ahve to know the source and the target databases to know what's reasonable. I doubt that 65533 is reasonable under 99.9% of the cases but there may be rare instances.

Good question, though.


I believe 65533 is the highest unicode value. I don't think you identified what the junk data actually was but that it is outside of what can be interpreted. Out of curiosity did you look as the offending line, specifically around this area to see what the actuall hex was?


---------------------------------------------------------------------
Use Full Links:
KB Article from Microsoft on how to ask a question on a Forum
Post #912970
Posted Thursday, April 29, 2010 9:47 AM


SSCoach

SSCoachSSCoachSSCoachSSCoachSSCoachSSCoachSSCoachSSCoachSSCoachSSCoachSSCoach

Group: General Forum Members
Last Login: Today @ 10:55 AM
Points: 17,707, Visits: 15,569
Great question. I learned something from this one.

Nice job.




Jason AKA CirqueDeSQLeil
I have given a name to my pain...
MCM SQL Server


SQL RNNR

Posting Performance Based Questions - Gail Shaw
Posting Data Etiquette - Jeff Moden
Hidden RBAR - Jeff Moden
VLFs and the Tran Log - Kimberly Tripp
Post #913018
Posted Thursday, April 29, 2010 11:13 AM
SSCommitted

SSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommitted

Group: General Forum Members
Last Login: Wednesday, September 17, 2014 9:21 AM
Points: 1,676, Visits: 1,757
skjoldtc (4/29/2010)
65533 is a valid unicode number but represents a special replacement character. It is the highest value in the character set. dbowlin makes a good point about what values to check for. You would ahve to know the source and the target databases to know what's reasonable. I doubt that 65533 is reasonable under 99.9% of the cases but there may be rare instances.

Good question, though.

Do you know why 65533 is the highest value? Theoretically, the highest should be 65535. This is consistent with nchar implementation, i.e.

select nchar(65533); -- returns value
select nchar(65535); -- returns value
select nchar(65536); -- returns null because 65536 is obviously not valid
-- as it cannot really fit into 2 bytes

Good question, I learned something new today.

Oleg
Post #913118
Posted Thursday, April 29, 2010 12:12 PM
SSCrazy

SSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazy

Group: General Forum Members
Last Login: Today @ 9:57 AM
Points: 2,580, Visits: 3,875
Oleg Netchaev (4/29/2010)
skjoldtc (4/29/2010)
65533 is a valid unicode number but represents a special replacement character. It is the highest value in the character set. dbowlin makes a good point about what values to check for. You would ahve to know the source and the target databases to know what's reasonable. I doubt that 65533 is reasonable under 99.9% of the cases but there may be rare instances.

Good question, though.

Do you know why 65533 is the highest value? Theoretically, the highest should be 65535. This is consistent with nchar implementation, i.e.

select nchar(65533); -- returns value
select nchar(65535); -- returns value
select nchar(65536); -- returns null because 65536 is obviously not valid
-- as it cannot really fit into 2 bytes

Good question, I learned something new today.

Oleg


No. I don't know why that is.

I now recall that the AS400 has a max CCSID (coded character set ID) of 65533. That makes sense since the original data came from a mainframe. IBM mainframes and AS400 both use EBCDIC. There are some unprintable and undisplayable characters and 65533 is used as a replacement on those systems. Basically, IIRC, it ends being a printable/displayble character defined system-wide that is substituted. So, you could define that a ~ (or any other printable/displayable character) prints or displays instead of throwing an error.

At my age, that's a lot to recall, so, I may be mistaken.
Post #913165
Posted Thursday, April 29, 2010 12:57 PM
SSCommitted

SSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommitted

Group: General Forum Members
Last Login: Wednesday, September 17, 2014 9:21 AM
Points: 1,676, Visits: 1,757
skjoldtc (4/29/2010)


No. I don't know why that is.

I now recall that the AS400 has a max CCSID (coded character set ID) of 65533. That makes sense since the original data came from a mainframe. IBM mainframes and AS400 both use EBCDIC. There are some unprintable and undisplayable characters and 65533 is used as a replacement on those systems. Basically, IIRC, it ends being a printable/displayble character defined system-wide that is substituted. So, you could define that a ~ (or any other printable/displayable character) prints or displays instead of throwing an error.

At my age, that's a lot to recall, so, I may be mistaken.

Thank you very much for information, now everything makes perfect sense. I just found on the List of Unicode Characters page that the maximum available printable character code is not FFFF (65535) like I assumed, but is indeed FFFD (65533), and it is officially called "Replacement Character". 65534 and 65535 do not represent anything, they are so-called noncharacters much like anything in the range from FDD0 to FDEF (64976 to 65007).

Oleg
Post #913200
Posted Thursday, April 29, 2010 1:42 PM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: Administrators
Last Login: Today @ 10:44 AM
Points: 31,078, Visits: 15,521
Very interesting question. Once again the debate has proven extremely valuable.






Follow me on Twitter: @way0utwest

Forum Etiquette: How to post data/code on a forum to get the best help
Post #913232
Posted Thursday, April 29, 2010 3:17 PM
Right there with Babe

Right there with BabeRight there with BabeRight there with BabeRight there with BabeRight there with BabeRight there with BabeRight there with BabeRight there with Babe

Group: General Forum Members
Last Login: Monday, September 22, 2014 3:13 PM
Points: 774, Visits: 1,179
As usual the discussion adds to the knowledge I gain from the QotD.

Thanks for the knowledge.
Post #913285
Posted Thursday, April 29, 2010 11:31 PM


SSC Veteran

SSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC Veteran

Group: General Forum Members
Last Login: Thursday, September 18, 2014 11:12 AM
Points: 212, Visits: 1,116
Hey,

There is missing any concrete example "Not Concrete Programming " to get rid of problem "Data could not be imported because text was truncated or characters do not exist in the destination codepage"

If i identified it is the code page conflict... how it can be done with in the code, rather than to manually change the destination/code pages.

Any Concrete or Flexible bee over it?

Thanks
Post #913424
Posted Friday, April 30, 2010 2:51 AM


SSCrazy

SSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazy

Group: General Forum Members
Last Login: Tuesday, July 29, 2014 8:47 AM
Points: 2,296, Visits: 1,427
The CODEPOINT() is a rarely used function in the SSIS expression language, but can be an effective tool in your ETL arsenal in some cases.
Post #913521
Posted Friday, April 30, 2010 3:17 AM


Hall of Fame

Hall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of Fame

Group: General Forum Members
Last Login: Today @ 6:46 AM
Points: 3,982, Visits: 5,228
Abrar Ahmad_ (4/29/2010)
Hey,

There is missing any concrete example "Not Concrete Programming " to get rid of problem "Data could not be imported because text was truncated or characters do not exist in the destination codepage"

If i identified it is the code page conflict... how it can be done with in the code, rather than to manually change the destination/code pages.

Any Concrete or Flexible bee over it?

Thanks


In SSIS, an example of how to identify and correct would be:
CODEPOINT(ColumnName) < MinCodePageValue || CODEPOINT(ColumnName) >MaxCodePageValue ? (DT_WSTR,26)NULL(DT_WSTR,26): (DT_WSTR,26)ColumnName

a T-SQL equivalent would be:
SELECT CASE 
WHEN ASCII(ColumnName) < MinCodePageValue OR ASCII(ColumnName) > MaxCodePageValue THEN NULL
ELSE ColumnName
END AS ColumnName
.
.
.



____________________________________________
Space, the final frontier? not any more...
All limits henceforth are self-imposed.
“libera tute vulgaris ex”
Post #913534
« Prev Topic | Next Topic »

Add to briefcase ««123»»

Permissions Expand / Collapse