Remove non printable characters from source file

  • In SSIS 2008 importing from AS400 mainframe and got truncation errors. Could not redirect row. Could not handle with replace unwanted characters control from Task Factory . So tried script and although it would strip out the "standard" non-printable characters 0-31 there were other Unicode characters that made it puke at the end of some records. (http://www.ssec.wisc.edu/~tomw/java/unicode.html#x0000)

    So after trying alot of things finally got the following script to work in VB.NET. (Rant: Pardon me all you C# Mr. Spok types I am just a SQL DBA and have no desire to ever learn C#, I have enough to worry about with partitioning, replication, clustering etc..! But I did some VB6 and VBScript back in the day so I had to go with what I knew 🙂

    At any rate this worked.

    Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)

    Dim LoopCounter As Integer 'Loop Through valid Character Checks for each Row

    Dim StringCounter As Integer 'Loop Through positions in field

    Dim LastGoodPosition As Integer

    LoopCounter = 0

    'Unicode values Between 32 and 591 acceptable Latin characters

    'See http://www.ssec.wisc.edu/~tomw/java/unicode.html#x0000

    'Remove 0-31 as non printable characters

    'If over 591 then remove trailing characters

    While LoopCounter <= 591

    'Remove non printable characters 0-31

    If Row.NAME78.ToString.Contains(ChrW(LoopCounter)) Then

    'MsgBox(Row.PRODNU.ToString & " " & Row.NAME78.ToCharArray) 'Debug record key and string value

    'MsgBox(AscW(ChrW(LoopCounter))) 'Debug Unicode value

    Replace(Row.NAME78.ToString, AscW(ChrW(LoopCounter)), "")

    End If

    LoopCounter = LoopCounter + 1

    '32 through 591 ok skip these

    If LoopCounter = 32 Then

    LoopCounter = 592

    End If

    'Remove characters > than 591

    If LoopCounter > 591 Then

    'MsgBox(Row.PRODNU.ToString & " " & Row.NAME78.ToCharArray) 'Debug record key and string value

    'MsgBox("Unicode Character > 591") 'Over value we want to accept

    StringCounter = Len(Row.NAME78.ToString) 'Length of NAME78 field

    'Start from end of string and work backwards to get last good character

    While StringCounter >= 1

    'Keep position of last valid value

    If AscW(Mid(Row.NAME78.ToString, StringCounter, 1)) > 591 Then

    LastGoodPosition = StringCounter - 1

    End If

    StringCounter = StringCounter - 1

    End While

    'Keep record up to bad character

    Row.NAME78 = Left(Row.NAME78.ToString, LastGoodPosition)

    End If

    End While

    End Sub

  • OK corrected last part of script. Still somewhat of a hack but cannot seem to trap character coming in some value 65533 max Unicode value so just try to keep entire record if I can if not then arbitrarily keep first 45 as when I keep 50 it bombs.

    'Remove characters over 591

    If LoopCounter > 591 Then

    'MsgBox(Row.PRODNU.ToString & " " & Row.NAME78.ToCharArray) 'Debug record key and string value

    'MsgBox("Unicode Character > 591") 'Over value we want to accept

    StringCounter = Len(Row.NAME78.ToString) 'Length of NAME78 field

    LastGoodPosition = StringCounter 'Initialize to full string

    While StringCounter >= 1

    If Len(Mid(Row.NAME78.ToString, StringCounter, 1)) > 0 Then

    LastGoodPosition = StringCounter

    End If

    StringCounter = StringCounter - 1

    End While

    'Keep record up to bad character

    If StringCounter >= 1 Then

    Row.NAME78 = Left(Row.NAME78.ToString, LastGoodPosition)

    Else

    'Bad character keep what you can

    Row.NAME78 = Left(Row.NAME78.ToString, 45)

    End If

    End If

  • My apologies please disregard all prior posts. Corrected and yes tested code below to remove non printable and other unwanted characters.

    (Rant: People who post code not thoroughly tested = (schmuck, fool, moron, idiot, Senior Dumas, Me!!)

    1000 apologies for prior posts.

    Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)

    Dim LoopCounter As Integer 'Loop Through valid Character Checks for each Row

    LoopCounter = 0 'Unicode starts with null character which is 0

    'Unicode values Between 32 and 126 acceptable Latin characters

    'Unicode values Between 192 and 563 accepatble Latin characters

    'See http://www.ssec.wisc.edu/~tomw/java/unicode.html#x0000

    'Remove 0-31 as non printable characters and 127-191

    While LoopCounter <= 563

    'Remove non printable characters 0-31 and 127-191

    If Row.NAME78.ToString.Contains(ChrW(LoopCounter)) Then

    Row.NAME78 = Replace(Row.NAME78.ToString, ChrW(LoopCounter).ToString, String.Empty)

    End If

    LoopCounter = LoopCounter + 1

    '32 to 126 keep

    If LoopCounter = 32 Then

    LoopCounter = 127

    End If

    '192 to 563 keep

    If LoopCounter = 192 Then

    LoopCounter = 564

    End If

    End While

    End Sub

  • Oh and FYI the other characters besides 0-31 were 147 and 148 that caused the truncation errors. Those were the ones I was not expecting.

  • jsquared24 (7/13/2011)


    Oh and FYI the other characters besides 0-31 were 147 and 148 that caused the truncation errors. Those were the ones I was not expecting.

    Well, you solved your problem entirely yourself in this thread 😀

    Anyway, thanks for posting the solution, it will come in handy in the future.

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

Viewing 5 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply