Need to use a script data source to pull in 10,000+ characters of data in one Excel column

  • I have spent two days trying to set up a data source for an Excel file with a variety of data types in it, one of which is a column with values ranging from a few dozen characters up to over 10,000 characters. The Excel data source option never takes the largest values into consideration, so it fails every time with a data truncation error. I found a blog article from a savvy SSIS user that explains how to control the data types with a Script data source. That blog is here: https://egilhansen.com/2015/07/28/ssis-solution-excel-data-source-text-truncation-column-more-than-255-characters/.

    This solution works fine for all the columns except the one with 10,000+ characters in it. I think the final solution is going to add the CommandBehavior.SequentialAccess to the data reader and then subdivide the value somehow. Maybe the script needs to separate the raw string into two or more [DT_STR] output columns? Or maybe it can assign the full value to a text stream [DT_TEXT] output column a few characters at a time?

    If someone out there has done this and can explain how to code it, I'd be greatly appreciative. 

  • I have no clue how to fix this because I've never run into it before.  I am interested, though, so this will act as a "bump" for your post.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • What is the format of your destination column, the input from Excel is Unicode, if your output is varchar then you will have truncation issues.

    For a test a created an Excel 97-2003 workbook with one worksheet containing a single row with mixed values and one column with 10,000 chars.
    I created an SSIS (SQL Server 2008 R2) package with an Excel data connection pointing to the workbook
    It identified the large column as DT_NTEXT
    I add a conversion transformation to change DT_NTEXT to DT_TEXT
    I then added an output flat file connection (using the converted column)
    Ran the package and it successfully read the workbook and output the correct values including the 10,000 char column.

    Far away is close at hand in the images of elsewhere.
    Anon.

  • If any of the first eight rows of data in the Excel sheet contains the extra-long string of more than 8,000 characters, then the Excel Source will apply the DT_NTEXT data type, but if not it applies DT_STRING. Most of the rows in the source document are well below 8,000 characters; only a few exceed that maximum. So rather than rely on some sort of artificial manipulation of the Excel document or on some arbitrary "template" file to sort of "trick" the Excel Source into using DT_NTEXT, I wanted to proactively control the data type, and the Script Component is the only way to do that, or so I've concluded at this point.

    After many hours of struggling with my own ignorance of using the SSIS Script Component, of handling BLOBs in general, and the C# language in particular, I have finally been able to work out the solution. It begins with the approach outlined in the link from my original post, with these variations and additions:

    • The data reader must be executed with the CommandBehavior.SequentialAccess option. This enables the handling of the extra-large text (I'll call it BLOB text) in chunks that will not overload the memory buffer. But it also means that you have to handle the data in the same order it is extracted by the SQL command the data reader uses. So you might want to manually write the SQL code rather than rely on the automated process shown in the original solution.The sequential access restriction means you cannot work with a column of data and subsequently work with data from a column listed earlier in the SQL SELECT clause.
    • You must add code to consume the BLOB text in chunks. This kind of code is familiar to anyone who works with binary data regularly, but it was unfamiliar territory for me. And this was complicated by the fact that the data had to be extracted first as a character array and then later converted to a byte array. In my code I used a .GetChars() method from the reader to extract the characters in 8K chunks. 
    • You cannot use a simple assignment expression for DT_NTEXT output columns; you will get a "this property is read only" error, as I did. Instead you have to use a .AddBlobData method to append the chunks of the BLOB text to the output column value. (It was discovering this .AddBlobData method that was the real breakthrough leading me to the final solution).  The .AddBlobData method only accepts binary array data, so you have to convert the character array that holds the extracted data into a byte array, but this is pretty easy using an encoding object's .GetBytes method.
    • If you have any non-ASCII characters in the BLOB text, you will probably want to use a .Replace method to strip them or put ASCII-compatible equivalents in their places. I found that the "curly apostrophe" in the UDF8 encoding was coming out with funky characters in the final result. I find that odd because I specified UDF8 for the encoding of the .GetBytes method, but it still came out weird. So I just replaced them with the plain straight single-quotation mark, and it works fine.

    Here is the key portion of my modified script showing the process of iterating through the BLOB text in 8K chunks and using .AddBlobData to put it into the output column value. The first column of the SQL SELECT clause is the one with BLOB text, so it is referenced as the "0" column of the data because the .GetChars method only allows referencing the column by position and not by name. The other two columns are non-BLOB data and are referenced by name rather than by position.

    public override void CreateNewOutputRows()
      {
       int buffer_size = 8192;
       char[] blob_chars = new char[buffer_size];
       long start_index = 0;
       long ret_value = 0;
       string blob_chunk;

       while (reader.Read())
       {
        Sheet1Buffer.AddRow();

        start_index = 0;
        ret_value = reader.GetChars(0, start_index, blob_chars, 0, buffer_size);
        blob_chunk = new string(blob_chars);
        blob_chunk = blob_chunk.Replace('\u2018', '\u0027').Replace('\u2019', '\u0027').Replace('\u201c', '\u0022').Replace('\u201d', '\u0022');
        while (ret_value == buffer_size)
        {
          Sheet1Buffer.MajorDesc.AddBlobData(Encoding.UTF8.GetBytes(blob_chunk));
          start_index += buffer_size;
          blob_chars = new char[buffer_size];
          ret_value = reader.GetChars(0, start_index, blob_chars, 0, buffer_size);
          blob_chunk = new string(blob_chars);
          blob_chunk = blob_chunk.Replace('\u2018', '\u0027').Replace('\u2019', '\u0027').Replace('\u201c', '\u0022').Replace('\u201d', '\u0022');
        }
        Sheet1Buffer.MajorDesc.AddBlobData(Encoding.UTF8.GetBytes(blob_chunk));

        Sheet1Buffer.Title = (string)reader["Title"];
        Sheet1Buffer.Code = (string)reader["Code"];
       }
      }

  • Jeff Moden - Monday, September 18, 2017 8:33 AM

    I have no clue how to fix this because I've never run into it before.  I am interested, though, so this will act as a "bump" for your post.

    Thanks for the "bump." I have used your "numbers table" solutions to avoid RBAR on many occasions, so I'm honored that you took an interest in my problem and its solution.
    By the way, you may be interested to know I plan on extending the script in order to add a [RowNumber] column to the output columns of the Script Component as a way to avoid RBAR assignment of an iterating uniqueness key column value in the database

  • Thanks for the feedback both on your fix and on nonRBAR solutions, Geoff.  I appreciate it.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • I found this topic via a Google search, and it kinda-sorta relates to my issue.   I need a Script Component in SSIS to serve as a data source for an inbound text file.   The component reads the data into a variable using the StreamReader objects' ReadLine() method in Visual Basic.   The problem I run into is exactly the same...   I am not allowed to assign a value to the BLOB column I specified for the Output, as it complains that the property is "Read Only".   My data is a string, but I can't be sure that the row I get will be 8,000 characters or less, so my staging table to hold the data uses VARCHAR(MAX) as the data type.   I'm not sure that the method being used here can actually work for me, as I need this to be fairly simple instead of code that is going to be troublesome to modify in the future.   Also, the data from the file is coming into a String data type variable named RowData.   Might there be a way to use a loop to just grab 8,000 characters from the string at a time and then put them into a BLOB somehow?   That is what your solution does, but it also uses C# as well as it's reading the data from the reader object, and I don't need to do that.   Just not sure how to translate that code, or if the methods will be available if the reader isn't being used.   Anyone?

  • sgmunson wrote:

    I found this topic via a Google search, and it kinda-sorta relates to my issue.   I need a Script Component in SSIS to serve as a data source for an inbound text file.   The component reads the data into a variable using the StreamReader objects' ReadLine() method in Visual Basic.   The problem I run into is exactly the same...   I am not allowed to assign a value to the BLOB column I specified for the Output, as it complains that the property is "Read Only".   My data is a string, but I can't be sure that the row I get will be 8,000 characters or less, so my staging table to hold the data uses VARCHAR(MAX) as the data type.   I'm not sure that the method being used here can actually work for me, as I need this to be fairly simple instead of code that is going to be troublesome to modify in the future.   Also, the data from the file is coming into a String data type variable named RowData.   Might there be a way to use a loop to just grab 8,000 characters from the string at a time and then put them into a BLOB somehow?   That is what your solution does, but it also uses C# as well as it's reading the data from the reader object, and I don't need to do that.   Just not sure how to translate that code, or if the methods will be available if the reader isn't being used.   Anyone?

    Instead of using SSIS  or Visual Basic, why not just do a BULK INSERT from the file, which would handle the whole 9 yards quite nicely.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Jeff Moden wrote:

    sgmunson wrote:

    I found this topic via a Google search, and it kinda-sorta relates to my issue.   I need a Script Component in SSIS to serve as a data source for an inbound text file.   The component reads the data into a variable using the StreamReader objects' ReadLine() method in Visual Basic.   The problem I run into is exactly the same...   I am not allowed to assign a value to the BLOB column I specified for the Output, as it complains that the property is "Read Only".   My data is a string, but I can't be sure that the row I get will be 8,000 characters or less, so my staging table to hold the data uses VARCHAR(MAX) as the data type.   I'm not sure that the method being used here can actually work for me, as I need this to be fairly simple instead of code that is going to be troublesome to modify in the future.   Also, the data from the file is coming into a String data type variable named RowData.   Might there be a way to use a loop to just grab 8,000 characters from the string at a time and then put them into a BLOB somehow?   That is what your solution does, but it also uses C# as well as it's reading the data from the reader object, and I don't need to do that.   Just not sure how to translate that code, or if the methods will be available if the reader isn't being used.   Anyone?

    Instead of using SSIS  or Visual Basic, why not just do a BULK INSERT from the file, which would handle the whole 9 yards quite nicely.

    I need a row number based on the actual order in the original file, plus I need to include other metadata that I'm supplying via SSIS variables.   That row number might not be reproducible some other way, and it's not something we can do without.  It's an essential part of the process.   If you're aware of a way to do that via BULK INSERT, I'm all ears.   Unfortunately, there's nothing in the file itself that would provide that order beyond the order of the rows in the file.

  • It now appears I may have found the solution:

    Row.MyVARCHARMAXColumn.AddBlobData(System.Text.Encoding.UTF8.GetBytes("something")) or

    Row.MyVARCHARMAXColumn.AddBlobData(System.Text.Encoding.ASCII.GetBytes("something"))

    At least now my script doesn't generate an error for compilation.   We'll see what happens at runtime...

  • sgmunson wrote:

    Jeff Moden wrote:

    sgmunson wrote:

    I found this topic via a Google search, and it kinda-sorta relates to my issue.   I need a Script Component in SSIS to serve as a data source for an inbound text file.   The component reads the data into a variable using the StreamReader objects' ReadLine() method in Visual Basic.   The problem I run into is exactly the same...   I am not allowed to assign a value to the BLOB column I specified for the Output, as it complains that the property is "Read Only".   My data is a string, but I can't be sure that the row I get will be 8,000 characters or less, so my staging table to hold the data uses VARCHAR(MAX) as the data type.   I'm not sure that the method being used here can actually work for me, as I need this to be fairly simple instead of code that is going to be troublesome to modify in the future.   Also, the data from the file is coming into a String data type variable named RowData.   Might there be a way to use a loop to just grab 8,000 characters from the string at a time and then put them into a BLOB somehow?   That is what your solution does, but it also uses C# as well as it's reading the data from the reader object, and I don't need to do that.   Just not sure how to translate that code, or if the methods will be available if the reader isn't being used.   Anyone?

    Instead of using SSIS  or Visual Basic, why not just do a BULK INSERT from the file, which would handle the whole 9 yards quite nicely.

    I need a row number based on the actual order in the original file, plus I need to include other metadata that I'm supplying via SSIS variables.   That row number might not be reproducible some other way, and it's not something we can do without.  It's an essential part of the process.   If you're aware of a way to do that via BULK INSERT, I'm all ears.   Unfortunately, there's nothing in the file itself that would provide that order beyond the order of the rows in the file.

    Just double checking... what's the target here... An Excel spreadsheet or a table?  I'm pretty sure that it's not possible to do in a single blob in Excel.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • sgmunson wrote:

    Jeff Moden wrote:

    sgmunson wrote:

    I found this topic via a Google search, and it kinda-sorta relates to my issue.   I need a Script Component in SSIS to serve as a data source for an inbound text file.   The component reads the data into a variable using the StreamReader objects' ReadLine() method in Visual Basic.   The problem I run into is exactly the same...   I am not allowed to assign a value to the BLOB column I specified for the Output, as it complains that the property is "Read Only".   My data is a string, but I can't be sure that the row I get will be 8,000 characters or less, so my staging table to hold the data uses VARCHAR(MAX) as the data type.   I'm not sure that the method being used here can actually work for me, as I need this to be fairly simple instead of code that is going to be troublesome to modify in the future.   Also, the data from the file is coming into a String data type variable named RowData.   Might there be a way to use a loop to just grab 8,000 characters from the string at a time and then put them into a BLOB somehow?   That is what your solution does, but it also uses C# as well as it's reading the data from the reader object, and I don't need to do that.   Just not sure how to translate that code, or if the methods will be available if the reader isn't being used.   Anyone?

    Instead of using SSIS  or Visual Basic, why not just do a BULK INSERT from the file, which would handle the whole 9 yards quite nicely.

    I need a row number based on the actual order in the original file, plus I need to include other metadata that I'm supplying via SSIS variables.   That row number might not be reproducible some other way, and it's not something we can do without.  It's an essential part of the process.   If you're aware of a way to do that via BULK INSERT, I'm all ears.   Unfortunately, there's nothing in the file itself that would provide that order beyond the order of the rows in the file.

    I know you mentioned other metadata, but to get the line numbers could you run the file through a simple powershell script to append these to each line in the source file before using Jeff's suggestion? Or is that going to cause you other grief?

    "Knowledge is of two kinds. We know a subject ourselves, or we know where we can find information upon it. When we enquire into any subject, the first thing we have to do is to know what books have treated of it. This leads us to look at catalogues, and at the backs of books in libraries."
    — Samuel Johnson
    I wonder, would the great Samuel Johnson have replaced that with "GIYF" now?

  • Jeff Moden wrote:

    sgmunson wrote:

    Jeff Moden wrote:

    sgmunson wrote:

    I found this topic via a Google search, and it kinda-sorta relates to my issue.   I need a Script Component in SSIS to serve as a data source for an inbound text file.   The component reads the data into a variable using the StreamReader objects' ReadLine() method in Visual Basic.   The problem I run into is exactly the same...   I am not allowed to assign a value to the BLOB column I specified for the Output, as it complains that the property is "Read Only".   My data is a string, but I can't be sure that the row I get will be 8,000 characters or less, so my staging table to hold the data uses VARCHAR(MAX) as the data type.   I'm not sure that the method being used here can actually work for me, as I need this to be fairly simple instead of code that is going to be troublesome to modify in the future.   Also, the data from the file is coming into a String data type variable named RowData.   Might there be a way to use a loop to just grab 8,000 characters from the string at a time and then put them into a BLOB somehow?   That is what your solution does, but it also uses C# as well as it's reading the data from the reader object, and I don't need to do that.   Just not sure how to translate that code, or if the methods will be available if the reader isn't being used.   Anyone?

    Instead of using SSIS  or Visual Basic, why not just do a BULK INSERT from the file, which would handle the whole 9 yards quite nicely.

    I need a row number based on the actual order in the original file, plus I need to include other metadata that I'm supplying via SSIS variables.   That row number might not be reproducible some other way, and it's not something we can do without.  It's an essential part of the process.   If you're aware of a way to do that via BULK INSERT, I'm all ears.   Unfortunately, there's nothing in the file itself that would provide that order beyond the order of the rows in the file.

    Just double checking... what's the target here... An Excel spreadsheet or a table?  I'm pretty sure that it's not possible to do in a single blob in Excel.

    Data source is the Script Component, which must populate the column in the output buffer.   Data destination is a SQL Server 2019 database table.

  • david.edwards 76768 wrote:

    sgmunson wrote:

    Jeff Moden wrote:

    sgmunson wrote:

    I found this topic via a Google search, and it kinda-sorta relates to my issue.   I need a Script Component in SSIS to serve as a data source for an inbound text file.   The component reads the data into a variable using the StreamReader objects' ReadLine() method in Visual Basic.   The problem I run into is exactly the same...   I am not allowed to assign a value to the BLOB column I specified for the Output, as it complains that the property is "Read Only".   My data is a string, but I can't be sure that the row I get will be 8,000 characters or less, so my staging table to hold the data uses VARCHAR(MAX) as the data type.   I'm not sure that the method being used here can actually work for me, as I need this to be fairly simple instead of code that is going to be troublesome to modify in the future.   Also, the data from the file is coming into a String data type variable named RowData.   Might there be a way to use a loop to just grab 8,000 characters from the string at a time and then put them into a BLOB somehow?   That is what your solution does, but it also uses C# as well as it's reading the data from the reader object, and I don't need to do that.   Just not sure how to translate that code, or if the methods will be available if the reader isn't being used.   Anyone?

    Instead of using SSIS  or Visual Basic, why not just do a BULK INSERT from the file, which would handle the whole 9 yards quite nicely.

    I need a row number based on the actual order in the original file, plus I need to include other metadata that I'm supplying via SSIS variables.   That row number might not be reproducible some other way, and it's not something we can do without.  It's an essential part of the process.   If you're aware of a way to do that via BULK INSERT, I'm all ears.   Unfortunately, there's nothing in the file itself that would provide that order beyond the order of the rows in the file.

    I know you mentioned other metadata, but to get the line numbers could you run the file through a simple powershell script to append these to each line in the source file before using Jeff's suggestion? Or is that going to cause you other grief?

    That would cause one of the basic principles I adhere to, to be grossly violated:  "NEVER destroy your inputs."  Not now, not ever.  We retain the original files in an Archive folder, so that's not going to be allowed, even if I had wanted to.

  • sgmunson wrote:

    Jeff Moden wrote:

    Just double checking... what's the target here... An Excel spreadsheet or a table?  I'm pretty sure that it's not possible to do in a single blob in Excel.

    Data source is the Script Component, which must populate the column in the output buffer.   Data destination is a SQL Server 2019 database table.

    Ok... what does that script look like?  Is it pulling from another table, a file, or ????  Can you post that script, please?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

Viewing 15 posts - 1 through 15 (of 19 total)

You must be logged in to reply to this topic. Login to reply