Importing data reports data truncation in both EM and SSIS from multiple data sources

Question

Importing data reports data truncation in both EM and SSIS from multiple data sources

skeleton567

SSCertifiable

Points: 5672
More actions
August 25, 2017 at 11:42 am

#325106

First, I'm retired so have time to play with my own data. I have two data sources, an extensive collection of digital music files and many years of financial data.
I use a program called MP3tag to load and export to .CSV files the TAG data from 70k tracks. This program is 32-bit architecture, so I must export my music tag data in parts to avoid memory issues with that program.
I used Quicken 2014 and a program called QIF2CSV to export 30 years of all types of financial data (checking, credit card, investments) to .CSV files. Multiple-year files and even individual-year files all appear to contain valid data when loaded into text file editing utilities.
I have tried to use the Import Wizard in Enterprise Manager, and custom SSIS packages to import data from both of the sources of .CSV files for both types of data. In both cases I consistently experience reported data truncation errors even though I have defined all fields in my SS database import tables as NVARCHAR(MAX) to attempt to even get the data in so I can query it and clean it. I have for both types of data tried various input data definitions of varchar, nvarchar, text, ntext, but still experience reported truncation errors
While I understand that the audio file tag data may not be exactly 'clean', my financial data is all personally entered, not even downloaded from financial institutions, so I feel it should be fairly clean.
Any ideas what I can do to avoid the reported 'truncation' errors?
Rick
Disaster Recovery = Backup ( Backup ( Your Backup ) )

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply

Sue_H SSC Guru Points: 90860 More actions · Answer 1

skeleton567 - Friday, August 25, 2017 11:42 AM
First, I'm retired so have time to play with my own data. I have two data sources, an extensive collection of digital music files and many years of financial data.
I use a program called MP3tag to load and export to .CSV files the TAG data from 70k tracks. This program is 32-bit architecture, so I must export my music tag data in parts to avoid memory issues with that program.
I used Quicken 2014 and a program called QIF2CSV to export 30 years of all types of financial data (checking, credit card, investments) to .CSV files. Multiple-year files and even individual-year files all appear to contain valid data when loaded into text file editing utilities.
I have tried to use the Import Wizard in Enterprise Manager, and custom SSIS packages to import data from both of the sources of .CSV files for both types of data. In both cases I consistently experience reported data truncation errors even though I have defined all fields in my SS database import tables as NVARCHAR(MAX) to attempt to even get the data in so I can query it and clean it. I have for both types of data tried various input data definitions of varchar, nvarchar, text, ntext, but still experience reported truncation errors
While I understand that the audio file tag data may not be exactly 'clean', my financial data is all personally entered, not even downloaded from financial institutions, so I feel it should be fairly clean.
Any ideas what I can do to avoid the reported 'truncation' errors?

The size depends upon the source files, the destination columns and the mapping of those. So you would want to check the data types in both.

Sue

skeleton567 SSCertifiable Points: 5672 More actions · Answer 2

Sue_H - Monday, August 28, 2017 12:47 PM
skeleton567 - Friday, August 25, 2017 11:42 AM
First, I'm retired so have time to play with my own data. I have two data sources, an extensive collection of digital music files and many years of financial data.
I use a program called MP3tag to load and export to .CSV files the TAG data from 70k tracks. This program is 32-bit architecture, so I must export my music tag data in parts to avoid memory issues with that program.
I used Quicken 2014 and a program called QIF2CSV to export 30 years of all types of financial data (checking, credit card, investments) to .CSV files. Multiple-year files and even individual-year files all appear to contain valid data when loaded into text file editing utilities.
I have tried to use the Import Wizard in Enterprise Manager, and custom SSIS packages to import data from both of the sources of .CSV files for both types of data. In both cases I consistently experience reported data truncation errors even though I have defined all fields in my SS database import tables as NVARCHAR(MAX) to attempt to even get the data in so I can query it and clean it. I have for both types of data tried various input data definitions of varchar, nvarchar, text, ntext, but still experience reported truncation errors
While I understand that the audio file tag data may not be exactly 'clean', my financial data is all personally entered, not even downloaded from financial institutions, so I feel it should be fairly clean.
Any ideas what I can do to avoid the reported 'truncation' errors?
The size depends upon the source files, the destination columns and the mapping of those. So you would want to check the data types in both.
Sue

Well, Sue, it's not quite that simple. I've checked the source files in text editing software and there are no fields visually anywhere near the limits. I've checked the input files with Notepad++ for containing anything other than ASCII characters because I have suspected invalid characters to be causing the problem. Input data was cleaned of commas before creating the CSV files. Input files as defined inside the import have fields defined larger than any actual input data content. It's just strange that this problem occurs in CSV files from two entirely different software sources with no problems until I hit the import. 25 years of historical data has been separated by year and examined individually both visually and with various software. Actually, I've been impressed with the consistency and reliability of the data exported from Quicken as it has been created with a number of different versions of that product over the years. The music file tag data is a different story, but that also has been manually cleaned of foreign language characters from various publishers and online tag sources.

But I'm not giving up on this. Like President Reagan said, 'It CAN be done'.

Rick
Disaster Recovery = Backup ( Backup ( Your Backup ) )

drew.allen SSC Guru Points: 76944 More actions · Answer 3

Have you checked the end-of-record delimiter. It could be LF, CR, or CR/LF.

Drew

J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA

skeleton567 SSCertifiable Points: 5672 More actions · Answer 4

drew.allen - Monday, August 28, 2017 3:02 PM
Have you checked the end-of-record delimiter. It could be LF, CR, or CR/LF.
Drew

Drew, I haven't actually tried this, but reason indicates that this is not the case. Certain imports of partial data files work fine, producing smaller numbers of rows that are indeed correct, indicating the mechanics of the import work correctly. I suppose this could happen if the file creation process intermittently failed to append the correct characters. However, 'truncation', to me indicates a failure in a particular field, meaning that the comma delimiters would also need to be incorrect, wouldn't it? I'm not seeing any incorrectly divided fields or combined fields in the files.

Also arguing against this is the situation that the failing files are created by two distinct commercial package products which would both need to have the similar problem.

At this point I'm thinking there is something I'm missing in the import wizard such as the lady above suggested. This is why I have defined all input as variable length character field types with non-character data to be validated and then converted internally with SQL. Back to the drawing board.

Incidentally, I found the article on ISNUMERIC a few days ago very interesting and informative.

Rick
Disaster Recovery = Backup ( Backup ( Your Backup ) )

skeleton567 SSCertifiable Points: 5672 More actions · Answer 5

Well, I believe the Quicken .csv import problem is also solved. The failure reported was 'String or binary data would be truncated'.
By eliminating one column at a time, I determined that the column 'Payee' in the input table was causing the failure. Since the column in the table being INSERTed is varchar(50), I tried using DATALENGTH(Payee) to select from the input table. No rows were reported to contain a value longer than varchar(50).

So then I used the following query to identify the row that SQL THOUGHT was longer that 50 characters.

SELECT * FROM dbo.DataConversionWork WHERE SUBSTRING(Payee,1,50) <> Payee

This compared the first 50 characters of the column to the complete column. Weird, right?
There was a single row returned with the contents of the Payee column being:
'United Airlines 848.18 ???'

Once the question marks were removed from the data in the table being SELECTed, the INSERT to the second table no longer fails. Apparently the repeated question mark characters were interpreted such that the receiving VARCHAR(50) column was seen as truncated.

A varchar(50) column containing three question mark characters is longer than 50 characters. We never stop learning

Rick
Disaster Recovery = Backup ( Backup ( Your Backup ) )

Jeff Moden SSC Guru Points: 1003863 More actions · Answer 6

skeleton567 - Monday, October 23, 2017 9:40 PM
Well, I believe the Quicken .csv import problem is also solved. The failure reported was 'String or binary data would be truncated'.
By eliminating one column at a time, I determined that the column 'Payee' in the input table was causing the failure. Since the column in the table being INSERTed is varchar(50), I tried using DATALENGTH(Payee) to select from the input table. No rows were reported to contain a value longer than varchar(50).
So then I used the following query to identify the row that SQL THOUGHT was longer that 50 characters.
SELECT * FROM dbo.DataConversionWork WHERE SUBSTRING(Payee,1,50) <> Payee
This compared the first 50 characters of the column to the complete column. Weird, right?
There was a single row returned with the contents of the Payee column being:
'United Airlines 848.18 ???'
Once the question marks were removed from the data in the table being SELECTed, the INSERT to the second table no longer fails. Apparently the repeated question mark characters were interpreted such that the receiving VARCHAR(50) column was seen as truncated.
A varchar(50) column containing three question mark characters is longer than 50 characters. We never stop learning

VARCHAR(50) and questions marks have no particular problem with each other. What may be the problem is how the Import Wizard is interpreting them. I avoid the import wizard like the plague so I don't know for sure but 3 question marks means nothing special to a VARCHAR(50).

Also, "EM" (Enterprise Manager) hasn't been around since SQL Server 2000. It's now called "SSMS" or "SQL Server Management Studio".

Here's one example...

DECLARE @SomeVarchar VARCHAR(50); SELECT @SomeVarchar = 'United Airlines 848.18 ???'; --26 characters between the quotes SELECT LEN(@SomeVarchar); --Will return 26

It may also be that you're not actually working with plain text. The 3 question marks may be substitutions for unprintable control characters.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

skeleton567 SSCertifiable Points: 5672 More actions · Answer 7

Jeff Moden - Monday, October 23, 2017 10:15 PM
skeleton567 - Monday, October 23, 2017 9:40 PM
Well, I believe the Quicken .csv import problem is also solved. The failure reported was 'String or binary data would be truncated'.
By eliminating one column at a time, I determined that the column 'Payee' in the input table was causing the failure. Since the column in the table being INSERTed is varchar(50), I tried using DATALENGTH(Payee) to select from the input table. No rows were reported to contain a value longer than varchar(50).
So then I used the following query to identify the row that SQL THOUGHT was longer that 50 characters.
SELECT * FROM dbo.DataConversionWork WHERE SUBSTRING(Payee,1,50) <> Payee
This compared the first 50 characters of the column to the complete column. Weird, right?
There was a single row returned with the contents of the Payee column being:
'United Airlines 848.18 ???'
Once the question marks were removed from the data in the table being SELECTed, the INSERT to the second table no longer fails. Apparently the repeated question mark characters were interpreted such that the receiving VARCHAR(50) column was seen as truncated.
A varchar(50) column containing three question mark characters is longer than 50 characters. We never stop learning
VARCHAR(50) and questions marks have no particular problem with each other. What may be the problem is how the Import Wizard is interpreting them. I avoid the import wizard like the plague so I don't know for sure but 3 question marks means nothing special to a VARCHAR(50).
Also, "EM" (Enterprise Manager) hasn't been around since SQL Server 2000. It's now called "SSMS" or "SQL Server Management Studio".
Here's one example...
DECLARE @SomeVarchar VARCHAR(50); SELECT @SomeVarchar = 'United Airlines 848.18 ???'; --26 characters between the quotes SELECT LEN(@SomeVarchar); --Will return 26
It may also be that you're not actually working with plain text. The 3 question marks may be substitutions for unprintable control characters.

Well, Jeff, first of all, at my 75 years of age, I'll call it EM if I jolly well feel like it. ;>)

You're correct on the Import Wizard being a pain most of the time. HOWEVER, both the Wizard and a single step SSIS Bulk Load package had the same issue with the question marks. And, no, I'm pretty sure they were not control characters. The data was hand-keyed by myself into the entry with the question marks as a reminder to myself for later attention. Besides, I further scrubbed the data using Notepad ++ to verify that there were no non-ASCII characters in the file.

I didn't try the Wizard on the file again, but the SSIS step had no problem first try after the question marks were removed. I have to assume I am working with plain text but can't prove that because the data is a .csv file exported from commercial software, and you know how that goes.

I know there is a tendency to dismiss a problem after it is solved as 'Sure, that's what it was.' But I don't think it is always that simple, at least in my 42 years of working with data, 7 years of retirement and still learning! SSIS didn't 'just happen' to work that time.

OK, it's 9:30 AM. Time to put down the coffee cup and get out of bed! Thanks for the examples. That's pretty close to what I did to find the problem

Rick
Disaster Recovery = Backup ( Backup ( Your Backup ) )

Jeff Moden SSC Guru Points: 1003863 More actions · Answer 8

skeleton567 - Tuesday, October 24, 2017 7:42 AM
Jeff Moden - Monday, October 23, 2017 10:15 PM
skeleton567 - Monday, October 23, 2017 9:40 PM
Well, I believe the Quicken .csv import problem is also solved. The failure reported was 'String or binary data would be truncated'.
By eliminating one column at a time, I determined that the column 'Payee' in the input table was causing the failure. Since the column in the table being INSERTed is varchar(50), I tried using DATALENGTH(Payee) to select from the input table. No rows were reported to contain a value longer than varchar(50).
So then I used the following query to identify the row that SQL THOUGHT was longer that 50 characters.
SELECT * FROM dbo.DataConversionWork WHERE SUBSTRING(Payee,1,50) <> Payee
This compared the first 50 characters of the column to the complete column. Weird, right?
There was a single row returned with the contents of the Payee column being:
'United Airlines 848.18 ???'
Once the question marks were removed from the data in the table being SELECTed, the INSERT to the second table no longer fails. Apparently the repeated question mark characters were interpreted such that the receiving VARCHAR(50) column was seen as truncated.
A varchar(50) column containing three question mark characters is longer than 50 characters. We never stop learning
VARCHAR(50) and questions marks have no particular problem with each other. What may be the problem is how the Import Wizard is interpreting them. I avoid the import wizard like the plague so I don't know for sure but 3 question marks means nothing special to a VARCHAR(50).
Also, "EM" (Enterprise Manager) hasn't been around since SQL Server 2000. It's now called "SSMS" or "SQL Server Management Studio".
Here's one example...
DECLARE @SomeVarchar VARCHAR(50); SELECT @SomeVarchar = 'United Airlines 848.18 ???'; --26 characters between the quotes SELECT LEN(@SomeVarchar); --Will return 26
It may also be that you're not actually working with plain text. The 3 question marks may be substitutions for unprintable control characters.
Well, Jeff, first of all, at my 75 years of age, I'll call it EM if I jolly well feel like it. ;>)
You're correct on the Import Wizard being a pain most of the time. HOWEVER, both the Wizard and a single step SSIS Bulk Load package had the same issue with the question marks. And, no, I'm pretty sure they were not control characters. The data was hand-keyed by myself into the entry with the question marks as a reminder to myself for later attention. Besides, I further scrubbed the data using Notepad ++ to verify that there were no non-ASCII characters in the file.
I didn't try the Wizard on the file again, but the SSIS step had no problem first try after the question marks were removed. I have to assume I am working with plain text but can't prove that because the data is a .csv file exported from commercial software, and you know how that goes.
I know there is a tendency to dismiss a problem after it is solved as 'Sure, that's what it was.' But I don't think it is always that simple, at least in my 42 years of working with data, 7 years of retirement and still learning! SSIS didn't 'just happen' to work that time.
OK, it's 9:30 AM. Time to put down the coffee cup and get out of bed! Thanks for the examples. That's pretty close to what I did to find the problem

You can call it whatever you want. I was just trying to help someone to not sound 12 years out of touch especially if they're trying to they're trying to give it a go at 75. 😉

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

skeleton567 SSCertifiable Points: 5672 More actions · Answer 9

'Thank you for your support'. The luxury at my age is that it's OK to sound a bit out of touch, 'cause I don't have to interview for jobs any more, and don't have to go through annual performance reviews. Of course, I get the daily reviews from my wife! ;>) Bless her for putting up with me. We just passed 38 years the other day, second time around the block for both of us.

It is good to have all of you good folks to run things past when I get into a situation.

Rick
Disaster Recovery = Backup ( Backup ( Your Backup ) )