RE: Problems importing CSV files using SSIS.

SSC Guru

Points: 1003863

July 3, 2015 at 11:19 pm

Phil Parkin (6/2/2015)
P Jones (6/2/2015)
We do lots of csv import and have found that | isn't a good delimiter character as it can easily be mistyped into data when hitting the shift key or even used to "draw pictures" in data text fields. Now we use the character § which can't be easily typed.
Good tip.

I'm still amazed that much of what has been done with CSVs and TSVs is to support spreadsheets and the fact that humans want to be able to see such things using the likes of NotePad to see things.

Way back when a lot of the folks on these fine forums weren't even born yet (I'm 62 and that includes me), there were simple non-printable single characters that were used VERY effectively for character based transmissions such as CSV files, et al.

Take a look at the ASCII table, a decent copy of which can be found at the following URL.

http://www.asciitable.com/

In particular, have a look at characters 28 through 31 (according to the DEC column). You have a file separator, a group separator, a record (line or row as humans think of them) separator, and a unit separator (divides the record into units and we know these as "delimiters"). A CSV-like file could easily be created using the record separator (CHAR(30)) to replace the "newline" or ^C_r^L_f combinations and the unit separator (CHAR(31)) would replace the comma delimiters. The neat thing about doing this is you would NOT have to worry about whether or not commas were embedded in the actual data and you wouldn't need text qualifiers (double quotes, usually) at all.

The group separator (CHAR(29)) could be used to build hierarchical data, if you wanted to screw up a good thing, and you could use the file separator (CHAR(28)) could be used to include a layout file (think header with datatypes like the old DBase III files used to have) followed by the data "file", followed by a summary file (think footer) that would give you the total number of rows, a total byte count and an end-of-transmission character (CHAR(4) or CHAR(23)) to ensure that you got the whole file.

As in the old days when transmission speeds where incredibly slow (I was thrilled to finally go from 110 Baud to 300 Baud), integer values could be sent as byte values which would have the effect of greatly "compressing" the transmission for integers with more than 2 places.

Personally, I think it makes the likes of XML and JSON look silly and really bulky for character based transmissions. And, yeah... it even works with {gasp!} Unicode if you need it to.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)