SED and the Big Bad UNIX File

Question

SED and the Big Bad UNIX File

Peter Ward

Grasshopper

Points: 13
More actions
December 19, 2005 at 5:06 pm

#167400

Comments posted to this topic are about the content posted at http://www.sqlservercentral.com/columnists/pward/sedandthebigbadunixfile.asp

Viewing 15 posts - 1 through 15 (of 16 total)

You must be logged in to reply to this topic. Login to reply

Bill Geake Ten Centuries Points: 1352 More actions · Answer 1

Good stuff about dealing with big files, but for the delimiter it's easy enough to set LF rather than CRLF in the file's connection object. I do this for data from a well-known provider of financial information which seems unable to standardize on CRLF or LF for its various files.

Bill.

cneuhold Hall of Fame Points: 3375 More actions · Answer 2

You could also achieve this result by a simple Shell script using RegEx-Replace and an ActiveX task before the actual transformation (dealt successfully with such things before)

Johan Kotze Valued Member Points: 57 More actions · Answer 3

There are also two *NIX and DOS utilities to convert between the *NIX and DOS LF/CRLF formats. They are:

UNIX2DOS and DOS2UNIX

Simple, open source and widely available.

SALIM ALI SSC Eights! Points: 859 More actions · Answer 4

Yeah, I use UNIX2DOS & DOS2UNIX, very handy little utilities.

Peter Thomas Clarke Grasshopper Points: 23 More actions · Answer 5

or open the file in an editor like Testpad and save it in Windows format.

Edmond Shamon Larson SSC Enthusiast Points: 103 More actions · Answer 6

Nice article. I would have never thought to use SED on Windows for that purpose.

I'll keep it in my favorites for just in case situations.

Thanks,

Edmond Shamon Larson

Antares686 SSC Guru Points: 125444 More actions · Answer 7

Great article. Fortunately I have not come across this yet but always keep this type of info around just in case.

For me the biggest issue has been mainframe files where the columns are fixed width but the file truncates to the line when a particular coulmn is not there. For me it is simply to import the whole thing into a single char column then export back to a new file and run thru the import as originally design. Would be nice to have a tool to auto check this and fix the file or data (source code would be great so I could just make a DTS object) while importing. Anyone got such an animal per chance?

sqltung Hall of Fame Points: 3962 More actions · Answer 8

Good article. I'll have to check out SED.

Regarding the bad record: if the record was embarassed, you could call it a rouge record. But I think you meant rogue.

Yelena Varshal SSC-Dedicated Points: 34286 More actions · Answer 9

Hi,

Good article.

The problem is even more general then just importing from UNIX-generated files. I had recently to show one of the business ladies why the data import jobs for her application done by the third-party SW fails based on what files another company sends her. I created 5 files generated by DTS, VBScript and export from Excel using different row and field delimeters to show her the file will contan delimeters that a particular programmer specifies. Also, I showed to her that if the Comments field contains one of her row or field delimier characters her data import job will fail too with the message about incorrect number of fields.

We all have to determine what are row and field delimeters before setting up any file processing and after that to work with the file supplier to make sure they don't change their processes and technologies to produce the files. Here is a part of the script that helps me to see what characters are used. It posts a message for each character (for the demo), re-write it to output into the file.

Set objFSO = CreateObject("Scripting.FileSystemObject")

Set objFile = objFSO.OpenTextFile("C:\Temp\Files\MyTextFile.txt", 1)

Do Until objFile.AtEndOfStream

strCharacters = objFile.Read(1)

Wscript.Echo strCharacters & " " & Cstr(asc(strCharacters))

Loop

Regards,Yelena Varsha

TDuffy SSCarpal Tunnel Points: 4170 More actions · Answer 10

Good article on SED. But could you not just specify LF as the Row Delimiter in the properties of the Text File source within the package?

This may not have solved identifying the problem row, but there are some very good GUI based, inexpensive, text editors for opening large files. My favorite is vedit http://www.vedit.com/

Terry

mosaic-287947 Ten Centuries Points: 1053 More actions · Answer 11

Thanks for a helpful article - and all the replies with other utilities and work-rounds. My favourite text editor is Notepad2.

I would like to make a suggestion: although it is nice to see the error messages as images, but it would help if you also mentioned the initial error message:

Error calling OpenRowset on the provider

in the text. Then this article can be found when someone else hits this problem and uses google to find a solution.

Rich James Default port Points: 1484 More actions · Answer 12

"less" is a command that I find most useful for browsing through text based files.

Ability to scroll forward, backwards, jump to line number, and so on, all make this an invaluable tool.

Again though, based in UNIX (and available on the UNIX ports from sourceforge).

corey lawson Hall of Fame Points: 3773 More actions · Answer 13

Awk, Perl, Ruby are probably better than sed, as could some VBScript programming (hint: open the file as a stream, and replace LF with CRLF, or whatever...). sed has some limitations on line lengths...

Me? I would have used awk (or, more specifically, gawk). In the Textutils section on the gnuwin32 website you can d/l its verson of Less for win32, as well as gawk.

Kernighan, I think, keeps the 'original' awk on his site (he's the 'K' of awk).

Chuck Ritenour SSC-Addicted Points: 420 More actions · Answer 14

Regarding the post that talked about mainframe file dumps that used column width as the delimiter vs csv, pipes, tabs etc. I run across this a lot because the folks that actually purchase our data absolutely refuse to even discuss data formats - it's a 'not my problem to load it issue'.

There's no easy answer, but the sweat equity answer is to build a parser in the programming language of you choice to replace two spaces with one (except when found between quote delimiters). Run that multiple times until there are no double spaces, then replace the space with the column delimiter of your choice.

Now if I could just get mainframe weenies to understand that giving me a data dictionary of the columns, data types and widths IN column name ALPHABETICAL ORDER vs. ordinal position doesn't help at all....(especially when the first row of the file is not column names, which for some reason just never happens on main frame files).

Thanks, and don't forget to Chuckle

SED and the Big Bad UNIX File

Cookies on SQLServerCentral