SED and the Big Bad UNIX File

  • Comments posted to this topic are about the content posted at http://www.sqlservercentral.com/columnists/pward/sedandthebigbadunixfile.asp

  • Good stuff about dealing with big files, but for the delimiter it's easy enough to set LF rather than CRLF in the file's connection object. I do this for data from a well-known provider of financial information which seems unable to standardize on CRLF or LF for its various files.

    Bill.

  • You could also achieve this result by a simple Shell script using RegEx-Replace and an ActiveX task before the actual transformation (dealt successfully with such things before)

  • There are also two *NIX and DOS utilities to convert between the *NIX and DOS LF/CRLF formats. They are:

    UNIX2DOS and DOS2UNIX

    Simple, open source and widely available. 😉

  • Yeah, I use UNIX2DOS & DOS2UNIX, very handy little utilities.

  • or open the file in an editor like Testpad and save it in Windows format.

  • Nice article. I would have never thought to use SED on Windows for that purpose.

    I'll keep it in my favorites for just in case situations.


    Thanks,

    Edmond Shamon Larson

  • Great article. Fortunately I have not come across this yet but always keep this type of info around just in case.

    For me the biggest issue has been mainframe files where the columns are fixed width but the file truncates to the line when a particular coulmn is not there. For me it is simply to import the whole thing into a single char column then export back to a new file and run thru the import as originally design. Would be nice to have a tool to auto check this and fix the file or data (source code would be great so I could just make a DTS object) while importing. Anyone got such an animal per chance?

  • Good article. I'll have to check out SED.

    Regarding the bad record: if the record was embarassed, you could call it a rouge record. But I think you meant rogue.

  • Hi,

    Good article.

    The problem is even more general then just importing from UNIX-generated files. I had recently to show one of the business ladies why the data import jobs for her application done by the third-party SW fails based on what files another company sends her. I created 5 files generated by DTS, VBScript and export from Excel using different row and field delimeters to show her the file will contan delimeters that a particular programmer specifies. Also, I showed to her that if the Comments field contains one of her row or field delimier characters her data import job will fail too with the message about incorrect number of fields.

    We all have to determine what are row and field delimeters before setting up any file processing and after that to work with the file supplier to make sure they don't change their processes and technologies to produce the files.  Here is a part of the script that helps me to see what characters are used. It posts a message for each character (for the demo), re-write it to output into the file.

    Set objFSO = CreateObject("Scripting.FileSystemObject")

    Set objFile = objFSO.OpenTextFile("C:\Temp\Files\MyTextFile.txt", 1)

    Do Until objFile.AtEndOfStream

        strCharacters = objFile.Read(1)

        Wscript.Echo strCharacters & " " & Cstr(asc(strCharacters))

    Loop

     

    Regards,Yelena Varsha

  • Good article on SED. But could you not just specify LF as the Row Delimiter in the properties of the Text File source within the package? 

    This may not have solved identifying the problem row, but there are some very good GUI based, inexpensive, text editors for opening large files. My favorite is vedit http://www.vedit.com/

    Terry

     

     

  • Thanks for a helpful article - and all the replies with other utilities and work-rounds. My favourite text editor is Notepad2.

    I would like to make a suggestion: although it is nice to see the error messages as images, but it would help if you also mentioned the initial error message:

    Error calling OpenRowset on the provider

    in the text. Then this article can be found when someone else hits this problem and uses google to find a solution.

  • "less" is a command that I find most useful for browsing through text based files.

    Ability to scroll forward, backwards, jump to line number, and so on, all make this an invaluable tool.

    Again though, based in UNIX (and available on the UNIX ports from sourceforge).

  • Awk, Perl, Ruby are probably better than sed, as could some VBScript programming (hint: open the file as a stream, and replace LF with CRLF, or whatever...). sed has some limitations on line lengths...

    Me? I would have used awk (or, more specifically, gawk). In the Textutils section on the gnuwin32 website you can d/l its verson of Less for win32, as well as gawk.

    Kernighan, I think, keeps the 'original' awk on his site (he's the 'K' of awk).

  • Regarding the post that talked about mainframe file dumps that used column width as the delimiter vs csv, pipes, tabs etc. I run across this a lot because the folks that actually purchase our data absolutely refuse to even discuss data formats - it's a 'not my problem to load it issue'.

    There's no easy answer, but the sweat equity answer is to build a parser in the programming language of you choice to replace two spaces with one (except when found between quote delimiters). Run that multiple times until there are no double spaces, then replace the space with the column delimiter of your choice.

    Now if I could just get mainframe weenies to understand that giving me a data dictionary of the columns, data types and widths IN column name ALPHABETICAL ORDER vs. ordinal position doesn't help at all....(especially when the first row of the file is not column names, which for some reason just never happens on main frame files).


    Thanks, and don't forget to Chuckle

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply