Need to remove an arrow char at end of each file as well as a trailing comma in header

  • I've been trying find an approach to take to remove a trailing comma and an EOF -> (arrow) character at the end of hundreds of files (some very large). I just got these files to import into SQL Server and I'm wondering if I should use a Windows.bat script or Powershell to clean the files first or is there a better way within SSIS (a Script Task) to do it? I'm still a beginner with SSIS but I do have some months of experience off and on with it.

    This .bat file worked but took c. 30 min for a 5.4MB file :

    @echo off

    setlocal enabledelayedexpansion

    set InFile=SGMCLARE_PE20160601_REBAL760.txt

    set OutFile=test-nolastline.txt

    if exist "%OutFile%" del "%OutFile%"

    set Line=

    for /f "delims=" %%a in ('type "%InFile%"') do (

    if not "!Line!"=="" >>"%OutFile%" echo.!Line!

    set Line=%%a

    )

    Anyone have a better idea? Also how to remove the end of header line comma?

    I'll keep searching and experimenting so I'll post again if I resolve.

    Thanks in advance.

  • Since Powershell is also "interpreted" rather than "compiled", I'm not sure that PowerShell would be any faster. Short of writing a compiled, automatic file editor (which could be incredibly fast because it could do a direct byte level change of the file), I'd be tempted to do a blob-style BULK INSERT and then use something like xp_Cmdshell to rewrite the file back out without the offending characters. Then turn around and do a normal columnar BULK INSERT. Even with those 3 evolutions, it's likely to be faster than using a batch file or Powershell processor.

    Shifting gears a bit, someone has a strange notion of what a file should contain. The extra comma/EOF/arrow thing is undoubtedly someone's attempt to be clever about how to clearly indicate that the end of file has been reached instead of providing a control file with a rowcount in it. I'd find out who's making the files and tell them to fix it and to stop being "clever". 😉 If that's not possible and this is going to be an on-going task, then it would definitely be worth having someone write a "file repair" SQLCLR as a preprocessor.

    Either that, or you might be using the wrong "code page" for the file type. Some code page styles include a 4 header (such as a Unicode or UTF-8 File), some contain a byte footer (real files actually do end with an end-of-file character that is absorbed by the OS, and some contain both. It might be important to find out what the file type/code page that they're sending the file as actually is.

    BTW, SQL Server doesn't support the very common UTF-8 until you get to SQL Server 2016.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • You could do this in SSIS but it's a bit clumsy.

    To do this for a single file:

    Create a file connector so that each line is read into a single column

    In a data flow task, use the file as a source

    Create a script component to add a row number e.g. https://hansmichielscom.wordpress.com/2013/06/20/how-to-add-a-rownumber-to-a-data-flow-task-in-ssis-2012/

    Add a derived column component, test for the first row (the header) and then remove the last comma e.g. (RowNumber ==1)?SUBSTRING(<<column_name>>,1,LEN(<<column_name>>)-1):<<column_name>>

    To remove the EOF marker you would need a nested test e.g.(RowNumber ==1)?SUBSTRING(<<column_name>>,1,LEN(<<column_name>>)-1):(<<column_name>>=="<<eof>>")?"":<<column_name>>

    Output this to a new file with a different file connection.

    To do this for all files, you would need a For Each loop and have the DFT within it. You would need a package variable to hold the file name which would then be an expression for the file connection manager so that the source changes for each loop. Same for the output connection manager.

    I've not tested so my expression might be a bit wrong but it should give you the idea.

    This might be quicker but it will be limited by your network speed.

    Jez

  • I run into poorly formatted files on a fairly regularly when trying to consume external files via SSIS. My first choice is always to see if the source system can clean them up. Otherwise, I tend to write a C# script component in my control flow to pre-process the file before I hit my dataflow.

    I have one file in particular that is supposed to be comma delimited with double quote text qualifiers, but then they imbed commas and double quotes inside the text data (and a few other quarks) which SSIS does not like. My script task processes the file, cleans it up, and rebuilds it out to another directory. It's an 80 MB file and it takes 3-4 minutes to run out.

    John Rowan

    ======================================================
    ======================================================
    Forum Etiquette: How to post data/code on a forum to get the best help[/url] - by Jeff Moden

  • Thanks Jeff Moden and Jez. Jeff's answer was at times above my skill/experience level (CLR) but I still learned from it. Jez, I will follow your ideas to see if I can get it done.

    Today I have been using a sed command (using GnuWin32 on Windows) to remove the trailing comma from the header of the files but it is slow. These are big files. The command I used is simply:

    sed "s/,$//" nameOfSourceFile > nameOfDestinationFile

    When the file is written out the stray arrow character at the end of the file is no longer there so it solves both problems.

  • Thanks John, that sounds like what I should do.

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply