Guarantee entire flat file loaded

  • cycle90210 (1/14/2016)


    I may do a write up on this as it seems to generate several good remarks and approaches!

    I think there are 2 real ways to address this:

    1. If you are able to customize the input/output from the source system; a checksum file and a header/footer line can be a sure way to guarantee things.

    2. If you are not able to customize the source system; things get to be a "best guess" situation.

    This may deal with transaction setup; doing a checksum of sorts; reading the rows separately and then comparing row counts to a best guess.

    I will need to experiment but how does a StreamReader determine a "line"? If I have a CRLF in the middle of a text field; will stream reader count this as a line even though it is in a text field?

    Just to be sure in case I'm misunderstanding... I wouldn't add a "header/footer" to the data file. There are too many fast importing techniques that can't actually tolerate different types of lines in the same file including but not limited to BCP, Bulk Insert, etc. Row totals, file check sums, and the like belong in a separate file and it only needs to be one row of information.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Orlando Colamatteo (1/13/2016)


    Counting the number of lines in a flat-file in a Script Task in SSIS is trivial and wicked fast. You can then take that count and compare it to the number of rows in the table after your Data Flow Task completes.

    string fileName = @"C:\Document1.txt";

    int lineCount = 0;

    using (StreamReader sr = new StreamReader(fileName))

    {

    while ((sr.ReadLine()) != null)

    {

    lineCount++;

    }

    }

    Knowing that the number of lines in the file matches the number of rows in the table only confirms you loaded all lines in the file, however. It tells you nothing about what the data provider intended to or was supposed to send you. For that you need a header or trailer line (or separate file as some mentioned) with a line count you can use to validate against the line/row count.

    Exactly. And I agree with such data being in a separate file. Think of it as "normalizing" the files where each file (the "file info" file and the "data" file) has one and only one purpose and format.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • cycle90210 (1/14/2016)


    I will need to experiment but how does a StreamReader determine a "line"? If I have a CRLF in the middle of a text field; will stream reader count this as a line even though it is in a text field?

    It will retrieve text until it sees a \r , \n or a \r\n and treat that as a line. So, the technique is insulated from different source system EOLs but it will not handle things like embedded line delimiters because it is not parsing the file. You would need to write a custom line counter that is aware of the file format.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • Jeff Moden (1/14/2016)


    Orlando Colamatteo (1/13/2016)


    Counting the number of lines in a flat-file in a Script Task in SSIS is trivial and wicked fast. You can then take that count and compare it to the number of rows in the table after your Data Flow Task completes.

    string fileName = @"C:\Document1.txt";

    int lineCount = 0;

    using (StreamReader sr = new StreamReader(fileName))

    {

    while ((sr.ReadLine()) != null)

    {

    lineCount++;

    }

    }

    Knowing that the number of lines in the file matches the number of rows in the table only confirms you loaded all lines in the file, however. It tells you nothing about what the data provider intended to or was supposed to send you. For that you need a header or trailer line (or separate file as some mentioned) with a line count you can use to validate against the line/row count.

    Exactly. And I agree with such data being in a separate file. Think of it as "normalizing" the files where each file (the "file info" file and the "data" file) has one and only one purpose and format.

    I have had to ingest and regurgitate files in both schemes. Separate files is easier for ETL but is complicates file-transmission. I have found that mainframe systems tend to love the header/trailer approach, and multi-record file formats in general, which makes for a self-contained file that is easy to transmit and archive but complicates ETL.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • Orlando Colamatteo (1/14/2016)


    Jeff Moden (1/14/2016)


    Orlando Colamatteo (1/13/2016)


    Counting the number of lines in a flat-file in a Script Task in SSIS is trivial and wicked fast. You can then take that count and compare it to the number of rows in the table after your Data Flow Task completes.

    string fileName = @"C:\Document1.txt";

    int lineCount = 0;

    using (StreamReader sr = new StreamReader(fileName))

    {

    while ((sr.ReadLine()) != null)

    {

    lineCount++;

    }

    }

    Knowing that the number of lines in the file matches the number of rows in the table only confirms you loaded all lines in the file, however. It tells you nothing about what the data provider intended to or was supposed to send you. For that you need a header or trailer line (or separate file as some mentioned) with a line count you can use to validate against the line/row count.

    Exactly. And I agree with such data being in a separate file. Think of it as "normalizing" the files where each file (the "file info" file and the "data" file) has one and only one purpose and format.

    I have had to ingest and regurgitate files in both schemes. Separate files is easier for ETL but is complicates file-transmission. I have found that mainframe systems tend to love the header/trailer approach, and multi-record file formats in general, which makes for a self-contained file that is easy to transmit and archive but complicates ETL.

    Heh... same here. One man's boon is another man's bane. Went through that with ATT when I was working for a nationwide CLEC. The tapes (literally reel-to-reel tapes back in the late 90's and had a desktop tape reader) contained 4 different sections, each having a header and a footer. They wouldn't let us send our requests that way though. Oh no... instead, they wanted us to send them new customers in a file that would only contain 100 or fewer customers, a file control file with the number of rows and total bytes, a customer control file with a separate listing of the customers in the customers file with a repeat of the number of bytes in the customer file, and a file that had our company information in it. All the files had to be named with our company ID and a rather cryptic date/time, which they thought would force us to use their software to make the 3.5 inch floppy disks (even though they could handle dozens of such files, only one set of 4 per floppy was allowed) until I broke their date code. It was kinda fun to write the code to do all of that (I also had it automagically print disk labels) but you can just imagine what our postal costs were to send the floppies when we wanted to put on 40,000 new customers each week.

    I suppose what you say is why some people like XML. The root tag serves as both a starting and ending tag and you can embed just about anything you want and "easily" extract it.

    Really REALLY glad I don't have to work with either of those methods anymore.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (1/14/2016)


    I suppose what you say is why some people like XML. The root tag serves as both a starting and ending tag and you can embed just about anything you want and "easily" extract it.

    I would include myself in "some people." With XML we can pass an entire database across a wire, or any subset of data. Conceptually XML has solved all the problems we're talking about and added a ton more benefit as well. That said, tag-overhead is a big disadvantage but well worth the trade for me.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • It has, indeed, solved a whole lot of problems for a whole lot of people. I'm just not one of those people. It's caused me more problems than not and not just because of the tag bloat. But, that's a topic for future discussion. 🙂

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Fair enough. The topics on which I would like to do a deep dive with you someday just keeps on growing.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • Steve Jones - SSC Editor (1/13/2016)


    My only thought here is that if you know you have windows formatted files with a \r at the end, you can count these with a seperate, non SSIS process. Then you can compare to lines read.

    Perhaps some mouse guarding the cheese, but if depends if you're trying to ensure SSIS works correctly or ensure that something else.

    Personally I'd trust SSIS if this is working and handle error rows inside the ETL. While you might have issues with some files, I'd deal with those at the time rather than coding to prevent them.

    Counting LF (ASCII 10) is a fairly safe approach which works for most files except the old Mac format where the line ending is CR (ASCII 13).

    😎

Viewing 9 posts - 16 through 24 (of 24 total)

You must be logged in to reply to this topic. Login to reply