Simple Method for Importing Ragged files

  • Cool... let's see your solution for this, Brandon.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Heh... Hour or so? Like I said previously, I've used this method to do 5.1 million rows in about 3 minutes. I'm sure there's a faster method, but it worked for me.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Just to clarify - the "hour or so" being referred to was the time taken to create the package.

    Rgds,

    Paul Ibison

  • The "Ragged Right" Files I have to deal with are Mixed Row Type files as well. Because of that, I cannot use column names from the csv file, but also, each row starts with a TypeID.

    So, in DTS, I create a transform task for each row type and use a Transform Task with this code on the first column:

    Function Main()

    If DTSSource("Col001") = "HDRID" then

    Main =DTSTransformStat_OK

    ELSE

    Main = DTSTransformStat_SkipRow

    End if

    End Function

    In the example for the article your initial transform would just check the length and skip it if it was a header or footer. There's no need to import-export-import, even to get the column headings.

    --

    JimFive

  • Brandon,

    all due respect but I think you're missing the point.

    Judging by some of the questions, some others (Jeff apart) have also never really tried to solve this problem in practice.

    Take a csv file with 50 columns and use your conditional split starting with a single column and see how long it takes. You'll have to define each of the 50 columns separately, and define the substring function separately for each one. In the example CSV file have no fixed width so your substring function will take account of looking for the commas, possibly nested inside a string. How long before this is robust? How tedious will it be to do all this coding? And the resulting code required will be huge.

    This method can do the same in an hour or so and have a very simple resultant package.

    Cheers,

    Paul Ibison

  • paul.ibison (3/6/2008)


    Brandon,

    all due respect but I think you're missing the point.

    Judging by some of the questions, some others (Jeff apart) have also never really tried to solve this problem in practice.

    Take a csv file with 50 columns and use your conditional split starting with a single column and see how long it takes. You'll have to define each of the 50 columns separately, and define the substring function separately for each one. In the example CSV file have no fixed width so your substring function will take account of looking for the commas, possibly nested inside a string. How long before this is robust? How tedious will it be to do all this coding? And the resulting code required will be huge.

    This method can do the same in an hour or so and have a very simple resultant package.

    Cheers,

    Paul Ibison

    I'll admit I've never had to solve this exact problem before, where simply stripping headers and footers away would do the trick. Problems I've had to solve involving headers and footers have sometimes involved files with complex hierarchical structures from legacy systems, like this:

    FILE

    ORDERS|2

    HDR|100293|987|20080326

    ITM|897654|9876.87|3

    ITM|098643|76.34|12

    FTR|100293|2

    HDR|100294|456|20080326

    ITM|765432|11.99|6

    FTR|100294|1

    ENDORDERS|2

    CUST|2

    HDR|987

    ITM|Joe|Jackson|98 Palomino Way|Los Angeles|CA|90823

    FTR|987

    HDR|456

    ITM|Lisa|Lewis|123 Sesame Street|New York|NY|10014

    FTR|456

    ENDCUST|2

    ENDFILE

    How does your process work for files like this? Ignoring header and footer information in this file isn't an option since you will lose important information during the process, such as the order #s and order dates, the line item count, and other auditing information included in the file like record counts, etc.

  • paul.ibison (3/6/2008)


    Brandon,

    all due respect but I think you're missing the point.

    Judging by some of the questions, some others (Jeff apart) have also never really tried to solve this problem in practice.

    Take a csv file with 50 columns and use your conditional split starting with a single column and see how long it takes. You'll have to define each of the 50 columns separately, and define the substring function separately for each one. In the example CSV file have no fixed width so your substring function will take account of looking for the commas, possibly nested inside a string. How long before this is robust? How tedious will it be to do all this coding? And the resulting code required will be huge.

    This method can do the same in an hour or so and have a very simple resultant package.

    Cheers,

    Paul Ibison

    Unfortunately in my line of work...a IT solutions vendor...it's part of my job to be a tedious coder. Again, with all due respect to you, I've never seen an example in practice of having header and trailer records along with field names.

  • pnewhart (3/6/2008)


    Unfortunately in my line of work...a IT solutions vendor...it's part of my job to be a tedious coder. Again, with all due respect to you, I've never seen an example in practice of having header and trailer records along with field names.

    With all due respect to all of y'all after these many years I've seen some zany stuff. I'm rarely surprised any more. I write file parsers all the time. I just don't have to use SQL to do the job. I'm also not reading millions of rows at a crack either.

    We all have to play the cards we are dealt. We don't have to like them and we get to complain about them.

    ATBCharles Kincaid

  • Charles Kincaid (3/6/2008)


    With all due respect to all of y'all after these many years I've seen some zany stuff. I'm rarely surprised any more. I write file parsers all the time. I just don't have to use SQL to do the job. I'm also not reading millions of rows at a crack either.

    We all have to play the cards we are dealt. We don't have to like them and we get to complain about them.

    I agree with your point of view, as I've done the same type of thing and have written many custom file parsers and custom ETL applications over the years. Honestly I don't think I've ever seen a header/footer file format in which all of the headers and footers could be so easily discarded without a second thought.

  • How does your process work for files like this?

    I don't believe it was meant to... wasn't advertised that way except maybe by a slightly misleading title.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (3/6/2008)


    I don't believe it was meant to... wasn't advertised that way except maybe by a slightly misleading title.

    Yes, catchy title. When I first clicked on it I actually expected it would be about the type of file I was asking about. The author's right though, I've never had any need to solve this very specific problem. I was kind of wondering how common this specific problem is compared to, say, ANSI X12 EDI and other formats where simply stripping headers and footers is not an option?

  • Mike C (3/6/2008)


    Yes, catchy title. When I first clicked on it I actually expected it would be about the type of file I was asking about. The author's right though, I've never had any need to solve this very specific problem. I was kind of wondering how common this specific problem is compared to, say, ANSI X12 EDI and other formats where simply stripping headers and footers is not an option?

    Heh... guess it depends on what business you're in... I've been fortunate in not ever having to mess with EDI files. I have had to do exactly what's in this article a couple of dozen times. Not necessarily using the methods in this article each time, but same problem files.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (3/6/2008)


    Mike C (3/6/2008)


    Heh... guess it depends on what business you're in... I've been fortunate in not ever having to mess with EDI files. I have had to do exactly what's in this article a couple of dozen times. Not necessarily using the methods in this article each time, but same problem files.

    I guess I should have suspected, but I've gotten used to working with file formats where the headers and footers (if there are any) are used to group detail lines for specific reasons. I guess there's still a proliferation of headers and footers in files out there with the sole purpose of getting in the way to make the ETL process harder 🙂

  • Is it possible to use SSIS to import this same flat file if it doesn't have a ROW delimiter?

    I get these files which are basically text streams and they are only delimited by record.

    I've written a utlity to handle it now (it's slow and difficult to manage on all my remote sites) but is there anything already in SSIS that would just delimit the row at a specified length?

    Perhaps there's a different approach that doesn't require an external utility?

  • johnr (3/19/2008)


    Is it possible to use SSIS to import this same flat file if it doesn't have a ROW delimiter?

    I get these files which are basically text streams and they are only delimited by record.

    I've written a utlity to handle it now (it's slow and difficult to manage on all my remote sites) but is there anything already in SSIS that would just delimit the row at a specified length?

    Perhaps there's a different approach that doesn't require an external utility?

    You can use the Bulk Insert task with a BCP format file to import fixed-length record files. You can probably do it in the data flow task as well with some transformations, although I haven't had occasion to do that myself yet.

Viewing 15 posts - 31 through 45 (of 51 total)

You must be logged in to reply to this topic. Login to reply