XML/CSV Data Source

  • Hi everyone

    We are currently in the process of defining the technical requirements for a new client we have acquired. All of the source data files we will receive as based on insurance data.

    The structure of the data for our largest import is as follows:

    A Policy has one or many Sections

    A Section has one or many Items

    We will be importing around 1 million policies per source file on a monthly basis. Due to the structure of the hierarchy mentioned above, we are debating whether to go with XML or Flat File source files.

    Could I please have your ideas on using XML as an ETL data source with large transaction volumes?

    Thank you

  • Although XML would have a grand draw here, I loath it because of the duplicated data in the form of tags. A 1 mega-byte flat file can take 8 or 16 mega-bytes or more to transmit/store the same information.

    Look at what you've defined...

    Policy

    Section

    Item

    In an RDBMS, those would be separate tables... yet, when it comes to creating transmission data, people loose their minds and try to cram it all into a single file!

    CSV's aren't so good either. One bad name with a comma in it and BOOOM! That's unless you go through the ardur of double quoting alpha-numeric data.

    No, I have two favorites for extremely high speed ETL and transmission of data... TAB Delimited and Fixed Field. I particularly like fixed field because it also allows the double check of a fixed line length just to be sure nothing was lost or added in the transmission.

    I'd recommend 3 separate files using Fixed field data. If they don't want to go for that, then 3 separate files using the proper TAB delimited format.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Hi Jeff

    Thank you very much. I really appreciate the feedback

  • I rarely get the luxury of "calling the ball" on what kinds of data I get, but given the choice - I would definitely also steer towards three separate files. The only reason I might request XML in this case is if the data were to be stored in your system as semi-structured (i.e. in a XML fragment in one table).

    In my opinion, the biggest downside there is to have to "decompose" the file into the three data elements. The extra size of the file shouldn't begin to play in as a factor unless your uploads are truly epic; the processing time on the other hand would start adding up quickly. (It's good on the other hand if you don't have the luxury of knowing what the structure of the data is before you receive it: you KNOW what you're getting, so you can go a more direct route).

    Of course - I probably WOULD use CSV format (which by definition, REQUIRES quoting alpha data in my book) for the three separate files, but that's just me.... I've had my share of troubles with both fixed-width and tab-delimited, so my "backup format" is pipe-delimited.

    At the end of the day - as long as the data is consistent (meaning - all three files have the correctly typed data in the right places, the children files don't have orphans, the identifiers are carried on each file,....) the internals of the file aren't going to matter (meaning fixed width or delimited). conversely - if the source system isn't smart enough to put out a consistent set of files, it's usually not smart enough to do it in any format....

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Man, I agree with that... personal choice on such things is rare indeed. If you do get to make the choice, despite what I or anyone else says, choose wisely because you're the one that's gonna have to live with it. And, doesn't matter what the choice is, Matt is spot on... at the end of the day, the data must be in the database and it must be accurate to the bit.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thanks for the feedback guys.

Viewing 6 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply