February 24, 2008 at 11:19 pm
Hi everyone
We are currently in the process of defining the technical requirements for a new client we have acquired. All of the source data files we will receive as based on insurance data.
The structure of the data for our largest import is as follows:
A Policy has one or many Sections
A Section has one or many Items
We will be importing around 1 million policies per source file on a monthly basis. Due to the structure of the hierarchy mentioned above, we are debating whether to go with XML or Flat File source files.
Could I please have your ideas on using XML as an ETL data source with large transaction volumes?
Thank you
February 25, 2008 at 6:53 am
Although XML would have a grand draw here, I loath it because of the duplicated data in the form of tags. A 1 mega-byte flat file can take 8 or 16 mega-bytes or more to transmit/store the same information.
Look at what you've defined...
Policy
Section
Item
In an RDBMS, those would be separate tables... yet, when it comes to creating transmission data, people loose their minds and try to cram it all into a single file!
CSV's aren't so good either. One bad name with a comma in it and BOOOM! That's unless you go through the ardur of double quoting alpha-numeric data.
No, I have two favorites for extremely high speed ETL and transmission of data... TAB Delimited and Fixed Field. I particularly like fixed field because it also allows the double check of a fixed line length just to be sure nothing was lost or added in the transmission.
I'd recommend 3 separate files using Fixed field data. If they don't want to go for that, then 3 separate files using the proper TAB delimited format.
--Jeff Moden
Change is inevitable... Change for the better is not.
February 25, 2008 at 6:58 am
Hi Jeff
Thank you very much. I really appreciate the feedback
February 25, 2008 at 7:23 am
I rarely get the luxury of "calling the ball" on what kinds of data I get, but given the choice - I would definitely also steer towards three separate files. The only reason I might request XML in this case is if the data were to be stored in your system as semi-structured (i.e. in a XML fragment in one table).
In my opinion, the biggest downside there is to have to "decompose" the file into the three data elements. The extra size of the file shouldn't begin to play in as a factor unless your uploads are truly epic; the processing time on the other hand would start adding up quickly. (It's good on the other hand if you don't have the luxury of knowing what the structure of the data is before you receive it: you KNOW what you're getting, so you can go a more direct route).
Of course - I probably WOULD use CSV format (which by definition, REQUIRES quoting alpha data in my book) for the three separate files, but that's just me.... I've had my share of troubles with both fixed-width and tab-delimited, so my "backup format" is pipe-delimited.
At the end of the day - as long as the data is consistent (meaning - all three files have the correctly typed data in the right places, the children files don't have orphans, the identifiers are carried on each file,....) the internals of the file aren't going to matter (meaning fixed width or delimited). conversely - if the source system isn't smart enough to put out a consistent set of files, it's usually not smart enough to do it in any format....
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
February 25, 2008 at 8:00 am
Man, I agree with that... personal choice on such things is rare indeed. If you do get to make the choice, despite what I or anyone else says, choose wisely because you're the one that's gonna have to live with it. And, doesn't matter what the choice is, Matt is spot on... at the end of the day, the data must be in the database and it must be accurate to the bit.
--Jeff Moden
Change is inevitable... Change for the better is not.
February 25, 2008 at 9:43 pm
Thanks for the feedback guys.
Viewing 6 posts - 1 through 6 (of 6 total)
You must be logged in to reply to this topic. Login to reply