• Irozenberg 1347 (3/1/2013)


    Thank you Jeff!

    IMHO CSV data in database is a pure hack - some developers just cutting corners to speed up first cut and leave before it causing HUGE maintenance problem, especially with reporting.

    Last year have to deal with data migration project when salesforce.com legacy system had been implemented by storing comma separate values inside of table cells. I used CSV split function based on XML in order to tranform a column that contained comma-separated strings into a new table with ID and atomic items. Pure job creation scheme!

    You're welcome.

    Just to expose my opinion a bit...

    There are two types of CSV...

    Comma Separated Values - This is the "true" CSV and will always have quotes around aything that is character based. It's extremely useful but has the same problem as XML... if a quote is embedded in the actual data, it needs to be "escaped" in some fashion. Most people and programs don't format this type of file correctly including EXCEL.

    Comedy Separated Values - This is what most people think of when you say CSV. The only rules are that commas separate values and commas cannot be embedded in the values. It's a real PITA unless those rules are followed explicitly. Otherwise, it's pretty easy to make and use.

    Both types of CSV are easy for humans to understand and read which is why they both gained favor in the past.

    If it were me, I'd go back to the old ways and use things like ASCII characters 28 through 31 as they were meant to be along with the CrLf characters to help make it a bit easier to troubleshoot.

    Shifting gears, I think XML is probably the worst hack of them all. I just can't imagine why anyone would have written such a dirty, bloated, PITA to use data transmission format even if we didn't have to move the data to databases but especially since databases and spreadsheets DO make the world go'round. Consider the following.

    1. The tags cause extreme bloat. If communications were slower than what they are, no one in their right might would transmit so much useless repetative data as that contained in the tags.

    2. The tags don't actually do what they were advertised to do in the very beginning. There are things like OPENXML in SQL Server that will correctly shred even the most complex (even if not perfectly "well formed") of XML hierarchical data into an "edge" table but nothing to tell the human what the structure is especially when it comes to nested levels (think indented list of tags). The human has to figure that out and then write code to shred it correctly.

    3. The data format doesn't follow just one standard. It follows at least 2. One for entity data and one for element data both of which are just row data at different levels.

    4. XML data sucks for relational databases because multiple entity data (tables) are nested in a hierarchy. Ask most relational database programmers what their least favorite subject is and chances are good that they'll say "hierarchies".

    5. Unless you build a separate style sheet, XML doesn't do much to tell you what the datatype of each "cell" (whether it be entity or element data) is. That's another short coming, IMHO.

    6. All of that was done for the sake of supposed human readability and, I've got to tell you, if you have XML with more than about 3 levels (entities), a human is going to have some good difficulty with shredding it in the mind well enough to actually write queries against it.

    7. The data violates every rule of normalization there is if you store it in a database in an unshredded format. Holy shades of old IMS systems!

    One of my favorite ways to transmit data in the old days included a combination of the ASCII characters 28-31 with CrLf just for readability and the old DBase III table format. Each file had a header that named the columns and identified the datatype much like a CREATE TABLE statement. It was database friendly because each file formed an entity (table). It was incredibly compact because there were no tags and the separators weren't something that most humans could type via a keyboard which meant you didn't have to worry about things like embedded quotes or separators.

    The only good thing about XML, IMHO, is that it is so bloated and so resource intensive to shred that people HAD to make communications and CPUs faster just to handle XML and it's equally bloated father, HTML. Who in their right mind would use a 4 character starting tag and a 5 character ending tag just to identify a "cell" in a table for a single digit? Bleah!!! :sick: What the hell were they thinking?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)