Do You Have ALL the YAML?

  • Comments posted to this topic are about the item Do You Have ALL the YAML?

  • With the proliferation of file formats, perhaps it is time to revisit the delimited file and how its problems can be overcome.

    CSV stands for comma separated values. As we know there is always the issue that your data might contain either the value delimiter or the line delimiter (carriage return, line break).

    This problem can be avoided by adopting the surprisingly little used standard Unicode control characters for record separator ("\u001E") and unit separator ("\u001F"). We can add the column names in the first row by separating them with the Unicode group separator ("\u001D"). Mapping the column names to the values is then simply a matter of indexing the names to the column index. This data format is far more compact than JSON or XML and just as readable (any program that can split a file into columns by delimiter can parse it).

    If we want to confirm that all the data has been delivered then we use the End of Transmission block ("\u0017").

    These control characters have been in ASCII and Unicode from the beginning and are very unlikely to go away.

     

     

     

     

  • Since I am the Accidental TFS Administrator, and since I'm trying to move my organization to use the cloud for VCS instead of our out of support TFS, I've gotten into YAML. But like everything else where I work, they don't provide any training or support. I'm just through headfirst into the deep end of the pool.

    Until I read today's editorial, Steve, I'd never even heard of a "closing key:value". After posting this, I'm going to have to look that up.

    To answer your question, no I've not done anything to ensure that the YAML I'm working with is complete. Although, I'm as sure as I can be that it's complete, as much as anything else in a Git repo is complete. The YAML I work with is a part of our Git repos. So, either Git restored the whole file, or it doesn't. And that's as likely to be true of any source code file in the Git repo.

    Rod

  • When it comes to data interchange file formats, the merits of one versus another comes down to:

    • how efficiently it can be parsed (ingesting a million records in 1 minute as opposed to 1 hour has obvious advantages to a DBA)
    • how gracefully can it recover from truncation or corruption (does one malformed record or tag prevent import of the entire file?)
    • how widely is it adopted (when exchanging data with 3rd parties, it's great to speak the same language)

    Delimited format is superior in all these aspects.

    However, the other markup and object notation file formats are better for things like configuration file.

     

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Regarding MS Excel, that's my least favorite data interchange format. In a previous job, I found out the hard way that when someone in accounting hides a range of rows from view, the SSMS Import Wizard considers them deleted and will ignore them.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • I forgot to mention earlier that I like YAML, because it's easier for me to understand than the convoluted PowerShell script the former TFS admin wrote for lots of our on-premises builds. And our TFS is so old it doesn't work with YAML.

    Kindest Regards, Rod Connect with me on LinkedIn.

  • I do like YAML for the most part, but I'd never considered that a truncated file might not be noticed. If you upload something manually for config, might not be a problem, but if you're sending this across systems automatically, it could be something to worry about if this controls critical processes.

  • will 58232 wrote:

    With the proliferation of file formats, perhaps it is time to revisit the delimited file and how its problems can be overcome.

    CSV stands for comma separated values. As we know there is always the issue that your data might contain either the value delimiter or the line delimiter (carriage return, line break).

    This problem can be avoided by adopting the surprisingly little used standard Unicode control characters for record separator ("\u001E") and unit separator ("\u001F"). We can add the column names in the first row by separating them with the Unicode group separator ("\u001D"). Mapping the column names to the values is then simply a matter of indexing the names to the column index. This data format is far more compact than JSON or XML and just as readable (any program that can split a file into columns by delimiter can parse it).

    If we want to confirm that all the data has been delivered then we use the End of Transmission block ("\u0017").

    These control characters have been in ASCII and Unicode from the beginning and are very unlikely to go away.

    + 1000.  That and control characters 28 thru 31 (""\u001D"" and "\u001E" and "\u001F" are 3 of those 4). And, to be sure, there is no problem with Delimited Data... it works perfectly when done correctly.  It's what people do with it and to it that puts the screws to it. 😀

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thanks Jeff

    That and control characters 28 thru 31.

    I'm writing them how I would decode them in JavaScript -

    Row split:

    data.split("\u001e")

    column split:

    data.split("\u001f")

    It is possible to define a whole set of JavaScript functions that map column names to data and so on - or for that matter implement the relational operators (restrict, project, union, intersect, minus and join).

    I'm principally concerned at present with sending data to a client. In this scenario things like JSON, XML and YAML are simply too bulky - when you consider that the network is one of the slower components in the system - and I want to be a good citizen and not hog bandwidth.

     

    • This reply was modified 2 years, 8 months ago by  will 58232.
    • This reply was modified 2 years, 8 months ago by  will 58232.
    • This reply was modified 2 years, 8 months ago by  will 58232.
    • This reply was modified 2 years, 8 months ago by  will 58232.
    • This reply was modified 2 years, 8 months ago by  will 58232.
  • Something else folks may not consider and that's the "native" format for transmitting data from SQL Server to SQL Server.  It does really cool stuff such as leaving INTs in a 4 byte format and DATETIME in an 8 byte format or DATE in a 3 BYTE format.  You can even generate a BCP format file to include as a "meta-data" file when you send the data as a separate file.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • will 58232 wrote:

    Thanks Jeff

    That and control characters 28 thru 31.

    I'm writing them how I would decode them in JavaScript -

    Row split:

    data.split("\u001e")

    column split:

    data.split("\u001f")

    It is possible to define a whole set of JavaScript functions that map column names to data and so on - or for that matter implement the relational operators (restrict, project, union, intersect, minus and join).

    I'm principally concerned at present with sending data to a client. In this scenario things like JSON, XML and YAML are simply too bulky - when you consider that the network is one of the slower components in the system - and I want to be a good citizen and not hog bandwidth.

    Yeah... we "crossed streams" on this one a bit.  I was in the process of updating my post to say that you used 3 of the 4 control characters that I was talking about.  Totally agree with you.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Steve Jones - SSC Editor wrote:

    I do like YAML for the most part, but I'd never considered that a truncated file might not be noticed. If you upload something manually for config, might not be a problem, but if you're sending this across systems automatically, it could be something to worry about if this controls critical processes.

    Very good point. In the scenarios I've been working with its simpler than what you've described.

    Rod

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply