The struggle to renormalize data

  • Comments posted to this topic are about the item The struggle to renormalize data

    Best wishes,
    Phil Factor

  • In the old days of receiving data on 1/2" tape the list brokers used to supply the tapes with 1/2 page of A4 telling you the record format, encoding, number of records, record terminator, file terminator etc.  May be, if you were lucky, you would get a data profile and explanation of enum values and codes embedded in the data.

    The discipline of documenting the record structures seems to have been lost.  Having spent the past year using Apache Spark to ingest JSON data into a data warehouse I can tell you that a significant amount of project time was wasted on software archaeology.  

    The theory of constraints tells us that there are very few constraints or for simplicity one constraint in a process.  Time is wasted on other perceived constraints as our one constraint is the bottleneck.  Has the development speed gained by using JSON for front-end systems been negated by the time taken to get the data into a BI system?  Have we tackled the wrong constraint and in the wrong way?

    Storing data in Apache Avro format does at least give us some form of schema for JSON data.  Parquet format does arrays within its tabular format.
    JSON is a convenient format for passing data between web services but as a bulk data ingestion medium I'm struggling to find it appropriate.

  • Personally, I think all of the "new" ways of sharing data are just stupid.  XML, JSON, EDI, whatever.  It's all stupid.  If it's not tag bloated hierarchical crap that does need to be, its some other junk the you need a 100 page spec to write the code to disassemble the garbage into normalized tables.  I was really excited when XML first came out with the promise of "not having to know the structure of the data".  Nothing could be further from the truth.

    I think back to when I was using DB2 and DB3 ("DBF" files) for a short while.  THAT was amazing. If someone want to me to have the data in 3 related tables, we wouldn't go through all this hierarchical XML junk.  They would just send me 3 files that I could attach to a database and it was a done deal because the header of the file (the DB III structure) had the column names and the datatypes described in it and the rest of the rows were data.  ASCII in all it's glory sans tag bloat, hierarchical nonsense, or having to be careful about special characters.

    Image result for make ascii great again hat

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden - Monday, September 11, 2017 11:04 PM

    Personally, I think all of the "new" ways of sharing data are just stupid.  XML, JSON, EDI, whatever.  It's all stupid.  If it's not tag bloated hierarchical crap that does need to be, its some other junk the you need a 100 page spec to write the code to disassemble the garbage into normalized tables.  I was really excited when XML first came out with the promise of "not having to know the structure of the data".  Nothing could be further from the truth.

    I think back to when I was using DB2 and DB3 ("DBF" files) for a short while.  THAT was amazing. If someone want to me to have the data in 3 related tables, we wouldn't go through all this hierarchical XML junk.  They would just send me 3 files that I could attach to a database and it was a done deal because the header of the file (the DB III structure) had the column names and the datatypes described in it and the rest of the rows were data.  ASCII in all it's glory sans tag bloat, hierarchical nonsense, or having to be careful about special characters.

    Personally, trying to conform the world to one way of thinking or one tool or one format is extremely hard. It's better to focus on how you can cut through all the politics and come to some type of middle ground that bridges the gaps. Maybe not so much in smaller organizations where you have more control, but in larger ones, especially where you may have NoSQL and RDBMS teams, it can be counter productive to even think about it.

    On the article, I do think it's easier said than done to bridge the worlds to using the same agreed upon schemas so they compliment each other and make things like plucking out data from one system and easily putting it into another system. This is kind of like where roles like mine should shine (the data architect) in trying to ensure that can happen or at least foresee that it WILL happen at some point down the line. But, it's extremely complicated much like it's even still extremely complicated to build a single data warehouse where everyone can agree on the final model and what questions or potentially new questions that warehouse should be answering for the business and it's units.

  • xsevensinzx - Wednesday, September 13, 2017 1:09 AM

    Jeff Moden - Monday, September 11, 2017 11:04 PM

    Personally, I think all of the "new" ways of sharing data are just stupid.  XML, JSON, EDI, whatever.  It's all stupid.  If it's not tag bloated hierarchical crap that does need to be, its some other junk the you need a 100 page spec to write the code to disassemble the garbage into normalized tables.  I was really excited when XML first came out with the promise of "not having to know the structure of the data".  Nothing could be further from the truth.

    I think back to when I was using DB2 and DB3 ("DBF" files) for a short while.  THAT was amazing. If someone want to me to have the data in 3 related tables, we wouldn't go through all this hierarchical XML junk.  They would just send me 3 files that I could attach to a database and it was a done deal because the header of the file (the DB III structure) had the column names and the datatypes described in it and the rest of the rows were data.  ASCII in all it's glory sans tag bloat, hierarchical nonsense, or having to be careful about special characters.

    Personally, trying to conform the world to one way of thinking or one tool or one format is extremely hard. It's better to focus on how you can cut through all the politics and come to some type of middle ground that bridges the gaps. Maybe not so much in smaller organizations where you have more control, but in larger ones, especially where you may have NoSQL and RDBMS teams, it can be counter productive to even think about it.

    On the article, I do think it's easier said than done to bridge the worlds to using the same agreed upon schemas so they compliment each other and make things like plucking out data from one system and easily putting it into another system. This is kind of like where roles like mine should shine (the data architect) in trying to ensure that can happen or at least foresee that it WILL happen at some point down the line. But, it's extremely complicated much like it's even still extremely complicated to build a single data warehouse where everyone can agree on the final model and what questions or potentially new questions that warehouse should be answering for the business and it's units.

    It's not actually difficult to get the world to conform to a particular methodology.  All you have to do is advertise it as the latest "must have to be cool" thing, publish a tool or two to make it "easy to use", and make it "self bandwagoning" to make it so that you're an unpopular "behind the times" person if you're not using it.  Look what happened with XML (and now JSON, etc) even with all its disadvantages when it comes to transmission length, storage size, and lack of performance compared to other, much simpler, methods.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I loathe json... don't see the point of it over XML or even flat files. 

    Of course i've only worked with it in POSH, which is a bit of a PITA, so i may be a bit biased 🙂

  • ManicStar - Thursday, September 14, 2017 11:28 AM

    I loathe json... don't see the point of it over XML or even flat files. 

    Of course i've only worked with it in POSH, which is a bit of a PITA, so i may be a bit biased 🙂

    You mentioned what you don't like.  What method for large data transfers do you prefer?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden - Thursday, September 14, 2017 5:38 PM

    ManicStar - Thursday, September 14, 2017 11:28 AM

    I loathe json... don't see the point of it over XML or even flat files. 

    Of course i've only worked with it in POSH, which is a bit of a PITA, so i may be a bit biased 🙂

    You mentioned what you don't like.  What method for large data transfers do you prefer?

    That's an easy one for me - text files.  I can read a 1 GB text file into staging, normalize a few things and write it to a permanent structure faster than the query to parse the XML would take to run.  The files aren't as huge, it processes faster and it's relatively simple to do.  Compared with XML, well...nevermind, there's no comparison.  The problem with text files is that they aren't shiny, new, cool or hip.  What some people call boring, I call efficient.

    To the point of the excellent article, it can be difficult to renormalize data once it's denormalized into a structure for transfer.  Granted, it can usually be done, especially if you're working with the group that's sending it to you, but it can be cumbersome.  But when you stop and consider some of the stuff we do in T-SQL and how fast it runs, it is pretty amazing.  For a historical perspective, most of us routinely carry around a flash drive in our pocket with more data than it took to put the first man on the moon.

  • Ed Wagner - Thursday, September 14, 2017 9:13 PM

    Jeff Moden - Thursday, September 14, 2017 5:38 PM

    ManicStar - Thursday, September 14, 2017 11:28 AM

    I loathe json... don't see the point of it over XML or even flat files. 

    Of course i've only worked with it in POSH, which is a bit of a PITA, so i may be a bit biased 🙂

    You mentioned what you don't like.  What method for large data transfers do you prefer?

    That's an easy one for me - text files.  I can read a 1 GB text file into staging, normalize a few things and write it to a permanent structure faster than the query to parse the XML would take to run.  The files aren't as huge, it processes faster and it's relatively simple to do.  Compared with XML, well...nevermind, there's no comparison.  The problem with text files is that they aren't shiny, new, cool or hip.  What some people call boring, I call efficient.

    To the point of the excellent article, it can be difficult to renormalize data once it's denormalized into a structure for transfer.  Granted, it can usually be done, especially if you're working with the group that's sending it to you, but it can be cumbersome.  But when you stop and consider some of the stuff we do in T-SQL and how fast it runs, it is pretty amazing.  For a historical perspective, most of us routinely carry around a flash drive in our pocket with more data than it took to put the first man on the moon.

    Yup... I absolutely agree on the subject of text files being head and shoulders above any kind of markup language file.  What were you referring to as "flat files"?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden - Thursday, September 14, 2017 10:35 PM

    Ed Wagner - Thursday, September 14, 2017 9:13 PM

    Jeff Moden - Thursday, September 14, 2017 5:38 PM

    ManicStar - Thursday, September 14, 2017 11:28 AM

    I loathe json... don't see the point of it over XML or even flat files. 

    Of course i've only worked with it in POSH, which is a bit of a PITA, so i may be a bit biased 🙂

    You mentioned what you don't like.  What method for large data transfers do you prefer?

    That's an easy one for me - text files.  I can read a 1 GB text file into staging, normalize a few things and write it to a permanent structure faster than the query to parse the XML would take to run.  The files aren't as huge, it processes faster and it's relatively simple to do.  Compared with XML, well...nevermind, there's no comparison.  The problem with text files is that they aren't shiny, new, cool or hip.  What some people call boring, I call efficient.

    To the point of the excellent article, it can be difficult to renormalize data once it's denormalized into a structure for transfer.  Granted, it can usually be done, especially if you're working with the group that's sending it to you, but it can be cumbersome.  But when you stop and consider some of the stuff we do in T-SQL and how fast it runs, it is pretty amazing.  For a historical perspective, most of us routinely carry around a flash drive in our pocket with more data than it took to put the first man on the moon.

    Yup... I absolutely agree on the subject of text files being head and shoulders above any kind of markup language file.  What were you referring to as "flat files"?

    My preference is tab-separated values, but CSV and fixed-width are fine.  I process some that are fixed-width AND delimited by a semi-colon.

  • Jeff Moden - Thursday, September 14, 2017 5:38 PM

    You mentioned what you don't like.  What method for large data transfers do you prefer?

    I actually prefer pipe symbol delimited text files.  I've almost never seen them used in the value of a column as opposed to commas which are much more frequent, and it makes the file more compact than a fixed width text file.  I do prefer the fixed width to comma delimited though.

  • Chris Harshman - Friday, September 15, 2017 7:39 AM

    Jeff Moden - Thursday, September 14, 2017 5:38 PM

    You mentioned what you don't like.  What method for large data transfers do you prefer?

    I actually prefer pipe symbol delimited text files.  I've almost never seen them used in the value of a column as opposed to commas which are much more frequent, and it makes the file more compact than a fixed width text file.  I do prefer the fixed width to comma delimited though.

    Yes, pipe-delimited text with column headings in the first row are my favs.  So easy to feed to a staging table.

  • Ah... ok.  Then we're still talking about the same methodology of having some "unique" single character delimiter.  The unfortunate part of using a pipe or any other character and "standard" end of line characters that I have seen each of those used in the data itself.

    What I'd like to see is a return to an ancient standard that would work well even in this day and age.  That would be the proper use of ASCII characters 28 thru 31.
    
    The reason this fell out of favor was because humans found it difficult to read and they had a hell of a time trying to import such files into things like spreadsheets.  But for computer transfers, it's nasty fast.

    As for fixed field, it's even faster because it doesn't have to split stuff at the delimiters behind the scenes.  It can be faster still if you transmit numeric data as hexadecimal (I prefer Big Endian for purposes of performance but can, of course, handle Little Endian, as well).  It's much faster to pass a 4 byte integer "field" than it is to pass a 9 digit varchar but will settle for the latter rather than using the likes of any markup "language".

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Chris Harshman - Friday, September 15, 2017 7:39 AM

    Jeff Moden - Thursday, September 14, 2017 5:38 PM

    You mentioned what you don't like.  What method for large data transfers do you prefer?

    I actually prefer pipe symbol delimited text files.  I've almost never seen them used in the value of a column as opposed to commas which are much more frequent, and it makes the file more compact than a fixed width text file.  I do prefer the fixed width to comma delimited though.

    Depends on the industry. In advertising, Pipes, hyphens and even underscores are commonly imputed by the end user on the front-end, which causes you to still escape them in a pipe delimited export or have the application deny them. 

    This is why I use þorn over pipe.

  • xsevensinzx - Friday, September 15, 2017 6:00 PM

    Chris Harshman - Friday, September 15, 2017 7:39 AM

    Jeff Moden - Thursday, September 14, 2017 5:38 PM

    You mentioned what you don't like.  What method for large data transfers do you prefer?

    I actually prefer pipe symbol delimited text files.  I've almost never seen them used in the value of a column as opposed to commas which are much more frequent, and it makes the file more compact than a fixed width text file.  I do prefer the fixed width to comma delimited though.

    Depends on the industry. In advertising, Pipes, hyphens and even underscores are commonly imputed by the end user on the front-end, which causes you to still escape them in a pipe delimited export or have the application deny them. 

    This is why I use þorn over pipe.

    If I had my preference, I'd use CHAR(31) for the delimiter and CHAR(30) for the "end-of-line".

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 1 through 15 (of 21 total)

You must be logged in to reply to this topic. Login to reply