SQL Clone
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


The struggle to renormalize data


The struggle to renormalize data

Author
Message
Phil Factor
Phil Factor
SSCertifiable
SSCertifiable (6.8K reputation)SSCertifiable (6.8K reputation)SSCertifiable (6.8K reputation)SSCertifiable (6.8K reputation)SSCertifiable (6.8K reputation)SSCertifiable (6.8K reputation)SSCertifiable (6.8K reputation)SSCertifiable (6.8K reputation)

Group: General Forum Members
Points: 6841 Visits: 3050
Comments posted to this topic are about the item The struggle to renormalize data


Best wishes,

Phil Factor
Simple Talk
Dave Poole
Dave Poole
SSC-Insane
SSC-Insane (24K reputation)SSC-Insane (24K reputation)SSC-Insane (24K reputation)SSC-Insane (24K reputation)SSC-Insane (24K reputation)SSC-Insane (24K reputation)SSC-Insane (24K reputation)SSC-Insane (24K reputation)

Group: General Forum Members
Points: 24089 Visits: 3483
In the old days of receiving data on 1/2" tape the list brokers used to supply the tapes with 1/2 page of A4 telling you the record format, encoding, number of records, record terminator, file terminator etc. May be, if you were lucky, you would get a data profile and explanation of enum values and codes embedded in the data.

The discipline of documenting the record structures seems to have been lost. Having spent the past year using Apache Spark to ingest JSON data into a data warehouse I can tell you that a significant amount of project time was wasted on software archaeology.

The theory of constraints tells us that there are very few constraints or for simplicity one constraint in a process. Time is wasted on other perceived constraints as our one constraint is the bottleneck. Has the development speed gained by using JSON for front-end systems been negated by the time taken to get the data into a BI system? Have we tackled the wrong constraint and in the wrong way?

Storing data in Apache Avro format does at least give us some form of schema for JSON data. Parquet format does arrays within its tabular format.
JSON is a convenient format for passing data between web services but as a bulk data ingestion medium I'm struggling to find it appropriate.

LinkedIn Profile
www.simple-talk.com
Jeff Moden
Jeff Moden
SSC Guru
SSC Guru (337K reputation)SSC Guru (337K reputation)SSC Guru (337K reputation)SSC Guru (337K reputation)SSC Guru (337K reputation)SSC Guru (337K reputation)SSC Guru (337K reputation)SSC Guru (337K reputation)

Group: General Forum Members
Points: 337216 Visits: 42591
Personally, I think all of the "new" ways of sharing data are just stupid. XML, JSON, EDI, whatever. It's all stupid. If it's not tag bloated hierarchical crap that does need to be, its some other junk the you need a 100 page spec to write the code to disassemble the garbage into normalized tables. I was really excited when XML first came out with the promise of "not having to know the structure of the data". Nothing could be further from the truth.

I think back to when I was using DB2 and DB3 ("DBF" files) for a short while. THAT was amazing. If someone want to me to have the data in 3 related tables, we wouldn't go through all this hierarchical XML junk. They would just send me 3 files that I could attach to a database and it was a done deal because the header of the file (the DB III structure) had the column names and the datatypes described in it and the rest of the rows were data. ASCII in all it's glory sans tag bloat, hierarchical nonsense, or having to be careful about special characters.

Image result for make ascii great again hat

--Jeff Moden

RBAR is pronounced ree-bar and is a Modenism for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
If you think its expensive to hire a professional to do the job, wait until you hire an amateur. -- Red Adair

Helpful Links:
How to post code problems
How to post performance problems
Forum FAQs
xsevensinzx
xsevensinzx
SSCertifiable
SSCertifiable (7.2K reputation)SSCertifiable (7.2K reputation)SSCertifiable (7.2K reputation)SSCertifiable (7.2K reputation)SSCertifiable (7.2K reputation)SSCertifiable (7.2K reputation)SSCertifiable (7.2K reputation)SSCertifiable (7.2K reputation)

Group: General Forum Members
Points: 7172 Visits: 3311
Jeff Moden - Monday, September 11, 2017 11:04 PM
Personally, I think all of the "new" ways of sharing data are just stupid. XML, JSON, EDI, whatever. It's all stupid. If it's not tag bloated hierarchical crap that does need to be, its some other junk the you need a 100 page spec to write the code to disassemble the garbage into normalized tables. I was really excited when XML first came out with the promise of "not having to know the structure of the data". Nothing could be further from the truth.

I think back to when I was using DB2 and DB3 ("DBF" files) for a short while. THAT was amazing. If someone want to me to have the data in 3 related tables, we wouldn't go through all this hierarchical XML junk. They would just send me 3 files that I could attach to a database and it was a done deal because the header of the file (the DB III structure) had the column names and the datatypes described in it and the rest of the rows were data. ASCII in all it's glory sans tag bloat, hierarchical nonsense, or having to be careful about special characters.


Personally, trying to conform the world to one way of thinking or one tool or one format is extremely hard. It's better to focus on how you can cut through all the politics and come to some type of middle ground that bridges the gaps. Maybe not so much in smaller organizations where you have more control, but in larger ones, especially where you may have NoSQL and RDBMS teams, it can be counter productive to even think about it.

On the article, I do think it's easier said than done to bridge the worlds to using the same agreed upon schemas so they compliment each other and make things like plucking out data from one system and easily putting it into another system. This is kind of like where roles like mine should shine (the data architect) in trying to ensure that can happen or at least foresee that it WILL happen at some point down the line. But, it's extremely complicated much like it's even still extremely complicated to build a single data warehouse where everyone can agree on the final model and what questions or potentially new questions that warehouse should be answering for the business and it's units.

Jeff Moden
Jeff Moden
SSC Guru
SSC Guru (337K reputation)SSC Guru (337K reputation)SSC Guru (337K reputation)SSC Guru (337K reputation)SSC Guru (337K reputation)SSC Guru (337K reputation)SSC Guru (337K reputation)SSC Guru (337K reputation)

Group: General Forum Members
Points: 337216 Visits: 42591
xsevensinzx - Wednesday, September 13, 2017 1:09 AM
Jeff Moden - Monday, September 11, 2017 11:04 PM
Personally, I think all of the "new" ways of sharing data are just stupid. XML, JSON, EDI, whatever. It's all stupid. If it's not tag bloated hierarchical crap that does need to be, its some other junk the you need a 100 page spec to write the code to disassemble the garbage into normalized tables. I was really excited when XML first came out with the promise of "not having to know the structure of the data". Nothing could be further from the truth.

I think back to when I was using DB2 and DB3 ("DBF" files) for a short while. THAT was amazing. If someone want to me to have the data in 3 related tables, we wouldn't go through all this hierarchical XML junk. They would just send me 3 files that I could attach to a database and it was a done deal because the header of the file (the DB III structure) had the column names and the datatypes described in it and the rest of the rows were data. ASCII in all it's glory sans tag bloat, hierarchical nonsense, or having to be careful about special characters.


Personally, trying to conform the world to one way of thinking or one tool or one format is extremely hard. It's better to focus on how you can cut through all the politics and come to some type of middle ground that bridges the gaps. Maybe not so much in smaller organizations where you have more control, but in larger ones, especially where you may have NoSQL and RDBMS teams, it can be counter productive to even think about it.

On the article, I do think it's easier said than done to bridge the worlds to using the same agreed upon schemas so they compliment each other and make things like plucking out data from one system and easily putting it into another system. This is kind of like where roles like mine should shine (the data architect) in trying to ensure that can happen or at least foresee that it WILL happen at some point down the line. But, it's extremely complicated much like it's even still extremely complicated to build a single data warehouse where everyone can agree on the final model and what questions or potentially new questions that warehouse should be answering for the business and it's units.

It's not actually difficult to get the world to conform to a particular methodology. All you have to do is advertise it as the latest "must have to be cool" thing, publish a tool or two to make it "easy to use", and make it "self bandwagoning" to make it so that you're an unpopular "behind the times" person if you're not using it. Look what happened with XML (and now JSON, etc) even with all its disadvantages when it comes to transmission length, storage size, and lack of performance compared to other, much simpler, methods.


--Jeff Moden

RBAR is pronounced ree-bar and is a Modenism for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
If you think its expensive to hire a professional to do the job, wait until you hire an amateur. -- Red Adair

Helpful Links:
How to post code problems
How to post performance problems
Forum FAQs
ManicStar
ManicStar
SSCertifiable
SSCertifiable (5.1K reputation)SSCertifiable (5.1K reputation)SSCertifiable (5.1K reputation)SSCertifiable (5.1K reputation)SSCertifiable (5.1K reputation)SSCertifiable (5.1K reputation)SSCertifiable (5.1K reputation)SSCertifiable (5.1K reputation)

Group: General Forum Members
Points: 5083 Visits: 4201
I loathe json... don't see the point of it over XML or even flat files.

Of course i've only worked with it in POSH, which is a bit of a PITA, so i may be a bit biased Smile
Jeff Moden
Jeff Moden
SSC Guru
SSC Guru (337K reputation)SSC Guru (337K reputation)SSC Guru (337K reputation)SSC Guru (337K reputation)SSC Guru (337K reputation)SSC Guru (337K reputation)SSC Guru (337K reputation)SSC Guru (337K reputation)

Group: General Forum Members
Points: 337216 Visits: 42591
ManicStar - Thursday, September 14, 2017 11:28 AM
I loathe json... don't see the point of it over XML or even flat files.

Of course i've only worked with it in POSH, which is a bit of a PITA, so i may be a bit biased Smile


You mentioned what you don't like. What method for large data transfers do you prefer?

--Jeff Moden

RBAR is pronounced ree-bar and is a Modenism for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
If you think its expensive to hire a professional to do the job, wait until you hire an amateur. -- Red Adair

Helpful Links:
How to post code problems
How to post performance problems
Forum FAQs
Ed Wagner
Ed Wagner
SSC Guru
SSC Guru (93K reputation)SSC Guru (93K reputation)SSC Guru (93K reputation)SSC Guru (93K reputation)SSC Guru (93K reputation)SSC Guru (93K reputation)SSC Guru (93K reputation)SSC Guru (93K reputation)

Group: General Forum Members
Points: 93596 Visits: 11140
Jeff Moden - Thursday, September 14, 2017 5:38 PM
ManicStar - Thursday, September 14, 2017 11:28 AM
I loathe json... don't see the point of it over XML or even flat files.

Of course i've only worked with it in POSH, which is a bit of a PITA, so i may be a bit biased Smile


You mentioned what you don't like. What method for large data transfers do you prefer?

That's an easy one for me - text files. I can read a 1 GB text file into staging, normalize a few things and write it to a permanent structure faster than the query to parse the XML would take to run. The files aren't as huge, it processes faster and it's relatively simple to do. Compared with XML, well...nevermind, there's no comparison. The problem with text files is that they aren't shiny, new, cool or hip. What some people call boring, I call efficient.

To the point of the excellent article, it can be difficult to renormalize data once it's denormalized into a structure for transfer. Granted, it can usually be done, especially if you're working with the group that's sending it to you, but it can be cumbersome. But when you stop and consider some of the stuff we do in T-SQL and how fast it runs, it is pretty amazing. For a historical perspective, most of us routinely carry around a flash drive in our pocket with more data than it took to put the first man on the moon.



Tally Tables - Performance Personified
String Splitting with True Performance
Best practices on how to ask questions
Jeff Moden
Jeff Moden
SSC Guru
SSC Guru (337K reputation)SSC Guru (337K reputation)SSC Guru (337K reputation)SSC Guru (337K reputation)SSC Guru (337K reputation)SSC Guru (337K reputation)SSC Guru (337K reputation)SSC Guru (337K reputation)

Group: General Forum Members
Points: 337216 Visits: 42591
Ed Wagner - Thursday, September 14, 2017 9:13 PM
Jeff Moden - Thursday, September 14, 2017 5:38 PM
ManicStar - Thursday, September 14, 2017 11:28 AM
I loathe json... don't see the point of it over XML or even flat files.

Of course i've only worked with it in POSH, which is a bit of a PITA, so i may be a bit biased Smile


You mentioned what you don't like. What method for large data transfers do you prefer?

That's an easy one for me - text files. I can read a 1 GB text file into staging, normalize a few things and write it to a permanent structure faster than the query to parse the XML would take to run. The files aren't as huge, it processes faster and it's relatively simple to do. Compared with XML, well...nevermind, there's no comparison. The problem with text files is that they aren't shiny, new, cool or hip. What some people call boring, I call efficient.

To the point of the excellent article, it can be difficult to renormalize data once it's denormalized into a structure for transfer. Granted, it can usually be done, especially if you're working with the group that's sending it to you, but it can be cumbersome. But when you stop and consider some of the stuff we do in T-SQL and how fast it runs, it is pretty amazing. For a historical perspective, most of us routinely carry around a flash drive in our pocket with more data than it took to put the first man on the moon.


Yup... I absolutely agree on the subject of text files being head and shoulders above any kind of markup language file. What were you referring to as "flat files"?

--Jeff Moden

RBAR is pronounced ree-bar and is a Modenism for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
If you think its expensive to hire a professional to do the job, wait until you hire an amateur. -- Red Adair

Helpful Links:
How to post code problems
How to post performance problems
Forum FAQs
Ed Wagner
Ed Wagner
SSC Guru
SSC Guru (93K reputation)SSC Guru (93K reputation)SSC Guru (93K reputation)SSC Guru (93K reputation)SSC Guru (93K reputation)SSC Guru (93K reputation)SSC Guru (93K reputation)SSC Guru (93K reputation)

Group: General Forum Members
Points: 93596 Visits: 11140
Jeff Moden - Thursday, September 14, 2017 10:35 PM
Ed Wagner - Thursday, September 14, 2017 9:13 PM
Jeff Moden - Thursday, September 14, 2017 5:38 PM
ManicStar - Thursday, September 14, 2017 11:28 AM
I loathe json... don't see the point of it over XML or even flat files.

Of course i've only worked with it in POSH, which is a bit of a PITA, so i may be a bit biased Smile


You mentioned what you don't like. What method for large data transfers do you prefer?

That's an easy one for me - text files. I can read a 1 GB text file into staging, normalize a few things and write it to a permanent structure faster than the query to parse the XML would take to run. The files aren't as huge, it processes faster and it's relatively simple to do. Compared with XML, well...nevermind, there's no comparison. The problem with text files is that they aren't shiny, new, cool or hip. What some people call boring, I call efficient.

To the point of the excellent article, it can be difficult to renormalize data once it's denormalized into a structure for transfer. Granted, it can usually be done, especially if you're working with the group that's sending it to you, but it can be cumbersome. But when you stop and consider some of the stuff we do in T-SQL and how fast it runs, it is pretty amazing. For a historical perspective, most of us routinely carry around a flash drive in our pocket with more data than it took to put the first man on the moon.


Yup... I absolutely agree on the subject of text files being head and shoulders above any kind of markup language file. What were you referring to as "flat files"?

My preference is tab-separated values, but CSV and fixed-width are fine. I process some that are fixed-width AND delimited by a semi-colon.



Tally Tables - Performance Personified
String Splitting with True Performance
Best practices on how to ask questions
Go


Permissions

You can't post new topics.
You can't post topic replies.
You can't post new polls.
You can't post replies to polls.
You can't edit your own topics.
You can't delete your own topics.
You can't edit other topics.
You can't delete other topics.
You can't edit your own posts.
You can't edit other posts.
You can't delete your own posts.
You can't delete other posts.
You can't post events.
You can't edit your own events.
You can't edit other events.
You can't delete your own events.
You can't delete other events.
You can't send private messages.
You can't send emails.
You can read topics.
You can't vote in polls.
You can't upload attachments.
You can download attachments.
You can't post HTML code.
You can't edit HTML code.
You can't post IFCode.
You can't post JavaScript.
You can post emoticons.
You can't post or upload images.

Select a forum

































































































































































SQLServerCentral


Search