Serialising tabular and relational data.

Question

Serialising tabular and relational data.

Phil Factor

SSC-Insane

Points: 20244
More actions
March 12, 2011 at 11:34 am

#150052

Comments posted to this topic are about the item Serialising tabular and relational data.
Best wishes,
Phil Factor

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply

djackson 22568 SSChampion Points: 11733 More actions · Answer 1

The universities would distain such a practical task.

Oh please, let's not involve the universities! Seriously, I am tired of the better than you attitude of American universities. From trying to control the political spectrum to engaging in overcharging students so professors can spend their time chasing dreams - I do not see much value in them. My opinion only, but I prefer for working people, who actually understand what it means to hold a job, to come up with solutions to problems.

As to your concerns, I just don't get it. I move data from system to system reliably every day. Between CSV and XML, not to even mention EDI, I am able to share data between systems with very little effort. One of my systems even allows for feeding PDF, DOC, DOCX, RTF, JPG, TIFF and other formatted documents into it with very little effort. Mostly I work with SQL Server, but I have moved data from systems before where I had no access to the relational model, and had to examine the data to determine what to do to create what I needed. In that case I actually restored data that the vendor was unable to find.

For those systems that don't support importing data natively, why are we buying them? For those that don't support exporting data, what do we expect? Should a vendor provide a product that allows you to extract the data in order to move to another vendor's product? We may want that, but do you really expect a company to ease your transition to a competitor?

Admittedly my experience where I work is going to be different than in another industry, but I wonder if the issue is one of true need. Nobody said being a DBA would be easy.

Dave

Jeff Moden SSC Guru Points: 1003861 More actions · Answer 2

djackson 22568 (3/12/2011)
The universities would distain such a practical task.
Oh please, let's not involve the universities! Seriously, I am tired of the better than you attitude of American universities. From trying to control the political spectrum to engaging in overcharging students so professors can spend their time chasing dreams - I do not see much value in them. My opinion only, but I prefer for working people, who actually understand what it means to hold a job, to come up with solutions to problems.

Why the hell are you picking on American universities? The word "universities" was used as a general term in the article. All universities in all countries breed their own generations of ring knockers. 😉

As to your concerns, I just don't get it. I move data from system to system reliably every day. Between CSV and XML, not to even mention EDI, I am able to share data between systems with very little effort. One of my systems even allows for feeding PDF, DOC, DOCX, RTF, JPG, TIFF and other formatted documents into it with very little effort. Mostly I work with SQL Server, but I have moved data from systems before where I had no access to the relational model, and had to examine the data to determine what to do to create what I needed.

Now THAT I agree with!

For those systems that don't support importing data natively, why are we buying them? For those that don't support exporting data, what do we expect? Should a vendor provide a product that allows you to extract the data in order to move to another vendor's product? We may want that, but do you really expect a company to ease your transition to a competitor?

Hear here and spot on!

Nobody said being a DBA would be easy.

BWAA-HAAA!!!! Especially when XML is floating around. 😀

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1003861 More actions · Answer 3

Phil, I have to mostly agree with djackson above. The editorial almost sounds like whining about the disparities of transmittable data formats. A "universal" data format has been tried and failed miserably more times than I care to mention. True CSV. EDI. XML. All failures in my eyes and all tremendous successes in the eyes of those that praise them. Heh... why not try something like DBIII table files which was actually better than any format I've come across so far because it had built in metadata by column instead of by element like XML does.

Do I think we should keep looking for a more universal, metadata and format rich format? Absolutely because, right now, you can't even just copy and paste between MS products without something going wrong. :hehe:

In the meantime, let the spirit of competition thrive in the data transmission arena. To do otherwise is almost like wanting all SQL languages to be the same... there’d be no competition and no incentive for improvement. 🙂

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Phil Factor SSC-Insane Points: 20244 More actions · Answer 4

Do I think we should keep looking for a more universal, metadata and format rich format? Absolutely because, right now, you can't even just copy and paste between MS products without something going wrong.

I couldn't agree more. I don't know why you think I'm whining. I'm merely pointing out that there is no reasonable standard way of doing this.I just want to shake you guys out of your Comedy-Limited mentality for sharing tabular data. It isn't good enough. Do we want another generation to wince when asked to export a report to a spreadheet? Does anybody seriously attempt to suggest that XML is a reasonable attempt at doing so? CSV, bless it, has no built-in way of transmitting the metadata and has design flaws (e.g. the decimal comma, and the 'escaping' of the end-of-line)

Given just a little bit of encouragement, I think might shall just go away and produce a standard myself. After all, if Doug Crockford can produce JSON virtually single-handed, then it must be achievable.

Best wishes,
Phil Factor

Jeff Moden SSC Guru Points: 1003861 More actions · Answer 5

Phil Factor (3/13/2011)
Do I think we should keep looking for a more universal, metadata and format rich format? Absolutely because, right now, you can't even just copy and paste between MS products without something going wrong.
I couldn't agree more. I don't know why you think I'm whining. I'm merely pointing out that there is no reasonable standard way of doing this.I just want to shake you guys out of your Comedy-Limited mentality for sharing tabular data. It isn't good enough. Do we want another generation to wince when asked to export a report to a spreadheet? Does anybody seriously attempt to suggest that XML is a reasonable attempt at doing so? CSV, bless it, has no built-in way of transmitting the metadata and has design flaws (e.g. the decimal comma, and the 'escaping' of the end-of-line)
Given just a little bit of encouragement, I think might shall just go away and produce a standard myself. After all, if Doug Crockford can produce JSON virtually single-handed, then it must be achievable.

Ah. Bloody hurriedly-typed written word. I didn't mean to make it sound like your were personally whining. It's a whine throughout the industry hence attempts at a universal protocol such as EDI, XML, and, now, JSON. My most sincere apologies.

I agree... we've a long way to go and things like CSV (comedy or true) and XML aren't exactly the bees-knees when it comes to meta-data or even the data itself.

Actually, my ol' friend, this conversation may have done just what you've intended. I've had an epiphany that just knocked me out of my chair and, now, I have some research to do. 🙂 As you say, if Doug Crockford can do it, so can you and I. :w00t:

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

GPO SSCarpal Tunnel Points: 4574 More actions · Answer 6

Oh ye of little faith.

HL7[/url] is the answer to your prayers as surely as a wagon needs a cummerbund. Quiet please. This is important. You must consider the HL7 protocol. It slices. It dices. It makes ShamWOW look like... a dishcloth. It can cure your halitosis and lumbago too. Old Septimus Cragg from down Cornpone Road had twenty-five years struck down with the palsy, but two weeks with HL7 and he was back climbing the walls again. Jumpy as he ever was, people. Tis no lie, I speaks here today.

To all the naysayers, the blaggards, the snarks and the scoundrels who wrongly accuse dear, dear HL7 of putting the electronic health record back 25 years, I ask you this, my poor myopic impatient friends: "What other transmission protocol gives you a free set of steak knives with every brace of pipe delimiters?"

Not only is the capacious, capricious, even carpaccious HL7 changing the world of healthcare forever, in ways never even dreamt of at its conception, it is doing so with style, sophistication, and, dare I say it, ... love. How many other interoperability protocols can boast accreditation under the ANSI Love (version 1.1 draft) standard? None frankly.

My detractors will call me a zealot. To that charge I plead GUILTY. True zealotry only comes when you have something this good. Scrutiny, rational evaluation, utility... these feeble irritations must be relegated to the status of impudent impediment. The game is now about belief my friends. When was the last time you had interoperability you could truly believe in.

Ladies and gentlemen I give you... HL7.

...One of the symptoms of an approaching nervous breakdown is the belief that ones work is terribly important.... Bertrand Russell

Michael.Beeby Ten Centuries Points: 1282 More actions · Answer 7

Michael.Beeby

Ten Centuries

Points: 1282

March 14, 2011 at 4:07 am

#1298246

Aka http://www.hl7.org/ (http://http//www.hl7.org/).

dfalso SSC-Addicted Points: 491 More actions · Answer 8

It sounds like you guys haven't worked much with 3rd-party data vendors and their data. If you had, you'd appreciate how big an issue this is. Widespread use of a text file format with metadata (so not CSV), scaling well (so not XML), and supporting long file names (so not dBase) would be a huge win. I don't mean to come off as hostile in the following, but some of the tone was...not nice. Anyway, to reply:

djackson 22568 (3/12/2011)

As to your concerns, I just don't get it. I move data from system to system reliably every day. Between CSV and XML, not to even mention EDI, I am able to share data between systems with very little effort. One of my systems even allows for feeding PDF, DOC, DOCX, RTF, JPG, TIFF and other formatted documents into it with very little effort.

Well, yeah, because you're moving data between your systems. So you can go look up metadata (heck, you can even query your metadata schema in the source system, if it's open) and know everything there is to know. But imagine if you couldn't - if it's a closed system.

djackson 22568 (3/12/2011)

...but I have moved data from systems before where I had no access to the relational model, and had to examine the data to determine what to do to create what I needed. In that case I actually restored data that the vendor was unable to find.

You probably haven't worked with very extensive data, then. My last project was with US Census ACS data: ~20,000 fields, CSV format. The only thing they documented was that each was a number of some kind. Try examining that data by hand and see how long that takes you. And with most data vendors, it's take-it-or-leave-it, so you're left to figure out formats yourself.

djackson 22568 (3/12/2011)

For those systems that don't support importing data natively, why are we buying them? For those that don't support exporting data, what do we expect? Should a vendor provide a product that allows you to extract the data in order to move to another vendor's product? We may want that, but do you really expect a company to ease your transition to a competitor?

Oftentimes one doesn't have a choice. Whether it's a vendor your CIO/CTO/whoever has decided to use, a product that's entrenched, or a price issue, it happens. And it's especially true in the data vendor world. There aren't going to be alternatives to government data.

And yes, I do expect something like SQL Server to export to different formats. Realistically places have a mish-mosh of systems, platforms, analysis software, etc. For instance, we need to migrate data from our internal analysis SQL Servers to an offsite linux Postgres database for the web server.

-Dom

rdxman SSC Journeyman Points: 84 More actions · Answer 9

You've been reading my mind. I've been thinking about this issue for quite a while. In fact I made a brief start on the project before Christmas. The format I propose is called RDX - for Relational Data eXchange. It is backwards compatible with many existing delimited file formats - even some (but not all) CSV files would qualify.

The basics of it are this (in no particular order):

- You can use either a comma, pipe (|) or tab as a delimiter

- The delimiter is undefined until one of the potential delimiters is read in a heading line

- Blank lines are completely ignored

- Records cannot span lines - i.e. the eol character(s) always specify the end of the line

- Every data line must contain the correct number of delimiters for the current table definition

- The format supports either a single table per file, in which case the first non-blank line is regarded as a line of headings

- Or multiple tables per file, in which case the table names are prefixed with ">>" as the first 2 characters of the line, and column lines which are prefixed with a ">". Any other non-blank lines are regarded as data lines.

For example:

[font="Courier New"]>>People

>PersonID,FirstName,LastName,DateOfBirth

1,Fred,Smith,1999-11-22

2,Mary,Hussain,1998-02-03

>>BooksOnLoan

>PersonId,BookTitle

1,Gone with the wind

1,Dr No

2,Catcher in the Rye[/font]

The (optional) metadata would be held in the file itself using reserved table names. Something like the information_schema views of the SQL standard. I've not yet defined these. We would also need to use some sort of escape code to enable the special characters to be included in the data where necessary.

I have created a website www.rdxman.com that I will use to define the standard further, but I'm interested in any thoughts others may have.

Chris Harshman SSC-Forever Points: 42192 More actions · Answer 10

While it would be nice to have a format that meets all of Phil's expectations, I don't know how practical it would be for vendors to create and support such a format for what you would gain from such a standard. The biggest problem I've had in my 17 years in IT, is that the tools we use for ETL, in an attempt to become more powerfull or flexible, have actually made it harder to work with the formats we do have available.

CSV, tab, pipe symbol delimited files, or even fixed width files may not have metadata and may not support unicode, but are still frequently used because they are the lowwest common denominator, and will do for the vast majority of situations. However, thinking back to the early-mid 90's in my FoxPro days, it seemed you could read in and export out these text files, or even spreadsheets or other simple formats quite easily compared with trying to do the same thing in DTS with SQL Server, and is even more difficult in SSIS. The ETL tools are what is failing us and make our lives more difficult than it needs to be

Phil Factor SSC-Insane Points: 20244 More actions · Answer 11

Well, what I have in mind is a strategic attack on the problem, once we are agreed as to what the problem is, and we find enough enthusiastic people to help out.

1/ First phase. Work towards a standard that is simple and resilient, by discussion and experiment. Use the JSON philosophy of keeping the standard as bullet-proof as possible. I think that one might even get a backward compatibility with the real CSV. Test it out with code that we can throw together quickly; get something that meets the core requirements well. Learn from history (e.g. the over-complexity of YAML, the ridiculous volubility of XML). Try informal tests such as copying AdventureWorks to PostgreSQL, and back again, and then comparing them.

2/ Define the first draft standard. Test it until it really hurts.

3/ Publish v1 of the draft standard

4/ Create good parsers in a variety of languages, and ways of testing files for correctness.

5/ Develop ETL tools for SQL Server and other RDBMS, and conversion tools.

6/ Campaign for filters for the main office tools.

7/ When it becomes worthwhile for ordinary folks to use it then publicize it.

I have certain questions

Should the standard support merged cells (by row and columns) to support spreadsheet import/export and representation as HTML tables.

Should a file allow more than one 'result' or table?

What form should foreign references take?

How should one specify constraints?

...and so on...

Best wishes,
Phil Factor

rdxman SSC Journeyman Points: 84 More actions · Answer 12

Phil, I think I agree with your approach, and I would count myself as one of the enthusiastic people.

To have my say on your questions:

- I wouldn't support merged cells. I think it would cause too much confusion in situations that didn't involve spreadsheets.

- The format should allow more than one table, otherwise it's not adding enough value

- Foreign key references and constraints should themselves take the form of tables in the file.

Here's another example:

>>RdxForeignKeys

>PkTable|PkColumn|FkTable|FkColumn

Authors|AuthorId|Books|AuthorId

>>Authors

>AuthorId|Name|Address

1|Joe Brighton|2 Simple Terrace[nl]Manchester

2|Mary Bondi|18 Ocean Road[nl]Sydney

3|Sam Shore|21 Breaker Way[nl]Adelaide

>>Books

>BookId|Title|AuthorId

1|The Road to Brighton|1

2|Brighton and beyond|1

3|Back to Bondi|2

4|By the shore|3

This example illustrates a few more points about the RDX format:

- The foreign key constraints are defined by the reserved table RdxForeignKeys. You can see that the AuthorId in the Books table refers to the AuthorId in the Authors table.

- Other constraints (datatypes, primary keys, value ranges) could similarly be defined by other reserved tables defined in the standard. This has the benefit of not mixing up the formatting of data within the file. Everything just remains a table.

- Blank lines are supported by the format by being completely ignored. They are useful for separating tables though.

- The [nl] escape sequence represents my current thinking on how to embed a new line character within a record. I prefer it to as it seems easier to read, which I think is a key requirement for the standard. Other escape sequences would be needed to represent the other special characters in the format.

Chris Anderson-432071 SSC Enthusiast Points: 134 More actions · Answer 13

I think that was EDI was/is/was supposed to be. Including the metadata (i.e. XSD) along with the data seems to conflict with the file being compact, and easy to be read by commodity tools. For many of my needs, where 100's of the same signature data is being processed, it's less about communicating the meta data with each file as much as asserting that the file meets some standard of metadata.

With this standard that is being proposed, how about the ability to either 1) include the metadata inline the file, or 2) reference a URI for the metadata?