Representing XML data in SQL Server

Question

Representing XML data in SQL Server

diogosouzac

SSC Journeyman

Points: 85
More actions
August 6, 2018 at 11:02 pm

#368741

Comments posted to this topic are about the item Representing XML data in SQL Server

Viewing 15 posts - 1 through 15 (of 21 total)

You must be logged in to reply to this topic. Login to reply

Jeff Moden SSC Guru Points: 1004704 More actions · Answer 1

This is a nicely written article. Just as a suggestion for future articles, it would have been great if you identified where you got the Sales and SaleItems data from or included it as readily consumable data for people to experiment and confirm with. If you come up with a link to get the data or want to append some code to create and populate simple test tables, send a note to Steve and I'm sure he'll be glad to oblige.

Shifting gears a bit, from the article...

However, there are situations where it is necessary to visualize and export the results in a more structured format, that can be read by individuals with little technical knowledge

In direct contrast to what that states, I've found that XML is one of the worst formats for readability whether a person has technical knowledge or not. There is very little that's actually easier to understand than a typical row/column/tabular format possibly accompanied by a representative chart or two.

Then, there's this, also from the article...

In general, when exporting data to other applications, you should follow an XML formation scheme.

While I agree that presenting data in another format is sometimes necessary, I can't agree that exporting data with XML does anything that's actually good. It's bloated with repetitive tags (doubled for "Elements") and is so verbose that the bulk of the time to display the data is spent on decoding the XML just to extract the data. It's also and frequently 8 to 16 times (my personal observation) longer byte wise and so makes transmissions of data take longer, takes more memory to process and display, and much more disk space to store.

Just because millions of people are doing it, doesn't make it right. XML is a prime example of that.

p.s. Ah... there IS one good thing about using XML... because millions of people have kowtowed to using XML, it has driven the industry to make much faster equipment and hard disks much larger, faster, and cheaper. They had to because XML is so very inefficient in all ways. 😀

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

aaron.reese SSChampion Points: 13498 More actions · Answer 2

Personally I find XML (with an XML parser to properly do the indentation) much easier to read than JSON and it is a shame that JSON seems to be the default format for DTO activities such as API results. Glad than MS finally saw the light and included JSON in 2016/VNext

A couple of things to be aware of with XML, All of the nested SELECT statements effectively run independently so are a hidden RBAR so if you have a structure like

<Customers> <Customer> ... <Invoices> <Invoice> ... <Lines> <Line> ... <SerialNumbers> <SerialNumber> ...

Then that is a lot of processing overhead.

Secondly, If you have a really large dataset which you have output as a single XML results field, you might have issues with saving the results. I once created a file for 25,000 rent letters with up to 5 levels of nested data and the resulting XML file was 1.4M elements and wouldn't fit into memory. I ended up having to save the letter fragments to a temp table and then BCP the table contents to a file.

lapras_dewgong SSC Enthusiast Points: 164 More actions · Answer 3

Very useful article to understand the XML conversion options available in SQL. Concise examples are the key.

I agree with Jeff in his views around XML. Bloated and not that easy to interpret for a human being, but accomplished it's objective to become a common standard. The equivalent of IRS here use it to represent our invoices and record keeping.

Rod at work SSC-Dedicated Points: 34110 More actions · Answer 4

I appreciate this article and thank you for writing it. In my current job they store XML data in several databases. The curious thing, to me, is that the XML data is never stored as XML data. It is always stored as either a large NVARCHAR or maybe NVARCHAR(MAX). I do not know why. In some cases I imagine it was done before SQL Server supported the XML data format. But I know that's not true in other cases. In other cases they're creating new databases with XML data, but still storing it as NVARCHAR. Anyway, thank you for this article, as it helps me understand how XML probably should be handled in SQL Server.

Kindest Regards, Rod Connect with me on LinkedIn.

Scott Coleman One Orange Chip Points: 27452 More actions · Answer 5

There is no question that XML causes the data files to bloat. On the other hand, it is much less ambiguous than swapping CSV files with knuckleheads who don't know how to deal with qualifying fields, embedded delimiters, columns with multi-line text values, and so on. Have you ever dealt with someone who thought it was your fault that your code couldn't handle \" instead of "" in a quote-qualified field?

One application I dealt with that benefited from XML data files was downloading reports (as XML) from a web service, where the report columns were subject to change without notice. An XML query could return all the field names from the first record, which were used to create a dynamic SQL query to read the columns in that report. The data columns came from a limited domain of possibilities, so a data dictionary table was used to provide data types for each column. The biggest argument against CSV in this case was text columns with multi-line values, that were guaranteed to include commas, quotes, tabs, line feeds, and other obnoxious characters.

In another case I had to replicate tables from an old Progress system to an SQL 2005 server. The ODBC connection between the servers had issues, querying a large table created so many locks in the Progress database that it would blow up. But Progress code dumping the table to a text file ran efficiently and without killing the server. (The reason for not using CSV escapes me at the moment.) The largest table was too big for either server to digest at once as an XML DOM object, but I could write a Progress program to export the data to a file as a text stream in XML format, and in SSIS I wrote a script task that could read the file token-by-token with the XmlStreamReader class. I'm not sure I want to write generic input code to read any possible XML file, but I controlled the source file format and thus could simplify the parsing. There's no rule that putting the entire XML file into memory at once is the only way to deal with it.

Jeff Rivera SSC Enthusiast Points: 125 More actions · Answer 6

Jeff Moden - Tuesday, August 7, 2018 6:37 AM
This is a nicely written article. Just as a suggestion for future articles, it would have been great if you identified where you got the Sales and SaleItems data from or included it as readily consumable data for people to experiment and confirm with. If you come up with a link to get the data or want to append some code to create and populate simple test tables, send a note to Steve and I'm sure he'll be glad to oblige.
Shifting gears a bit, from the article...
However, there are situations where it is necessary to visualize and export the results in a more structured format, that can be read by individuals with little technical knowledge

In direct contrast to what that states, I've found that XML is one of the worst formats for readability whether a person has technical knowledge or not. There is very little that's actually easier to understand than a typical row/column/tabular format possibly accompanied by a representative chart or two.
Then, there's this, also from the article...
In general, when exporting data to other applications, you should follow an XML formation scheme.
While I agree that presenting data in another format is sometimes necessary, I can't agree that exporting data with XML does anything that's actually good. It's bloated with repetitive tags (doubled for "Elements") and is so verbose that the bulk of the time to display the data is spent on decoding the XML just to extract the data. It's also and frequently 8 to 16 times (my personal observation) longer byte wise and so makes transmissions of data take longer, takes more memory to process and display, and much more disk space to store.
Just because millions of people are doing it, doesn't make it right. XML is a prime example of that.
p.s. Ah... there IS one good thing about using XML... because millions of people have kowtowed to using XML, it has driven the industry to make much faster equipment and hard disks much larger, faster, and cheaper. They had to because XML is so very inefficient in all ways. 😀

I feel sheepish disagreeing with a SQL legend but here goes: Regarding the point "XML is one of the worst formats for readability whether a person has technical knowledge or not" I think you need to keep context in mind. XML is not a good replacement for tables - which I agree are superior for displaying data. Rather XML is the superior and more readable option when outputting data into files. When you have to open a file for examination the XML file format is easier to read than the CSV equivalent. That is the context that XML is usually used in, containerizing information typically for consumption.

For what it's worth I agree with your bloated point, this may partially explain JSON's popularity.

Regarding the article I say "well done." I certainly learned something.

Jeff Moden SSC Guru Points: 1004704 More actions · Answer 7

Jeff Rivera - Tuesday, August 7, 2018 6:10 PM
I feel sheepish disagreeing with a SQL legend but here goes: Regarding the point "XML is one of the worst formats for readability whether a person has technical knowledge or not" I think you need to keep context in mind. XML is not a good replacement for tables - which I agree are superior for displaying data. Rather XML is the superior and more readable option when outputting data into files. When you have to open a file for examination the XML file format is easier to read than the CSV equivalent. That is the context that XML is usually used in, containerizing information typically for consumption.
For what it's worth I agree with your bloated point, this may partially explain JSON's popularity.
Regarding the article I say "well done." I certainly learned something.

Thank you for the kind words but I'm just like everyone else... not a legend. I just try to be helpful. Or to be more true than that, I aim to please... I sometimes miss but I'm always aiming. 😀

The following isn't an argument... It's just my humble opinion based on my own humble experiences.

I believe that what you're referring to as ease of readability in XML files is when you have to look at data halfway or more through a 10,000 "record" files (for example) and the file has a lot of columns. Trying to manually discern which column is which in a CSV or even a TSV file is admittedly a bugger and I do agree that properly rendered/formatted XML (which bloats it even more in a lot of cases unless you have a tool that automatically does the proper indentation, etc) can make life easier in such cases.

However, if you're in the business of working with such files on a regular basis and they are CSV or TSV or however delimited, then you should also have the right tools to examine the files. That's IF you need to examine the files. How many times does someone need to actually examine a file at the "element" level? In a previous position, I was responsible for the successful download, import, validation, and final merge of hundreds of files per day. Each file consisted of about 30,000 rows and was between 50 and 800 columns wide and varied from hour to hour (DoubleClick.net files) even for the same "client" source. Imagine the sheer volume of data that was. The files were (fortunately) all TSV files. I think I may have looked at a total of 5 of them over an 18 month period and I certainly didn't use NotePad, TextPad, or NotePad++ or any similar tool to do so.

I developed a system that would read the "two row" headers in each file to figure out what they were (each file had the same first 9 columns as a reference) and I was doing the full monty (download, import, validate, merge) at a rate of 8 files every 2 minutes. Remember that the number and names of columns would change incessantly.

Now, imagine the size of those files if they were XML. Then also try to imagine automatically flattening those files so that they could be unpivoted based on pairs and quads of columns and aligned with the first 9 columns using XML and also imagine how long that may take.

I know, I know... I'm one of those technical wire-heads that don't actually need to look directly at data in the files themselves and, when I do, I have the tools to do so. I'm not one of those non-technical people that might want or need to look at the data in the file directly.

So why would I be required to use XML just because someone else doesn't know what they heck they're doing? Why would anyone send such volumes of tag bloated, slow transmit (8 to 16 times slower in this case), slow to parse data just on the off chance that someone non-technical might want to look at the data directly in the file?

Heh... and I've seen non-technical people look for something in XML files... the term "Deer in Headlights" figures prominently when I watch them try to figure it out. Most of the ones that I've worked with can't even figure out where one "record" ends and another starts.

Do I think XML is helpful when looking at execution plans? In some cases, sure, but it is hierarchical data to begin with. Not data that belongs in one or more two dimensional tables.

So, to summarize... I'm not totally against XML. It DOES have it's uses but the word "uses" should never be meant to mean "everywhere". And, IMHO, it should never be used to transmit data for import into databases, I don't care who thinks they need to read the files with the ol' Mark I Mod 1 Eyeball. 😉

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Eirikur Eiriksson SSC Guru Points: 182951 More actions · Answer 8

Thank you for this article Diogo, nice write-up.
😎

Quick thoughts, I think it would benefit from slightly more technical details on the SQL Server's implementation of XML, such as the schema collections and the difference between typed and untyped XML. On the storage side, the method used to store XML in SQL server is a binary representation of the XML, not an identical copy. This safes roughly around 20% compared to storing the XML in text files.

Jeff Moden SSC Guru Points: 1004704 More actions · Answer 9

Eirikur Eiriksson - Tuesday, August 7, 2018 11:49 PM
Thank you for this article Diogo, nice write-up.
😎
Quick thoughts, I think it would benefit from slightly more technical details on the SQL Server's implementation of XML, such as the schema collections and the difference between typed and untyped XML. On the storage side, the method used to store XML in SQL server is a binary representation of the XML, not an identical copy. This safes roughly around 20% compared to storing the XML in text files.

Uh huh... what about during data transmission? And, I could certainly be incorrect, but considering that the XML text files are 8 to 16 times larger than delimited text to begin with, I'm thinking that's not much of a savings.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Eirikur Eiriksson SSC Guru Points: 182951 More actions · Answer 10

Jeff Moden - Wednesday, August 8, 2018 12:12 AM
Eirikur Eiriksson - Tuesday, August 7, 2018 11:49 PM
Thank you for this article Diogo, nice write-up.
😎
Quick thoughts, I think it would benefit from slightly more technical details on the SQL Server's implementation of XML, such as the schema collections and the difference between typed and untyped XML. On the storage side, the method used to store XML in SQL server is a binary representation of the XML, not an identical copy. This safes roughly around 20% compared to storing the XML in text files.
Uh huh... what about during data transmission? And, I could certainly be incorrect, but considering that the XML text files are 8 to 16 times larger than delimited text to begin with, I'm thinking that's not much of a savings.

Of course it does depend on the structure of the XML, if it's element or attribute bound etc., but because of the binary storage, there is little difference between short and long element and attribute names.
😎

When it comes to the storage size comparison, lets include the final target, the SQL Server database😉
Here is a recent example (SQL Server 2017 RTM):

XML files on disk 42.7 MB
XML content in XML column 36.8 MB
The content parsed into a normalized and optimized schema 41.3 MB

Ray Herring SSCertifiable Points: 5533 More actions · Answer 11

You confuse the features of the desk top application SSMS with SQL Server. SSMS != SQLServer.
Most, if not all, Relational Database Managers return row-sets formatted according to a protocol standard such as TDS for MS SQL Server.

The desktop applications SSMS displays that row-set in one of several formats depending on the user's chosen settings. Other applications, (e.g. SQLCMD, OSQL, BCP, custom web pages, ...) display the row-set in different formats.

When your query tells the server to format the output as an XML string that is what it returns. That XML string will then be formatted for display by the application (SSMS, a Web Page, WinForm, etc.)

It is important to be clear about the relationship and features of the various clients as opposed to the server itself.

Aaron N. Cutshall SSCrazy Eights Points: 9094 More actions · Answer 12

Jeff's experience not withstanding, I've actually had good experiences with XML. But then, I was the one who established the format and oversaw its implementation and use. I have seen XML horribly used and most default XML exports are terribly bloated because they don't utilize attributes and treat everything as an element. Situations like this are why XML gets such a bad reputation. It's also why JSON gets trumpeted as a "superior" format. However, while a well-crafted XML format can be validated (through an XML schema definition or XSD) for structure and content, JSON cannot. Yet, the very things that give XML its great flexibility are generally why it gets abused, misused, and horribly twisted. I have seen some XML scenarios that just make me cringe (HL7 v3 is a prime example) due to over-complexity and generally poor design.

LinkedIn: https://www.linkedin.com/in/sqlrv
Website: https://www.sqlrv.com

frost_tom SSC Veteran Points: 221 More actions · Answer 13

Nice, concise review of XML handling in SQL Server! Thanks for the article.

Jeff Moden SSC Guru Points: 1004704 More actions · Answer 14

Eirikur Eiriksson wrote:

Jeff Moden - Wednesday, August 8, 2018 12:12 AM
Eirikur Eiriksson - Tuesday, August 7, 2018 11:49 PM
Thank you for this article Diogo, nice write-up.
😎
Quick thoughts, I think it would benefit from slightly more technical details on the SQL Server's implementation of XML, such as the schema collections and the difference between typed and untyped XML. On the storage side, the method used to store XML in SQL server is a binary representation of the XML, not an identical copy. This safes roughly around 20% compared to storing the XML in text files.
Uh huh... what about during data transmission? And, I could certainly be incorrect, but considering that the XML text files are 8 to 16 times larger than delimited text to begin with, I'm thinking that's not much of a savings.
Of course it does depend on the structure of the XML, if it's element or attribute bound etc., but because of the binary storage, there is little difference between short and long element and attribute names.
😎
When it comes to the storage size comparison, lets include the final target, the SQL Server database😉
Here is a recent example (SQL Server 2017 RTM):
XML files on disk 42.7 MB
XML content in XML column 36.8 MB
The content parsed into a normalized and optimized schema 41.3 MB

It would be interesting to see the full monty on the storage size comparisons you've posted, Eirikur. I've never seen such small differences. Perhaps I've never been exposed to anything but bad XML but my experience has been much different than the size comparison you've posted. Generally speaking, the size differences between the actual XML and the actual data contained in elements and attributes is about 6-8 to 1. The culprit, from what I can see, is the tag bloat and I've seen much worse than the 6-8 ratio.

That being said, I still think that XML and JSON are two of the worst ways to transmit data from one place to another. I'm seriously torn about storing it in a database. For example, it can be a great thing for bulk storage of, say, error and run logs because of it's mostly freeform capabilities but the tag bloat is horrendous and, once people start doing it, it's like anything else. They use it for bloody everything without thinking of the ramifications for storage, pipe bload, memory usage, and searchability. It and JSON are both extreme violations of 1st normal form and resource conservation.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)