SQL 2005: Enter XML

  • Comments posted here are about the content posted at http://www.sqlservercentral.com/columnists/mcoles/2826.asp

  • What a great communicator!


    If you build it they will come.

    Joseph A. Montione-SQLServerCentral

  • I have been waiting a long time for someone to remove the mystery and complexity of the XML datatype.  It is one of those features of SQL 2005 that has great potential, but I have been cautious in building any solutions that use this feature.  I am anxiously looking forward to Part II.

  • I can't imagine ever wanting to use the XML data type.  The only time I'd ever store XML of any sort in the database is if there is some kind of real requirement (regulation) to store the exact messages used in an exchange.  In that case it gets stored as a varchar(max).  Data stored in the XML data type is not guaranteed to be the SAME as the original XML document/message, so for most regulatory/audit requirements it is not sufficient.

    If there is any need to use the data contained within the XML documents, beyond simple storage and retrieval, it gets "shredded" and stored in a properly normalized structure.

    /*****************

    If most people are not willing to see the difficulty, this is mainly because, consciously or unconsciously, they assume that it will be they who will settle these questions for the others, and because they are convinced of their own capacity to do this. -Friedrich August von Hayek

    *****************/

  • Hi all, thanks for the feedback!

    To DC - One of the advantages of the XML data type is that you can easily shred your XML documents into relational data using the XML data type's built-in methods like .query() and .nodes(), for relational storage.  Even if you don't want to store data using the XML data type itself, it's still useful for manipulating XML data being passed in and out.

    And of course XML is an excellent choice for semi-structured data which might not fit well into the relational model.  Even our good friends over at ANSI have seen the need to store, retrieve, and manipulate XML in SQL databases (the ANSI SQL/XML standard was approved and is currently under revision AFAIK).

    Thanks again!

  • I am still confused as to why I would choose the XML datatype as opposed to writing something in C# to manage an XML document.

  • Hi Frank,

    It all depends on what you want to do with it.  For instance, when combined with some of SQL Server's other new features like HTTP Endpoints, etc., you can potentially "eliminate the middle man" and send XML documents directly to SQL Server for shredding via Web Methods.  Done right, this could potentially eliminate network traffic, cut down on development time, and eliminate potential points of failure in the system.

    Consider sending a 100 MB XML document to a .NET app, an app who's sole purpose is to shred the XML and send it to SQL Server.  You have a potential point of failure between the front end and the .NET app in the middle, and another potential point of failure between the .NET app and the SQL Server.  You also have the additional overhead of opening and maintaining a SQL Server connection while transferring the data from the .NET app to SQL Server, and the additional bandwidth involved in effectively sending the data over your network twice (once from the front end to the .NET app and once from the .NET app to SQL Server).

    By setting up an HTTP Endpoint and sending the XML data directly to SQL Server for shredding you eliminate one potential point of failure, remove the additional overhead of opening and maintaining a connection, and only have to transfer your XML data once.  Of course this is just one example, which might or might not be applicable to your particular situation.

    Also keep in mind that not all data is best represented relationally.  SQL is great for highly structured, highly "regular" data.  XML is often better at representing semi-structured and hierarchical data (granted CTE's in SQL do help a bit with the hierarchical data).

    Thanks!

  • Now there you go...

    Alarms should go off when you say something like "not all data is best represented relationally."  Semi structured data is an oxymoron.  Granted, it is an oft-repeated oxymoron, but an oxymoron none the less. 

    And "hierarchical" data is something that is very rare.  Sure, most data CAN be viewed hierarchically, and we, as humans, often find it very useful to categorize things.  The problem is that once you start to manage your data in hierarchical structures, you have limited the ability of other users to view the same data in different hierarchies.  The process of categorizing data into hierarchies (taxonomy) necessarily implies assigning more "value" to some attributes and less to others; this is one of the main problems that lead to the demise (almost) of hierarchical DBMS's.  And if more practitioners weren't totally ignorant of history, XML would have been relegated to forever be a bit player, instead of being a "major feature" in SQL Server or any other DBMS.

    To qualify as "hierarchical" the data must be able to be accurately represented in one and only one hierarchy.  Otherwise, the hierarchy is merely a mechanism imposed upon the data for a specific purpose, and thus should NOT be managed in hierarchial structures (XML etc...)

    To point to XML as a means of eliminating network traffic or storage space is the very definition of irony, although I'm sure that in this case it was purely unintentional.

    /*****************

    If most people are not willing to see the difficulty, this is mainly because, consciously or unconsciously, they assume that it will be they who will settle these questions for the others, and because they are convinced of their own capacity to do this. -Friedrich August von Hayek

    *****************/

  • DC - Why should alarms go off?  There is a lot of data in the world that is better represented in a hierarchical structure than in having a relational structure imposed upon it.  Consider the simple example of HTML.  Obviously an HTML page could have a relational structure imposed upon it, but that would be awfully painful (look at the source code of this page and imagine it in relational format, for example).

    For a more business-oriented example, consider ERP with its Bills of Material and Bills of Labor.  I've seen detailed BOMs that can go very, very deep indeed.  And I've seen the pain caused by trying to force a relational format on those BOMs (particularly in older versions of SQL Server where CTEs were not available).

    As far as BOMs are concerned, and most hierarchical structures, there is more "value" (in one form or another) assigned as you move up the tree.  For instance, the $ value of a bicycle in a BOM increases as you add both tires to the frame.  And as you go up another level in the BOM, the $ value increases again when you add pedals.  In a standard business Org. Chart, the "value" you speak of might be measured in units of "authority", "accountability", or some other metric.  This is not a function of XML or even hierarchical databases themselves, it's standard practice when representing data hierarchically using any method, so I'm not really sure what you're driving at.  And yes, one of the features of XML is that position is important, as it can provide context to the data, as opposed to the SQL model which is order-independent.

    Do you have a reference for all hierarchical data being able to be represented in "one and only one hierarchy"?  I would need a reference for that, as I can easily show a hierarchy like the following:

    Bishop

       |----Priest

                  |-----Nun

    And I could represent the same data in a second hierarchy:

    Pope

       |----Bishop

                  |-----Priest

                              |-----Nun

    So I'm not seeing what you're getting at there.  One subset of data represented in two different hierarchies.

    And of course, using the SQL model I either have to assign them all the same attributes (i.e., Pointy Hat Color, Habit Size, etc.) even though for many of them it will be NULL (Nuns don't have Pointy Hats and the Pope doesn't wear a Habit), or create a separate table for each of Pope, Bishop, Priest, Nun representing their attributes.  And of course another table for Altar Boy, one for Church Janitor, etc.

    Finally, in the example I gave, I'm talking about transferring a 100 MB XML file from the front end to the .NET app in the middle.  Further, after it's shredded we'll be transmitting maybe another 100 MB of network traffic from the .NET app to the SQL Server (depending on how much of the XML data we actually decide to keep).  If you eliminate the .NET app in the middle, you've just cut your network traffic in half.  I suppose I'm missing the irony of essentially cutting your network traffic in half...

  • "And I could represent the same data in a second hierarchy"

    Thank you for making my point.  In order to represent data stored in a hierarchical structure, in a different hierarchy, redundancy is required.

    Data stored in relational structures can be represented in many different hierarchies.  It is much more difficult, if not impossible, to do with data stored in hierarchical structures.  Granted, representing complex hierarchies is...well...complex, unless you store it that way.  But in so doing you limit the data's usability for other purposes, and you loose the ability to declare any but the most rudimentary constraints on it.

    So, now you've flushed data independence and data integrity down the drain...  Let me think... What was the purpose of a DBMS in the first place?...

    If you need any further proof that hierarchical structures are very poor for general data management, just look at all the problems with file systems.  The organization of file systems is notoriously difficult because what makes sense for one person or group, makes no sense for others.  So what happens?  Multiple file structures are created to support the different useful views of the data.  Then files are duplicated all over the place, many are fogotten and very quickly, no one has any idea which files are current etc...

    Hierarchical data management is a mess.  Those who are smart would avoid it.

    /*****************

    If most people are not willing to see the difficulty, this is mainly because, consciously or unconsciously, they assume that it will be they who will settle these questions for the others, and because they are convinced of their own capacity to do this. -Friedrich August von Hayek

    *****************/

  • Data stored in relational structures can be represented in many different hierarchies.  It is much more difficult, if not impossible, to do with data stored in hierarchical structures.  Granted, representing complex hierarchies is...well...complex, unless you store it that way.  But in so doing you limit the data's usability for other purposes, and you loose the ability to declare any but the most rudimentary constraints on it.

    Hierarchical data stored in relational structures is difficult to maintain and often has artificial limits imposed upon it by software developers because it can be extremely inefficient to retrieve, manipulate, etc.  Not to say that it cannot be done, but not every problem is a nail...  As for constraints, XML Schema provides a very strong typing system, and you can define a wide variety of constraints.  It's not as complex a system of constraints as SQL, but then again it hasn't been around for 3 decades either.  We can probably expect the power of XML to increase over time, as people begin to demand more power.

    Hierarchical data management is a mess.  Those who are smart would avoid it.

    If you expect XML to be equivalent to, or think that it is a replacement for, relational DBMS then you'll definitely be in for a shock.  I believe you already mentioned that you do have to deal with XML in various forms, and that you currently shred it and put it in the database in relational format.  What's wrong with using SQL Server to do that?

    I suppose there is a school of thought that sees the RDBMS as a glorified file cabinet whose only function in life is (or should be) to store relational data.  Then there are those that see more potential in the modern RDBMS as a full-blown application development platform.  Not that there's anything wrong with either view (in my opinion), but I do think it's a little limiting not to consider the full power of your RDBMS platform.

  • Here's a real world situation. I have a web based application that tracks games and points for a poker league. There are certain things that are easiest entered into the backend database using a bulk import.

    The poker league is hosted on a provider that does not allow file upload. My current solution uploads player statistics into a MemoryStream object which is converted into an XmlDocument to process into the database. The XML datatype may have been a solution if I needed to keep a copy of the actual uploaded file?

    The XML part of the solution is a vast improvement over CSV because of the data typing and heirarchy that the XML allows. 

    Season--Location--Game--Players--Player--Points

    Without getting too involved in my own solution, the XML hierarchy allows me to put these records into the database in a transaction with rollback and commit features.  I can validate the entire tournament against specific rules. Rules like, you can't have two players in first place. You can only play at one table during a round. So, it takes advantange of the self-describing, self-validating features of XML.

    Here's an interesting side effect of this XML file. If I wanted to display the touranment results, I could generate the same XML document and use a style sheet to display it as a web page.  Or, I could store the file as XML in the database...which I am not currently doing.

    VARCHAR(8000) is sufficient to handle this situation. The XML datatype could offer an additional schema stamp of authentication to the document. Currently, the benefits don't outweigh the hassle.

  • In your situation, where the XML is in a strongly structured, regular format it will probably be advantageous to shred it and store it in relational form.

    Without even needing to store the data in an XML column, however, you can still use the SQL Server platform to:

    1) validate the XML against an XML Schema,

    2) shred the original XML document and store it in your database directly,

    3) regenerate the XML from the relational data using the built-in XML functionality (FOR XML PATH clause, for instance),

    4) generate the HTML page for you (using FLWOR expressions or the SQLCLR),

    5) and accept requests/deliver the HTML page directly to you (via HTTP Endpoints)

    One of the chief advantages (IMHO) of the XML data type is that (unlike the old TEXT, NTEXT, and IMAGE LOB data types) you don't have to create columns to hold your XML data in order to take full advantage of its power and flexibility.

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply