Is XML the Answer?

  • ok, alot of comments here and no one is 100% correct as this is a subjective issue.

    The article was titled "Is XML the Answer...". The answer to what? I agree that XML is not the answer to a large enterprise data store. A good RDMS is whats needed here, SQL Server, Oracle, MyPhp, whatever... A large dataset is always going to best off stored and MANAGED in a fully relational database server.

    But XML is the answer to many problems and issues. One of the biggest is data transportation. XML has made this so much easier over the last 3-4 years. It is a standard and a standard that is going to stick. It may not be as strict a standard as we are used to, but a easy to use standard way of transporting data it is.

    It also offers so much more, easy transformation for user display, easy parsing, human readability (For debuging, development), but its not the answer to everything and I dont thing anyone has ever said it is.

    XML does not lend itself very well to data management, i dont think was ever intended to be used as such. It is purley a markup language for transporting data between systems.

    Let it do what it was intended for and XML is fantastic!

  • quote:


    The relational model is set-theory and logic applied to database management, what is xml?


    A simple solution for conveying meaninful information about simple entities and simple relationships. Like you say - different solutions for different things.

    quote:


    The complete set of constraints for all tables in a database is what describes the database and is the only thing the DBMS can use to 'understand' the data. How does xml handle this?


    An XSD schema and the Document Object Model. Granted, a standard XML schema is not enough to control referential integrity and relationships in itself. However, if XML is being mapped onto an RDBMS (such as SQL Server 2000), a schema using the SQL namespace and processed by the SQLXML engine is more than adequate to 'understand' the data in the way you suggested.

    (In fact you get two levels of schema, the XML-level schema AND the database schema (the former is generally born of the latter). That means that the data, relationships and integrity are validated both during code-execution by the SQLXML engine and during the standard database transaction. Bonus!)

    quote:


    We as developers and DBAs can do our part by not using these features and by continuing to request more important features and functionality, especially a better implementation of the relational model.


    I'm not quite sure why we should be protesting through not using features when they can (and in my case have) been useful.. That seems like a political decision rather than a practical one. Can't we have both - data services features as well as improvements to the relational implementation? After all, data access is one of the roles of a DBMS.

  • XML addresses two major needs: transferring data from one place to another and storing "sparse" data. The usefulness of XML in transferring data has been addressed well in the previous comments so I won't go into it.

    But let's talk about storing "sparse" data. You have a sparse data situation when most of your records don't have actual data in them, i.e., they have a lot of missing fields. As an example, think about people's addresses. It used to be that a person's address referred to a street address, but now it could be an email address, or a web site address, or any number of different phone numbers. In my contacts list I have a whole bunch of people with just their street addresses and a whole bunch with just their email address. So the questiuon is how do I store this information economically. If I use an RDB one choice is to have a StreetAddress table and an EmailAddress table both related to NameTable. Some of the people won't have an EmailAddress record and some won't have a StreetAddress record. It's trivial to get a listing of people with street addresses and it's trivial to get a listing of people with email addresses. But what if I want a listing of everyone with the appropriate address shown? Then I have to start doing outer join queries which are always a mess to get right. If I throw in telephone numbers I just add to the mess. XML can solve the problem by providing a single Address document that is subdivided into StreetAddress elements (which in turn can be subdivided into street, city, state, postal code elements), EMailAddress elements (which using attributes might distinguish between business email and personal email) and Telephone elements (also divided possibly into home phone, work phone, cell phone, pager, etc.). If, for a particular person, some (or most) of the data is missing, then those elements are left out of the Address document. Nothing extraordinary has to happen to create my contact list. XSL and XQuery easily deal with the lack of data.

    Addresses are just a simple example of this kind of situation. Think about what all the possibilities are to catalog a book (author names, available languages, editions, reprints, etc.) and imagine having to create an RDB that handles all the situations. It's possible but the schema will be incredibly complex. (Go look at the MODS XML schema being defined by the Library of Congress to see the gory details). It's these situations that XML does better. No one is ever going to produce an efficient XML implementation of an accounting system, but on the other hand there are clearly many situations where RDBs are too structured to provide an efficient repository mechanism.

  • quote:


    The Information Technology arena has no room for extremists. Use the right paradigm/methodology/tool for the job.


    I guess that's what it's all about!!!

    Every technique is fine for its' specific niche

    nothing more and nothing less.

    Btw, a truely relational database does not know such a thing as a field or a record. It's all about column and rows . To quote Joe Celko's usual rants on the MS newsgroups :

    HUGE DIFFERENCE !!!!

    Frank

    http://www.insidesql.de

    --
    Frank Kalis
    Microsoft SQL Server MVP
    Webmaster: http://www.insidesql.org/blogs
    My blog: http://www.insidesql.org/blogs/frankkalis/[/url]

  • Taking a step back, thanks for the article, dc -- I enjoyed it and the debate it has provoked. I certainly agree with your views on the overhyping and misapplication of technologies, although I think you've given an unbalanced description of XML to help make your case.

    However, one comment in your follow-up really piques my curiosity.

    >>XML is far from being a true standard.

    Could you elaborate on this? I think this could be worthy of a separate article.

  • quote:


    I also find it somewhat amusing that people still believe relational databases are inheriently better than hierarchal. There's no one solution fits all. Have the OO arguements taught these people nothing?


    NO! They have taught me that the OO and hierarchical DB proponents have one thing in common; a sad lack of education in data management fundamentals. YES! The relational model is superior to either the hierarchical or OO approach to data management. This argument was finished 30 years ago, but now we have a brand new crop of "IT Professionals" who are not acquainted with the past and the lessons that should be learned from it.

    /*****************

    If most people are not willing to see the difficulty, this is mainly because, consciously or unconsciously, they assume that it will be they who will settle these questions for the others, and because they are convinced of their own capacity to do this. -Friedrich August von Hayek

    *****************/

  • quote:


    NO! They have taught me that the OO and hierarchical DB proponents have one thing in common; a sad lack of education in data management fundamentals. YES! The relational model is superior to either the hierarchical or OO approach to data management. This argument was finished 30 years ago, but now we have a brand new crop of "IT Professionals" who are not acquainted with the past and the lessons that should be learned from it.


    I couldn't agree more. The problem is often one of perspective. To a programmer a database is a place to persist the state of his application. The programmers narrow view misses the real value of data.

    The following is a quote from a programmer who's "seen the light":

    quote:


    For an OO guy, taught that behavior was everything and data was an implementation technique, that's a startling conclusion. However, the beauty of a database is that it's devoid of behavior, or, if there is behavior, it's layered in on top of the data. Programming languages come and go along with the ideas that underlie them and the applications that are built with them. Relational data, on the other hand, is a model that's simple enough, but complete enough, to move forward from application to application, accumulating value as you go in the data itself. And, since the relational model is so entrenched, no technology for the last 10 years or the next 1000 would be complete without support for it. Even Microsoft, IBM, GM and AT&T will prove to be less enduring than relational data, the tools to program against it and the tools to slice and dice it w/o programming anything (the latter are amazing strong already and continue to grow).

    http://www.sellsbrothers.com/spout/#ooDeadInTheWater


    My two cents ...

    Jason


    JasonL

  • quote:


    You have a sparse data situation when most of your records don't have actual data in them, i.e., they have a lot of missing fields. ... In my contacts list I have a whole bunch of people with just their street addresses and a whole bunch with just their email address. So the questiuon is how do I store this information economically. If I use an RDB one choice is to have a StreetAddress table and an EmailAddress table both related to NameTable. Some of the people won't have an EmailAddress record and some won't have a StreetAddress record. ... But what if I want a listing of everyone with the appropriate address shown? Then I have to start doing outer join queries which are always a mess to get right. ... XML can solve the problem by providing a single Address document that is subdivided into StreetAddress elements (which in turn can be subdivided into street, city, state, postal code elements), EMailAddress elements (which using attributes might distinguish between business email and personal email) and Telephone elements (also divided possibly into home phone, work phone, cell phone, pager, etc.). If, for a particular person, some (or most) of the data is missing, then those elements are left out of the Address document. Nothing extraordinary has to happen to create my contact list. XSL and XQuery easily deal with the lack of data.


    Why would modeling this in a relational database be more difficult than modeling it in xml? And working with it, why would it be easier to query it qith XQuery and XSL than with SQL, or even Chris Date's language Tutorial D (which is a real relational language)? On the contrary, it would be much more difficult to maintain the integrity and constraints of this xml tree than it would be to do so for a properly designed RDB. That is the hype of xml, just because it can solve some problems it is touted by some to be the best (and often only) solution to these problems, when the real problem is (to quote Don) "a sad lack of education in data management fundamentals".

    --

    Chris Hedgate @ Apptus Technologies (http://www.apptus.se)

    http://www.sql.nu

  • quote:


    XML addresses two major needs: transferring data from one place to another and storing "sparse" data. The usefulness of XML in transferring data has been addressed well in the previous comments so I won't go into it.

    But let's talk about storing "sparse" data. You have a sparse data situation when most of your records don't have actual data in them, i.e., they have a lot of missing fields. As an example, think about people's addresses. It used to be that a person's address referred to a street address, but now it could be an email address, or a web site address, or any number of different phone numbers. In my contacts list I have a whole bunch of people with just their street addresses and a whole bunch with just their email address. So the questiuon is how do I store this information economically. If I use an RDB one choice is to have a StreetAddress table and an EmailAddress table both related to NameTable. Some of the people won't have an EmailAddress record and some won't have a StreetAddress record. It's trivial to get a listing of people with street addresses and it's trivial to get a listing of people with email addresses. But what if I want a listing of everyone with the appropriate address shown? Then I have to start doing outer join queries which are always a mess to get right. If I throw in telephone numbers I just add to the mess. XML can solve the problem by providing a single Address document that is subdivided into StreetAddress elements (which in turn can be subdivided into street, city, state, postal code elements), EMailAddress elements (which using attributes might distinguish between business email and personal email) and Telephone elements (also divided possibly into home phone, work phone, cell phone, pager, etc.). If, for a particular person, some (or most) of the data is missing, then those elements are left out of the Address document. Nothing extraordinary has to happen to create my contact list. XSL and XQuery easily deal with the lack of data.

    Addresses are just a simple example of this kind of situation. Think about what all the possibilities are to catalog a book (author names, available languages, editions, reprints, etc.) and imagine having to create an RDB that handles all the situations. It's possible but the schema will be incredibly complex. (Go look at the MODS XML schema being defined by the Library of Congress to see the gory details). It's these situations that XML does better. No one is ever going to produce an efficient XML implementation of an accounting system, but on the other hand there are clearly many situations where RDBs are too structured to provide an efficient repository mechanism.


    It is hard to know where to begin... If your data is so "sparse" you have obviously made a major blunder in gathering that data so how will ANY method of storing it make it any better? As for the Library of Congress example, the relational model to handle books with authors and versions etc... would be simpler and more flexible than any hierarchical model you could come up with. I have no doubt that the XML schema being dreamed up is full of "gory details" and it is undoubtedly vastly more complex than it has to be.

    Look, if you have no interest in educating yourself, that's your business. My article is hopefully a bit of a wake-up call for those who actually want to understand the theoretical underpinnings of good data management and stop being "blown to and fro" by every marketing wind that comes along.

    A well designed database preserves the MEANING of the data, it does not care, for the most part, how the data is used. This means that as the data is used in different ways, by different applications, the database will not need to be modified. The OO and hierarchical models are very concerned with how the data is used. This means that the data cannot easily be used by multiple applications and if the business process which the database supports changes, there will be MAJOR changes to the database.

    /*****************

    If most people are not willing to see the difficulty, this is mainly because, consciously or unconsciously, they assume that it will be they who will settle these questions for the others, and because they are convinced of their own capacity to do this. -Friedrich August von Hayek

    *****************/

  • quote:


    Look, if you have no interest in educating yourself, that's your business. My article is hopefully a bit of a wake-up call for those who actually want to understand the theoretical underpinnings of good data management and stop being "blown to and fro" by every marketing wind that comes along.

    A well designed database preserves the MEANING of the data, it does not care, for the most part, how the data is used.


    Mr. Peterson, you are obviously an intelligent and well spoken individual who is very familiar with the topic of data management. I agree with many of the points that you made in your article. However, I feel that you elected to reiterate the weakest aspects XML implementation, without reiterating the strength of XML when used appropriately. As such, your article is unbalanced and misrepresents the value of XML, both in general, and as it is used within a database.

    Consider the difficulty of replacing an entrenched legacy application within a business. Many applications of "tentacles" that reach into dozens of other places. This makes them hard to remove and replace.

    XML, as a data communication mechanism, has allowed me to work with "uncertainty." I can create a set of interfaces between existing systems. I can send more complete data than is currently required, using XML. (XML simply ignores extra data). In this way, I can create one interface that will work for both an OLD system and a NEW one. Note: I may not know what the new system is, or what it requires. Using XML, the BUSINESS can describe the transactions, and I can implement them.

    Once I've reworked the integration points, I can replace a legacy system with a new one, and NONE of the integrating partners care. THIS IS A MAJOR BONUS. Your article missed it completely.

    As for storing XML in a database... have you ever seen a database that stored long text strings? Say quotes from authors, or forum messages? The database is not required to maintain the meaning of these text strings. The database maintains the relationships between these strings and the data that surrounds them (who wrote it, dates, sources, etc).

    XML, stored in a database, can be used in the same way, if the same conditions apply. XML provides a good way to store data as a "document" that is self consistent. This document can have a meaning to some other system component. However, there is no REQUIREMENT that this meaning must be enforced in the database that holds it.

    At the risk of being redundant, just because XML is structured and has meaning, that doesn't mean that it MUST have meaning to the database.

    In this case, it makes good sense to use the database to store XML strings. If there are values that need to be mined out of the XML string to make it useful, they should be stored and managed in relational columns, as you'd expect (and, I suspect, defend). Yes, you'd be simply "storing" the XML in the db, and not managing it. So what? Do you "manage" the text in author's quotes?

    In conclusion: there are specific times when it is OK to store XML in a database. Also, XML is immediately and intensely useful in application integration, far more so that the CSV example you relied upon.

    As for the tone of your article, I must ask you to consider something. I trust your apparent knowledge on data management. Please trust me on application integration. With all due respect, if you blast those who have not learned data management principles, realize that you are perilously close to falling into the camp of those who have a poor understanding of enterprise application integration and system communication.

    Please, take a deep breath. If top tier data analysts like yourself and solution designers (like myself and others) work together, and use the appropriate tools for the appropriate tasks, we can create good designs that solve current and future needs efficiently.

    With utmost respect,

    Nick Malik

  • As someone who is guilty of coding 2-digit years back in the day when disk space was expensive, I’m a little offended and a little amused by the profligacy of the XML database. I’m pretty sure that it’s a transitional methodology; more a management tool than a technology.

    As the Hubble telescope fiasco demonstrates, agreement upon and reliable communication of metadata is not a trivial matter. It’s useful for geeks and non-geeks to speak the same language for awhile to facilitate hammering out agreements among client segments. During this phase, we’ll probably develop formal methods for working out the agreements. That will pave the way for the re-abstraction of data. I suspect that we’ll arrive at a new generation of RDMS in which metadata is incorporated as a fourth dimension.

    Seth Wilpan


    Seth Wilpan

  • quote:


    Mr. Peterson, you are obviously an intelligent and well spoken individual who is very familiar with the topic of data management. I agree with many of the points that you made in your article. However, I feel that you elected to reiterate the weakest aspects XML implementation, without reiterating the strength of XML when used appropriately. As such, your article is unbalanced and misrepresents the value of XML, both in general, and as it is used within a database.


    First off, thank you for you kind words.

    The whole point of the article is to show how the supposed strengths of XML are not significant when compared to its problems. Yes you can use XML to solve some problems, but there are pre-existing, and more efficient means in every case I can think of. In several years of discussing the subject, I have not had a good example provided where the same task could not be performed more efficiently by using some other means.

    quote:


    Consider the difficulty of replacing an entrenched legacy application within a business. Many applications of "tentacles" that reach into dozens of other places. This makes them hard to remove and replace.

    XML, as a data communication mechanism, has allowed me to work with "uncertainty." I can create a set of interfaces between existing systems. I can send more complete data than is currently required, using XML. (XML simply ignores extra data). In this way, I can create one interface that will work for both an OLD system and a NEW one. Note: I may not know what the new system is, or what it requires. Using XML, the BUSINESS can describe the transactions, and I can implement them.


    Here you are talking strictly of data transport. I do acknowledge that XML can be of some use here, however the same thing can be done using ANY agreed on data file format. Unfortunately for XML almost any other physical file format will be more efficient... The business will not be describing the transactions using XML, you do that. No matter the method of transport there must be agreement on the PRECISE meaning of the data being sent and recieved. XML tags are not a sufficient description in and of themselves.

    quote:


    Once I've reworked the integration points, I can replace a legacy system with a new one, and NONE of the integrating partners care. THIS IS A MAJOR BONUS. Your article missed it completely.


    This is nothing more than providing a layer of abstraction between systems, and again it can be done in a number of more efficient ways than to use XML.

    quote:


    As for storing XML in a database... have you ever seen a database that stored long text strings? Say quotes from authors, or forum messages? The database is not required to maintain the meaning of these text strings. The database maintains the relationships between these strings and the data that surrounds them (who wrote it, dates, sources, etc).

    XML, stored in a database, can be used in the same way, if the same conditions apply. XML provides a good way to store data as a "document" that is self consistent. This document can have a meaning to some other system component. However, there is no REQUIREMENT that this meaning must be enforced in the database that holds it.

    At the risk of being redundant, just because XML is structured and has meaning, that doesn't mean that it MUST have meaning to the database.

    In this case, it makes good sense to use the database to store XML strings. If there are values that need to be mined out of the XML string to make it useful, they should be stored and managed in relational columns, as you'd expect (and, I suspect, defend). Yes, you'd be simply "storing" the XML in the db, and not managing it. So what? Do you "manage" the text in author's quotes?

    In conclusion: there are specific times when it is OK to store XML in a database. Also, XML is immediately and intensely useful in application integration, far more so that the CSV example you relied upon.


    Of course I have seen large character strings stored in a database. I am usually very suspicious of them and do not allow them without GOOD reason. There are good reasons to allow them in some systems. And yes, there are times when a given kind of "document" is actually an attribute of some entity. In those cases the DBMS must assume that the document is internally consistent. However, XML is by definition not just another text document. It has entities and attributes which are likely important to the business (or else why bother?) If there are distinct entities and attributes (to speak very loosely) to be stored in the database then at a minimum I want datatype constraints enforced. This is impossible with raw XML.

    Perhaps the worst aspect of XML is its heirarchical nature. Hierarchical data structures are inflexible and inefficient for general data management and storage purposes. The other objection I have with storing XML in the database is that it takes up too much space.

    quote:


    As for the tone of your article, I must ask you to consider something. I trust your apparent knowledge on data management. Please trust me on application integration. With all due respect, if you blast those who have not learned data management principles, realize that you are perilously close to falling into the camp of those who have a poor understanding of enterprise application integration and system communication.

    Please, take a deep breath. If top tier data analysts like yourself and solution designers (like myself and others) work together, and use the appropriate tools for the appropriate tasks, we can create good designs that solve current and future needs efficiently.

    With utmost respect,

    Nick Malik


    I am familiar with enterprise application integration having come from that background. I know that integration efforts are non-trivial. However, I do not believe that there is room to take shortcuts when it comes to data integrity.

    I fully agree that data management professionals and application developers must cooperate to solve problems, but when it comes to storing XML in the databases I manage, the answer is NO! If you need a place to store XML use a file. If however you want a place to properly manage data, use a properly designed database.

    /*****************

    If most people are not willing to see the difficulty, this is mainly because, consciously or unconsciously, they assume that it will be they who will settle these questions for the others, and because they are convinced of their own capacity to do this. -Friedrich August von Hayek

    *****************/

  • Thank you for the article, I truly appreciate the in-depth analysis of XML. Too often I feel change is being made for the wrong reasons.

    I have limited use of XML, but I have tried to keep up with technology. There are some scenarios I thought XML could help, and I would like input on.

    I do quite a bit of file importing and exporting. I thought a move to XML in these areas, could bring some benefits.

    1. Does XML not bring flexibility to the data source?

    If I am the consumer of a csv file, changing the order or quantity of fields for most processes would require redevelopment. More often than not the file I receive that is going to several others, which means when anyone wants a change, everyone must accommodate that change.

    2. Does XML not bring integrity where there was none?

    In an XML file there is the ability to create referential integrity and from what I hear a fairly simple means to validate the data via another document. There are times when the file import is an all or nothing process, and I like the idea of creating a document as opposed to programmatic validation.

    3. One area I have been promoting XML, hopefully with good reason, is between the data level of hard coding and RDBMS. Is there not a place for XML between the two?

    I am regularly troubled by the idea of producing a read-only database, which requires at least minimal administrator and computing cost. The loss of flexibility in hard coding is equally troubling.

    I understand there is a performance loss with XML, but I feel in my current environment an increase in computing cost (CPU cycles, disk space and bandwidth) is acceptable if there is noticeable decrease in developer cost.

    Thank you for your time, and I look forward to your response.

  • I agree with the article. I have to ask the possibly naive question: why can't you do all that you can do with XML using a more efficient file format?

    I also have to ask: if vendors were to put all of the time and effort into building standards for transforming other file formats, say CSV, and developing frameworks of code for reading other more efficient file formats, parsing them, treating them like data sets, etc. as they've done with XML, essentially creating the same base of standards and support as they've done for XML, then what would be the advantage of using XML over those more efficient file formats?

    Why haven't vendors spent more time and energy in developing and supporting more efficient file formats?

  • quote:


    1. Does XML not bring flexibility to the data source?

    If I am the consumer of a csv file, changing the order or quantity of fields for most processes would require redevelopment. More often than not the file I receive that is going to several others, which means when anyone wants a change, everyone must accommodate that change.


    Yes and no. Sure the fields can be rearranged in an XML file, but what if the tags change? Besides, how often does the data source change for no good reason? In every case I have been involved in where the data source has changed, it has involved more than just a reordering of fields and thus required some effort to redefine the import processes downstream.

    I see this argument as somewhat of a "straw man" created to show off XML's one strength but not tied to reality at all.

    quote:


    2. Does XML not bring integrity where there was none?

    In an XML file there is the ability to create referential integrity and from what I hear a fairly simple means to validate the data via another document. There are times when the file import is an all or nothing process, and I like the idea of creating a document as opposed to programmatic validation.


    If we are talking about data transport then the integrity must be enforced by the source database and then by the target database. Why do I need the transfer mechanism to enforce integrity?

    Now, if you are talking about using XML for data management that is another story, and that story is a real Greek tragedy (see above comments and the article.) XML's data integrity mechanisims are pretty pathetic at this time and due to its hierarchical nature, I doubt that they will ever be adequate. This is what I was referring to when I said that with the advent of XML they only succeded in reinventing a square wheel.

    quote:


    3. One area I have been promoting XML, hopefully with good reason, is between the data level of hard coding and RDBMS. Is there not a place for XML between the two?

    I am regularly troubled by the idea of producing a read-only database, which requires at least minimal administrator and computing cost. The loss of flexibility in hard coding is equally troubling.

    I understand there is a performance loss with XML, but I feel in my current environment an increase in computing cost (CPU cycles, disk space and bandwidth) is acceptable if there is noticeable decrease in developer cost.


    You may see a short-term decrease in developer costs, but at what price? We learned long ago that application managed data files were a bad idea. What has changed to make it a good idea? You loose data independence and ultimately you will loose the ability to maintain your applications because of the spiderweb of data dependencies that tend to develop in such an environment. Put that together with the loss of performance and the increased overhead on systems and you really have a loose/loose situation.

    Besides, we have a tendancy to talk about system and network resources as if they are an insignificant expense. I don't know about your company, but our base Intel servers (2 CPU) weigh in at around $9k with the 4 way servers topping out close to $30k. We spend millions of dollars per year for connectivity. Any across the board increase in bandwidth will cost millions more. We use EMC storage for our production systems and the total cost is estimated at around $220 per GB once you count purchase costs, maintenance, administration, and backup, so disk space isn't cheap either. At those prices, we can pay a developer for a few more hours to do things right...

    /*****************

    If most people are not willing to see the difficulty, this is mainly because, consciously or unconsciously, they assume that it will be they who will settle these questions for the others, and because they are convinced of their own capacity to do this. -Friedrich August von Hayek

    *****************/

Viewing 15 posts - 16 through 30 (of 144 total)

You must be logged in to reply to this topic. Login to reply