SQL Clone
SQLServerCentral is supported by Redgate
Log in  ::  Register  ::  Not logged in

What is XML?

By Stephen Hirsch,

Note: This article was originally written in 1999, and not much has changed.

So, what the heck is XML, anyway?

The amount of hype pushing XML (eXtensible Markup Language) is amazing, even for the software industry. Bowstreet claims that its Business Web Factory "Integrates business processes (your own and your partners') seamlessly into customized business webs using XML". IBM has an XML evangelist. Microsoft "supports" XML in Internet Explorer 5.

I knew it was bad when my wife got a taxi receipt with the following message on it (I've saved it as a souvenir):



XML isn't a programming language. XML doesn't have anything to do with the web, in and of itself. It isn't really even a markup language. Technically speaking, it is a meta-language, or a language language. It is a set of guidelines for creating a markup language, or, more specifically, guidelines for creating your own set of tags. According to www.xml.com , "A markup language is a mechanism to identify structures in a document. The XML specification defines a standard way to add markup to documents."

Practically speaking, it is a data modeling language. It is a decent (though not problem-free) way for organizations to specify their data and object structures in a platform independent manner. Done correctly, an organization could use XML to easily transfer data between different software and hardware vendors.

It's just very, very, very hard to do it correctly.

A bit of background. The grandfather of markup languages is SGML (Standard Generalized Markup Language). SGML has been around for quite awhile, and has been used successfully. However, its complexity creates some programming difficulties.

Meanwhile, the HyperText Markup Language (HTML) was enjoying widespread use, mainly as a browser programming language. Even though it was an instantiation of SGML, it suffered from the opposite problem: it was too limited for what people wanted to do. So, XML was born as a compromise.

XML is closer to SGML than HTML in that it is extremely generic. XML doesn't have tags itself; it just tells you how to make tags. A slightly humorous example of this appeared in a PDF developer mailing list that I subscribe to. One person asked if there were any tools to convert PDF to XML. After explaining the apples/oranges thing, another subscriber created a well-formed XML document:

<?xml version="1.0"?>
	(the binary .PDF file)
The above is a legitimate conversion of PDF to XML. It is, however, completely and utterly useless.

To use XML well, a Document Type Definition (DTD) needs to be created. The DTD is where the tags that comprise the document are defined, as well as the relationship of one tag to another. It is where the data model is created, and where most of the difficulties of using XML to transfer data are.

In many ways, the data model is the guts of any application. The reasons why the software you are using exists can be found in the data model. Modeling how one piece of data relates to another is akin to determining what frame a car will have. The ramifications of the car's width determine how heavy it will be, which determines the engine and transmission it will have, and so on.

To give a real life example, I work for Novartis Pharmaceuticals, developing software for clinical trials. In clinical trials, which are really massive, outrageously expensive experiments, we start out with a compound. A compound is a chemical that may or may not have medicinal use. Aspirin is a compound, as is Tamoxifen, and penicillin, and so on. You can use compounds for several reasons, or indications. For example, you can use aspirin to get rid of a headache, or you can use aspirin to prevent heart attacks. A compound plus an indication is called a project.

Now, to test whether aspirin prevents heart attacks, an experiment is designed, which we call a trial. There can be several ways to test whether aspirin prevents heart attacks.

In data modeling speak, compounds, projects, and trials are entities. These entities have attributes. A compound has a chemical name, a trial has a visit schedule. At Novartis, a project has an attribute called the galenical aspect. This refers to the method of delivering the compound: a shot, a pill, in toothpaste, whatever. In our model, a project must have one and only one delivery method. Let's say, for sake of argument, that Merck allows a project to have more than one compound delivery method.

Now, let's say that Novartis decides to publish a clinical trial DTD which allows a project to have only one delivery method, and gets its DTD accepted as the industry standard. Implementation of Novartis's standard could then result in Merck having to change its clinical trial budgeting process, as different trials are moved into different projects because one gives the patients aspirin in pill form, and the other gives the patients suppositories.

That is a trivial example. Just wait until you get to things like Adverse Events (side effects), or patient demographics, or medical history; you are guaranteed to find some fundamentally irreconcilable points of view, neither of which are wrong. What information is a trial investigator required to get from a patient? Not a trivial issue.

Ok, so let's say that you've managed to get in house or external competitors to cooperate, got all those thorny issues settled, and are ready to use XML to transfer data. How are you going to look at it? It's not really practical to use a text editor to view the XML, as it is all marked up. It would be like looking at HTML source, only more so. You've got to write some XSL (eXtensible Style Sheet) code, which is a programming language for an XML=>HTML converter. That'll take you some time, both in the writing, and the execution. It'll really be fun to look at if you have a slow connection.

Now that you've defined it, and now that you can look at it, you're probably going to want to do something with it (if all you want to do is look at it, you could have kept it in HTML to start with). Perhaps you'll want to convert your existing processes to handle XML, or create whole new ones from scratch.


Not that XML itself is a perfect way to transfer data. While it's generally good enough, there are some flaws that show up in the real world.

The first is that it creates a hierarchical database. In the example above, a compound would have one or more projects, which would have one or more trials, etc. However, almost all client server and thin client applications use relational databases, where data is stored in tables with rows and columns. They're not the same shape. To store XML data in a relational database, it'll have to be converted back and forth. Not a big problem, but yet another obstacle to overcome.

A more serious problem is the file size. Again, according to www.xml.com, "Terseness in XML markup is of minimal importance." Well, maybe in a George Gilder dream, but in the real world, file size does matter. Alot. Bandwidth is not free, and XML is definitely not terse. For example, the SAS Institute is considering using XML to store their datasets. However, the FDA sets a 25 Meg file size limit on datasets in a new drug submission. Currently, usually only lab datasets are affected by this limit. If XML is used, many more datasets would hit this barrier and have to be split up, causing much angst and gnashing of teeth.

Data transfer standards are fine, and XML is as good a choice as any. Just don't think that it won't take its pound of flesh, just like every other technology known. My guess is that it will be one of those technologies of the future that always remain so.

Total article views: 10929 | Views in the last 30 days: 0
Related Articles




FK on part of a compound primary key?

can I create a FK on part of a compound primary key?


Export XML with multi-language data

Spanish language


sql 2008 trial version downloads

sql 2008 trial version installation


Change the Language for SSIS Script Task

My default language is C#