SQLServerCentral Article

What is XML?


Note: This article was originally written in 1999, and not much has changed.

So, what the heck is XML, anyway?

The amount of hype pushing XML (eXtensible Markup Language) is amazing,

even for the software industry. Bowstreet claims that its Business Web

Factory "Integrates business processes (your own and your partners')

seamlessly into customized business webs using XML". IBM has an XML

evangelist. Microsoft "supports" XML in Internet Explorer 5.

I knew it was bad when my wife got a taxi receipt with the following

message on it (I've saved it as a souvenir):



XML isn't a programming language. XML doesn't have anything to do with

the web, in and of itself. It isn't really even a markup language.

Technically speaking, it is a meta-language, or a language language. It

is a set of guidelines for creating a markup language, or, more

specifically, guidelines for creating your own set of tags. According to

www.xml.com , "A markup language is a mechanism to identify structures

in a document. The XML specification defines a standard way to add

markup to documents."

Practically speaking, it is a data modeling language. It is a decent

(though not problem-free) way for organizations to specify their data

and object structures in a platform independent manner. Done correctly,

an organization could use XML to easily transfer data between different

software and hardware vendors.

It's just very, very, very hard to do it correctly.

A bit of background. The grandfather of markup languages is SGML

(Standard Generalized Markup Language). SGML has been around for quite

awhile, and has been used successfully. However, its complexity

creates some programming difficulties.

Meanwhile, the HyperText Markup Language (HTML) was enjoying

widespread use, mainly as a browser programming language. Even though

it was an instantiation of SGML, it suffered from the opposite

problem: it was too limited for what people wanted to do. So, XML was

born as a compromise.

XML is closer to SGML than HTML in that it is extremely generic. XML

doesn't have tags itself; it just tells you how to make tags. A

slightly humorous example of this appeared in a PDF developer mailing

list that I subscribe to. One person asked if there were any tools to

convert PDF to XML. After explaining the apples/oranges thing, another

subscriber created a well-formed XML document:

<?xml version="1.0"?>
(the binary .PDF file)

The above is a legitimate conversion of PDF to XML. It is, however,

completely and utterly useless.

To use XML well, a Document Type Definition (DTD) needs to be created.

The DTD is where the tags that comprise the document are defined, as

well as the relationship of one tag to another. It is where the data

model is created, and where most of the difficulties of using XML to

transfer data are.

In many ways, the data model is the guts of any application. The reasons

why the software you are using exists can be found in the data model.

Modeling how one piece of data relates to another is akin to determining

what frame a car will have. The ramifications of the car's width

determine how heavy it will be, which determines the engine and

transmission it will have, and so on.

To give a real life example, I work for Novartis Pharmaceuticals,

developing software for clinical trials. In clinical trials, which are

really massive, outrageously expensive experiments, we start out with a

compound. A compound is a chemical that may or may not have medicinal

use. Aspirin is a compound, as is Tamoxifen, and penicillin, and so on.

You can use compounds for several reasons, or indications. For example,

you can use aspirin to get rid of a headache, or you can use aspirin to

prevent heart attacks. A compound plus an indication is called a


Now, to test whether aspirin prevents heart attacks, an experiment is

designed, which we call a trial. There can be several ways to test

whether aspirin prevents heart attacks.

In data modeling speak, compounds, projects, and trials are entities.

These entities have attributes. A compound has a chemical name, a trial

has a visit schedule. At Novartis, a project has an attribute called the

galenical aspect. This refers to the method of delivering the compound:

a shot, a pill, in toothpaste, whatever. In our model, a project must

have one and only one delivery method. Let's say, for sake of argument,

that Merck allows a project to have more than one compound delivery


Now, let's say that Novartis decides to publish a clinical trial DTD

which allows a project to have only one delivery method, and gets its

DTD accepted as the industry standard. Implementation of Novartis's

standard could then result in Merck having to change its clinical trial

budgeting process, as different trials are moved into different projects

because one gives the patients aspirin in pill form, and the other gives

the patients suppositories.

That is a trivial example. Just wait until you get to things like

Adverse Events (side effects), or patient demographics, or medical

history; you are guaranteed to find some fundamentally irreconcilable

points of view, neither of which are wrong. What information is a trial

investigator required to get from a patient? Not a trivial issue.

Ok, so let's say that you've managed to get in house or external

competitors to cooperate, got all those thorny issues settled, and are

ready to use XML to transfer data. How are you going to look at it? It's

not really practical to use a text editor to view the XML, as it is all

marked up. It would be like looking at HTML source, only more so. You've

got to write some XSL (eXtensible Style Sheet) code, which is a

programming language for an XML=>HTML converter. That'll take you some

time, both in the writing, and the execution. It'll really be fun to

look at if you have a slow connection.

Now that you've defined it, and now that you can look at it, you're

probably going to want to do something with it (if all you want to do

is look at it, you could have kept it in HTML to start with). Perhaps

you'll want to convert your existing processes to handle XML, or

create whole new ones from scratch.


Not that XML itself is a perfect way to transfer data. While it's

generally good enough, there are some flaws that show up in the real


The first is that it creates a hierarchical database. In the example

above, a compound would have one or more projects, which would have one

or more trials, etc. However, almost all client server and thin

client applications use relational databases, where data is stored in

tables with rows and columns. They're not the same shape. To store XML

data in a relational database, it'll have to be converted back and

forth. Not a big problem, but yet another obstacle to overcome.

A more serious problem is the file size. Again, according to

www.xml.com, "Terseness in XML markup is of minimal importance." Well,

maybe in a George Gilder dream, but in the real world, file size does

matter. Alot. Bandwidth is not free, and XML is definitely not terse.

For example, the SAS Institute is considering using XML to store their

datasets. However, the FDA sets a 25 Meg file size limit on datasets in

a new drug submission. Currently, usually only lab datasets are affected

by this limit. If XML is used, many more datasets would hit this barrier

and have to be split up, causing much angst and gnashing of teeth.

Data transfer standards are fine, and XML is as good a choice as any.

Just don't think that it won't take its pound of flesh, just like every

other technology known. My guess is that it will be one of those

technologies of the future that always remain so.


3 (2)




3 (2)