I am neither a big fan of XML nor an expert. Over the years I have seen XML
used and abused yet another silver bullet to solve all of our problems. I have
seen XML fail on many projects because it is too verbose - consuming too much
bandwidth and processing power where good ol' fixed-length records would have
been fine. I have also seen XML as a good solution to interfacing problems and I
(almost) unknowingly use it every day interfacing with web services and
configuring systems.
Although XML is interesting, I never found the motivation or need to go out and
buy the biggest book on XML that I can find in order to master what should
really be a simple markup language. XML gets complicated fast and before you
know it you are either confused or an XML bigot (where your confusion is hidden
behind a veneer of expertise).
I wouldn't call myself a DBA, but I have done enough late night babysitting of production databases to have an idea how my design and architectural decisions impact the database the least scalable of any part of the architecture. So in my opinion XML - as a bloated, misunderstood, misinterpreted, over-hyped and often badly implemented technology should be nowhere near the database. End of discussion no budging. Holding such a view won't get me fired and will win the welcome support of the production DBAs who could always point fingers at my bad SQL
syntax and data structures as a reason for poor performance.
I confess that I have used XML in the database in my latest project.
There, I have said it and in the company of expert DBAs who have had enough
of people like me putting all sorts of crap into the database. Amongst such
company I had better come up with an explanation and fast. Allow me to do so.
There are two situations where XML in SQL 2005 comes in quite handy. They are
when implementing classifications and temporal data designs.
SQL 2005 XML for Classifications
One thing that is missing in SQL databases is a construct to handle
classifications. It is a problem that designers always encounter and the
mechanisms for the physical implementations vary extensively.
What do I mean by a classification? Consider a system that has an order and a
customer table, where the key from the customer is put into the order table
simple normalization. Now what if your customer can be a person, with a first
name, surname and date of birth or it can be an organization, with a name and
company registration number? The problem is that it doesn't feel right to create
a single table with all of the attributes (breaking some relational model rules
in the process) and it is difficult to implement separate [Person] and
[Organization] tables.
There are two basic approaches, either a roll-up or a roll-down approach.
In the roll-up approach a single table is created with all of the fields and
discriminator or classification attributes to distinguish between the classes.
The individual classes can then be implemented as views as follows:
CREATE TABLE
Stakeholder(
StakeholderId int,
Firstname varchar(50),
Surname varchar(100),
DateOfBirth datetime,
Name varchar(100),
RegistrationNo varchar(20),
StakeholderClass int,
isPerson bit,
isOrganization bit
)
GO
CREATE VIEW Person
AS
SELECT
StakeholderId, Firstname, Surname, DateOfBirth
FROM Stakeholder
WHERE isPerson=1
GO
CREATE VIEW
Organization
AS
SELECT
StakeholderId, Name, RegistrationNo
FROM Stakeholder
WHERE
isOrganization=1
In the roll-down approach multiple tables are created with their correct
attributes and the 'leaf' tables are unioned together for an index table as
follows:
CREATE TABLE
Person(
StakeholderId int,
Firstname varchar(50),
Surname varchar(100),
DateOfBirth datetime
)
CREATE TABLE
Organization(
StakeholderId int,
Name varchar(100),
RegistrationNo varchar(20),
)
CREATE VIEW
Stakeholder
AS
SELECT
StakeholderId, Firstname+' '+Surname
AS Name, CAST(1
AS bit) AS
isPerson, CAST(0
AS bit) AS
isOrganization
FROM Person
UNION
SELECT
StakeholderId, Name,
CAST(0 AS bit)
AS isPerson,
CAST(1 AS bit)
AS isOrganization
FROM Organization
Combinations of the two approaches also exist where some fields are created on
an index table and other fields exist only on the leaf tables. Things get a bit
complicated when implementing complex classification structures. What if some
persons are employees (add an employee number) and what if some employees are
contractors (add contract period)? Not only do complex structures become
difficult to implement but CRUD across index tables and views becomes a pain.
I have implemented such mechanisms on large databases quite successfully - such as a bank that had
14 million customers. But it can become quite complex and I
was looking for a simple classification mechanism that would be reasonably easy
to implement, not contain excessive tables or attributes and would be able to be
extended or replaced. Enter the XML temptress
I create a simple index table with a column to store all the class-specific
attributes as follows:
CREATE TABLE
Stakeholder(
StakeholderId int
IDENTITY(1,1),
Name varchar(100),
ClassAttributes xml,
isPerson bit,
isOrganization bit
)
Since my application data access layer uses only sprocs to access the database,
some insert and update sprocs for the person and the organization need to be
written. For example with person:
CREATE PROCEDURE
InsertPerson
@Firstname varchar(50),
@Surname varchar(100),
@DateOfBirth datetime
AS
DECLARE @Extended
xml
SET @Extended='<person><firstName>'+@FirstName+'</firstName><surname>'+@Surname+'</surname></person>'
IF (@DateOfBirth
IS NOT NULL)
BEGIN
DECLARE @Dob
nvarchar(10)
SET @Dob=CONVERT(nvarchar(10),@DateOfBirth,20)
SET @Extended.modify('insert
<dateOfBirth>{sql:variable("@Dob")}</dateOfBirth> as last into (/person)[1]')
END
INSERT INTO
Stakeholder(Name,ClassAttributes,isPerson)
VALUES(@FirstName+'
'+@Surname,@Extended,1)
Executing the above sproc like this:
EXEC InsertPerson
'Joe',
'Soap', '1 Jan 1980'
Results in a record with the basic information and an XML structure that neatly
contains all of the other bits of information and would store the following XML:
<person>
<firstName>Joe</firstName>
<surname>Soap</surname>
<dateOfBirth>1980-01-01</dateOfBirth>
</person>
Notice the use of the XQuery insert that only adds the attribute if it is not
null, resulting in neater looking XML data.
A similar sproc for organization would store XML something like this:
<organization>
<name>SQLServerCentral</name>
<registrationNo>ABC1234</registrationNo>
</organization>
My individual [Person] and [Organization] tables are implemented as views like
this:
CREATE VIEW Person
AS
SELECT
StakeholderId, ClassAttributes.value('(/person/firstName)[1]',
'varchar(50)')
AS FirstName,
ClassAttributes.value('(/person/surname)[1]',
'varchar(100)')
AS Surname,
ClassAttributes.value('(/person/dateOfBirth)[1]',
'datetime')
AS DateOfBirth,
ClassAttributes.value('(/person/title)[1]',
'varchar(10)')
AS Title
FROM Stakeholder
WHERE (isPerson
= 1)
CREATE VIEW
Organization
AS
SELECT
StakeholderId, Name,
ClassAttributes.value('(/organization/organizationTypeId)[1]',
'int') AS
OrganizationTypeId,
ClassAttributes.value('(/organization/registrationNo)[1]',
'varchar(20)')
AS RegistrationNo
FROM Stakeholder
WHERE (isOrganization
= 1)
The views are an interesting implementation in that from a relational model
point of view they are valid relations and the syntax to use them will be
standard SQL. Consider the query where we want to search on the name of a
stakeholder, but with people we need to query the surname and on organizations
we need to query the name. The following query, even though it has XML innards
is a perfectly valid and understandable query.
SELECT
StakeholderId, Name
FROM Organization
WHERE Name LIKE
'S%'
UNION
SELECT
StakeholderId, Surname
FROM Person
WHERE Surname
LIKE 'S%'
There are other ways to query the XML directly using XQuery but I want to
stay as close to ANSI 92 syntax as possible.
Even though we are storing non-relational data in our SQL database we don't
really break that many relational rules. Technically the relational model states
that the storage of data is independent of the model so, the argument that the
use of views is non-relational is invalid (sort of) - if [Person] and
[Organization] are implemented as views, which are valid relations, then we are
not breaking any rules.
By now, any real DBA would be thinking This guy is frikkin insane, it will
perform like a dog! This is both a correct and incorrect thought no,
I am not frikkin insane and yes, performance can be an issue. I would not
recommend this approach if you have a structure with millions of records, which
would be the case with the stakeholder structure in large enterprises. But what
about structures with fewer rows, even in the tens or hundreds of thousands?
Maybe a product classification or retail outlet classification would perform
adequately? You may also notice in this example that Name is redundant, since it
is contained in the XML anyway this has been done on purpose for performance
reasons since most of the queries only want the name which is a common
attribute, so there is no point in mucking about with the XML.
Another key aspect with regard to performance is understanding the interfaces.
In my particular implementation, if I wanted to create proper fields for the
attributes there would be no far-reaching impact. The interfaces to the sprocs
wouldn't change and the fact that I may have replaced the Person view with a
table would make no difference to existing SQL.
Denormalization of Temporal Data Using XML
Consider a requirement stating
"For ordered items, display the part name and the
stock on hand". This may be turned into a query something like this:
SELECT o.OrderId,
o.PartId, o.Quantity, o.Cost, p.Name, p.StockAvailable
FROM OrderItem o
INNER JOIN Part p
ON p.PartId=o.PartId
WHERE OrderId=1
As with most requirements it is not specific enough and should read "For
ordered items, display the part name and the stock on hand when the order was
placed".
The problem with the above query is that part information, particularly the
available stock, changes continuously and the query doesn't take this into
account. Many similar problems exist with data that is date dependant (temporal
data). Again, there are many ways to resolve this problem you could have [Part]
movement records and join based on the order date to the movement tables, or you
could denormalize your structures and create an [OrderItem] table with the [Part].[Name] and
[Part].[StockAvailable] values in fields on [OrderItem]. The most common approach by far is to do
neither and land up with all sorts of issues relating to the temporality of the
data which puts the integrity of the entire database in question by the users.
Generally, unless I need to create a specific structure to handle temporality
where it may be important I tend to take the easy route and denormalize my
structures. The problem is figuring out how many of the fields from [Part] should
be stored on [OrderItem] to handle all the various combinations of queries that
the users may think up in future. Also, it looks ugly when [Part] fields are
reproduced on the [OrderItem] table apart from breaking a few relational model
rules along the way.
In a recent system there was a need to store some 'snapshot' temporal data, but
since other parts of the system were not specified, never mind developed, we
were unsure which fields to store the solution was to store most of them in an
XML field and worry about it later.
So in the above example I would create a table something like this:
CREATE TABLE
OrderItem(
OrderId int,
PartId int,
Quantity int,
Cost money,
PartStore xml)
With sprocs to handle the inserts and updates,
CREATE PROCEDURE
InsertOrderItem
@OrderId int,
@PartId int,
@Quantity int,
@Cost money
AS
DECLARE @PartStore
xml
SELECT @PartStore='<part><name>'+Name+'</name><stockAvailable>'+CAST(StockAvailable
AS varchar(10))+'</stockAvailable></part>'
FROM Part
WHERE PartId=@PartId
INSERT INTO
OrderItem(OrderId,PartId,Quantity,Cost,PartStore)
VALUES(@OrderId,@PartId,@Quantity,@Cost,@PartStore)
This would create a store of temporal data something like this
<part>
<name>Part
1</name>
<stockAvailable>10</stockAvailable>
</part>
When querying data I simply look in the XML for the part data that I need,
SELECT OrderId,
PartId, Quantity, Cost,
PartStore.value('(/part/name)[1]',
'varchar(50)')
AS Name,
PartStore.value('(/part/stockAvailable)[1]',
'int') AS
StockAvailable
FROM OrderItem
WHERE OrderId=1
And as per the classification examples above, I can wrap the queries into nicely
named views if I prefer.
The plan is that over time, provided my interfaces remain the same, I can add
Part attributes directly to the OrderItem table if needed. This can be done on a
production database and I would just need to alter the table,
ALTER TABLE
OrderItem ADD PartName
varchar(50)
UPDATE OrderItem
SET PartName=PartStore.value('(/part/name)[1]',
'varchar(50)')
and change any sprocs or views that reference the table all very backwards
compatible.
Summary
A year ago I would have recoiled at the mere suggestion of using XML directly in
the database, thinking that it was the domain of junior asp developers who can't
tell a
relation from a relationship. Over the last few months the XML Temptress
in SQL 2005 has lured me into her parlour providing a mechanism to approach old
problems in new ways.
I would still be in line violently opposing persisting objects as XML in the
database as object oriented bigots would be tempted to do, after all, they say
that the database is just a persistence mechanism for their objects. I can
picture object orientation bigots advocating a database of one table with two
attributes, ObjectId (GUID) and ObjectData (XML) such an image is concerning.
However, I hope that I have demonstrated that if carefully thought through that
XML in the database can be useful and elegant provided that it is done within
the context of the overall architecture. The biggest issue with XML in the
database is understanding when performance becomes the overriding factor as to
where and how data is stored after all it is mostly the performance of
databases that your DBA has to deal with anyway.
Simon Munro is currently pursuing the Microsoft Certified Architect (MCA) certification and maintains a blog at http://www.delphi.co.za/ that covers many interesting ideas on using Microsoft technologies.