The SQL Server 2005 XML Temptress

I am neither a big fan of XML nor an expert. Over the years I have seen XML

used and abused yet another silver bullet to solve all of our problems. I have

seen XML fail on many projects because it is too verbose - consuming too much

bandwidth and processing power where good ol' fixed-length records would have

been fine. I have also seen XML as a good solution to interfacing problems and I

(almost) unknowingly use it every day interfacing with web services and

configuring systems.

Although XML is interesting, I never found the motivation or need to go out and

buy the biggest book on XML that I can find in order to master what should

really be a simple markup language. XML gets complicated fast and before you

know it you are either confused or an XML bigot (where your confusion is hidden

behind a veneer of expertise).

I wouldn't call myself a DBA, but I have done enough late night babysitting of production databases to have an idea how my design and architectural decisions impact the database the least scalable of any part of the architecture. So in my opinion XML - as a bloated, misunderstood, misinterpreted, over-hyped and often badly implemented technology should be nowhere near the database. End of discussion no budging. Holding such a view won't get me fired and will win the welcome support of the production DBAs who could always point fingers at my bad SQL

syntax and data structures as a reason for poor performance.

I confess that I have used XML in the database in my latest project.

There, I have said it and in the company of expert DBAs who have had enough

of people like me putting all sorts of crap into the database. Amongst such

company I had better come up with an explanation and fast. Allow me to do so.

There are two situations where XML in SQL 2005 comes in quite handy. They are

when implementing classifications and temporal data designs.

SQL 2005 XML for Classifications

One thing that is missing in SQL databases is a construct to handle

classifications. It is a problem that designers always encounter and the

mechanisms for the physical implementations vary extensively.

What do I mean by a classification? Consider a system that has an order and a

customer table, where the key from the customer is put into the order table

simple normalization. Now what if your customer can be a person, with a first

name, surname and date of birth or it can be an organization, with a name and

company registration number? The problem is that it doesn't feel right to create

a single table with all of the attributes (breaking some relational model rules

in the process) and it is difficult to implement separate [Person] and

[Organization] tables.

There are two basic approaches, either a roll-up or a roll-down approach.

In the roll-up approach a single table is created with all of the fields and

discriminator or classification attributes to distinguish between the classes.

The individual classes can then be implemented as views as follows:

CREATE TABLE 
Stakeholder(
StakeholderId int,
Firstname varchar(50),
Surname varchar(100),
DateOfBirth datetime,
Name varchar(100),
RegistrationNo varchar(20),
StakeholderClass int,
isPerson bit,
isOrganization bit
)
GO
CREATE VIEW Person
AS
SELECT 
StakeholderId, Firstname, Surname, DateOfBirth
FROM Stakeholder
WHERE isPerson=1
GO
CREATE VIEW 
Organization
AS
SELECT 
StakeholderId, Name, RegistrationNo
FROM Stakeholder
WHERE 
isOrganization=1

In the roll-down approach multiple tables are created with their correct

attributes and the 'leaf' tables are unioned together for an index table as

follows:

CREATE TABLE 
Person(
StakeholderId int,
Firstname varchar(50),
Surname varchar(100),
DateOfBirth datetime
)
 
CREATE TABLE 
Organization(
StakeholderId int,
Name varchar(100),
RegistrationNo varchar(20),
)
 
CREATE VIEW 
Stakeholder
AS
SELECT 
StakeholderId, Firstname+' '+Surname
AS Name, CAST(1
AS bit) AS
isPerson, CAST(0
AS bit) AS
isOrganization
FROM Person
UNION
SELECT 
StakeholderId, Name,
CAST(0 AS bit)
AS isPerson, 
CAST(1 AS bit)
AS isOrganization
FROM Organization

Combinations of the two approaches also exist where some fields are created on

an index table and other fields exist only on the leaf tables. Things get a bit

complicated when implementing complex classification structures. What if some

persons are employees (add an employee number) and what if some employees are

contractors (add contract period)? Not only do complex structures become

difficult to implement but CRUD across index tables and views becomes a pain.

I have implemented such mechanisms on large databases quite successfully - such as a bank that had

14 million customers. But it can become quite complex and I

was looking for a simple classification mechanism that would be reasonably easy

to implement, not contain excessive tables or attributes and would be able to be

extended or replaced. Enter the XML temptress

I create a simple index table with a column to store all the class-specific

attributes as follows:

CREATE TABLE 
Stakeholder(
StakeholderId int
IDENTITY(1,1),
Name varchar(100),
ClassAttributes xml,
isPerson bit,
isOrganization bit
)

Since my application data access layer uses only sprocs to access the database,

some insert and update sprocs for the person and the organization need to be

written. For example with person:

CREATE PROCEDURE 
InsertPerson
@Firstname varchar(50),
@Surname varchar(100),
@DateOfBirth datetime
AS
DECLARE @Extended
xml
 
SET @Extended='<person><firstName>'+@FirstName+'</firstName><surname>'+@Surname+'</surname></person>'
IF (@DateOfBirth
IS NOT NULL) 
  BEGIN
    DECLARE @Dob
nvarchar(10)
    SET @Dob=CONVERT(nvarchar(10),@DateOfBirth,20)
    SET @Extended.modify('insert
<dateOfBirth>{sql:variable("@Dob")}</dateOfBirth> as last into (/person)[1]')
  END
 
INSERT INTO 
Stakeholder(Name,ClassAttributes,isPerson)
VALUES(@FirstName+'
'+@Surname,@Extended,1)

Executing the above sproc like this:

EXEC InsertPerson
'Joe', 
'Soap', '1 Jan 1980'

Results in a record with the basic information and an XML structure that neatly

contains all of the other bits of information and would store the following XML:

<person>
    <firstName>Joe</firstName>
    <surname>Soap</surname>
    <dateOfBirth>1980-01-01</dateOfBirth>
</person>

Notice the use of the XQuery insert that only adds the attribute if it is not

null, resulting in neater looking XML data.

A similar sproc for organization would store XML something like this:

<organization>
    <name>SQLServerCentral</name>
    <registrationNo>ABC1234</registrationNo>
</organization>

My individual [Person] and [Organization] tables are implemented as views like

this:

CREATE VIEW Person
AS
SELECT 
StakeholderId, ClassAttributes.value('(/person/firstName)[1]',
'varchar(50)')
AS FirstName, 
  ClassAttributes.value('(/person/surname)[1]',
'varchar(100)')
AS Surname, 
  ClassAttributes.value('(/person/dateOfBirth)[1]',
'datetime') 
AS DateOfBirth, 
  ClassAttributes.value('(/person/title)[1]',
'varchar(10)')
AS Title
FROM Stakeholder
WHERE     (isPerson
= 1)
 
CREATE VIEW 
Organization
AS
SELECT    
StakeholderId, Name, 
  ClassAttributes.value('(/organization/organizationTypeId)[1]',
'int') AS
OrganizationTypeId,
  ClassAttributes.value('(/organization/registrationNo)[1]',
'varchar(20)')
AS RegistrationNo
FROM Stakeholder
WHERE (isOrganization
= 1)

The views are an interesting implementation in that from a relational model

point of view they are valid relations and the syntax to use them will be

standard SQL. Consider the query where we want to search on the name of a

stakeholder, but with people we need to query the surname and on organizations

we need to query the name. The following query, even though it has XML innards

is a perfectly valid and understandable query.

SELECT 
StakeholderId, Name
FROM Organization
WHERE Name LIKE 
'S%'
UNION
SELECT 
StakeholderId, Surname
FROM Person
WHERE Surname
LIKE 'S%'

There are other ways to query the XML directly using XQuery but I want to

stay as close to ANSI 92 syntax as possible.

Even though we are storing non-relational data in our SQL database we don't

really break that many relational rules. Technically the relational model states

that the storage of data is independent of the model so, the argument that the

use of views is non-relational is invalid (sort of) - if [Person] and

[Organization] are implemented as views, which are valid relations, then we are

not breaking any rules.

By now, any real DBA would be thinking This guy is frikkin insane, it will

perform like a dog! This is both a correct and incorrect thought no,

I am not frikkin insane and yes, performance can be an issue. I would not

recommend this approach if you have a structure with millions of records, which

would be the case with the stakeholder structure in large enterprises. But what

about structures with fewer rows, even in the tens or hundreds of thousands?

Maybe a product classification or retail outlet classification would perform

adequately? You may also notice in this example that Name is redundant, since it

is contained in the XML anyway this has been done on purpose for performance

reasons since most of the queries only want the name which is a common

attribute, so there is no point in mucking about with the XML.

Another key aspect with regard to performance is understanding the interfaces.

In my particular implementation, if I wanted to create proper fields for the

attributes there would be no far-reaching impact. The interfaces to the sprocs

wouldn't change and the fact that I may have replaced the Person view with a

table would make no difference to existing SQL.

Denormalization of Temporal Data Using XML

Consider a requirement stating

"For ordered items, display the part name and the

stock on hand". This may be turned into a query something like this:

SELECT o.OrderId,
o.PartId, o.Quantity, o.Cost, p.Name, p.StockAvailable
FROM OrderItem o
INNER JOIN Part p
ON p.PartId=o.PartId
WHERE OrderId=1

As with most requirements it is not specific enough and should read "For

ordered items, display the part name and the stock on hand when the order was

placed".

The problem with the above query is that part information, particularly the

available stock, changes continuously and the query doesn't take this into

account. Many similar problems exist with data that is date dependant (temporal

data). Again, there are many ways to resolve this problem you could have [Part]

movement records and join based on the order date to the movement tables, or you

could denormalize your structures and create an [OrderItem] table with the [Part].[Name] and

[Part].[StockAvailable] values in fields on [OrderItem]. The most common approach by far is to do

neither and land up with all sorts of issues relating to the temporality of the

data which puts the integrity of the entire database in question by the users.

Generally, unless I need to create a specific structure to handle temporality

where it may be important I tend to take the easy route and denormalize my

structures. The problem is figuring out how many of the fields from [Part] should

be stored on [OrderItem] to handle all the various combinations of queries that

the users may think up in future. Also, it looks ugly when [Part] fields are

reproduced on the [OrderItem] table apart from breaking a few relational model

rules along the way.

In a recent system there was a need to store some 'snapshot' temporal data, but

since other parts of the system were not specified, never mind developed, we

were unsure which fields to store the solution was to store most of them in an

XML field and worry about it later.

So in the above example I would create a table something like this:

CREATE TABLE 
OrderItem(
OrderId int,
PartId int,
Quantity int,
Cost money,
PartStore xml)

With sprocs to handle the inserts and updates,

CREATE PROCEDURE 
InsertOrderItem
@OrderId int,
@PartId int,
@Quantity int,
@Cost money
AS
DECLARE @PartStore
xml
SELECT @PartStore='<part><name>'+Name+'</name><stockAvailable>'+CAST(StockAvailable
AS varchar(10))+'</stockAvailable></part>'
FROM Part
WHERE PartId=@PartId
 
INSERT INTO 
OrderItem(OrderId,PartId,Quantity,Cost,PartStore)
VALUES(@OrderId,@PartId,@Quantity,@Cost,@PartStore)

This would create a store of temporal data something like this

<part>
    <name>Part
1</name>
    <stockAvailable>10</stockAvailable>
</part>

When querying data I simply look in the XML for the part data that I need,

SELECT OrderId,
PartId, Quantity, Cost,
  PartStore.value('(/part/name)[1]',
'varchar(50)')
AS Name,
  PartStore.value('(/part/stockAvailable)[1]',
'int') AS
StockAvailable
FROM OrderItem
WHERE OrderId=1

And as per the classification examples above, I can wrap the queries into nicely

named views if I prefer.

The plan is that over time, provided my interfaces remain the same, I can add

Part attributes directly to the OrderItem table if needed. This can be done on a

production database and I would just need to alter the table,

ALTER TABLE 
OrderItem ADD PartName
varchar(50)
 
UPDATE OrderItem
SET PartName=PartStore.value('(/part/name)[1]',
'varchar(50)')

and change any sprocs or views that reference the table all very backwards

compatible.

Download the code

Summary

A year ago I would have recoiled at the mere suggestion of using XML directly in

the database, thinking that it was the domain of junior asp developers who can't

tell a

relation from a relationship. Over the last few months the XML Temptress

in SQL 2005 has lured me into her parlour providing a mechanism to approach old

problems in new ways.

I would still be in line violently opposing persisting objects as XML in the

database as object oriented bigots would be tempted to do, after all, they say

that the database is just a persistence mechanism for their objects. I can

picture object orientation bigots advocating a database of one table with two

attributes, ObjectId (GUID) and ObjectData (XML) such an image is concerning.

However, I hope that I have demonstrated that if carefully thought through that

XML in the database can be useful and elegant provided that it is done within

the context of the overall architecture. The biggest issue with XML in the

database is understanding when performance becomes the overriding factor as to

where and how data is stored after all it is mostly the performance of

databases that your DBA has to deal with anyway.

Simon Munro

Simon Munro is currently pursuing the Microsoft Certified Architect (MCA) certification and maintains a blog at http://www.delphi.co.za/ that covers many interesting ideas on using Microsoft technologies.

The SQL Server 2005 XML Temptress

SQL 2005 XML for Classifications

Denormalization of Temporal Data Using XML

Summary

Rate

Share

Categories

Share

Rate

The SQL Server 2005 XML Temptress

SQL 2005 XML for Classifications

Denormalization of Temporal Data Using XML

Summary

Rate

Share

Categories

Share

Rate

Related content

XML Options in Microsoft SQL Server 2005

Reduce Database RoundTrips Using XML

An Alternative XML Solution

An Introduction to InfoPath 2003

For SQL Server, XML Is One Answer