Stairway to XML

Stairway to XML: Level 1 - Introduction to XML

,

Support for the eXtensible Markup Language (XML) was first introduced in SQL Server with the release of SQL Server 2000. However, XML-related features were limited to data management capabilities that focused on mapping relational and XML data. For example, SQL Server 2000 added the FOR XML clause, which lets you return relational query results as XML.

However, it wasn’t until the release of SQL Server 2005—when the XML data type was added—that support for XML got interesting. The XML data type lets you natively store XML documents in columns and variables configured with that type. The data type also supports a set of methods you can use to retrieve and modify specific components within the XML document.

To take full advantage of the XML-related features supported in SQL Server, you might find it useful to have a fundamental understanding of XML itself. To that end, this first Level of the Stairway to XML series explains what XML is and describes the various components that make up an XML document.

An Overview of XML

Similar to the HyperText Markup Language (HTML), XML is a markup language that uses tags to delineate and describe the nature of the data associated with those tags. What makes XML extensible is its self-describing nature, that is, you create tags that are specific to the data values contained in the XML document. In HTML, those tags are pre-defined. (XML’s extensible nature will become clearer as we work through the XML components.)

Despite its extensibility, XML is still a standardized language that must conform to a specific set of formatting rules, as defined by the World Wide Web Consortium (W3C). Because of this standardization, the language has been widely adopted in order to transport and store data, unlike HTML, which is used to display data. XML makes it possible to easily share data among heterogeneous systems, regardless of hardware, operating system, or application type, and XML’s universal adoption means that data can be processed with little human intervention. At the same time, you can control how the data is described, while also controlling how the data is ordered and displayed.

XML Components

The primary components that make up an XML document—and the rules that govern the use of those components—are generally very straightforward, but you must adhere strictly to these rules in order for an XML document to be properly processed by the SQL Server XML parser.

There are primarily two types of information included in an XML document: the data to be stored and the tags that describe the data. A tag is made up of a set of angle brackets (< >) that enclose a descriptive word or compound word (no spaces) that describes the data associated with the tag. It’s because of the self-describing nature of these tags that XML is often considered a meta-language.

Each discrete piece of stored data is enclosed in an opening tag and a closing tag, as shown in the following example:

<Person>John Doe</Person>

In this case, the opening tag is <Person>, and the closing tag is </Person>. Notice that a forward slash precedes the tag description in the end tag. A forward slash must precede all end tags, but the language of the tag must be the same as the opening tag, which in the example above is Person. I could have chosen a name other than Person, including a name that has nothing to do with people, but a good practice is to always provide tag names that best describe the data enclosed in the opening and closing tags. In this case, the tags are describing the name of a person, John Doe, thus the tag name <Person>.

Together, the tags and enclosed data represent a single element. However, an element does not always have to contain data. An empty element can be rendered in one of two ways. The first is by specifying the opening and closing tags, but including no data, as I do in the following example:

<Person></Person>

Another way to represent an empty element is use only one set of brackets, but still include the forward slash:

<Person />

Again, this method can be used only when an element contains no value. As you’ll see later in the Stairway to XML series, a schema might require an element for which there is no value. In that case, you can use the shortened format to represent the both tags of the element.

Whether or not an element contains a value, whenever both tags are used, the opening and closing tags must match exactly, down to the capitalization (except for the forward slash in the closing tag). For instance, the following element generates an error in the SQL Server XML parser because the case is different between the two tags:

<person>John Doe</Person>

The descriptive word in the opening tag is all lowercase; however, the descriptive word in the closing tag begins with a capital letter. The opening and closing tags must match to be considered proper, or well formed, XML.

You can, however, embed elements within each other. In the following example, I embed two instances of the <Person> element within the <People> element:

<People>
<Person>John Doe</Person>
<Person>Jane Doe</Person>
</People>

Notice that each <Person> element is complete in itself. It includes the opening and closing tags and the data they enclose. Elements embedded in other elements are referred to as child elements or, in some cases, subelements. The outer element, in this case, <People>, is the parent element. The parent element at the highest level of an XML document is considered the root element. All XML documents must have one, and only one, root element. So the <People> element in the example above is the parent element to the two <Person> elements, and it is the root element for the XML document.

SQL Server also permits you to store XML fragments in an XML column or variable. A fragment is a chunk of XML code without a root element, such as the two elements shown in the following example:

<Person>John Doe</Person>
<Place>Seattle, WA</Place>

The elements must still be well formed XML, that is, have matching tags that enclose the data, but they don’t have to be an XML document. As you’ll see later in the Stairway to XML series, you can specify that only XML documents be permitted in an XML column or variable, but for now just know that SQL Server distinguishes between XML documents and fragments and can store both.

When you embed elements within other elements, you must ensure that the child elements are complete before you end the parent element. For instance, in the following example, I end the <People> element before the <Person> element, which causes the SQL Server XML parser to generate an error:

<People><Person>John Doe</People></Person>

You must ensure that your child elements are complete no matter how many levels contain embedded elements. In this following example, the <FirstName> and <LastName> elements are embedded in each <Person> element, and the <Person> elements are embedded in the <People> element:

<People>
<Person>
<FirstName>John</FirstName>
<LastName>Doe</LastName>
</Person>
<Person>
<FirstName>Jane</FirstName>
<LastName>Doe</LastName>
</Person>
</People>

In this case, the <Person> elements act as both child and parent elements. Notice, however, that each embedded element, regardless of the level, falls completely within the opening and closing tags of the parent element. For example, the first instances of the <FirstName> and <LastName> elements fall completely within the first instance of the <Person> element, and the two instances of the <Person> elements fall completely within the <People> element, which is the document’s root element.

Elements can also have attributes associated with them. An attribute is a property to which you can assign a value. The attribute is defined as part of the element’s opening tag. In the following example, I’ve added the id attribute to each instance of the <Person> element:

<People>
<Person id="1234">
<FirstName>John</FirstName>
<LastName>Doe</LastName>
</Person>
<Person id="5678">
<FirstName>Jane</FirstName>
<LastName>Doe</LastName>
</Person>
</People>

As the example demonstrates, an attribute consists of the attribute name (in this case, id), followed by an equal sign and the attribute value, enclosed in double quotes. So the id attribute for the first instance of the <Person> element has a value of 1234, and the id attribute for the second instance of the <Person> element has a value of 5678.

Another component contained in many XML documents is the declaration, which at a minimum specifies the version of the XML standard that the document conforms to. To date, there are only two versions: 1.0 and 1.1. If using XML 1.0, the declaration is not necessary; however, XML 1.1 requires one. For that reason, you should be aware of how to include a declaration in your XML document.

If you include a declaration, you must place it at the beginning of the document, start the declaration with the <? opening tag, and end it with the ?> closing tag. In addition, you must include the xml keyword (lowercase) and the version attribute (also lowercase). Another attribute commonly included, although optional, is encoding, which specifies the character encoding used for the XML document. In the following example, I include a declaration that specifies version 1.0 and an encoding of UTF-8, which means the data is stored as a sequence of 8-bit Unicode characters:

<?xml version="1.0" encoding="UTF-8"?>
<People>
<Person id="1234">
<FirstName>John</FirstName>
<LastName>Doe</LastName>
</Person>
<Person id="5678">
<FirstName>Jane</FirstName>
<LastName>Doe</LastName>
</Person>
</People>

You can also add comments to your XML documents. To do so, simply precede the comment with the <!-- tag and end it with the --> tag, as I’ve done in the following example:

<?xml version="1.0" encoding="UTF-8"?>
<!-- A list of current clients -->
<People>
<Person id="1234">
<FirstName>John</FirstName>
<LastName>Doe</LastName>
</Person>
<Person id="5678">
<FirstName>Jane</FirstName>
<LastName>Doe</LastName>
</Person>
</People>

As you can see, I’ve added the comment A list of current clients, which I’ve enclosed in the comment tags. The SQL Server XML parser will ignore anything within the tags, so you can use the commenting feature not only to provide information about the XML document and its data, but also to preserve parts of the XML code that you want to hang on to but you don’t want to have processed as part of the document.

Another consideration when working with XML is that certain characters cannot be parsed when they appear in element values. For example, you cannot include an ampersand (&) in an element’s value, as I’ve done in the <FavoriteBook> child element in the following example:

<?xml version="1.0" encoding="UTF-8"?>
<!-- A list of current clients -->
<People>
<Person id="1234">
<FirstName>John</FirstName>
<LastName>Doe</LastName>
</Person>
<Person id="5678">
<FirstName>Jane</FirstName>
<LastName>Doe</LastName>
<FavoriteBook>Crime & Punishment</FavoriteBook>
</Person>
</People>

If I try to assign this XML document to an XML column or variable, the <FavoriteBook> child element will cause the parser to generate an error because the value Crime & Punishment includes the ampersand. You must replace this type of character with an entity reference that tells the parser to preserve the character as it is originally intended. An entity reference begins with an ampersand and ends with a semi-colon and in between includes a multi-character code that represents the original value. For an ampersand, the entity reference should be &amp;, which I use in the following example:

<?xml version="1.0" encoding="UTF-8"?>
<!-- A list of current clients -->
<People>
<Person id="1234">
<FirstName>John</FirstName>
<LastName>Doe</LastName>
</Person>
<Person id="5678">
<FirstName>Jane</FirstName>
<LastName>Doe</LastName>
<FavoriteBook>Crime &amp; Punishment</FavoriteBook>
</Person>
</People>

Notice that I’ve replaced the ampersand with the &amp; entity reference. Now the XML parser will handle the <FavoriteBook> element with no problem. But note that the ampersand is not the only character that will generate an error. The XML standard identifies five characters that should be replaced with entity references, as I’ve done in the above example:

  • Less than (<): replace with &lt;
  • Greater than (>): replace with &gt;
  • Ampersand (&): replace with &amp;
  • Apostrophe ('): replace with &apos;
  • Quotation mark ("): replace with &quot;

Another issue that the example raises is the fact that the child elements do not have to be the same from one parent instance to the next. As you can see, the first instance of the <Person> element contains only the <FirstName> and <LastName> child elements, but the second instance of the <Person> element contains the <FirstName> and <LastName> child elements, as well as the <FavoriteBook> element. As long as your child elements are well formed, you can include whatever elements necessary to delineate and define your data.

Summary

In this Level, we’ve looked at the primary components that make up an XML document. Elements serve as the basic building blocks for all XML documents, with each element being delineated by an opening tag and a closing tag and the data value itself being enclosed between those two tags. Elements can be embedded within each other, but one element—the root—must act as the parent to all other elements in an XML document. An element can also include attributes, which are defined as part of an element’s opening tag.

As handy as it might be to know how to put together an XML document, the purpose of this Level has not been to train you in how to create these types of documents, but rather to provide an introduction to XML so you can more effectively work with XML in SQL Server. In the next Level, we’ll look at how the XML data type is implemented in SQL Server and how it can be assigned to columns and variables in order to store both XML documents and XML fragments.

This article is part of the parent stairway Stairway to XML

Rate

5 (1)

You rated this post out of 5. Change rating

Share

Share

Rate

5 (1)

You rated this post out of 5. Change rating