Why XML validation?
XML is everywhere and it has been the de-facto format for document structure and information exchange for a while. Today XML or Extensible Markup Language has replaced many/most delimiter separated, binary and fixed width formats of the past. The nice thing about XML is that it is both human readable and machine readable. The downside is that it also means that it is also human (and machine) mistakable. To counter this, there are two validation mechanisms available, most noticeably the schema definition, XSD. The other is the older DTD, Document Type Definition, which has more or less been superseded by the far superior XSD.
What is XML and XSD
To level the playing field, first a very short recap on XML and XSD. XML stands for Extensible Markup Language, a simple text based markup language similar to HTML. It is designed for representing (any) structured information in a self descriptive manner. Loosely, in the context of databases, this means that if a data can be stored in a database schema, it can be represented in an XML document.
The XSD or XML Schema Definition describes the structure of a XML document and to some extent the constraints of the content. This allows the XML document to be validated by the XSD, catching discrepancies before attempting to process the document.
Toe in the Water
For the sake of argument, a company has many distributed points of sale. Each sale made must be communicated to a central database system, using XML documents. The database schema to store the Customer's Orders can be described as:
For each order made by a Customer, a single record must exist in the Header table. For each Product within the Order, a corresponding record exists in the Detail table, including information on Quantity, Product type, Price and Currency. Any Order can only relate to one Customer. Each order can contain one or more Products.
[CUSTOMER]----<[HEADER]----<[DETAIL]
Figure 1, a high level ERD of the Customer Order schema.
In an XML document for this application (Listing 1), each Order forms the scope of the content. The top level node is the <ORDER> element, here shown as the document's root element. Nested under it are the <CUSTOMER>, the <HEADER> and the <DETAIL> elements.
<ORDER>
    <CUSTOMER />
    <HEADER />
    <DETAIL />
</ORDER>Listing 1, a simplified XML for a Customer Order.
Probabilistic Relationship
In XML, when element (child) is nested under another element (parent), the two elements relate to each other. The relationship can either be an One parent to One child, or an One parent to Many children. Siblings can relate to each other in any possible way, One to One, One to Many and Many to Many. Without a XML Schema Definition, there is no way of telling which relationship type is the right one. This makes XML reverse engineering tool generated relational schemas and XML Schema Definitions unreliable.
Data Integrity
The database has constraints and schema elements which can enforce the logical correctness of the data. As an example, the Header table has a foreign key constraint to the Customer table. This makes it impossible for an Order to relate to either non or many Customers. Unlike the database, there is nothing built into a text editor to prevent errors when creating an XML Document. For the XML, this has to be defined separately in an XSD document. The XSD itself is in an XML language and is therefore also both human readable and somewhat self descriptive. Listing 2 shows an XSD at the same (Entity) level as the ERD in Figure 1, with equal constraint as the relational schema; an Order can only relate to one Customer and an order must have one or more Products.
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xs:element name="ORDER">
    <xs:complexType>
      <xs:sequence>
        <xs:element name="CUSTOMER" minOccurs="1" maxOccurs="1" />
        <xs:element name="HEADER"   minOccurs="1" maxOccurs="1" />
        <xs:element name="DETAIL"   minOccurs="1" maxOccurs="unbounded" />
      </xs:sequence>
    </xs:complexType>
  </xs:element>
</xs:schema>Listing 2, an XML Schema Definition for the Customer Order XML Document.
Getting Your Feet Wet
At the attribute level on the relational schema, the ERD includes column specifications for each table.
 
Figure 2, attribute level ERD.
The XSD has Attribute tags corresponding to the Column specifications. It also has the relational constraints, by the means of the occurrence restrictions within each Element.
<xs:schema attributeFormDefault="unqualified" elementFormDefault="qualified" xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xs:element name="ORDER">
    <xs:complexType>
      <xs:sequence>
        <xs:element name="CUSTOMER" minOccurs="1" maxOccurs="1">
          <xs:complexType>
            <xs:attribute name="CUST_ID" type="xs:unsignedShort" use="required" />
            <xs:attribute name="CUST_NAME" type="xs:string" use="required" />
          </xs:complexType>
        </xs:element>
        <xs:element name="HEADER" minOccurs="1" maxOccurs="1">
          <xs:complexType>
            <xs:attribute name="ORDER_NO" type="xs:unsignedInt" use="required" />
            <xs:attribute name="ORDER_DATE" type="xs:date" use="required" />
          </xs:complexType>
        </xs:element>
        <xs:element minOccurs="1" maxOccurs="unbounded" name="DETAIL">
          <xs:complexType>
            <xs:attribute name="PRODUCT_NO" type="xs:string" use="required" />
            <xs:attribute name="QUANTITY" type="xs:unsignedShort" use="required" />
            <xs:attribute name="UNIT_PRICE" type="xs:decimal" use="required" />
            <xs:attribute name="CURRENCY" type="xs:string" use="required" />
          </xs:complexType>
        </xs:element>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
</xs:schema>Listing 3, an attribute level XML Schema Definition.
XSD in SQL Server
SQL Server supports XSD and XSD validation via the SCHEMA COLLECTION. To use the SCHEMA COLLECTION, an XML schema must be loaded using the CREATE XML SCHEMA COLLECTION statement. The statements takes three parameters, relational schema name, identifier and the actual XML schema definition. The relational schema parameter is optional and if omitted, the default relational schema is used. The identifier used to reference the schema collection in a similar way as a data type applicable to XML columns and variables. The XSD implementation in SQL Server has some limitations and restrictions, details of that are listed on Technet. If a SCHEMA COLLECTION is specified when declaring XML variables or XML columns, the XML is considered being typed, otherwise untyped. When an XML content is loaded into a typed XML, the SQL Server performs an XML Schema validation on the content.
Diving In
Then next section is a walk through, demonstration of the validation process with a simple exercise. The first step is to create a schema collection, then declare a typed XML variable and finally, load an XML content into the variable. The final part returns either a success notification or if the validation failed, an error message. Listing 4 shows the creation of a schema collection.
USE tempdb;
GO
IF EXISTS (SELECT * FROM sys.xml_schema_collections 
                    WHERE name = N'MY_XML_ORDER_SCHEMA' 
                    AND schema_id = SCHEMA_ID(N'dbo'))
   DROP XML SCHEMA COLLECTION dbo.MY_XML_ORDER_SCHEMA;
  
CREATE XML SCHEMA COLLECTION dbo.MY_XML_ORDER_SCHEMA
AS
N'<?xml version="1.0" encoding="utf-16"?>
<xs:schema attributeFormDefault="unqualified" elementFormDefault="qualified" xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xs:element name="ORDER">
    <xs:complexType>
      <xs:sequence>
        <xs:element name="CUSTOMER" minOccurs="1" maxOccurs="1">
          <xs:complexType>
            <xs:attribute name="CUST_ID" type="xs:unsignedShort" use="required" />
            <xs:attribute name="CUST_CC" type="xs:unsignedShort" use="required" />
          </xs:complexType>
        </xs:element>
        <xs:element name="HEADER" minOccurs="1" maxOccurs="1">
          <xs:complexType>
            <xs:attribute name="ORDER_NO" type="xs:unsignedInt" use="required" />
            <xs:attribute name="ORDER_DATE" type="xs:date" use="required" />
          </xs:complexType>
        </xs:element>
        <xs:element minOccurs="1" maxOccurs="unbounded" name="DETAIL">
          <xs:complexType>
            <xs:attribute name="PRODUCT_NO" type="xs:unsignedByte" use="required" />
            <xs:attribute name="QUANTITY" type="xs:unsignedByte" use="required" />
            <xs:attribute name="UNIT_PRICE" type="xs:decimal" use="required" />
            <xs:attribute name="CURRENCY" type="xs:string" use="required" />
          </xs:complexType>
        </xs:element>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
</xs:schema>';Listing 4, creating a SCHEMA COLLECTION.
In listing 5, we can see code that creates a typed XML variable, and then an untyped XML variable.
DECLARE @TYPED_XML XML(dbo.MY_XML_ORDER_SCHEMA); DECLARE @UNTYPED_XML XML = N'<?xml version="1.0" encoding="utf-16"?> <ORDER> <CUSTOMER CUST_ID="10001" CUST_CC="20001"/> <HEADER ORDER_NO="102030" ORDER_DATE="2014-03-03"/> <DETAIL PRODUCT_NO="122" QUANTITY="1" UNIT_PRICE="15.75" CURRENCY="USD"/> <DETAIL PRODUCT_NO="123" QUANTITY="1" UNIT_PRICE="60.00" CURRENCY="USD"/> </ORDER> ';
Listing 5, declare an untyped and a typed XML variables, assigning an XML content to the untyped one.
The final step is to attempt to assign the value of the untyped XML variable to the typed XML variable, as shown in Listing 6. This causes the SQL Server to perform a validation based on the schema definition. If the validation fails, it throws it's toys out of the pram, including an error which then must be caught in a TRY/CATCH. Unfortunately the new TRY_CAST and TRY_CONVERT will not contain this kind of conversion errors, making it hard to device a set based schema validation.
BEGIN TRY SELECT @TYPED_XML = @UNTYPED_XML; SELECT MSG = N'PASSED SCHEMA VALIDATION'; END TRY BEGIN CATCH SELECT MSG=ERROR_MESSAGE(); END CATCH
Listing 6, schema validation attempt enclosed in a TRY/CATCH.
The results:
| MSG | 
| PASSED SCHEMA VALIDATION | 
Successful validation.
So far so good, the XML document passes the test with flying colours. To fully test the functionality, the test has to fail if an XML document violates the schema. Looking at the CUSTOMER element's definition (listing 7), it states that one is required and only one can exist within an ORDER element.
<xs:element name="CUSTOMER" minOccurs="1" maxOccurs="1">
Listing 7, the CUSTOMER must appear once.
An additional CUSTOMER element in the XML document will now fail the validation.
SELECT @UNTYPED_XML XML = N'<?xml version="1.0" encoding="utf-16"?> <ORDER> <CUSTOMER CUST_ID="10001" CUST_CC="20001"/> <CUSTOMER CUST_ID="10001" CUST_CC="20002"/> <HEADER ORDER_NO="102030" ORDER_DATE="2014-03-03"/> <DETAIL PRODUCT_NO="122" QUANTITY="1" UNIT_PRICE="15.75" CURRENCY="USD"/> <DETAIL PRODUCT_NO="123" QUANTITY="1" UNIT_PRICE="60.00" CURRENCY="USD"/> </ORDER> '; BEGIN TRY SELECT @TYPED_XML = @UNTYPED_XML; SELECT MSG = N'PASSED SCHEMA VALIDATION'; END TRY BEGIN CATCH SELECT MSG = ERROR_MESSAGE(); END CATCH
Listing 8, attempting to validate an invalid XML document.
The results:
| MSG | 
| XML Validation: Invalid content. Expected element(s): 'HEADER'. Found: element 'CUSTOMER' instead. Location: /*:ORDER[1]/*:CUSTOMER[2]. | 
An unsuccessful validation.
Happy Hour at the RBAR
SQL Server has no specific functions or methods for validating XML, equivalent to TRY_CATCH and TRY_CONVERT. This means that a failed validation leaves a transaction in an invalid state, forcing a rollback. When working with an XML column in a table, the only workaround is to do a row by row validation with a CURSOR. The last code listing in this article shows how to do a validation on an XML column, capturing both successes and failures, including any error messages. Before running the code, make certain that the SCHEMA COLLECTION exists by running the code in Listing 9.
/* VALIDATED XML */DECLARE @VALIDXML XML = CONVERT(XML,N'<?xml version="1.0" encoding="utf-16"?>
<ORDER>
  <CUSTOMER CUST_ID="10001" CUST_CC="20001"/>
  <HEADER ORDER_NO="102030" ORDER_DATE="2014-03-03"/>
  <DETAIL PRODUCT_NO="122" QUANTITY="1" UNIT_PRICE="15.75" CURRENCY="USD"/>
  <DETAIL PRODUCT_NO="123" QUANTITY="1" UNIT_PRICE="60.00" CURRENCY="USD"/>
</ORDER>',2);
/* INVALID XML */DECLARE @INVALIDXML XML = CONVERT(XML,N'<?xml version="1.0" encoding="utf-16"?>
<ORDER>
  <CUSTOMER CUST_ID="10001" CUST_CC="20001"/>
  <HEADER ORDER_NO="102030" ORDER_DATE="2014-03-03"/>
  <DETAIL PRODUCT_NO="" QUANTITY="1" UNIT_PRICE="15.75" CURRENCY="USD"/>
  <DETAIL PRODUCT_NO="234" QUANTITY="1" UNIT_PRICE="15.75" CURRENCY="USD"/>
  <DETAIL PRODUCT_NO="123" QUANTITY="1" UNIT_PRICE="60.00" CURRENCY="USD"/>
</ORDER>',2);
/* TABLE VARIABLE FOR THE TEST */DECLARE @TEST_DATA TABLE
(
     TEST_DATA_ID    INT IDENTITY(1,1) PRIMARY KEY CLUSTERED NOT NULL
    ,TEST_DATA_XML   XML NOT NULL
);
/* INSERT FEW SAMPLES */INSERT INTO @TEST_DATA ( TEST_DATA_XML )
VALUES ( @VALIDXML   )
      ,( @VALIDXML   )
      ,( @INVALIDXML )
      ,( @VALIDXML   )
      ,( @VALIDXML   );
/* TABLE VARIABLE FOR THE VALIDATION LOG */DECLARE @TEST_LOG TABLE
(
     TEST_LOG_ID        INT IDENTITY(1,1) PRIMARY KEY CLUSTERED NOT NULL
    ,TEST_DATA_ID       INT NOT NULL
    ,TEST_LOG_MESSAGE   NVARCHAR(1024)  NULL
);
/* VARIABLE FOR THE CURSOR */DECLARE @TEST_DATA_ID INT = 0;
/* FAST FORWARD CURSOR */DECLARE T_SET CURSOR FAST_FORWARD FOR
    SELECT TEST_DATA_ID FROM @TEST_DATA
    ORDER BY TEST_DATA_ID;
OPEN T_SET;
FETCH NEXT FROM T_SET INTO @TEST_DATA_ID;
WHILE @@FETCH_STATUS = 0
BEGIN
    BEGIN TRY
        INSERT INTO @TEST_LOG ( TEST_DATA_ID,TEST_LOG_MESSAGE)
        SELECT 
            TD.TEST_DATA_ID
/* IF THE CONVERSION IS SUCCESSFUL THEN THIS STATEMENT RETURNS NULL */           ,NULLIF(SIGN(DATALENGTH(CAST(TD.TEST_DATA_XML AS XML(dbo.MY_XML_ORDER_SCHEMA)))),1) 
        FROM @TEST_DATA TD WHERE TD.TEST_DATA_ID = @TEST_DATA_ID;
    END TRY
    BEGIN CATCH
        INSERT INTO @TEST_LOG ( TEST_DATA_ID,TEST_LOG_MESSAGE)
 /* IF IT FAILS, THEN RETURN THE ERROR MESSAGE */       SELECT 
             @TEST_DATA_ID
            ,ERROR_MESSAGE();
    END CATCH
FETCH NEXT FROM T_SET INTO @TEST_DATA_ID;
END
/* CLEAN UP */CLOSE      T_SET;
DEALLOCATE T_SET;
/* VIEW THE RESULTS */SELECT
    TEST_LOG_ID 
   ,TEST_DATA_ID  
   ,TEST_LOG_MESSAGE
FROM @TEST_LOG;Listing 9, schema validation on an XML column.
A foot in the door
This article is a brief introduction to the XML schema validation and in a way only providing a sneak peek at the capabilities. Hopefully we will see more coverage of the subject in the near future.
Learning more on XML and XSD
A must read is the Stairway to XML on SSC. Another good source on XML and XML related technologies such as the XSD is the W3 Schools. SQL Server has supported XSD and XSD validation since version 2005 and there is a fair coverage in the BOL. Furthermore Red Gate provides a free eBook, The Art of XSD, SQL Server XML Schema Collections by Jacob Sebastian.
