Documenting your SQL Server Database

One of the shocks that a developer can get when starting to program in T-SQL is that there is no simple way of generating documentation for routines, structures and interfaces, in the way that Javadocs or Doxygen provides. To embed the documentation in the source is so obvious and easy that it is a wrench to be without this facility. Phil Factor suggests a solution.

‘Understanding the existing product consumes roughly 30 percent of the total maintenance time’

“Facts and Fallacies of Software Engineering” by Robert L. Glass.

Introduction

Code likes to be surrounded by explanation. I like to keep as much information as possible together with the code I write, whatever language I use. Most languages allow you to extract this information for the purposes of reporting, documentation, integrating with bug-tracking or source control, or providing tooltip help: SQL Server doesn’t. No sensible method for doing database documentation has either been provided by Microsoft, or properly supported by the software tools that are available. In the absence of an obvious way of going about the business of documenting routines or objects in databases, many techniques have been adopted, but no standard has yet emerged.

You should never believe anyone who tells you that effective database documentation can be entirely generated from a database just by turning a metaphorical handle. Automatic database generators can help, but cannot absolve the programmer from the requirement of providing enough information to make the database and all its routines intelligible and maintainable: This requires extra detail. The puzzle is in working out the most effective way of providing this detail.

Once you have an effective way of providing details about the tables, views, routines, constraints, indexes and so on in your database how do you then publish this information in a form that can be used?

In this article I’ll show how to insert structured headers into SQL code, and attach them to various database objects, and show to extract the information from the database in a form that can then be used for whatever purpose you require.

Why bother to document databases?

When you’re doing any database development work, it won’t be long before you need to seriously consider the requirement for documenting your routines and data structures. Even if you are working solo, and you operate a perfect source-control system, it is still a requirement that kicks in pretty soon, unless you have perfect recall. Many times, I’ve sat down in front of some convoluted code, asked the rhetorical question ‘God, what idiot wrote this code?’ only to find out it was I, sometime in the past. By documenting, I don’t just mean the liberal sprinkling of in-line comments to explain particular sections of code: If you are coordinating a number of programmers on a project, then it is essential to have more than this; you’ll require at least an explanation of what it does, who wrote it or changed it, and why they did so. I would never advocate presenting the hapless code-refactorer with a sea of green, but with a reasonable commentary on the code to provide enough clues for the curious. I’d also want examples of use and a series of assertion tests that I can execute to check that I haven’t broken anything. Such thing can save a great deal of time.

Where the documentation should be held?

Most database developers like to keep the documentation for a database object together with its build-script, where possible: That way, it is easy to access and never gets out of synchronization. Certain information should be held separately in source control, but only sufficient for the purposes of continuous integration and generating the correct builds for various purposes. The source code should be the canonical version of documentation where possible.

This primary source of the essential documentation should be, in effect, stored within the database, and the ideal place is, usually, within the source script. When documentation about code needs to be extracted, it is best done, where possible, by an automatic process.

You can hold the documentation separately from the code in source control, of course, but it is difficult to match the convenience of having as much information accessible within the database. The problem is in the duplication of effort if you wish to provide extra data to source control such as check-in comments.

The obvious place to hold documentation within the SQL source is in a comment block in the actual text for routines such as stored procedures, rules, triggers, views, constraints and functions.

This sort of comment block is frequently used, held in structured headers that are normally placed at the start of a routine, but acceptable anywhere within it.

Microsoft had an attempt at a standard for doing it. Some SSMS templates have headers like this …

… (the macros delimited by the angle-brackets are filled in by SSMS), but these headers are neither consistent not comprehensive enough for practical use. These headers would have to conform to a standard, so that routines can be listed and searched. At a minimum, there should be agreement as to the choice of headings. The system should be capable of representing lists, such as revisions, or examples of use. Many different corporate-wide standards exist but I don’t know of any common shared standard for documenting these various aspects. Many conventions for ‘structured headers’ take their inspiration from JavaDocs, or from the XML comment blocks in Visual Studio. Doxygen is probably one of the best of the documenters designed for r C-style languages like C++, C, IDL, Java, and even C# or PHP.

The major difficulty that developers face with database documentation is with tables, columns and other things that are not held in the form of scripts. These scripts cannot be reverse-engineered from the metadata with the tools that Microsoft provides, so if you wish to preserve them you can’t use the GUI to modify the tables. You cannot store documentation for these in comment blocks if you like using SSMS to modify tables and their associated objects : you have to store them in extended properties. We’ll discuss this at length later on in this article.

Wherever these headers are stored, they require special formatting because the information is really hierarchical in nature when it comes to lists of revisions, and examples of use. Microsoft use XML-formatted headers with Visual Studio. I know of people who have experimented with YAML and JSON headers with homebrew methods of extracting the information. Most of these scripts extract structured headers from T-SQL routines, automatically add information that is available within the database such as name, schema, and object type, and store them in an XML file. From there on, things get murky.

What should be in the documentation?

We probably want at least a summary of what the database object does. For routines, I suspect that you’ll also need a comprehensive list of examples of use. You may want to add a quick-check test batch that you can execute when you make a minor routine change. Even though source control will tell you who wrote and provide a history of who revised it, when, why, and what they did, it is handy to have a copy here for the even if the ‘who’ was yourself. I’ve always wanted to automatically pick up this information in source control to add to the headers but never got around to it. Headers need to support extensible lists, so you can make lists of revisions, parameters, examples of use and so on. Databases already hold a lot of information that is normally in code headers, such as when the routine was first created, when last modified, its parameters if any, what its dependencies are, and what objects it is dependent on. I generally add these if necessary when extracting the documentation.

How should the documentation be published?

There is no point in keeping all this documentation if it cannot be published in a variety of ways. There are many ways that development teams need to communicate, including intranet sites, PDF files, DDL scripts, DML scripts and Help files. This usually means extracting the contents of structured headers, along with the DDL for the routine, as an XML file and transforming that into the required form. Regrettably, because there are no current standards for structured headers, no existing SQL Documenter app is able to do this effectively. Several applications can publish prettified versions of the SQL code, but none can directly use such important fields of information as summary information or examples of use, though SQL Doc can display them. We don’t have the database equivalent of Sandcastle, which takes the XML file and generates a formatted, readable, help file. However, one can easily do an XSLT transformation on the XML output to provide HTML pages of the data, all nicely formatted, or one can do corresponding transformations into a format compatible with HELP-file documentation systems.

What standards exist?

Let’s firstly have a look at the equivalent standard for C#, VB.NET and F#. The compilers for these languages can extract strings tagged with ‘well-formed’ XML markup that are in special comment lines (/// for C# and ”’ for Visual Basic) or in a special comment block delimited as /** .... **/. The compilers can add information that it gains by parsing the source file and then placing the result into XML files called XML Documentation Comments. These are then used for Visual Studio Intellisense and the object browser, and can be imported into applications such as Doxygen or Sandcastle to convert them into help files. Of course, you can transform an ‘XML Documentation Comments’ file into anything you choose with XSL.

Microsoft’s XML documentation standard isn’t much used outside Visual Studio, and all tools seem to assume that you are writing in a .NET procedural language. The standard has not been adopted with much enthusiasm except for the three items of information (Summary, Params and Remarks) that are used by Visual studio.

The basic tags used in XMLdocs are appropriate for most purposes, but if you are doing your own processing of the results, you can add to them.

Unlike most forms of structured documentation, XML documentation is usually inserted immediately before the definition of the component that you are documenting.

The standard tags are as follows:

Tag

Purpose

<c>

Set text within a paragraph in a code-like font to indicate that it is a token

<code>

Set one or more lines of  code or  output

<example>

An example of the routine’s use

<exception>

The errors or exceptions that can be thrown by the code

<list>

Create a list or table

<para>

delimits a paragraph within the text

<param>

Describe a parameter for a routine, procedure, method or constructor

<paramref>

Identify that a word is a parameter name

<permission>

Document the security accessibility of a routine or member

<remarks>

Describe a type

<returns>

Describe the results, variables, or return value of a routine or method

<see>

Specify a link,

<seealso>

Generate a See Also entry e.g. <seealso cref=”MyProcedure”/>

<summary>

Describe a member of a type

<typeparam>

The name of the type parameter, with  the name enclosed  in double quotation marks (” “).

<value>

Describe a property or variable

There is, in addition, one tag that is actually a directive

Tag

Purpose

<include>

Refers to another file that describes the types and members in your source code. This can include a XPath spec to explain where in the file the compiler should go to get the information

Not all these tags require their contents to be filled in, since information such as the parameters and the permissions is easily retrieved. With the Microsoft .NET languages, the compiler will fill in the details that it can. Some of the most useful things you normally want to put into structured headers aren’t included in this list because they are made into lists, and XML isn’t as suited for lists as JSON or YAML.. The <LIST> elements are mostly used for revision/version lists but the format is no good for searching for version numbers, or the details of particular revisions of code

In order to try to persuade you that this is not a good standard to adopt, here is a very simple SQL Server Stored Procedure with the barest minimum of documentation..

Extracting the XML file isn’t too hard. For each routine, you’d need an XML fragment like this ..

Notice that you ought to escape certain characters. This makes the translation more complicated.

It isn’t hard to see from this why the ‘XML documentation comments’ convention never really caught on as a general standard, beyond using the summary tag to generate intellisense. It is awkward. It makes the code hard to read. You are liable to miss important information.

More modern approaches to structured text, such as YAML, provide a far easier and more intuitive approach, and allow for a much more versatile way of handling paragraphs of text. Here is a JSON header, which is easier to read or write than raw XML.

If you want a human-oriented approach to headers, YAML seems a better design. Here is a YAML version of the XML header which is directly equivalent.

Here is another example of a YAML header

YAML is far more usable, but would require processing to turn it into XML. However, other approaches can be modified with little change. JSON code can, for example, normally be digested by a YAML parser. The SSMS templates need just a slight adjustment. These are provided for you to fill with the required parameters , The content between the chevron brackets is filled in by SSMS as part of the templating process and is not part of YAML.

Becomes

In our example, we have placed the comments in a structured header at the head of the text of the routine. Here is our next problem: Only the text for Check constraints, Defaults (constraint or stand-alone), stored procedures, scalar functions, ,Replication filter procedures , DML triggers, DDL triggers, inline table-valued functions, table-valued functions, and Views are retained: The comments for tables, indexes and columns are lost. We need a different way of doing this for database objects that don’t support headers. Whereas all .NET objects boil down to code, the same is not true of tables in SQL Server. When you script out a table in SSMS, or via SMO you’ll notice that table build-scripts have lost any comments that you put in them

This may or may not present a problem. It depends on how you develop databases, and how quickly you’d like information about an object. There are two basic ways of developing databases; one where the script is always the primary source, and the other where the database is, at times, considered to be the source. The first method of developing databases involves maintaining one or more build scripts which contain all the comment blocks and inline comments. This is used to generate the database, but the process is never reversed.

The latter happens if you use the GUI tool, usually SQL Server Management Studio (SSMS) , to create or edit a database object. When you wish to update what is in source control, you will need to script out what you’ve done, but you’ll find that tables, indexes and columns will have lost their comments

If the script is always the primary source, then you don’t have a problem; Otherwise, you will have to put your comment or structured header in an ‘extended property’ of the table.

Extended properties

There are problems in using ‘Extended Properties’ as an alternative to structured headers, because they don’t support lists or hierarchies. Even if you define extra properties to store different types of information such as ‘summary’ or ‘remark’, you still face problems with lists, such as revision lists, since one cannot have more than one instance of any particular extended property. If, therefore, you need to describe the parameters that you pass to routines, give examples of their use, give a revision history of the routine and so on, then it suddenly all gets more difficult unless you store the information in a structure and store that structure in an extended property. In other words, you can use ‘Extended Properties, but only if you use it merely to contain the structure.

SSMS allows you to create other extended properties besides MS_Documentation. Here is a shot of me editing a ‘documentation’ extended property to a PERSON table in SSMS (right click on the table whose documentation you want to alter in the object browser, and click on properties. In the left-hand side of the properties window, click on ‘extended properties’. Click on the ‘three-dots’ icon on the right hand side of the property you wish to edit, or the bottom-most one to create a new one.)

2183-1-4b005f1c-a43d-4bd0-935f-f2a8f0c6a

SSMS is fine for maintaining this type of extended property. SQL Doc will display this as well but cannot parse it

Here’s an example of how you might place it in a build script.

Now we are able to attach documentation to all our database objects. With extended properties, you can now document almost any conceivable database object, except for Full-text objects, objects outside the database scope such as HTTP end points, unnamed objects such as partition function parameters, certificates, symmetric keys, asymmetric keys, and credentials and system-defined objects such as system tables, catalog views, and system stored procedures. That’s quite sufficient for me, anyway.

Why didn’t I use MS_Description? We need some way of distinguishing ordinary comments, and structured headers. I’m always loath to use MS_Description for special purposes because there is always someone, me in this case, who is considering whether to use them for something else.

I provided a stored procedure that creates a table build script that can use both the table and column extended properties called sp_ScriptFor which is available here. It is not a complete table script, but provides enough for exploring a database.

Now, how about reporting from our structured headers? Before we think about parsing the structured information, we have to consider how to extract the extended properties. These were originally accessible only via a rather awkward function called fn_listextendedproperty . Mercifully, Microsoft added a system catalog view called sys.extended_properties that is much easier to use. The extended properties that we have called ‘documentation’ in the current database can be extracted for all our tables by means of this SQL

And you can do the same for the columns with code such as this

Extracting headers

Having extracted our documentation, what can we do with it? Because we’ve chosen YAML as our standard, purely because of its readability

YAML

YAML, which is a superset of JSON, was first proposed by Clark Evans in 2001. YAML is a recursive acronym for “YAML Ain’t Markup Language”. It is a means of serializing both objects and relational data in a way that is reasonably intuitive. It is intended to be data-oriented notation, rather than document markup. Instead of using delimiters such as quotation marks, brackets, braces, and open/close-brackets, it uses indentation to represent hierarchical information. The specific number of spaces in the indentation is unimportant as long as parallel elements have the same left justification and the hierarchically nested elements are indented further. Strings do not have to be delimited. YAML handles indents as small as a single space, and so achieves better compression than markup languages. One can revert to JSON “inline-style” (i.e JSON-like format) without the indentation, if indentation is inconvenient.

The white space delimiters allow YAML files to be filtered with line oriented commands within grep, awk, perl, ruby, and python.

We can extract the contents of all the headers in a database in a number of different ways. The most obvious way to do it is with PowerShell. I use PowerYAML for parsing YAML, and I use my own ConvertTo-YAML to create headers.

Here is a simple routine that creates a report of all the routines (Procedure, Replication-filter-procedure, view, DML triggers, user functions, and rules) and tables in a database. A SQL Server database contains information about a database object already, and the routine shows how to incorporate that.  Where you have used a YAML-based header, it uses that to supplement what it finds from the metadata. The result in PowerShell, ‘$Routines’, is an array of hashtables, each of which represent a routine. This can be converted into an XML fragment and then translated into any required format with XSLT. You can write it into an excel or MS ~Word report, You can use the ConvertTo-YAML with to create a YAML document of the routines that can then be consumed by Ruby or Python and converted into native object formats.

Conclusions

It isn’t surprising that the proper documentation of database code has been so neglected by Microsoft or the third-party tool providers. The existing standards of procedural languages are clumsy to use, and the tools that are developed for them cannot be used for database scripts.

YAML, which has always seemed like a solution looking for a problem, has emerged as a perfect means of embedding readable structured, hierarchical information into the script of database objects. Even if the methods of converting this information into help text or intranet sites is currently slightly awkward, the method is useful as it is easy to scan at a glance when you are inspecting the code itself.

This article is updated from a chapter of ‘SQL Server Team-Based Development’ by Phil Factor, Grant Fritchey, Alex Kuznetsov and Mladen Prajdic; Simple Talk Publishing 2010