Blog Post

SQL Standards

,

Three posts, three critical rants. It's time to be constructive.

I was given the responsibility to decide on the SQL development standards at my company. I think most people with any real interest in database development would enjoy this task. As it happens, enforcing the standard is a little difficult (see my previous post about that...). Nevermind, I proposed the standard that I've evolved through the application of theory and experience over the last 10 or so years.

I've seen standards documents that normally simply say "do this", "don't do that". I find it difficult to respect such documents. If I should "do this", there must be a reason for it. In my opinion that reason should be in the document (as an appendix, if you like). The reasons are twofold. First, it necessarily clears the air of unjustified authoritarianism that can be so noxious to a development team (because the justifications are available). Secondly, it gives people the chance to intelligently comment and critique the conclusions. If this happens it's a good thing. Either your ideas will be defended by sound argument, in which case they become more trustworthy (just as scientific hypotheses and theories become more solid through attempts at falsificaton), or the ideas will evolve in response to the criticisms, in which case they will become more correct. There's no downside.

And so, I present to you a large portion of my standards as well as the reasoning behind them.

RFC: SQL Coding Standards – Don’s Opinions (and Arguments)


Data Element Names

Table names should indicate plurality.

Good:   personnel, employees
Bad:    employee

Why?

A table is not the same kind of thing as a class in object oriented programming. A table is a set of rows. A table is not an array or a collection with items in specific positions, so the fact that singularity creates the grammatically appealing form of array subscription used in procedural languages, array_name[index], is not applicable. There is no order to the members of a set. {1, 2, 3} and {2, 1, 3} are the same set.

See the ISO11179 overview at the end of this document for further details.

Conform to ISO11179.

Data element names should be of the form:

[(class-term|qualifier-term{_qualifier-term})_]representation-term

With the more general terms to the right, and the more specific terms to the left. For example:

• order_entered_date

• customer_first_name

• invoice_number

• invoice_customer_reference_number

• computer_CPU_fried_transistor_count

The generality of a term can usually be derived by whether or not the question “what kind of (x)?” applies some other term to the term (x). For instance, in the example above we have a count

What kind of count? A count of transistors.

What kind of transistors? fried transistors.

What kind of fried transistors? The fried transistors in a CPU.

Per ISO11179, the class term may be omitted when the data element name is defined in the context of the class. In SQL-speak the table is generally the class term, so the table name need not be included in the name of every column on the table (but see the ISO11179 overview for notes on this point). In the example, the class term “computer” may be omitted if this is the name of a column on a table called computers (or something semantically equivalent).

This is not always a hierarchy in that the order of elements can sometimes be rearranged. In the example, fried_cpu_transistor_count is just as good as cpu_fried_transistor_count. On the other hand, transistor_cpu_fried_count doesn’t work (“fried” is not a valid answer to “what kind of count?”).

Why?

• ISO11179 is an established metadata standard.

• The decomposition of data element names into the components specified by ISO11179 is intuitive.

• A decomposable data element name can be validated against a data dictionary.

• An overview of the principles of ISO11179 can be found at the end of this document..

Acronyms, and only acronyms, should be capitalized.

Good:   order_AUD_amt
Bad:    ORDER_AUD_AMT, Order_Aud_Amt, OrderAudAmt, OrderAUDAmt, order_aud_amt

Why?

• In the English language capitalization is used in three places: for grammar, proper names, and acronyms. A data element name is a single token and therefore contains no grammar. A data element name is not a proper name. A data element name may contain acronyms. This rule is therefore consistent with natural English.

• All-caps identifiers hide acronyms, and all-caps is more difficult to read than all-lowercase.

• Proper case / camel case can also hide acronyms and it can be hard to distinguish between terms.

To prove this, consider a few of the possible representations of the data element name for “the net Australian dollar mark-to-market value for a set of credit default swaps”, using standard abbreviations:

1.  CDS_MTM_NET_AUD_AMT
2.  cdsMtmNetAudAmt
3.  CDSMTMNETAudAmt
4.  Cds_Mtm_Net_Aud_Amt
5.  CDS_MTM_net_AUD_amt

The fifth representation is both readable and makes the parts which are acronyms easily distinguishable from the parts which are not. This rule can be consistently applied without ambiguity or at any detriment to readability.

Underscores should be used to separate terms.

Good:   customer_full_name
Bad:    customerFullName, CustomerFullName, CUSTOMERFULLNAME, CUSTOMER_FULL_NAME

Why?

• See the examples in the section on capitalization. The common practise of using camel-case or proper-case to separate terms produces the horribly unreadable second and third representations. The third representation is also virtually impossible to automatically parse for validation against a data dictionary.

• A standard implies something we ought follow. Ought implies can, and underscores always can be used without contradicting other parts of the standard or creating problems with readability.


T-SQL Formatting

• Use any fixed width font.

• Use vertical whitespace rivers in complex, multiline SQL statements. For example, align SQL keywords, column names, table names, table aliases and join conditions with other instances of the same type of token.

• There is no need to try to continue rivers through nested statements. Indeed, having the nested statement reset the rivers easily identifies the vertical span of the nested statement.

• Consider using single line forms for simple statements, especially those that are generally a single line in similar procedural code.

• Always wrap if blocks in begin and end keywords unless they are simple, single line statements on the same line as the condition. Never put the conditionally executed statement on a new line without wrapping it in begin and end, no matter how simple.

Example:

Declare @i int set @i = 5 -- equivalent to int i = 5;
if (@i != 5) return
if (@i = 6) begin
   return
end
select      a.colA,
            b.colB,
            c.colC
from        tableA       a
join        tableB       b on  a.col = b.col
                           and a.col1 = b.col2
join        (
            select  colD 
            from    tableD
            )            c on  c.colD = b.colD
left join   tableE       e on  e.col = b.col
cross apply (func())     f

Why?

The goal of formatting is the production of readable code. Block identing is the predominant form of code formatting in procedural languages. Block indenting helps the reader to identify scope as they navigate down a page.

SQL, and by extension much of T-SQL, is declarative. Single statements may have a large vertical span, and many elements within the statement may be of a similar type. The most important readability enhancement in SQL is therefore the use of vertical whitespace rivers to align similar concepts while minimizing the excessive consumption of vertical space (so as much logic as possible can be seen on one screen).

Begin controversial part

Certain aspects of formatting can easily be automatically changed to a user’s preference by the environment. Block indenting is perhaps the easiest of any of these aspects to automatically reformat. SSMS can do this natively.

Vertical whitespace rivers are purely organizational and not strictly related to surrounding logic or formatted elements in the way block indenting is related to other indented blocks. Vertical rivers are therefore very difficult to automatically reformat. Redgate’s toolbelt contains two code reformatters that attempt to perform vertical alignment, but neither works very well.

A formatting standard for SQL should therefore be focused on maintaining the formatting and readability of vertical alignment within large and complex statements.

When a fixed width font is used, a space character can always produce vertical alignment with any character in a previous line. Fixed width font therefore must be used to guarantee the preservation of vertical alignment both within one’s own code, and when viewing the code of others.

If the IDE is set to keep tabs instead of inserting spaces, any vertical alignment set by the use of the tab key will be completely misaligned when another user views the code with a different tab size. If more than one user has worked on a section of code then trying to set the IDE to match the original tab size may fix one part of the code while breaking another part. If the IDE is set to insert spaces then vertical alignment is preserved across all users’ environments.

“Insert spaces” will also preserve alignment across different IDEs and text editors.

It is true that enforcing “insert spaces” is bad because it requires all users to set up their IDE the same way in terms of this setting. It is, however, equally true that “keep tabs” is bad because it requires all users to set up their IDE the same way in terms of tab size. The argument from vertical alignment is the trump card for “insert spaces”.

End controversial part


ISO 11179 Applied to Relational Databases

High Level Overview Of ISO11179

ISO11179 is a metadata standard. The standard mandates the description of data elements in terms of semantic concepts.

A data element is some attribute of some object which exists in a logical data model.

The semantic concept is “what that data element means”. This is best explained by way of example: Take the last name of an employee. This semantic concept may be described multiple ways:

• The last name of a person

• The surname of a person

• The family name of an employee

• EMP_LAST_NM

All of those representations have the same conceptual referent.

ISO 11179 claims that the representation of a semantic concept may be conceptually broken down into identifiable components:

• An object class, which is called the “class term”

• An attribute of that object class, which is called the “property term” or

• A qualification of that attribute, which is called the “qualification term”.

In the case of the above example:

• The class term is “employee” or “person”, or “EMP”.

• The attribute is “Name” or “Surname”

• The qualification is “Last” or “Family”

Note that “surname” is, semantically, a single word which represents both an attribute (the last name of a person) and the qualification of a broader conceptual attribute (the name of something).

ISO 11179 Applied to Tables

Tables, like all stored objects, are of a type. The type of a table is its header, or column list. The header in turn is a set of types composed of attribute names and domains. A domain defines the legal values for an attribute.

Semantically, then, tables are collections of well defined objects. From this we derive that:

• Tables are broadly equivalent to the semantic class term described by ISO11179

• Tables are sets, or collections.

Good:   personnel
    employees
    debtors
Bad:    employee
    DBT_Master

This is not to suggest that tables cannot be namespaced. For instance, the DBT_ prefix in (vendor system) is part of a namespace. It is not conceptually part of the object representation. While this is better represented by schemas in SQL Server 2005, the inclusion of the namespace in the object representation is still fairly widespread. However, if the namespace DBT_ is accepted, a more correct representation of the concept represented by this table would be DBT_debtors or, using SQL Server schemas, dbt.debtors.

ISO 11179 Applied to Columns

A column in a table represents an attribute of the object of which the table contains a set. For instance, order_entered_date represents the entry date attribute of any order in a collection of orders.

The breakdown of this representation into ISO11179 components should by now be fairly clear:

• The class term is order

• The attribute is date

• The qualification of the attribute is entered

ISO 11179 does allow, but does not require, that the class term be omitted if the attribute is defined in the context of its class. So, for instance, if this column is defined on a table “orders”, the column name may be simply “entry_date”.

Note that this holds true even if the representation of the class term is not always identical between table and column. For instance, in a table personnel, the first name of an employee may be “employee_first_name”, but may also be simply “first_name”, even though “employee” is not identical in representation to “personnel”. This is legal due to the fact that “employee” and “a member of the set of personnel” are identical concepts in the data model.

Omitting the class term is discouraged when there is no qualification term. So, for instance, “order_id” may simply be called “id”, but this is discouraged because there is no qualification term. “order_entered_date” may be called “entered_date”, and this is not discouraged because the qualification term “entered” exists.

Another point to note is potential confusion over the attribute representation “date”. As developers we are likely to think of this as a data type, with a particular storage mapping, etc. However, the semantic concept of a “date” is a calendar day. The important point here is that this is not the same as the concept of a particular time on a particular calendar day. There is thus a conceptual difference between a date and a datetime. In SQL Server 2005 there is no difference in the underlying SQL datatype. Nevertheless, the column name should represent the semantic concept, not the physical storage.

The qualification term distinguishes the entered date from any other date related to an order (such as modified date, cancelled date).

ISO11179 Data Dictionary

The application of ISO11179 standards to identifier names allows the creation of a data dictionary. The data dictionary contains class and attribute terms and their accepted abbreviation (if any exists). Qualification terms need not be stored in a data dictionary, although they certainly may be.

For instance, a sample data dictionary might look like the following:

Semantic concept                             ISO11179 Term       Representation      Abbreviation
Point in time on a particular calendar day      Attribute           datetime            DT
Point in time                                   Attribute           time
Calendar day                                    Attribute           date            
Incremented alphanumeric identifier             Attribute           number              nbr
System generated identifier                     Attribute           identifier          id
Natural language identifier                     Attribute           name
Short or abbreviated alphanumeric identifier    Attribute           code
Description                                     Attribute           description         dscr
An individual human being                       Class               person
General ledger                                  Class               generalledger       GL
United states currency                          Qualifier           USD 
Monetary amount                                 Attribute           amount              amt
Cost                                            Qualifier           cost
Purchase Order                                  Class               purchaseorder       PO
Sellable item                                   Class               product 

Such a dictionary would allow, among others, the following column names:

•   person_last_name
•   PO_USD_amt
•   PO_total_amt
•   PO_entered_DT
•   PO_cancelled_date
•   product_name
•   product_dscr

Rate

You rated this post out of 5. Change rating

Share

Share

Rate

You rated this post out of 5. Change rating