Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 

Don Halloran

SQL Server, SSIS, SSAS and maybe even SSRS rants and raves (and, on occasion, useful code or designs).

SQL Standards

Three posts, three critical rants. It's time to be constructive.

I was given the responsibility to decide on the SQL development standards at my company. I think most people with any real interest in database development would enjoy this task. As it happens, enforcing the standard is a little difficult (see my previous post about that...). Nevermind, I proposed the standard that I've evolved through the application of theory and experience over the last 10 or so years.

I've seen standards documents that normally simply say "do this", "don't do that". I find it difficult to respect such documents. If I should "do this", there must be a reason for it. In my opinion that reason should be in the document (as an appendix, if you like). The reasons are twofold. First, it necessarily clears the air of unjustified authoritarianism that can be so noxious to a development team (because the justifications are available). Secondly, it gives people the chance to intelligently comment and critique the conclusions. If this happens it's a good thing. Either your ideas will be defended by sound argument, in which case they become more trustworthy (just as scientific hypotheses and theories become more solid through attempts at falsificaton), or the ideas will evolve in response to the criticisms, in which case they will become more correct. There's no downside.

And so, I present to you a large portion of my standards as well as the reasoning behind them.

RFC: SQL Coding Standards – Don’s Opinions (and Arguments)


Data Element Names

Table names should indicate plurality.

[pref] Good: personnel, employees Bad: employee [endpref] Why?

A table is not the same kind of thing as a class in object oriented programming. A table is a set of rows. A table is not an array or a collection with items in specific positions, so the fact that singularity creates the grammatically appealing form of array subscription used in procedural languages, array_name[index], is not applicable. There is no order to the members of a set. {1, 2, 3} and {2, 1, 3} are the same set.
See the ISO11179 overview at the end of this document for further details.

Conform to ISO11179.

Data element names should be of the form:
[pref][(class-term|qualifier-term{_qualifier-term})_]representation-term[endpref] With the more general terms to the right, and the more specific terms to the left. For example:
• order_entered_date
• customer_first_name
• invoice_number
• invoice_customer_reference_number
• computer_CPU_fried_transistor_count

The generality of a term can usually be derived by whether or not the question “what kind of (x)?” applies some other term to the term (x). For instance, in the example above we have a count
What kind of count? A count of transistors.
What kind of transistors? fried transistors.
What kind of fried transistors? The fried transistors in a CPU.

Per ISO11179, the class term may be omitted when the data element name is defined in the context of the class. In SQL-speak the table is generally the class term, so the table name need not be included in the name of every column on the table (but see the ISO11179 overview for notes on this point). In the example, the class term “computer” may be omitted if this is the name of a column on a table called computers (or something semantically equivalent).
This is not always a hierarchy in that the order of elements can sometimes be rearranged. In the example, fried_cpu_transistor_count is just as good as cpu_fried_transistor_count. On the other hand, transistor_cpu_fried_count doesn’t work (“fried” is not a valid answer to “what kind of count?”).

Why?

• ISO11179 is an established metadata standard.
• The decomposition of data element names into the components specified by ISO11179 is intuitive.
• A decomposable data element name can be validated against a data dictionary.
• An overview of the principles of ISO11179 can be found at the end of this document..

Acronyms, and only acronyms, should be capitalized.
[pref] Good: order_AUD_amt Bad: ORDER_AUD_AMT, Order_Aud_Amt, OrderAudAmt, OrderAUDAmt, order_aud_amt [endpref] Why?

• In the English language capitalization is used in three places: for grammar, proper names, and acronyms. A data element name is a single token and therefore contains no grammar. A data element name is not a proper name. A data element name may contain acronyms. This rule is therefore consistent with natural English.
• All-caps identifiers hide acronyms, and all-caps is more difficult to read than all-lowercase.
• Proper case / camel case can also hide acronyms and it can be hard to distinguish between terms.
To prove this, consider a few of the possible representations of the data element name for “the net Australian dollar mark-to-market value for a set of credit default swaps”, using standard abbreviations:
[pref] 1. CDS_MTM_NET_AUD_AMT 2. cdsMtmNetAudAmt 3. CDSMTMNETAudAmt 4. Cds_Mtm_Net_Aud_Amt 5. CDS_MTM_net_AUD_amt [endpref] The fifth representation is both readable and makes the parts which are acronyms easily distinguishable from the parts which are not. This rule can be consistently applied without ambiguity or at any detriment to readability.

Underscores should be used to separate terms.
[pref] Good: customer_full_name Bad: customerFullName, CustomerFullName, CUSTOMERFULLNAME, CUSTOMER_FULL_NAME [endpref] Why?

• See the examples in the section on capitalization. The common practise of using camel-case or proper-case to separate terms produces the horribly unreadable second and third representations. The third representation is also virtually impossible to automatically parse for validation against a data dictionary.
• A standard implies something we ought follow. Ought implies can, and underscores always can be used without contradicting other parts of the standard or creating problems with readability.


T-SQL Formatting

• Use any fixed width font.
• Use vertical whitespace rivers in complex, multiline SQL statements. For example, align SQL keywords, column names, table names, table aliases and join conditions with other instances of the same type of token.
• There is no need to try to continue rivers through nested statements. Indeed, having the nested statement reset the rivers easily identifies the vertical span of the nested statement.
• Consider using single line forms for simple statements, especially those that are generally a single line in similar procedural code.
• Always wrap if blocks in begin and end keywords unless they are simple, single line statements on the same line as the condition. Never put the conditionally executed statement on a new line without wrapping it in begin and end, no matter how simple.

Example:

[pref] Declare @i int set @i = 5 -- equivalent to int i = 5; if (@i != 5) return if (@i = 6) begin return end select a.colA, b.colB, c.colC from tableA a join tableB b on a.col = b.col and a.col1 = b.col2 join ( select colD from tableD ) c on c.colD = b.colD left join tableE e on e.col = b.col cross apply (func()) f [endpref]

Why?

The goal of formatting is the production of readable code. Block identing is the predominant form of code formatting in procedural languages. Block indenting helps the reader to identify scope as they navigate down a page.
SQL, and by extension much of T-SQL, is declarative. Single statements may have a large vertical span, and many elements within the statement may be of a similar type. The most important readability enhancement in SQL is therefore the use of vertical whitespace rivers to align similar concepts while minimizing the excessive consumption of vertical space (so as much logic as possible can be seen on one screen).

Begin controversial part Certain aspects of formatting can easily be automatically changed to a user’s preference by the environment. Block indenting is perhaps the easiest of any of these aspects to automatically reformat. SSMS can do this natively.

Vertical whitespace rivers are purely organizational and not strictly related to surrounding logic or formatted elements in the way block indenting is related to other indented blocks. Vertical rivers are therefore very difficult to automatically reformat. Redgate’s toolbelt contains two code reformatters that attempt to perform vertical alignment, but neither works very well.

A formatting standard for SQL should therefore be focused on maintaining the formatting and readability of vertical alignment within large and complex statements.

When a fixed width font is used, a space character can always produce vertical alignment with any character in a previous line. Fixed width font therefore must be used to guarantee the preservation of vertical alignment both within one’s own code, and when viewing the code of others.

If the IDE is set to keep tabs instead of inserting spaces, any vertical alignment set by the use of the tab key will be completely misaligned when another user views the code with a different tab size. If more than one user has worked on a section of code then trying to set the IDE to match the original tab size may fix one part of the code while breaking another part. If the IDE is set to insert spaces then vertical alignment is preserved across all users’ environments.

“Insert spaces” will also preserve alignment across different IDEs and text editors.

It is true that enforcing “insert spaces” is bad because it requires all users to set up their IDE the same way in terms of this setting. It is, however, equally true that “keep tabs” is bad because it requires all users to set up their IDE the same way in terms of tab size. The argument from vertical alignment is the trump card for “insert spaces”.

End controversial part


ISO 11179 Applied to Relational Databases

High Level Overview Of ISO11179

ISO11179 is a metadata standard. The standard mandates the description of data elements in terms of semantic concepts.

A data element is some attribute of some object which exists in a logical data model.

The semantic concept is “what that data element means”. This is best explained by way of example: Take the last name of an employee. This semantic concept may be described multiple ways:
• The last name of a person
• The surname of a person
• The family name of an employee
• EMP_LAST_NM

All of those representations have the same conceptual referent.

ISO 11179 claims that the representation of a semantic concept may be conceptually broken down into identifiable components:
• An object class, which is called the “class term”
• An attribute of that object class, which is called the “property term” or
• A qualification of that attribute, which is called the “qualification term”.

In the case of the above example: • The class term is “employee” or “person”, or “EMP”.
• The attribute is “Name” or “Surname”
• The qualification is “Last” or “Family”

Note that “surname” is, semantically, a single word which represents both an attribute (the last name of a person) and the qualification of a broader conceptual attribute (the name of something).

ISO 11179 Applied to Tables

Tables, like all stored objects, are of a type. The type of a table is its header, or column list. The header in turn is a set of types composed of attribute names and domains. A domain defines the legal values for an attribute.

Semantically, then, tables are collections of well defined objects. From this we derive that:
• Tables are broadly equivalent to the semantic class term described by ISO11179
• Tables are sets, or collections.

[pref] Good: personnel employees debtors Bad: employee DBT_Master [endpref]
This is not to suggest that tables cannot be namespaced. For instance, the DBT_ prefix in (vendor system) is part of a namespace. It is not conceptually part of the object representation. While this is better represented by schemas in SQL Server 2005, the inclusion of the namespace in the object representation is still fairly widespread. However, if the namespace DBT_ is accepted, a more correct representation of the concept represented by this table would be DBT_debtors or, using SQL Server schemas, dbt.debtors.

ISO 11179 Applied to Columns

A column in a table represents an attribute of the object of which the table contains a set. For instance, order_entered_date represents the entry date attribute of any order in a collection of orders.

The breakdown of this representation into ISO11179 components should by now be fairly clear:
• The class term is order
• The attribute is date
• The qualification of the attribute is entered

ISO 11179 does allow, but does not require, that the class term be omitted if the attribute is defined in the context of its class. So, for instance, if this column is defined on a table “orders”, the column name may be simply “entry_date”.
Note that this holds true even if the representation of the class term is not always identical between table and column. For instance, in a table personnel, the first name of an employee may be “employee_first_name”, but may also be simply “first_name”, even though “employee” is not identical in representation to “personnel”. This is legal due to the fact that “employee” and “a member of the set of personnel” are identical concepts in the data model.

Omitting the class term is discouraged when there is no qualification term. So, for instance, “order_id” may simply be called “id”, but this is discouraged because there is no qualification term. “order_entered_date” may be called “entered_date”, and this is not discouraged because the qualification term “entered” exists.

Another point to note is potential confusion over the attribute representation “date”. As developers we are likely to think of this as a data type, with a particular storage mapping, etc. However, the semantic concept of a “date” is a calendar day. The important point here is that this is not the same as the concept of a particular time on a particular calendar day. There is thus a conceptual difference between a date and a datetime. In SQL Server 2005 there is no difference in the underlying SQL datatype. Nevertheless, the column name should represent the semantic concept, not the physical storage.

The qualification term distinguishes the entered date from any other date related to an order (such as modified date, cancelled date).

ISO11179 Data Dictionary

The application of ISO11179 standards to identifier names allows the creation of a data dictionary. The data dictionary contains class and attribute terms and their accepted abbreviation (if any exists). Qualification terms need not be stored in a data dictionary, although they certainly may be. For instance, a sample data dictionary might look like the following:

[pref] Semantic concept ISO11179 Term Representation Abbreviation Point in time on a particular calendar day Attribute datetime DT Point in time Attribute time Calendar day Attribute date Incremented alphanumeric identifier Attribute number nbr System generated identifier Attribute identifier id Natural language identifier Attribute name Short or abbreviated alphanumeric identifier Attribute code Description Attribute description dscr An individual human being Class person General ledger Class generalledger GL United states currency Qualifier USD Monetary amount Attribute amount amt Cost Qualifier cost Purchase Order Class purchaseorder PO Sellable item Class product [endpref]

Such a dictionary would allow, among others, the following column names:
[pref] • person_last_name • PO_USD_amt • PO_total_amt • PO_entered_DT • PO_cancelled_date • product_name • product_dscr [endpref]

Comments

Posted by robertmcook on 13 December 2010

It is nice to see an increase in the articles on topics on highly debated topics so first of all thank you.  In a past job, I had to create developed standards.  Two items were done differently, plurality and capitalization.  Singularity was simply because it was just easier.  Do you name a table about index as indexes or indices?  Do you name a table about an organizational staff as staff or staffs?  Auto-complete can help find the correct table name but in a very large database or across an enterprise you could end up with multiple different names for the same item.  Capitalization was all lowercase because as you suggested it is easier to read and also for case sensitive servers, who wants to try and remember the casing schema.  Like plurality auto-complete can help find the correct name but in a very large database or across an enterprise you could end up with multiple different cases for the same name.

Posted by allmhuran on 13 December 2010

I actually like collective terms even more than just adding an "s" on the end of the underlying entity. So to me, "staff" is a good name in itself (it already denotes a set).

For the other one, I'd personally go with "indexes", but I don't think having to choose here is much of a problem is it? This seems to be more of an issue with just deciding on which name you'd prefer, and happens regardless of plurality (eg, do I want to use "staff", "employees" or "personnel" as a table name?) In the end you just pick one and run with it.

Thanks for your comment!

Posted by Anonymous on 23 December 2010

Introduction I know I said I was going to add more information about the CDC system in this post, but

Leave a Comment

Please register or log in to leave a comment.