Data Element Names and Scoping Rules in SQL

,

How you name data elements in databases and applications programs has often been a matter of personal taste. Decades ago, when I worked for state government, there is a COBOL programmer who would pick a theme for his programs. The paragraphs and variables would be named based on the current theme. One of his programs use the names of countries, so we had statements like “GO TO Afghanistan. ” Or when he was gardening, we had to figure out exactly what “PERFORM Chrysanthemum. ” was actually doing. This is what we old-timers referred to these naming conventions as “job secure programming” since nobody but him could figure out What his code was doing.

FORTRAN and BASIC

Over the decades, we have come to agree on some naming conventions across multiple programming languages. The “ancestral form” from which all modern programming languages began started with punchcards and FORTRAN. Because of the punch cards, you had no choice but to use uppercase letters and digits since that’s all that was available to you. But a punch card is only 80 card columns across, and some of those columns had to be used for line number labels on the left hand side and sequential numbers on the right-hand side. The sequential numbers were there in case you dropped the deck of punch cards. Ask an old timer what that was like.

FORTRAN started the convention that data element name must be begin with a letter. Digits are allowed to appear after the first position, but FORTRAN did not allow any punctuation marks in the name. The fact that FORTRAN was based on algebra, and algebraic variables are typically single letters, we didn’t think much about it. Today, the Unicode standard assures that all the language character sets can also accept Latin letters, digits, and a limited set of punctuation. This was done so that all of the standard SI (Système international, better known as the “metric system”) abbreviations can be written in any language on earth.

At this point, language standards were much more physical than they are today. For example, the method used it to store an array (row major or column major order) was specified and was usually based on the hardware, as was the choice of ASCII or EBCDIC for the collation order. At the extreme, early FORTRAN compilers had a three way flow control statement (IF (<exp>) <label_1>,<label_2>,<label_3>), which mapped to a single assembly language statement on an early IBM computer.

Today, we try to be more abstract and “implementation dependent” features are usually specified in other industry standards which specify a minimum. The FIPS (Federal Information Processing Standards) is probably the most common example because of the overwhelming presence of the US federal government in the IT industry.

The original FORTRAN allowed only six character identifiers. But there was a gimmick here. If the identifier began with the letters “I” through “N”, the data element was an integer; otherwise, it was a REAL or floating-point number. Before you ask, FORTRAN did not originally have any other datatypes.

The BASIC programming language evolved from FORTRAN and extended this initial letter convention. But the differences were that the earlier versions of BASIC only had floating-point numbers and the names were limited to a single letter plus a digit (“<letter><digit>”). Later, versions of the language added string data types, and mark them by putting a $ at the front or the end of the name. Having the datatype built into the name makes parsing a simple one-pass operation.

Many of the early operating systems also used prefixes to signal the nature of an identifier. This is why the T-SQL language uses @, @@, #, and ## prefixes for local and global variable names. Today, programming languages have more sophisticated parsing methods. Modern programming languages also have more than just simple variables, which require more complicated symbol tables.

COBOL

The most common programming language, even today, is COBOL. COBOL originally started off with 19 characters as the maximum in the standards. SQL-92 Also followed 19 character length. The allowed character set was based on EBCDIC, rather than ASCII, but there was no problem here. The language also uses a dash in data element names. COBOL syntax prevents this symbol from being parsed as a minus sign. The express goal of COBOL was to look as much like English as possible, so we use a hyphen to break data element name into readable units.

SQL

When SQL was created, compilers had gotten pretty sophisticated. We also had the concept of a namespace; this is an abstraction in which various types of data elements in the language have different semantic or syntactic rules.

Constraints and naming them are interesting in SQL because the namespace rules. This is why a table name and a column name can be the same. Frankly, we screwed up on this one. By definition, a table models a set of rows and a set of scalar columns create a row. Unfortunately it’s all too common to see a table named “employee” which has a column named “employee” also. The grammar of SQL makes it easy for the compiler to tell the difference between these two data elements, but humans aren’t quite as clever.

We then created the double quotemark syntax, for which Microsoft also uses square brackets. An identifier enclosed in double quotemarks can pretty much contain any characters you want for a name. In fact, when we came up with this idea, other ANSI X3 language committees also considered it. We felt very clever, but didn’t think about bad SQL programmers. Instead of giving a properly named data element, these programmers use this convention to create display headers, with weird punctuation marks and (ugh!) embedded spaces. The result is these data element names cannot be reused in other software.

Back in the days of FORTRAN and COBOL, porting names across the entire enterprise was Not done the way it is today. We were just beginning to get the idea of data dictionaries and sharing database access among many applications. One of my little mantras is that a data element should be spelled one way, defined one way, stored with only one data type and kept in one place. While this might seem obvious, think about how many databases you’ve seen where a birthdate is abbreviated as “dob”, “birth_date”, “birthday” and so forth. Then, depending which one of these synonyms is used, the data stored as a DATE using American, British, and some other weird local date format. Or worse yet, the dates are stored as a string, an integer or floating-point in various places in the database!

Thanks to CHECK() constraints, we don’t have to worry about trying to coordinate such constraints in dozens or hundreds of application programs. Decades ago, I had the problem of one program checking to see if an employee’s age was greater than or equal to 18, while another program was checking to see if it was simply greater than 18. Finding that little tiny = in a large body of code It’s pretty hard. I feel that people under-use the CHECK() constraints. My favorite example that everybody violates is that some numeric values make no sense when they are zero or negative.

While SQL Server does not yet have it, ANSI/ISO Standard SQL has a CREATE ASSERTION statement with the syntax "CREATE ASSERTION <assertion name> CHECK(<search condition>);"which is a global CHECK constraint. This means that all CHECK constraints have a single global namespace, not local to a given statement or scope.

The reasons for this are (1) all constraints are TRUE on an empty table, so you have to move verification up another level (global) in the scoping rules of the language (2) the regular CHECK () clause is limited to the scope of one table. It can apply to a single column such as "order_qty INTEGER NOT NULL CHECK (order_qty > 0)," or it can apply to multiple columns at the table level, such as "CHECK (birth_date < hire date)"or even more complicated, such as relationships among multiple columns in the same table.

But wait that's not all! If a column is NULL-able, then a CHECK () constraint in the DDL gets what we called "benefit of the doubt" logic. This means that in the DML (queries and all that non-declarative stuff), if a predicate test to UNKNOWN, then the rows are selected. But in DDL, if a predicate and a CHECK constraint test to {TRUE, UNKNOWN}, each row in this set is selected.

ISO Naming Conventions

The ISO 11179 Standards give a base set of rules for creating data element names. The basic rules are:

  • be unique within the data dictionary of the scope which it appears. If you can be universally unique, so much the better. For example, the Vehicle Identification Number is universally known as the VIN.
  • the name of the scalar should be singular.

The name should tell us what the concept is rather than what it is not.

  1. The concept should be defined with a descriptive phrase or sentence.
  2. Use commonly understood abbreviations and terminology. Don’t confuse the user with jargon.
  3. Express the concept without embedding definitions of other data elements or underlying concepts.
  4. Since tables, sets and other collections should be named with a collective class or plural name.
  5. Procedures do something, so they require a verb in their name.
  6. The name of an alias or copy of a table should include the base tablename as well as the role that is being played.

One of my pet peeves with an alias is that newbies will name them alphabetically, even though letter doesn’t even occur in the base tablename. I think the reason for this is that is how this drives on PC machine were named.

The general format of the data element name is: [<role>_] <attribute name>_<attribute property>

The usual example is that we have a table named “Forest”, whose rows each model a tree. Each tree has attributes (columns) such as diameter, height, species and so forth. But there’s no such thing as a generic “diameter” in any kind of data model; it has to be the diameter of something in particular. In this case, it would be “tree_diameter”, and have some unit of measure. And we still have to have a key for the table; in this case before strangers would probably have assigned some kind of unique identifier, put it on a metal disk, and nail this disc to each individual tree. If you go to a city park, you will see such discs. Today, radio detection devices and satellite photos form that the same function for larger areas.

Hungarian Notation

Hungarian notation is an identifier naming convention in computer programming, in which the name of a variable or function indicates its intention or kind, and its type. The original dialect was invented by Charles Simonyi, a programmer who worked at Xerox PARC circa 1972–1981, and who later became Chief Architect at Microsoft. Simonyi is of Hungarian origin, hence the name of the system. It became the internal naming standard for coding Word, Excel and other apps. As the Microsoft Windows division adopted the naming convention, they used the actual data type for naming, and this convention became widely spread through the Windows API; this is sometimes called Systems Hungarian notation.

Hungarian notation was designed to be language-independent, and found its first major use with the BCPL programming language. BCPL has no data types, since it is a low-level tool based on the length of a word in the machine hardware. That means a variable can be anything.

Hungarian notation attempts to remedy this design flaw by putting data type information in the name of the variable. Of course, the compiler does not have any way of enforcing any constraints or type checking from the name.

In Hungarian notation, a variable name starts with a group of lower-case letters which are mnemonics for the type or purpose of that variable, followed by whatever name the programmer has chosen; this last part is sometimes distinguished as the "given name" and it can be capitalized to visually separate it. The reason this is referred to as "Hungarian notation" is that Hungarian, like Japanese, is a language which puts the family name before the first name ("Smith John" and not "John Smith" as per English). This is not how the ISO 11179 and Metadata Standards Committee do it. Another problem is that we get a lot of camel case (uppercase letters embedded in the name) words. Camel case doesn't work well because if you've grown up with a Latin, Greek or Cyrillic alphabet, your eye has been taught to jump to an uppercase letter. The uppercase letters and those alphabets signal the start of a paragraph, sentence, or proper name; with camel case your eye jerks the uppercase letter, and then back to the start of the word.

Hungarian notation is concerned with semantic information about the purpose of the variable in the system. Unfortunately, a variable can change its purpose depending where it is used. The good news is that it strives encode the logical data type rather than the physical data type.

While the notation always uses initial lower-case letters as mnemonics, it does not prescribe the mnemonics themselves. There are several widely used conventions, but any set of letters can be used, as long as they are consistent within a given body of code.

While Hungarian notation can be applied to any programming language and environment, it was widely adopted by Microsoft for use with the C language, in particular for Microsoft Windows, and its use remains largely confined to that area. In particular, use of Hungarian notation was widely evangelized by Charles Petzold's "Programming Windows", the original (and for many readers, the definitive) book on Windows API programming. Thus, many commonly seen constructs of Hungarian notation are specific to Windows:

For programmers who learned Windows programming in C, probably the most memorable examples are the wParam (word-size parameter) and lParam (long-integer parameter) for the WindowProc() function. For example:

  • hwndFoo : handle to a window
  • lpszBar : long pointer to a zero-terminated string

The notation is sometimes extended in C++ to include the scope of a variable, optionally separated by an underscore. This extension is often also used without the Hungarian type-specification:

  • g_nWheels : member of a global namespace, integer
  • m_nWheels : member of a structure/class, integer
  • m_wheels, _wheels : member of a structure/class
  • s_wheels : static member of a class
  • c_wheels : static member of a function

In JavaScript code using jQuery, a $ prefix is often used to indicate that a variable holds a jQuery object (versus a plain DOM object or some other value).

Hungarian notation is redundant in languages with strong typing, such as SQL. Modern integrated development environments display variable types on demand, and automatically flag operations which use incompatible types, making the notation largely obsolete. Even worse, notation, can be confusing when you're trying to represent several different properties in the name. For example, consider "a_crszkvc30LastNameCol: a constant reference argument, holding the contents of a database column LastName of type varchar(30) which is part of the table's primary key.

Leszynski Naming Convention

The Leszynski naming convention (or LNC) is a variant of Hungarian notation popularized by consultant Stan Leszynski specifically for use with Microsoft Access development. The conventions are derived from a 1992 convention, originally developed by Greg Reddick. Eventually, Leszynski and Reddick split into two separate sets of conventions. Although these naming conventions are often used within the Microsoft Access community, and is the standard in Visual Basic programming, it is not widely used elsewhere. 

Conclusion

Scoping and naming conventions are not easy. When you’re working with SQL, however, you have the advantage of a strongly typed language that’s highly specialized. The specialization means it doesn’t have the range objects that general purpose languages do.  What we would like to do is have naming conventions for the data elements can be used everywhere in our system, And not just in SQL. This might not be as easy to do, since we have to worry about all of the languages (present and future) that will use our data. Spend some time on this problem.

Rate

4.2 (5)

Share

Share

Rate

4.2 (5)