Stairway to SQL Dialects

Stairway to SQL Dialects Level 1: Overview


“England and America are two countries separated by a common language.”

George Bernard Shaw (1856 - 1950)

The fifth and final volume of the Dictionary of American Regional English, Volume V: Sl-Z, from Harvard University Press, came out at the start of 2012. Frederic G. Cassidy, an exuberant Jamaican-born linguist, inaugurated this project about 50 years ago. Mr. Cassidy died in 2000 and did not make it to the end of the alphabet.

The fifth book has about 60,000 terms. This is not quite the American answer to the Oxford English Dictionary, but it does show how dialects are born.

Much like the English-speaking world, the “SQL world” has a common language with a lot of dialects.

The SQL world is dominated by three major players: DB2 from IBM, Oracle, and Microsoft SQL Server. Then there are niche players. Teradata dominates the data warehouse niche, which also includes Vertica, Sand, WX2, Kx, and other engines based on special hardware. Most websites sit on the LAMP platform (Linux, Apache, MySQL, and PHP or Perl). And I have not even mentioned phones or embedded systems! Or small open source products.

Programming language dialects evolve for reasons similar to natural language dialects. SQL started as SEQUEL and Project R at the IBM Santa Teresa labs. It was an experiment and, as Jim Grey used to say, we were making it up as we went along. The character set was determined by what they had on IBM hardware, and the underlying model was determined by existing file systems.

Without a standard, each group of speakers uses what they had in an older language. In the case of Transact-SQL, we used a lot of old Unix and C conventions. The original Sybase compiler was written in C for 16-bit mini-computers running UNIX. That is why Transact-SQL had != instead of <>, getdate() instead of CURRENT_TIMESTAMP, if-else instead of IF-THEN-ELSE-END IF. All these bits of syntax are classic C language constructs.

ANSI standard SQL started with SQL-86. We put this out as fast as we could to avoid too many dialects. We had seen the fate of a wonderful little language you might remember: Beginners All-purpose Symbolic Instruction Code, or BASIC. It wound up with several hundred dialects and a weak ANSI standard that came years too late.

SQL-86 was a three-legged chair, with major parts obviously missing. My personal favorite was having a GRANT statement, but no corresponding REVOKE statement. I doubt your security officer would have liked that very much. But weak as it was, the major keywords and concepts were there. It was a subset of the intersection of the existing SQL products. Its purpose was to anchor the language and to prevent dialect formations like BASIC. We did not have a useable SQL standard until SQL-89 and a good SQL standard until SQL-92.

The SQL standard is highly abstract. It deliberately avoids physical implementation details. This is why CREATE INDEX is not included; we thought it was too physical since it was a particular physical access method. Other earlier programming languages assumed physically contiguous storage and various hardware features. For example, the double-plus syntax in C comes directly from the auto-incrementing options in the old DEC VAX machine assembly language.

Did we succeed in preventing dialects? No, of course not.

Because the major players had membership on the original ANSI X3H2 committee, we did a much better job of developing a common language that all vendors could implement and tailor for niches than previous efforts. Hooray for us! The current name of the same committee is the InterNational Committee for Information Technology (INCITS) H2 committee, and it still has some of the original members.

But while we were writing the standard, people were writing compilers. Where the standard was silent, the compiler writers made decisions based on what they could make work. The best example of this is temporal data types. Yes, there is an ANSI/ISO standard now. But originally you got temporal data from the operating system clocks and file systems that your SQL was written on top of. That is why the Transact-SQL DATETIME data type has that funny fraction of a second quata. But DB2 came from a mainframe COBOL world, where temporal data types are kept as strings and the hardware had better system clocks.

In the meantime, compiler writers tried to figure out what the standard meant. Reading ANSI and ISO documents is much like reading legal texts. It has specialized terms, with special meanings. The documents have approval processes and validation suites. And even then, it is better to have actually been in the discussions so you know the intent.

Is Portable SQL Possible?

The short answer is, yes, you can write very portable SQL. But this has to be a design goal from the start of your project. This is such an important goal for the US Federal Government that we have the Federal Information Processing Standard for various languages, including SQL, which requires a flagger option in the compilers. When the flagger is on, you get warning messages when a statement is not standard ANSI.

If you want to do your own validation, there are parsers for various ANSI standards. My favorite is, which has SQL-92, SQL-99, and SQL:2003 parsers.

What you will find is that most of your code already adheres to the ANSI standard. The most common data manipulation language (DML) statements—SELECT..,FROM, UPDATE, and INSERT INTO—are simple ANSI standard statements. The primary differences are in the library function names and environmental settings. Most of the vendor extensions can be done with standard features.

The Levels that follow in this Stairway series will discuss what the differences are and how to overcome them.

Classifying SQL Ports

Let's start by classifying code ports by difficulty. The first port is just getting your code into a readable format so that all the code in your shop is consistent. Thank goodness there are pretty printer programs for doing this. Your real worry is Cowboy Coder Tony, who writes proprietary code. His dialect is so strong he codes like a hillbilly. He might do this to ensure job security; the more obscure his code, the safer he feels. SQL programmers are already bad about inserting comments, and all programmers seem to hate documentation.

Another reason for his overabundance of proprietary code might be ignorance; he learned a SQL programming idiom earlier in his career and never updated his education later. Even worse, he learned a procedural or object oriented language and kept writing it using SQL.

The third reason for his approach might be due performance advantages he saw in one of the releases of his SQL product. There is almost always a trade-off between slight performance improvement now for less maintainability over the lifetime of the system. And that performance improvement often disappears in the next release.

While we use the term upgrade instead of port when we move from one version of a vendor's SQL to the next release, it is really a port. For obvious reasons, vendors try to make this as easy as possible and keep release n+1 code running on release n software. And that the data will certainly have to move if you want to stay in business.

Let's assume that our Transact-SQL is in good shape and we now want to bring in foreign SQL code. The next level of porting difficulty is getting the foreign code into good enough shape that it uses only the features implemented in your version of Transact-SQL. The foreign code should be so simple that it runs just as it is, that is, if the code you’re trying to translate was well written by a professional. Plug and play!

The next difficulty level is Portable SQL from another product. The code will not run as written, but you can write a simple regular expression text edit to get it into standard SQL. An example in Transact-SQL is the old Sybase/UNIX getdate() function. This can be bulk edited to CURRENT_TIMESTAMP and should be.

But the edit is not always a simple string substitution. To use another Transact SQL example, CONVERT() with non-temporal data types can be changed to CAST(). But CONVERT() is an overloaded function that has an optional third parameter when it casts temporal data to display strings.

The most obvious example is the library function. ANSI syntax likes to use keywords inside the parameter list of a function rather than a simple parameter list. Vendors pick their own names and parameter order. For example, the syntax for the ANSI SUBSTRING() function reads SUBSTRING (<string expr> FROM <integer start position> FOR <integer char count>), while Transact-SQL uses commas instead of the FROM and FOR keywords. Other products use SUBSTR() as the name of the function.

So far, the port has been one-for-one replacements.

The next level is Translatable SQL, and it requires some human smarts. The code will not run as written, but a simple text edit will not work either. This is often due to a feature in the source SQL not yet available in the target SQL. A recent example is the window clause OVER() in SQL Server 2008, which did not have the [RANGE | ROW] sub-clause. Now, we have it in SQL Server 2012. Hooray!

But now look back at SQL Server 2008 code you wrote to get a simple running total of porting issues. The code includes a self-join (or worse, a cursor). This takes a bit more thought than just a text edit. You have to recognize the intent of the code. That requires a human being who decides if we can use the new feature and how to re-write it.

Mother Celko's heuristics are to re-write code to the new feature when the new feature is ANSI standard or when it is fewer lines of code than the old syntax. Vendors will maintain and improve an ANSI standard feature in all future releases, while they will deprecate their own proprietary syntax. Older transaction SQL programmers remember *= before the LEFT OUTER JOIN syntax.

The “fewer lines of code” heuristic is based on the idea that we had a common recurring programming idiom that took a lot of code. Finally, someone made it into a single declarative construct and got it thru the INCITS H2 committee. This is why the MERGE statement was created.

The final level of difficulty is Untranslatable, which means you’re screwed. The source SQL has some specialized features or implementation that cannot map into Transact-SQL. The source SQL is designed for a specialized purpose such as data warehousing, streaming data, video searching, GIS, or something out of the mainstream. Most of us are working on commercial systems, and we will be porting SQL from another commercial system. This Stairway series will stick to products that you are likely to see in your work.

This article is part of the parent stairway Stairway to SQL Dialects