Having covered the procedure headers in SQL Server in the previous level, Joe tackles the subject of the contents of stored procedures. In this level, he outlines limitations of TSQL as a procedural language, and what you need to bear in mind when deciding how to use them.
In levels one to four, we built the tables, base and virtual, of a schema. But we were not done because a schema is more than that. In a real database there are more structures to consider. Among these other schema level things are: cursors, triggers and stored procedures. There are more things like collations, translations, privileges and suchlike, too. I am going to deal with only those three things - cursors, triggers and stored procedures - I named and only in general ways. T-SOL and other products can get highly proprietary in spite of the ANSI/ISO Standards. The reason is simple; these were things that were built on top of the preexisting file systems that early SQL products used. These procedural constructs were ways of making up for the lack of declarative code in the old products. The vendors have users who are locked into this "code museum" today and cannot abandon their client base.
In Level Five, we discussed what the header for a procedure has in it and how it works as if it were a black box. In Level Six, we will look inside that black box.
SQL allows procedural code modules to be kept in the schema. While there is a SQL/PSM (Persistent Stored Module) language in Standard SQL, you will probably use a proprietary vendor language like T-SQL. These proprietary languages are usually members of the C and Algol family; that mean they have an IF-THEN-ELSE, a WHILE loop and BEGIN-END blocks with scoping rules.
Most of these proprietary languages were never meant to be used for applications development. The rules of thumb for T-SOL are not to write any procedure over 50 lines (one page) and not to use PRINT. But in fact, you could avoid all set oriented programming, treat each table like a file and code as if the database is a procedural file system. You could also drive nails into your body if you like pain.
T-SQL is a simple one-pass compiler. This is why you must prefix local variables and parameters with @ (also known as at-sign, "snail" or "petite escargot"), system level variables with @@ and temporary tables with # (hash, octothrope or "number sign") and ##. Multi-pass compilers build a symbol table and then discover things about program objects with each pass. A one-pass compiler has to be told what to do with a new symbol when it finds it for the first time. Thus @ says "allocate local storage for me" , @@ says "look for me outside the program, I am global to it", # says "create me in tempdb for the session" and ## says "creaet and persist me in tempdb" as soon as they are parsed. Anything else is assumed to be defined in the DDL.
SQL is Not Computational
You cannot expect T-SQL to do procedural code optimization. That requries multiple passes. For example, most FORTRAN compilers were written to do mathematcal optimizations using algebra. The F and G series from IBM were a fast compile with detailed errors messages for students and a slow but optimized for performance for production work.
Once of the classic IT stories in the 1960's was a DoD (Department of Defense) test of IBM and Univac FORTRAN compilers. The IBM compiler ran for a long time, produced a compact executable module and produced the right answer quickly when it ran. The Univac compiler ran for a very long time, produced a tiny executable module, which contained one WRITE statement that printed the right answer instantly. The gimmick? The sample problem was a set of calculations that involved functions and their inverses that would cancel out. The point was to test floating point rounding errors. The Univac FORTRAN compiler had sucessfully paired the functions and their inverses, done the algebra and produced the answer as a constant.
Avoid FLOAT and REAL in T-SQL. These are the same data type in T-SQL, but not in Standard SQL. The problem is that FLOAT needs special handling to avoid rounding errors and do to comparisons. There is a very reason that they are called aproximate numeric data types. This special handling has to be either built into the software or (better) part of the hardware, which means you need a floating point processor. Likewise, your boss is not likely to install a video game graphics chip set for your desktop either. Servers for commerical applications do not usually need these expensive features, no matter how much company time you spend playing Halo or Doom at work.
But even if the chips were cheap, you cannot reasonably expect T-SQL to do that kind of mathematical optimization. SQL is a data retrieval and management language and not meant for computations. What you want to write is good queries that get the data to pass to a statistical package, a report writer or other specialized tools.
If you need decimal places, then use DECIMAL(s,p) data types. They work just fine and they port. The trick is to give yourself enough decimal places to get correct rounding. That means you need to know the standards of your industry. In particular, if you work with Euros then you need to learn about "euro triangulation", the rules for currency converations and bookkeeping.
It is best to do a little algebra yourself and to keep the math as simple as possible. The same advice applies to string and temporal data.
T-SQL has a function library that is based on C. This is why it uses % instead of the standard MOD() function.
SQL is Not for Display
Again, SQL is a data retrieval and management language and not meant for front-end display. Get the data together in SQL data types and pass them "over the wall" to front end programs, such as report writers and graphics packages, to make it look pretty.
But because procedural languages are welded to their files, programmers grew up writing monolithic programs. COBOL is nothing but character strings and display templates. FORTRAN has its FORMAT statement. Versions of BASIC have picture options using #'s and other symbols. Even a low level language like C has elaborate formatting options in its printf() function!
The concept of a tiered architecture is very hard for many programmers after dyers of procedural language programming. In fact, you will still hear protests that "I could do that in the database and save time!" today.
Sometimes that is true. But more often than not, it is a false saving. The display formatting will block the optimizer from using indexes on the base columns. The front end will then disassemble the formatted columns back to their source data or to another format. The real loss is that the code is harder to maintain compared to simple rows with basic data types in their columns.
Let me given two common examples. Using the proprietary CONVERT() function to make a temporal data type into a string for display. Let the applications do this for you; they have function libraries for this. You do not have to worry about national settings or appropriate rounding (which can be an application by application design decision). The worst use of CONVERT() is for string manipulations, when you have the DATEPART() and CAST() functions. Look for code where two dates are turned into strings, then the strings are compared.
The second common example is assembling a full name from the surname and firstname. This prevents using an index on the surname, gives the front end the problem of re-formatting it for available space and rules. You will see front end code that flips the name around to (firstname, surname) order for display.
Basic Declarative Programming Heuristics
Structured programming actually has mathematical proofs of correctness. You can and should read Dijkstra, Wirth and Manna. This is theory that actually helps you program. Declarative programming is not yet at that point. But we can give you heuristics. These are things to try when you see a particular situations; they are not laws of the universe, just the way the smart money bets.
I have a two part series on this topic (see references) which gives examples of procedural, semi-procedural and declarative programing styles. But for now, let me give a quick list of "high level hints" that might help.
Prefer One Statement over Many
The more work you can do in one SQL statement without the use of T-SQL flow control, the better the code will probably work. Thus, if your procedure body has two or more references to the same table, you can probably combine them and access that table only once.
You can get rid of a lot of IF-THEN-ELSE control logic with CASE expressions. Before the CASE expression, there were unexpected pitfalls in trying to apply procedural logic to SQL. The classic example is an UPDATE statement that was part of Sybase/SQL Server classes for decades. You have a bookstore and want to change the prices of the books. Any book over $25 will be discounted by 10% (we will advertise that) and books under $25 will be increased by 15% (we will not advertise that).
The classic structured programming answer in a pseudo-code skeleton is:
OPEN FILE (Bookstore);
READ (price) FROM Bookstore;
WHILE NOT EOF (bookstore)
IF price < 25.00
THEN UPDATE Books SET price = price * 1.10
ELSE UPDATE Books SET price = price * 0.85;
FETCH NEXT Bookstore;
CLOSE FILE (Bookstore);
This works and can be easily converted from pseudo-code to a cursor. The usual first attempt in pure SQL is this:
SET price = price * 1.10
WHERE price < 25.00;
SET price = price * 0.85
WHERE price >= 25.00;
But it does not work! Look at a book that sells for $24.95 currently. Its price jumps to $27.45 when the first UPDATE is done. But when we do the second UPDATE, the price goes down to $23.33 finally. That is not what we meant to do. Flipping the updates does not help; a book on the cusp can be updated twice.
This was the classic argument for cursors. We had all kinds of horrible multiple table scan procedures for this kind of problem in those days. Today, we have the CASE expression, which is declarative and makes one table scan.
WHEN price < 25.00
THEN price * 1.10
ELSE price * 0.85
This heuristic has two parts:
- Look for the same table to appear in multiple statements; there is a good chance they can be combined.
- If the statements are controlled with IF-THEN-ELSE, try to replace the branches with CASE expressions inside a single statement.
This same heuristic applies to INSERT INTO statements. One form of this is to insert an initial set of rows, followed by a selected set of rows. The skeleton is like this:
INSERT INTO Foobar (..) VALUES(..);
INSERT INTO Foobar SELECT .. FROM.. WHERE..;
Which can often be written as:
INSERT INTO Foobar (..)
(SELECT X.* FROM (VALUES (..)) AS X)
SELECT .. FROM.. WHERE..;
Again CASE expressions can be used inside the SELECT statements.
Perhaps the best example of this heuristic is the MERGE statement, which lets youy combine INSERTs and UPDATEs into one statement. I will not discuss it here, but suggest strongly that you read up on it.
Avoid Local Variables
T-SQL has to allocate local variables and they are very often not needed. A common pattern is:
CREATE FUNCTION Foobar (..)
RETURNS <data type>
DECLARE @local_return_holder <data type>;
= <scalar query>:
Which could have been simply:
CREATE FUNCTION Foobar (..)
RETURNS <data type>
RETURN (<scalar query>);
The other disadvantage of local variables is that they can hide expressions from the optimizer.
SET @local_x = (<scalar query>); -- has to load local variable
<statement using @local_x>;
Could have been:
<statement using (<scalar query>)>; --optimizes whole expression
You can also nest function calls rather than doing step-by-step processing with intermediary results kept in local variables. One of my favorite examples of this is building a sequence of REPLACE() calls like this:
SET @x = REPLACE (@x, 'a', 'A');
SET @x = REPLACE (@x, 'B', 'b');
When you can go 32 levels deep with REPLACE (REPLACE..(REPLACE (@x, 'z', 'Z') ..))
If you have trouble with the concept, talk to a LISP programmer. That entire language is nothing but nested functions calls.
Prefer JOINs over Loops
There are a lot of other tricks to avoid row-by-row processing. For example, without going into details, a for-loop can often be replaced with a join to a Series table. A Series table is a set of integers from one to some upper limit.
Look for Things that Should be in the DDL
The use of IF-THEN logic in a procedure to clean up data at run time is often a sign that you really need a CHECK() in the DDL to prevent it from being bad in the first place. For example:
SET T.x = COALESCE (T.x, 0);
IF (x > 12)..;
Might be a sign that you need to have defaults and constraints on some column. Change "x INTEGER" to something like this in its table:
CREATE TABLE T
x INTEGER DEFAULT 0 NOT NULL
CHECK (x BETWEEN 0 AND 12),
Avoid CLR and XML Contamination
Keep external languages outside of the schema. It is hard enough to maintain SQL without adding other languages to the mix. What do you do when you find a CLR module in a language you don't know? They might not follow the same definitions of such basic functions as MOD(), SUBSTRING() and arithmetic rounding. My favorite example is the differences between C# and VB, two Microsoft proprietary languages that do agree on the representation of Booleans.