Stairway to Database Design

Stairway to Database Design Level 8: Cursors

,

In levels one to four, we built the tables, base and virtual, of a schema. Levels five, six and seven dealt with procedural code in stored procedures and triggers. This level deals with a feature you need to avoid as much as possible. This is the article is on Cursors.

SQL is a set-oriented language. The majority of the host languages are not set-oriented. The application world still runs on procedural languages like COBOL, BASIC, FORTRAN, etc. and OO languages like C++, Java, VB and others. These languages are based on a sequential file model. This is called an impedance mismatch, a borrowed term from electronics that refers to parallel versus serial data transmission.

So how do you overcome this mismatch? A cursor converts a table into a sequential file structure that a host language can use. It also changes the SQL data types into the host language data types and sets INDICATORs for NULLs that the host language cannot handle.

The great secret of cursors is that they are based on one particular sequential file model; the original IBM magnetic tape drives. No, really. Each of the SQL statements that deal with cursors matches to statements in magnetic tape drive systems. Most of you probably never used magnetic tapes for your active data, but they were the most popular and best understood files in the world at one time. If you could get the SQL data to look and act like a tape, then you did not have to re-write the hundreds of application programs.

ORDER BY and NULLs

Whether a sort key value that is NULL is considered greater or less than a non-NULL value is implementation-defined. There are SQL products that do it either way and here is a quick list:

PostgreSQL - Higher

DB2 - Higher

MS SQL Server - Lower

MySQL - Lower

Oracle - Higher

The SQL-99 Standard added the optional {NULLS FIRST | NULLS LAST} subclause to the <ordering specification> which has been implemented in DB2 and Oracle among others. There is a story here.

In 1999 March, Chris Farrar brought up a question from one of their developers which caused him to examine a part of the SQL standard that I thought I understood. Chris found some differences between the general understanding and the actual wording of the specification. The situation can be described as follows: A table, Sortable, with two integer columns, a and b, containing two rows that happen to be in this physical order:

Sortable

alpha beta

============

NULL 8

NULL 4

Given the pseudo-query:

SELECT alpha, beta

FROM Sortable

ORDER BY alpha, beta;

The first question is whether it is legal SQL for the cursor to produce the result sequence:

Cursor Result Sequence

alpha beta

===========

NULL 8

NULL 4

The problem is that while the standard set up a rule to make the NULLs group together either before or after the known values, we never said that they have to act as if they were equal to each other. What is missing is a statement that when comparing NULL to NULL, the result in the context of ORDER BY is that NULL is equal to NULL, just as it is in a GROUP BY. This was the intent of the committee, so the expected result should have been:

Cursor Result Sequence

alpha beta

===========

NULL 4

NULL 8

Phil Shaw, former IBM representative and one of the smartest and oldest members of the committee, dug up the section of the SQL-89 Standard which answered this problem. In SQL-89, the last General Rule of <comparison predicate> specified this:

"Although "x = y" is unknown if both x and y are NULL values, in the context of GROUP BY, ORDER BY, and DISTINCT, a NULL value is identical to or is a duplicate of another NULL value." This is the grouping versus equality issue which causes all NULLs to go into the same group, rather than each in its own group. Apply that rule, and then apply the rules for ORDER BY, the NULL values of column alpha of the two rows are equal, so you have to order the rows by the columns to the right in the ORDER BY.

The sort keys are applied from left to right and a column name can appear only once in the list. But there is no obligation on the part of SQL to use a stable (sequence preserving) sort. A stable sort on cities, followed by a stable order on states would result in a list with cities sorted within each state, and the states sorted. While stability is a nice property, the non-stable sorts are generally much faster.

You can use computed columns to get specialized sorting orders. For example, construct a table with a character column and an integer column. The goal is to order the results so that it first sorts the integer column descending, but then groups the related character column within the integers. This is much easier with an example:

CREATE TABLE Fruit_Test

(fruit_name CHAR(10) NOT NULL,

taste_score INTEGER NOT NULL,

PRIMARY KEY (fruit_name, taste_score));

INSERT INTO Fruit_Test

VALUES ('Apples', 2), ('Apples', 1), ('Oranges', 5),

('Apples', 5), ('Banana', 2);

I'm looking to order the results as the following:

('Apples', 5)

('Apples', 2)

('Apples', 1)

('Oranges', 5)

('Banana', 2)

In the above, the first pass of the sort would have produced this by sorting on the integer column:

SELECT F1.fruit_name, F1.taste_score,

FROM Fruit_Test AS F1

ORDER BY F1.taste_score DESC;

One outcome might have been any of these:

Result #1

('Apples', 5)

('Oranges', 5)

('Apples', 2)

('Banana', 2)

('Apples', 1)

Result #2

('Oranges', 5)

('Apples', 5)

('Apples', 2)

('Banana', 2)

('Apples', 1)

Result #3

('Oranges', 5)

('Apples', 5)

('Banana', 2)

('Apples', 2)

('Apples', 1)

Result #4

('Apples', 5)

('Oranges', 5)

('Banana', 2)

('Apples', 2)

('Apples', 1)

If you use:

SELECT F1.fruit_name, F1.taste_score,

FROM Fruit_Test AS F1

ORDER BY F1.taste_score DESC, F1.fruit_name ASC;

Result

('Apples', 5)

('Oranges', 5)

('Apples', 2)

('Banana', 2)

('Apples', 1)

But this is not what we wanted -- the order within fruits has been destroyed. Likewise:

SELECT F1.fruit_name, F1.taste_score

FROM Fruit_Test AS F1

ORDER BY F1.fruit_name ASC, F1.taste_score DESC;

Results

('Apples', 5)

('Apples', 2)

('Apples', 1)

('Banana', 2)

('Oranges', 5)

But this is still not what we wanted -- the order within scores has been destroyed. We need a dummy column to preserve the ordering, thus.

SELECT F1.fruit_name, F1.taste_score,

(SELECT MAX(taste_score)

FROM Fruit_Test AS F2

WHERE F1.fruit_name = F2.fruit_name)

AS taste_score_preserver

FROM Fruit_Test AS F1

ORDER BY taste_score_preserver DESC,

F1.fruit_name ASC, F1.taste_score DESC;

CURSOR Statements

First, you allocate working storage in the host program. This had to be explicitly done in early programming languages so that records from the tape could be buffered. The difference in SQL versus a native file tape is that the SQL data has to be converted to the matching host-language data types. For example, the C language uses an ASCII nul to terminate a string while other languages have the length of the string at the start. Since other languages do not have a NULL, we also need to allocate an INDICATOR for each NULL-able column.

While the ANSI/ISO Standards have syntax for all of the ANSI Standard languages, you will find that there are some differences among the procedural and OO languages. I am goign to skip over this since it is more of a task for the application developers than the database guys.

DECLARE CURSOR Statement

You might have noticed that in SQL, the keyword CREATE builds persistent schema objects. The keyword DECLARE builds transient objects that disappear with the end of the session in which they were build. This is why you say DECLARE CURSOR and not CREATE CURSOR. T-SQL differs from Standard SQL by having more options. These options have to do with how the data is read, whether you can scroll back and forth, if it is copied over to tempdb, and so forth.

The most important characteristics are: (1) Can I update the base table with this cursor or is it read-only? (2) Can I scroll back and forth in the data? Remember that punch cards and early tape systems could only read forward. The basic syntax is:

<declare cursor> ::=

DECLARE <cursor name>  CURSOR [<< lots of options!!>>]

FOR <cursor specification>

 

<cursor specification> ::=

<query expression> [<order by clause>]

[<updatability clause>]

 

<updatability clause> ::= FOR UPDATE [OF <column name list>]

 

<order by clause> ::= ORDER BY <sort specification list>

The <query expression> is executed and the results are saved, so they can be sorted. The result set will have the column aliases, not the original column names. If you have no ORDER BY clause, then the order is non-deterministic.  The <updatability clause> says that you want to update certain columns in the result set that map back to unique rows in the base table. If you don't give an explicit list of columns, then all of them are assumed.

ORDER BY Clause

Contrary to popular belief, the ORDER BY clause is not part of the SELECT statement; it is part of a CURSOR declaration. Newbies will put an ORDER BY clause on a VIEW, derived table or CTE and not stop to think that it makes no sense. A table has no ordering by definition and virtual tables are still table.

The reason that people think it is part of the SELECT statement is that the only way you can get to the result set of a query in a host language is via a cursor. When a vendor tool builds a cursor under the covers for you, they usually allow you to include an ORDER BY clause on the query.

There are more complications to ORDER BY than most people think. Should NULLs sort high or low since they are not values? Standard SQL solved the problem of how NULLs sort with an optional <ordering specification> sub-clause; here is the syntax:

<order by clause> ::=

ORDER BY <sort specification list> [<ordering specification>]

 

<ordering specification> ::= [ASC | DESC] {NULLS FIRST | NULLS LAST}

You can use a positional number instead of a column name in T-SQL. Of course, it gives you no information about the data element and a change to the SELECT clause will screw up the sort. Oh, positional numbers are not allowed when the ORDER BY is in an OVER() window clause. It is also legal to use computed values.

Good programming practice: The sort keys should be column names that appear in the SELECT clause. If you want to sort on a computation, put the function calls or expressions in the SELECT list, name that column and use the name in the ORDER BY clause. This lets the user see what values the sorting is done on. Think about it, what good is a report or display when you have no idea how it was sorted?

Furthermore, the sorting columns pass information to middle-tier machines that can re-sort the data before distributing it to other front-end clients. The sort order is based on the collation sequence of the column to which is attached.

OPEN Statement

This is one time when SQL uses the same keyword as most old file systems. The difference is that back in the dark ages, we had to give a channel or tape drive number that connected the tape drive to the program.

The OPEN <cursor name> statement positions an imaginary read/write head before the first record in the cursor. FETCH statements can then move this imaginary read/write head from record to record. When the read/write head moves past the last record, an exception is raised, like an EOF (end of file) flag in a magnetic tape file system.

Not all sequential file systems did this. Some systems would buffer the first record when the file was opened, some set the EOF flag when they read the last record. This made for some major problems when early SQL programmers tried directly replace their READ statements with FETCH statements, but did not change the program's control logic.

Another problem has to do with a SELECT that has aggregate functions in it. The aggregate functions drop all the NULLs before they do their computations; they also set a warning message about the missing data. You could really need this information!

But when does the cursor tell you? It could do it when you DECLARE CURSOR, or when you OPEN the cursor or when you fetch the record constructed from the row that had the NULLs. The corr3eddt answer is "yes" -- it is implementation defined and different products do it at each of those three statements.

FETCH Statement

The FETCH statement is the cursor version of a READ. It gets a row out of the cursor, converts the data to the host language data types, and puts the values into the local variables. In T-SQL this means local T-SQL variables with a @ in front of them. Here is the syntax:

<fetch statement> ::=

FETCH [[<fetch orientation>]

FROM] <cursor name>

INTO <fetch target list>

 

<fetch orientation> ::= NEXT | PRIOR | FIRST | LAST

| {ABSOLUTE | RELATIVE} <integer value>

I am going to ignore NULLs and INDICATORs in this discussion.

The <fetch orientation> tells the read/write head which way to move. NEXT and PRIOR read one record forward or backward from the current position. FIRST and LAST put the read/write on the fist or last records respectively. The ABSOLUTE fetch moves to a given record number. The RELATIVE fetch moves the imaginary read/write head forward or backward (n) records from the current position. Again, this is a straight imitation of a sequential file system.

CLOSE Statement

The CLOSE <cursor name> statement resets the cursor read/write head to a position before the first row in the cursor. The cursor still exists, but must be re-opened before it can be used again. This is similar to the CLOSE FILE operations in FORTRAN or COBOL, but with an important difference! The cursor can be re-computed when it is re-opened.

DEALLOCATE Statement:

The DEALLOCATE CURSOR statement frees the working storage in the host program. Think of it as physically dismounting a tape from the tape drive and freeing buffer space in a sequential file system.

Positioned UPDATE AND DELETE Statements

You can use a cursor to update and delete rows in a base table. A positioned INSERT would not make any sense because it woudl destroy the sort order of the cursor.

Obviously, the cursor needs an explicit or implicit <updatability clause> of FOR UPDATE for this to work and has to be in the same module as the positioned statements. You get an exception when you try to change a READ ONLY cursor or if the cursor is not positioned on a row.

The clause "CURRENT OF <cursor name>" refers to the record that the imaginary read/write heads is on. This cursor has to map back to one and only one row in the base table.

UPDATE Statement:

<update statement: positioned>

::= UPDATE <table name>

SET <set clause list>

WHERE CURRENT OF <cursor name>

The cursor remains positioned on its current row, even if an exception condition is raised during the update attempt. This is straight out of the magnetic tape file model. When a record was updated, the image in the buffer sat in place and over-wrote the original tape record when the read/write head moved off of it.

DELETE FROM Statement:

<delete statement: positioned>

::= DELETE FROM <table name>

WHERE CURRENT OF <cursor name>

If, while the cursor is open, another DELETE FROM or UPDATE statement attempts to modify the current cursor record, then a cursor operation conflict warning is raised. The transaction isolation level then determines what happens. If the <delete statement: positioned> deleted the last cursor record, then the position of the cursor is after the last record; otherwise, the position of the cursor is before the next cursor record.

This is straight out of the magnetic tape file model. When a record was deleted from a tape, a bit flag at the start of the record was set. The read/write would move forward to the next active record on the tape.

How to Use a CURSOR

user them is the best advice. Cursors convert an SQL database into a 1960's tape or punch card system. It will have set a lot of locks, used tempdb for working storage and eaten up lots of resources. This is only one reason cursors run so slow. The other reason is that the host program requires time to process each record one at a time rather than buffering an entire result somewhere.

As an example, in the Dark Ages of SQL, we would write a report, building it one line at a time from a cursor. This is why T-SQL has CONVERT() and MONEY in its Sybase Code Museum. We were formatting data for display in SQL ! There was no concept of a tiered architecture yet. Every report had a stored procedure with a cursor in it that mimicked a COBOL program.

The old argument for cursors in the original Sybase SQL Server training course was this example. You own a bookstore and you want to change prices; say, all books $25 and over are reduced 10%, and all books under $25 are increased 15%. The first attempt is usually:

BEGIN

UPDATE Books

   SET price = price * 0.90

WHERE price >= $25.00;

UPDATE Books

   SET price = price * 1.15

WHERE price < $25.00;

END;

Oops! Look at a book which was $25.00 ((25.00 * .90) *1.10) = $24.75. So you were told to cursor through the table, and change each row with a cursor.

Today you solve the problem with one simple statement:

UPDATE Books

SET price

= CASE WHEN price < $25.00;

THEN price * 1.15

WHEN price >= $25.00

THEN price * 0.90

ELSE price END;

With new functions and expressions we did not have years ago, I am not sure that you will ever need to write a cursor again.

Here is a list of basic heuristics for cursors:

  1. Limit the number of rows and columns in the cursor's SELECT statement to only those required the desired result set. This will avoid unnecessary fetching of data that in turn will require fewer server resources and increase cursor performance.
  2. Use FOR READ ONLY instead of UPDATE cursors if possible. You will have to watch the transaction isolation level however.
  3. Opening a cursor with some options can cause its rows to be copied to a working table in many products or locked at the table level in others. Only use what you need.
  4. Do a CLOSE cursor as soon as you are finished with the result set. This will release any locks on the rows. Always remember to explicitly deallocate your cursors when you are finished.
  5. Look for your product options. For example, SQL Server has FAST_FORWARD and FORWARD_ONLY cursor options when working with unidirectional, read-only result sets. Using FAST_FORWARD defines a FORWARD_ONLY, READ_ONLY cursor with a number of internal performance optimizations.
  6. Be careful with modifying large number of rows via a cursor loop that are contained within a transaction. Depending on the transaction isolation level, those rows may remain locked until the transaction is committed or rolled back, possibly causing resource contention on the server.
  7. Do not nest cursors inside cursors. This can get really slow, really fast for the same reason as heuristic #6, only multiplied.
  8. Do not write more than five cursor in your entire career. This is Mother Celko's advice.

Conclusion

These eight stair levels give you the basics of an SQL database. Yes, there are more details for each level, but the idea of the series is to change your mindset. It is like learning any new language. If you have a good idea of the structure and conventions, adding new stuff is much easier.

This article is part of the parent stairway Stairway to Database Design

Rate

4.67 (6)

You rated this post out of 5. Change rating

Share

Share

Rate

4.67 (6)

You rated this post out of 5. Change rating