13 January 2016

41209 views

26 0

Declarative SQL: Using CHECK() & DEFAULT

SQL is unusual is that data is not passively stored. Instead you use declarative SQL to specify the rules that underlie the data and its integrity. When used properly, constraints can avoid having to provide a lot of logic elsewhere. CHECK() and DEFAULT can do a lot to ensure that your data is correct

In this series, there are …

SQL signaled a major leap from a file system to a relational databases system. A file is basically passive; it just holds the data. SQL, in contrast, allows you to specify, define it, and protect its integrity. While the choice of data type for a column is an important part of this, we will, in this article, be more interested in the CHECK () and DEFAULT options because they are important for ensuring that your data is correct.

Traditionally, file descriptions are in the application programs, such as the DATA DIVISION in COBOL, the FORMAT statement in FORTRAN, fopen() in C, or the Pascal ‘TYPE <file name>: FILE OF <record descriptor>‘ declaration.

But the Data Definition Language (DDL) in SQL is a separate, active sub-language. It is the DDL that defines the data types, not the host program; the DDL applies CHECK constraints, not the host program; the DDL maintains certain relationships, not the host program.

CHECK()

We probably all underuse the CHECK() constraints. Part of the problem is cultural: They do not exist in other programming languages. People generally rely on ‘transfer of training’ when picking up new skills, and it is more difficult to assimilate things that they have never come across; but a more subtle part of the problem is those SQL NULLs! Our language has three logical values – – True, False, and Unknown. The Unknown value shows up when you try to use a predicate with a NULL value. One of the principles of SQL is that NULLs propagate. This is why good SQL avoids NULL-able columns; the old heuristic was that fewer than 5% of the columns in a schema should be NULL-able.

When you do a Data-manipulation Language (DML) statement, (INSERT, MERGE, UPDATE, DELETE and SELECT), an Unknown logical result from a predicate is treated as a False. But an Unknown result in a DDL CHECK () clause is treated as a True. We call this the ‘benefit of the doubt’ option in SQL. Look at this skeleton code to see what I mean:

CREATE TABLE Test

(test_id CHAR(3) NOT NULL PRIMARY KEY,

null_val INTEGER

CONSTRAINT oversized_value

CHECK (null_val < 10));

Now try these test insertion statements.

INSERT INTO Test VALUES ('A', 1); -- no problems

INSERT INTO Test VALUES ('B', 11); -- caught by constraint

INSERT INTO Test VALUES ('C', NULL); -- benefit of the doubt

If you want to play with a bit, try these simple queries.

SELECT * FROM Test WHERE null_val < 10;

SELECT * FROM Test WHERE null_val > 10;

SELECT * FROM Test WHERE null_val = 10;

The simplest CHECK() constraints on the column use basic predicates. They establish such rules as minimums and/or maximums for numeric values or allowed values for a column. They use a LIKE predicate to define a regular expression for strings. They can include expressions, but you would, in general, prefer to keep these fairly simple. This information is passed on the optimizer along with statistics.

The most overlooked constraint for numeric data is one that will prevent negative or zero values in a column.

1 2	CHECK (order_qty > 0) CHECK (item_cnt >= 0)

There is a horror story about a simple order entry system that allowed negative quantities to be ordered. When it would do the price extensions, you got a negative total and turn that into a refund to the customer. It did not take too long for less than honest people to figure out how to game the system.

Another overlooked constraint is for strings. You can enforce upper and lower case conventions with the UPPER() or LOWER() functions. Likewise, even simple regular expressions can save a lot of grief.

1 2	CHECK (xyz_code = UPPER (xyz_code)) CHECK (zip_code LIKE '[0-9][0-9][0-9][0-9][0-9]'

The ISO convention is to use only unaccented uppercase ASCII Latin letters, digits and limited set of punctuation marks for standardized encoding schemes, so this simple constraint saves you a lot of programming. This subset of ASCII is part of all the Unicode character sets for languages, so that everyone on earth can write ISO encoding strings in their native languages.

The IN() predicate is useful for limiting the column to a fixed list of constant values. Think of it as a “local table lookup” in DDL. But the question becomes when do you use the IN() predicate or the REFERENCES clause. My rule of thumb is that if the list of possible values is short and static, then use the IN() predicate in a CHECK() constraint. If the list of possible values is long or is dynamic, then it is better to reference an actual lookup table. The definition of “long” and “static” are fuzzy; remember, this is a heuristic, not mathematics. SQL Server supports several thousand values in the IN() predicate.

Some SQL products have special optimizations for long lists; they sort the list and use a binary search, put it in a hash table, or self-organize it. Do not depend on this in SQL Server. A really long list probably be should be sorted in the code from most likely to least likely values.

Computed Checks

Since the constraint eventually is to evaluate to a predicate, we often forget that you can do a lot of math inside a predicate. Consider, for example, a simple check digit like the one used in many barcode and credit card validation applications.

Each digit in the barcode is assigned a weight by its position in the sequence within the barcode string. The final position holds the check digit. If you are faced with validating the old ten digit UPC barcode where the 10th digit is the check digit then we need to use modulus 10 arithmetic: Multiply each digit by its weight. The multiplication will automatically cast each digit to an integer, but you could be explicit and use a CAST() function. This also adds documentation.

Some of these expressions will be a bit unnatural and complicated to write without loops. However, the payoff is that it is done, one way, one time in one place in the database, so there is no chance of an application program forgetting to do. People too often forget that the function of the database is to maintain data integrity, as well as to store it

CREATE TABLE Products

(barcode CHAR(10) NOT NULL,

CONSTRAINT All_Numeric_Barcode

CHECK (CAST(barcode AS INTEGER) NOT LIKE '%[^0-9]%'),

CONSTRAINT Valid_Checkdigit

CHECK (

SUBSTRING (CAST(barcode AS INTEGER), 10, 1) --- check digit

= (SUBSTRING (CAST(barcode AS INTEGER), 1, 1)

+ 2 * SUBSTRING (CAST(barcode AS INTEGER), 2, 1)

+ 3 * SUBSTRING (CAST(barcode AS INTEGER), 3, 1)

+ 4 * SUBSTRING (CAST(barcode AS INTEGER), 4, 1)

+ 5 * SUBSTRING (CAST(barcode AS INTEGER), 5, 1)

+ 6 * SUBSTRING (CAST(barcode AS INTEGER), 6, 1)

+ 7 * SUBSTRING (CAST(barcode AS INTEGER), 7, 1)

+ 8 * SUBSTRING (CAST(barcode AS INTEGER), 8, 1)

+ 9 * SUBSTRING (CAST(barcode AS INTEGER), 9, 1))

% 10)

);

See how easy it is to avoid procedural code in favor of declarative code?

As an exercise, try doing the Luhn algorithm (https://en.wikipedia.org/wiki/Luhn_algorithm) with CASE expressions instead of the look-up table. The CHECK() constraint is a little long, but pretty fast.

CASE Expressions

You can use a CASE expression in a CHECK constraint. It becomes a replacement for if-then-else logic procedural code. The general outline for this idiom is:

CONSTRAINT <constraint_name>

CHECK

(CASE WHEN <predicate_1> THEN 'T'

WHEN <predicate_2> THEN 'T'

WHEN <predicate_3> THEN 'F'

...

ELSE NULL END = 'T');

The CASE expression tests the WHEN clauses in the order they are written, and stops at the first predicate that evaluates to TRUE. Therefore, you can get a performance boost by putting the most likely test, at the front of the CASE expression. Finally, be careful about the ELSE clause; I like to use “ELSE NULL” explicitly so that I can use it as a marker when I discover a need to put more test code in the expression.

Computed Columns

Computed columns are a bit like a VIEW but in it in the DDL rather than the DML. The syntax is simply to declare the column as

1	<column_name> AS (<formula>) [PERSISTED]

The optional keyword PERSISTED, says that the computation of the formula will actually be materialized in the table. The SQL engine looks to see if any of the component columns of the formula are changed and re-computes the value of the column anew. Without the PERSISTED keyword, the SQL engine performs the computation only when the column name is used in the DML.

Because division by zero is a common problem, you will find this trick to be handy. The quotient of a division by zero becomes a NULL, rather than an error.

CREATE TABLE Zero_Divide_Test

(..

numerator INTEGER NOT NULL,

denominator INTEGER NOT NULL,

quotient AS (numerator/NULLIF(denominator, 0))

.. );

I would not use a user-defined function (UDF) in a computed column. In general, they cannot help the optimizer, they do not port to other RDBMS, and there will be problems with indexing the column if they are not flagged as being deterministic and precise.

Persisted computed columns can be indexed in SQL Server. They must be deterministic; meaning that the formula always returns the same result when given the same inputs. There are also some precision requirements. Precision requirements are usually not a problem; we do not use approximate data types very often.

Multi-Column CHECK Constraints

CHECK() constraints come in two flavors. So far we have talked about the simple column level constraint. However, a CHECK() constraint can be a multi-column predicate. Probably the most common example of this is with temporal values. So it has to be modeled, using the ISO open intervals approach. This means that we have (start_date, end_date) pairs in the tables. But it also means we need a constraint for CHECK(start_date <= end_date); for each interval.

Status attributes are related to temporal constraints and have to be modeled with multiple columns. A status is a state of being and things exist in time as well as space. A transition constraint says that the status can change only in certain ways over time. For example, you have to be born before you can die, you have to get married before you can be divorced and you can be re-married after a divorce.

In order to enforce a transition constraint, we need the current state and the previous state of an entity. But we also need to limit the pairs of valid current and previous states. Transition constraints are usually modeled with the state transition diagram. This is made up of arrows and boxes that show the direction of flow in the model (see diagram).

A declarative way to enforce transition constraints is put the state transitions into a separate table and then reference the legal transitions. This requires that the referencing table have both the previous, and the current state in two separate columns. Using this example, we would have something like this:

CREATE TABLE State_Changes

(previous_state VARCHAR(15) NOT NULL,

current_state VARCHAR(15) NOT NULL,

PRIMARY KEY (previous_state, current_state));

INSERT INTO State_Changes

VALUES ('Born', 'Born'), -- initial state

('Born', 'Married'),

('Born', 'Dead'),

('Married', 'Divorced'),

('Married', 'Dead'),

('Divorced', 'Married'),

('Divorced', 'Dead'),

('Dead', 'Dead'); -- terminal state

The referencing table looks like this.

CREATE TABLE MyLife

(..

previous_state VARCHAR(15) DEFAULT 'Born' NOT NULL,

current_state VARCHAR(15) DEFAULT 'Born' NOT NULL,

FOREIGN KEY (previous_state, current_state)

REFERENCES State_Changes (previous_state, current_state)

ON UPDATE CASCADE,

..);

It is not always realized by developers that a foreign key can be multi-column.

DEFAULT

The DEFAULT clause is attached to a column in the table and gives a value that can be used to construct a complete row with an INSERT INTO statement. There are some other uses for defaults, but I will skip those for now.

The value is usually a constant of the same data type column to which it belongs. The reason that I specified that the types must match, is that a lot of programmers

put the default value in quotation marks, even when they are numeric types,
do not use standard data formats
oversize or under-size the value.

Besides being sloppy programming, these habits require the database to cast the default value to the proper data type, size and precision, which can only waste resources.

Many programmers do not know that the values list of an INSERT INTO statement can include the word DEFAULT. If the column is NULL-able, then the “default DEFAULT” is NULL. It is probably less surprising that almost nobody knows about the DEFAULT VALUES option. This is a shorthand for a VALUES list made up of nothing but defaults. Obviously, this assumes that all the columns have a default value declared, which isn’t often the case.

INSERT INTO <table name> [(<column_list>)]

VALUES ({ DEFAULT | NULL | <expression> } [,..n]) [,..n]

| DEFAULT VALUES

The DEFAULT option does not have to be a simple constant. It can be a system-level function, such as CURRENT_TIMESTAMP (please stop using the old Sybase/UNIX getdate() function call). This can be very handy for time-stamping a row. SQL Server has a range of such functions that, when used as DEFAULTs, are valuable for auditing changes in the data.

But the default can also use a sequence. First, you need to use a CREATE SEQUENCE statement. This statement creates a schema level object that is accessible to any user. It is not part of a table. It is not a procedure that belong to one user. The idea is that a user can invoke the sequence with a special syntax that is used wherever an integer value of the sequence’s data type would work.

If you want a physical model, imagine you are in the butcher store. You walk in and pull a service ticket number from a roll of tickets on the counter. Sequence numbers are generated outside the scope of the current transaction, just like the tickets. The basic options are fairly simple. Let’s look at the roll of tickets (i.e. CREATE SEQUENCE). It has to be declared with parameters that define behavior.

CREATE SEQUENCE Service_Ticket_Seq

AS INTEGER

START WITH 1

INCREMENT BY 1

MINVALUE 1

MAXVALUE 100

CYCLE;

I can put a SEQUENCE in the DEFAULT clause of the DDL for table:

CREATE TABLE Butcher_Shop

(ticket_nbr INTEGER DEFAULT NEXT VALUE FOR Service_Ticket_Seq,

..);

The numbering will cycle when the sequence reaches 100 in this example and start over at 1. Besides being declarative and giving you lots of control, the NEXT VALUE can be used anywhere that allows its data type.

Conclusion

SQL has evolved to be much more declarative than when it was first created. I would recommend that you pick a feature you do not know well, and spend an afternoon seeing how far you can push it. You will be surprised how much procedural code you can remove.

Declarative SQL: Using CHECK() & DEFAULT

CHECK()

Computed Checks

CASE Expressions

Computed Columns

Multi-Column CHECK Constraints

DEFAULT

Conclusion

Subscribe for more articles

Rate this article

Joe Celko

Related articles

Snake draft sorting in SQL Server, part 2