Stairway to Database Design

Stairway to Database Design Level 2: Domains, Constraints and Defaults

,

A clear understanding of SQL Data Types and domains is a fundamental requirement for the Database Developer, but it is not elementary. If you select the most appropriate data type, it can sidestep a variety of errors. Furthermore, if you then define the data domains as exactly as possible via constraints, you can catch a variety of those problems that would otherwise bedevil the work of the application programmer.

In level one, we named data elements for what they are and classified them. So we now know that we cannot ever have, say, a generic, magical, universal “id” as a data element. It has to be an exact name with a precise meaning in the schema. Even better, the name should be enterprise-wide, industry-wide or global in that order of preference.

In level two we will decide how to get that data into the computer by picking appropriate domains. The idea of a domain goes back to Dr. Codd. Well, actually before the Good Doctor, since it came from mathematics. A domain is like a data type in programming, but involves a little bit more. A domain has legal operations that can be done on its values in addition to what comes with a data type. For example, I can use INTEGER to record temperature in my database; that is the data type. When I see 100, that is a value. While I can add, subtract, multiply and divide integers, it makes no sense to do math like that on temperatures. You cannot put three people with a body temperature of 36 °C together to boil water. The type of scale determines what operations are allowed. But I need to know if the number is in degrees Fahrenheit (°F). degrees Celsius (°C) or degrees Kelvin (°K); that is the unit of measurement. Put all of these things together and you have a domain.

While Standard SQL has a CREATE DOMAIN statement, the schema object it creates is not really a full domain. It is just a shorthand for the things we will discuss next.

SQL has three major data types:

  1. Numeric
  2. String
  3. Temporal

Numeric types are broken down into exact and approximate. The exact numerics are INTEGER, SMALLINT, DECIMAL, NUMERIC and BIGINT. They hold exact numeric values and have well-defined mathematical operations. Approximate numerics are REAL, FLOAT and DOUBLE PRECISION. These are floating point numbers and they have some rounding issues; these days the IEEE Floating Standard is universally used. There is no great surprise in SQL; most other programming languages that can do math use all or some of these types because they were built into computer hardware long before SQL was invented. PC and mini-computer people are probably not aware of BCD (Binary Coded Decimal) numbers, but they were part of commercial mainframe computers for decades. If you don't understand any of these types, then Google them.

String types are broken down into fixed and varying length. The fixed length strings are CHAR (n) and NCHAR(n), where (n) is their length. The NCHAR(n) is a shorthand for “National Character”, which really means any character from any language that Unicode has (www.unicode.org) in actual implementations. Likewise, CHAR(n) is the local ASCII character set. Varying length strings are not padded out with blanks like the fixed length strings. Picking the length and the character set for a column is a constraint that you really need to think about before you do it. A column that is too narrow prevents true data while a column that is too wide allows false data. One of my favorite tricks when I see a column declared as NVARCHAR(255) is to load it with the Heart Sutra in Chinese. This is a classic text of Zen Buddhism. If I cannot teach them SQL, I might lead them to enlightenment.

Temporal types are broken down into datetime and interval types. The datetime types break down into dates and times, which makes sense. It models a point in time. A date type includes years, months and days. A time type includes hours, minutes, seconds and decimal sub-seconds. Put together, they make a TIMESTAMP data type in Standard SQL, which SQL server calls DATETIME. Interval data types are durations of time like days, hours, minutes and seconds. SQL Server does not represent them as a special type, but uses functions with integers to get similar results.

Rounding and truncation for all data types are implementation defined. All data types allow for a NULL. While there is a core set of functions for the data types in the SQL Standards, vendors all have proprietary extensions and product particulars. For example, SQL Server has a BIT data type which is an exact numeric with only the values {0, 1, NULL} allowed. You will also find more data types, but these are the “Big Three” that will do most of your work.

Use numeric types when you have to do computations on the data. That means data elements like quantities, counts, totals and so forth. If you need to do computations other than simple addition and subtraction, declare some extra decimal places for rounding and overflow.

You also want to use numeric types for codes that have an ordering. One example would be the number of stars a restaurant has (i.e. a 1-star restaurant is not as good as a 2-star restaurant, which is not as good as a 3-star restaurant, and so forth). Do not use numeric types for scales that do not do computation or comparisons. The common example of this mistake is ZIP codes; the leading zeroes have meaning, you cannot do math on them and there is no linear ordering to them. They are tag numbers that happen to be in a hierarchy.

Use strings for text, names, and encoding schemes that can be expressed with a Regular Expression. For example, ZIP codes should be declared as CHAR(5). Do not try to do math them; that works fine in COBOL, but not in SQL.

Use temporal data types for temporal data. Yes, that sounds so obvious that I should not have to say it. But one of the most common design errors is to use strings for date and time data. Of course people who do this never write constraints to prevent dates like “2010-02-31” or functions to do simple temporal math. They have committed the design error of putting display formatting into the database instead of the front end.

Real numbers and Time are examples of a continuum, while the other data types are discrete. A discrete scale has a finite number (possibly zero) between between any two distinct values. Think about the integers {4, 9}, which have {5, 6, 7, 8} between them. There was an episode of the kid's television iCarly in which the heroine convinces another kid that there is a new number named Dirf between 5 and 6. The joke is funny because it is so obviously wrong.

A continuum is a mathematical structure that has an infinite number of data values between any two distinct values. You can always add more and more decimal places to a real number or a time without any limit. Floating point numbers have built-in functions to handle rounding and computational problems, but temporal data does not. This often means that you will need to model events with a (start_time, end_time) pair of values. If the event is still current, then use a NULL for the end_time value. You can then use the COALESCE() function to convert the NULL to the CURRENT_DATE or other meaninfule value in your application programs.

Constraints are one reason that columns in a table are nothing like fields in a record. A constraint is a declarative clause that restricts the values in a column. The most important one is NOT NULL. Declare every column with it the first time and then if you decide to allow NULLs, comment the row declaration to explain what it means in context. For example:

sale_start_date DATE NOT NULL,

sale_end_date DATE, – sale is still in progress

The CHECK (<predicate involving the column>) is the simplest row level constraint. Any valid predicate can be used, but some typical ones might be:

sex_code SMALLINT NOT NULL

 CHECK (sex_code IN (0,1,2,9)),

 

body_temperature DECIMAL (3,1) NOT NULL

 CHECK (body_temperature BETWEEN 0.0 AND 45.0),

airport_code CHAR(3) NOT NULL

 CHECK (airport_code = UPPER (airport_code)),

zip_code CHAR(5) NOT NULL

 CONSTRAINT Valid_ZIP_Code

 CHECK (zip_code LIKE '[0-9][0-9][0-9][0-9][0-9]'),

You can also name the constraint, as shown above. That is a good idea since it makes the error messages easier to read. You can get pretty fancy with a CASE expression and other predicates for complex rules with if-then logic, such as:

floob_score INTEGER CHAR(5) NOT NULL

CHECK (CASE WHEN floob_score NOT BETWEEN 1 AND 99 THEN 'F'

            WHEN floob_score = 42

             AND fuzz_nbr = 17

            THEN 'T' ELSE 'F' END = T'),

As an exercise, write a CASE expression to validate a check digit. It is long but not that hard.

Besides data integrity, constraints do two other nice things for you. They are available to the optimizer to improve your queries, inserts, updates and deletes. They save you a lot of front end coding; do it once here and not over and over in thousands of application programs, present and future. They guarantee that all the front end programs are using the same definitions for the data elements.

A CHECK() constraint can also be placed at the table level. For example:

sale_start_date DATE NOT NULL,

sale_end_date DATE, – sale is still in progress

CONSTRAINT Validate_Sale_Duration

CHECK (sale_start_date <= sale_end_date),

Table constraints involve two or more columns. In standard SQL, you can also have a CHECK() constraint that references other tables, but I will skip over that for now; it is not widely implemented or part of SQL Server.

A strangeness of SQL is that when we have a comparison with a NULL, the result is the logical value UNKNOWN, not TRUE or FALSE. But in a CHECK () constraint, the TRUE and UNKNOWN are treated alike. We give the benefit of the doubt to the UNKNOWN. “

The final option in a row declaration is the DEFAULT clause, It is not technically part of defining a domain, it is very useful. In Standard SQL, it appears immediately after the data type but most products are fairly liberal about its placement in the row declaration. It is a constant value or a call to system values of the appropriate data type. The most common examples are a default value in the encoding scheme for numeric and character data and CURRENT_TIMESTAMP and CURRENT_DATE for temporal data. For example the ISO sex codes use zero for “unknown, 1 for “male”, 2 for “female” and 9 for a lawful person (corporations and other organizations which are recognized as “legal persons” in the context. As an example, we might declare rows like this:

sex_code SMALLINT NOT NULL

 DEFAULT 0

 CHECK (sex_code IN (0,1,2,9)),

 

sale_start_date DATE DEFAULT CURRENT_DATE NOT NULL,

The purpose of the default is to provide a value in situations where no explicit value is given. This is usually done with INSERT INTO statements where you must construct an entire row but you do not want to expose all the columns to a user or you want to save some programming effort. It is not always possible to provide a default value, but when you have one then do so.

Now that we can build rows, we need to assemble the rows into tables in Level Three. That is where we will learn about other kinds of constraints that apply to columns rather than single rows.

The Series

Read about the rest of The Stairway to Database Design Series and view the other articles.

This article is part of the parent stairway Stairway to Database Design

Rate

You rated this post out of 5. Change rating

Share

Share

Rate

You rated this post out of 5. Change rating