A clear understanding of SQL Data Types and domains is a fundamental requirement for the Database Developer, but it is not elementary. If you select the most appropriate data type, it can sidestep a variety of errors. Furthermore, if you then define the data domains as exactly as possible via constraints, you can catch a variety of those problems that would otherwise bedevil the work of the application programmer.
In level one, we
named data elements for what they are and classified them. So we now know that
we cannot ever have, say, a generic, magical, universal “id” as a data element.
It has to be an exact name with a precise meaning in the schema. Even better,
the name should be enterprise-wide, industry-wide or global in that order of
preference.
In level two we
will decide how to get that data into the computer by picking appropriate
domains. The idea of a domain goes back to Dr. Codd. Well, actually before the
Good Doctor, since it came from mathematics. A domain is like a data type in
programming, but involves a little bit more. A domain has legal operations that
can be done on its values in addition to what comes with a data type. For
example, I can use INTEGER to record temperature in my database; that is the
data type. When I see 100, that is a value. While I can add, subtract, multiply
and divide integers, it makes no sense to do math like that on temperatures. You
cannot put three people with a body temperature of 36 °C together to boil water.
The type of scale determines what operations are allowed. But I need to know if
the number is in degrees Fahrenheit (°F). degrees Celsius (°C) or degrees Kelvin
(°K); that is the unit of measurement. Put all of these things together and you
have a domain.
While Standard
SQL has a CREATE DOMAIN statement, the schema object it creates is not really a
full domain. It is just a shorthand for the things we will discuss next.
SQL has three
major data types:
-
Numeric
- String
- Temporal
Numeric types
are broken down into exact and approximate. The exact numerics are INTEGER,
SMALLINT, DECIMAL, NUMERIC and BIGINT. They hold exact numeric values and have
well-defined mathematical operations. Approximate numerics are REAL, FLOAT and
DOUBLE PRECISION. These are floating point numbers and they have some rounding
issues; these days the IEEE Floating Standard is universally used. There is no
great surprise in SQL; most other programming languages that can do math use all
or some of these types because they were built into computer hardware long
before SQL was invented. PC and mini-computer people are probably not aware of
BCD (Binary Coded Decimal) numbers, but they were part of commercial mainframe
computers for decades. If you don't understand any of these types, then Google
them.
String types are
broken down into fixed and varying length. The fixed length strings are CHAR (n)
and NCHAR(n), where (n) is their length. The NCHAR(n) is a shorthand for
“National Character”, which really means any character from any language that
Unicode has (www.unicode.org)
in actual implementations. Likewise, CHAR(n) is the local ASCII character set.
Varying length strings are not padded out with blanks like the fixed length
strings. Picking the length and the character set for a column is a constraint
that you really need to think about before you do it. A column that is too
narrow prevents true data while a column that is too wide allows false data. One
of my favorite tricks when I see a column declared as NVARCHAR(255) is to load
it with the Heart Sutra in Chinese. This is a classic text of Zen Buddhism. If I
cannot teach them SQL, I might lead them to enlightenment.
Temporal types
are broken down into datetime and interval types. The datetime types break down
into dates and times, which makes sense. It models a point in time. A date type
includes years, months and days. A time type includes hours, minutes, seconds
and decimal sub-seconds. Put together, they make a TIMESTAMP data type in
Standard SQL, which SQL server calls DATETIME. Interval data types are durations
of time like days, hours, minutes and seconds. SQL Server does not represent
them as a special type, but uses functions with integers to get similar results.
Rounding and
truncation for all data types are implementation defined. All data types allow
for a NULL. While there is a core set of functions for the data types in the SQL
Standards, vendors all have proprietary extensions and product particulars. For
example, SQL Server has a BIT data type which is an exact numeric with only the
values {0, 1, NULL} allowed. You will also find more data types, but these are
the “Big Three” that will do most of your work.
Use numeric
types when you have to do computations on the data. That means data elements
like quantities, counts, totals and so forth. If you need to do computations
other than simple addition and subtraction, declare some extra decimal places
for rounding and overflow.
You also want to
use numeric types for codes that have an ordering. One example would be the
number of stars a restaurant has (i.e. a 1-star restaurant is not as good as a
2-star restaurant, which is not as good as a 3-star restaurant, and so forth).
Do not use numeric types for scales that do not do computation or comparisons.
The common example of this mistake is ZIP codes; the leading zeroes have
meaning, you cannot do math on them and there is no linear ordering to them.
They are tag numbers that happen to be in a hierarchy.
Use strings for
text, names, and encoding schemes that can be expressed with a Regular
Expression. For example, ZIP codes should be declared as CHAR(5). Do not try to
do math them; that works fine in COBOL, but not in SQL.
Use temporal
data types for temporal data. Yes, that sounds so obvious that I should not have
to say it. But one of the most common design errors is to use strings for date
and time data. Of course people who do this never write constraints to prevent
dates like “2010-02-31” or functions to do simple temporal math. They have
committed the design error of putting display formatting into the database
instead of the front end.
Real numbers and
Time are examples of a continuum, while the other data types are discrete. A
discrete scale has a finite number (possibly zero) between between any two
distinct values. Think about the integers {4, 9}, which have {5, 6, 7, 8}
between them. There was an episode of the kid's television iCarly in which the
heroine convinces another kid that there is a new number named Dirf
between 5 and 6. The joke is funny because it is so obviously wrong.
A continuum is a
mathematical structure that has an infinite number of data values between any
two distinct values. You can always add more and more decimal places to a real
number or a time without any limit. Floating point numbers have built-in
functions to handle rounding and computational problems, but temporal data does
not. This often means that you will need to model events with a (start_time,
end_time) pair of values. If the event is still current, then use a NULL for the
end_time value. You can then use the COALESCE() function to convert the NULL to
the CURRENT_DATE or other meaninfule value in your application programs.
Constraints are
one reason that columns in a table are nothing like fields in a record. A
constraint is a declarative clause that restricts the values in a column. The
most important one is NOT NULL. Declare every column with it the first time and
then if you decide to allow NULLs, comment the row declaration to explain what
it means in context. For example:
sale_start_date DATE NOT NULL,
sale_end_date DATE, – sale is still in progress
The CHECK
(<predicate involving the column>) is the simplest row level constraint. Any
valid predicate can be used, but some typical ones might be:
sex_code
SMALLINT NOT NULL
CHECK
(sex_code IN (0,1,2,9)),
body_temperature DECIMAL (3,1) NOT NULL
CHECK
(body_temperature BETWEEN 0.0 AND 45.0),
airport_code CHAR(3) NOT NULL
CHECK (airport_code
= UPPER (airport_code)),
zip_code
CHAR(5) NOT NULL
CONSTRAINT Valid_ZIP_Code
CHECK (zip_code
LIKE '[0-9][0-9][0-9][0-9][0-9]'),
You can also
name the constraint, as shown above. That is a good idea since it makes the
error messages easier to read. You can get pretty fancy with a CASE expression
and other predicates for complex rules with if-then logic, such as:
floob_score INTEGER CHAR(5) NOT NULL
CHECK
(CASE WHEN floob_score NOT BETWEEN 1 AND 99 THEN 'F'
WHEN floob_score = 42
AND fuzz_nbr = 17
THEN 'T' ELSE 'F' END = T'),
As an exercise,
write a CASE expression to validate a check digit. It is long but not that hard.
Besides data
integrity, constraints do two other nice things for you. They are available to
the optimizer to improve your queries, inserts, updates and deletes. They save
you a lot of front end coding; do it once here and not over and over in
thousands of application programs, present and future. They guarantee that all
the front end programs are using the same definitions for the data elements.
A CHECK()
constraint can also be placed at the table level. For example:
sale_start_date DATE NOT NULL,
sale_end_date DATE, – sale is still in progress
CONSTRAINT Validate_Sale_Duration
CHECK
(sale_start_date <= sale_end_date),
Table
constraints involve two or more columns. In standard SQL, you can also have a
CHECK() constraint that references other tables, but I will skip over that for
now; it is not widely implemented or part of SQL Server.
A strangeness of
SQL is that when we have a comparison with a NULL, the result is the logical
value UNKNOWN, not TRUE or FALSE. But in a CHECK () constraint, the TRUE and
UNKNOWN are treated alike. We give the benefit of the doubt to the UNKNOWN. “
The final option
in a row declaration is the DEFAULT clause, It is not technically part of
defining a domain, it is very useful. In Standard SQL, it appears immediately
after the data type but most products are fairly liberal about its placement in
the row declaration. It is a constant value or a call to system values of the
appropriate data type. The most common examples are a default value in the
encoding scheme for numeric and character data and CURRENT_TIMESTAMP and
CURRENT_DATE for temporal data. For example the ISO sex codes use zero for
“unknown, 1 for “male”, 2 for “female” and 9 for a lawful person (corporations
and other organizations which are recognized as “legal persons” in the context.
As an example, we might declare rows like this:
sex_code
SMALLINT NOT NULL
DEFAULT
0
CHECK (sex_code
IN (0,1,2,9)),
sale_start_date DATE DEFAULT CURRENT_DATE NOT NULL,
The purpose of
the default is to provide a value in situations where no explicit value is
given. This is usually done with INSERT INTO statements where you must construct
an entire row but you do not want to expose all the columns to a user or you
want to save some programming effort. It is not always possible to provide a
default value, but when you have one then do so.
Now that we can
build rows, we need to assemble the rows into tables in Level Three. That is
where we will learn about other kinds of constraints that apply to columns
rather than single rows.
The Series
Read about the rest of The Stairway to Database Design Series and view the other articles.