First Normal Form (1NF) is the foundation of the relational model. It requires that all information is shown as scalar values in the columns of the rows of a table. That means no arrays, CSV lists, pointer chains or other fancy, frilly data structures. Scalar is not quite the same thing as atomic. The word “atom” comes from the Greek “atoma” which means “without parts” or “indivisible” in English. This is usually not much of a problem.
Consider the VIN number on your automobile; it has various codes in various places which tell you a lot about the vehicle (see How to decode your VIN number. VIN decoding information ) and has a check digit in the 9th position that helps assure that the data is correct. It is a CHAR(17)
column in your database. If you try to cut it into these sections, you destroy the information, the unique identifier of the vehicle. This column is both scalar and atomic.
Now look at the (longitude FLOAT NOT NULL
, latitude FLOAT NOT NULL
) pair. Each value is a scalar, but has no meaning in itself; the pair is atomic. Anther common example is a (start_time, end_time
) pair to model a duration.
But what about a group of columns that models a permutation from a set of elements? Is this a normalization problem, a repeating group in violation of First Normal Form? Or is it allowed? Let me use an example that near and dear to my heart, a poker game! Assume we have numbered the cards from 1 to 52 and let the presentation layers encode and display them. Let’s declare the table to look like this and dive into the problem.
1 2 3 4 5 6 7 |
CREATE TABLE Poker_Hand (player_nbr INTEGER NOT NULL PRIMARY KEY, card_1 SMALLINT NOT NULL CHECK (card_1 BETWEEN 1 AND 52), card_2 SMALLINT NOT NULL CHECK (card_2 BETWEEN 1 AND 52), card_3 SMALLINT NOT NULL CHECK (card_3 BETWEEN 1 AND 52), card_4 SMALLINT NOT NULL CHECK (card_4 BETWEEN 1 AND 52), card_5 SMALLINT NOT NULL CHECK (card_5 BETWEEN 1 AND 52)); |
Each hand is a permutation of 5 of the 52 cards in a deck. My columnar constraints make sure that each card is in the deck, but it does not protect me from having five aces or other repeater value problems. Each card has to be unique, and I can write a table constraint to assure that:
1 2 3 4 5 |
CONSTRAINT Proper_Permutation CHECK (card_1 NOT IN (card_2, card_3, card_4, card_5) AND card_2 NOT IN (card_3, card_4, card_5) AND card_3 NOT IN (card_4, card_5) AND card_4 NOT IN (card_5)) |
The last predicate looks funny, but it shows the pattern. I am going to argue this is not a repeating group; those columns are a permutation, which is atomic.
But what we would really like is a permutation that is in sorted order so we can figure out the value of the hand. That constraint is actually easier:
1 2 3 4 5 |
CONSTRAINT Sorted_hand CHECK (card_1 < card_2 AND card_2 < card_3 AND card_3 < card_4 AND card_4 < card_5) |
In SQL, we have to have this constraint to help the optimizer and protect data integrity, but it would be handy to have some way to sort these columns in case the data entry layers do not do it for us.
Everyone that its given this problem immediately envisioned a stored procedure that would take the five values, sort them and return them to their original row in the table. The only way to make this approach work for the whole table was to write an update cursor and loop thru all the rows of the table. Even Itzik Ben-Gan posted a simple procedure that loaded the values into a temporary table, then pulled them out in sorted order, starting with the minimum value, using a loop.
Swapping Columns in an UPDATE
SQL is a set-oriented language, so the UPDATE statement does not work the way it would in a procedural language. When you write this statement in SQL:
1 2 3 4 |
UPDATE Poker_Hand SET card_1 = card_2, card_2 = card_1 WHERE card_1 > card_2; |
It swaps the two cards if they are out of order. This is always a surprise to the first time SQL programmer. In his procedural language, he would write a swap like this using a working variable (or a bit-wise operation with an XOR in assembly language).
A swap(a,b) procedure is the basic tool for sorting algorithms.
All I need is to find a sequence of swaps that will sort my Poker hand. You are probably thinking that this method is a bit weak because the results are only good for sorting a fixed number of items. But a table only has a relatively small fixed number of columns, so that is not such a problem when sorting a row.
You can set up a sorting network that will sort my Poker hand, with the minimal number of exchanges, nine swaps, in this sequence:
1 2 3 4 5 6 7 8 9 |
Swap(c1, c2); Swap(c4, c5); Swap(c3, c5); Swap(c3, c4); Swap(c1, c4); Swap(c1, c3); Swap(c2, c5); Swap(c2, c4); Swap(c2, c3); |
You might want to deal yourself a hand of five playing cards in one suit to see how it works. Put the cards face down in a line on the table and pick up the pairs, swapping them if required, then turn over the row to see that it is in sorted order when you are done.
In theory, the minimum number of swaps needed to sort (n) items is CEILING (log2 (n!))
and as (n) increases, this approaches O(n*log2(n))
. The Computer Science majors will remember that “Big O” expression as the expected performance of the best sorting algorithms, such as Quicksort.
But how do I find these sequences of swaps for any number of items? There is an algorithm called the Bose-Nelson sort which is very good for small values of (n). If (n < 9)
then it is perfect, actually. But as things get bigger, Bose-Nelson approaches O(n ^ 1.585)
. In English, this method is good for a fixed size list of 16 or fewer items and goes to hell after that. Let’s be honest, the way you find the minimal swap sequence is to Google it.
Once you have the swap pairs, the obvious direct way is to change each call to swap() into a single UPDATE statement. But SQL UPDATE statements can handle any number of columns and their assignments happen in parallel. If two or more consecutive swap() calls are independent, then you can easily write a SET clause that combines them and makes fewer table accesses. Using the above swap chain, we first get this block of code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 |
BEGIN -- Swap(c1, c2); UPDATE Foobar SET c1 = c2, c2 = c1 WHERE c1 > c2; -- Swap(c4, c5); UPDATE Foobar SET c4 = c5, c5 = c4 WHERE c4 > c5; -- Swap(c3, c5); UPDATE Foobar SET c3 = c5, c5 = c3 WHERE c3 > c5; -- Swap(c3, c4); UPDATE Foobar SET c3 = c4, c4 = c3 WHERE c3 > c4; -- Swap(c1, c4); UPDATE Foobar SET c1 = c4, c4 = c1 WHERE c1 > c4; -- Swap(c1, c3); UPDATE Foobar SET c1 = c3, c3 = c1 WHERE c1 > c3; -- Swap(c2, c5); UPDATE Foobar SET c2 = c5, c5 = c2 WHERE c2 > c5; -- Swap(c2, c4); UPDATE Foobar SET c2 = c4, c4 = c2 WHERE c2 > c4; -- Swap(c2, c3); UPDATE Foobar SET c2 = c3, c3 = c2 WHERE c2 > c3; END; |
This fully portable, standard SQL code, but it makes nine updates. that parallelism is useful. So we combine some of the UPDATE statements, being careful not to change the effective sequence of the swap operations.
If you look at the first two UPDATE statements, you can see that they do not overlap. This means you could roll them into one statement like this:
1 2 3 4 5 6 7 |
-- Swap(c1, c2) AND Swap(c4, c5); UPDATE Foobar SET c1 = CASE WHEN c1 <= c2 THEN c1 ELSE c2 END, c2 = CASE WHEN c1 <= c2 THEN c2 ELSE c1 END, c4 = CASE WHEN c4 <= c5 THEN c4 ELSE c5 END, c5 = CASE WHEN c4 <= c5 THEN c5 ELSE c4 END WHERE c4 > c5 OR c1 > c2; |
The advantage of doing this is that you have to execute only one UPDATE statement and not two. If you could roll the statements into one single UPDATE, you would have the best of all possible worlds, but I doubt that the code would be easy to read.
We can see this same pattern in the pair of statements.
1 2 |
Swap(c1, c3); Swap(c2, c5); |
But there are other patterns, so you can write general templates
for them. Consider this one:
1 2 |
Swap(x, y); Swap(x, z); |
If you write out all possible triplets and apply these two operations on them, thus.
1 2 3 4 5 6 |
(x, y, z) => (x, y, z) (x, z, y) => (x, z, y) (y, x, z) => (x, y, z) (y, z, x) => (x, z, y) (z, x, y) => (x, y, z) (z, y, x) => (x, y, z) |
The result of this pattern is that x is lowest value of the three values, and y and z either stay in the same relative position to each other or get sorted properly. Getting them properly sorted would have the advantage of saving exchanges later and also reducing the set of the subset being operated upon by each UPDATE statement. With a little thought, we can write this symmetric piece of code.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
-- Swap(x, y) AND Swap(x, z); UPDATE Foobar SET x = CASE WHEN x BETWEEN y AND z THEN y WHEN z BETWEEN y AND x THEN y WHEN y BETWEEN z AND x THEN z WHEN x BETWEEN z AND y THEN z ELSE x END, y = CASE WHEN x BETWEEN y AND z THEN x WHEN x BETWEEN z AND y THEN x WHEN z BETWEEN x AND y THEN z WHEN z BETWEEN y AND x THEN z ELSE y END, z = CASE WHEN x BETWEEN z AND y THEN y WHEN z BETWEEN x AND y THEN y WHEN y BETWEEN z AND x THEN x WHEN z BETWEEN y AND x THEN x ELSE z END WHERE x > z OR x > y; |
While it is very tempting to write more and more of these pattern templates, it might be more trouble than it is worth because of increased maintenance and readability.
References:
- Bose, R. C. and Nelson, R. J.; “A Sorting Problem”; Journal of the ACM, vol. 9 pages 282-296. This is a recursive procedure that takes an integer and then generates swap pairs for a vector of that size.
- Knuth, D. E.; THE ART OF COMPUTER PROGRAMMING, vol 3, section 5.3.4 Networks for Sorting. There is a section of sorting networks, which model sequences of parallel swap pairs. The book gives many minimal networks.