• I'm consulting at a company with a developer-designed database which has a very large table (over 2 billion rows) that has a number of small char or varchar categorical columns. Currently, there are no check constraints or foreign keys pointing to look-up tables. I'm wondering what's the best approach to enforce integrity at the DB level rather than at the app level.

    Consider the case where the categorical column is a char(1) and appears only in the subject table. It seems ridiculous to create a lookup table if the only valid values are 'M' or 'F', and given that the meaning of these abbreviations are so well known, a check constraint certainly seems better than a two row lookup table and an FK. Going one step further, there can be a case where there are 15 valid char(1) codes. Here the case for a lookup table is a bit better, since the lookup table could include a description column for clarity. We could also change the PK/FK data type to tinyint and not increase the size of the subject table. I wonder if there is any performance benefit in replacing the char(1) with a tinyint when joining.

    Then there are marginal cases where the categorical char column is larger but not too large. In these cases, there would be a minor size benefit from replacing the char columns with tinyint FKs pointing to a lookup table. But if the number of alternatives was still small (e.g. Male and Female) so clarity was not an issue, I wonder whether a check constraint would perform better than the FK and lookup table.

    For the case where the categorical character variable is quite large relative to the size of the smallest integer-type key value, it seems clear that the FK and lookup table is best from a size perspective alone. And lookup tables are preferred when there are many valid values and their presence adds clarity or when they could be used by multiple tables. Check constraints have the benefit of reducing the number of tiny lookup tables. But it seems they're only good for a 'small' number of 'small' sized categorical alternatives that appear only in one table.

    Has anyone done any timing comparisons on this issue?