Every so often I find myself needing to build schema for something which has very little complexity in the "real world", but where the model of that part of the world becomes more and more difficult the more I think about it. Maybe this is a symptom of overthinking, but I like thinking.
To wit, phone numbers. I will provide two obvious possible schemas. Then I will discuss why I find it almost impossible to logically choose between them.
Schema 1: Each type of phone number gets its own column. I will refer to this as the "multi column" solution:
create table contacts (
-- some kind of contact key,
/* ... */,
home_phone varchar(20) null,
work_phone varchar(20) null,
mobile_phone varchar(20) null,
/* ... */
Schema 2: Phone numbers are all kept in a single column, with a second column indicating type. I will refer to this as the "single column" solution:
create table contact_phone_numbers (
-- some kind of contact key
phone_type char not null check (phone_type in ('H', 'W', 'M', /* .... */ ),
phone_number varchar(20) not null
Practical considerations, which I will ignore:
Clearly there are practical consequences of the choice. If, for example, I ask "Which contacts use this particular phone number?", the second schema is much easier to query.
On the other hand, I want a report that shows contact name, home phone and mobile phone, the first schema is much easier to query.
Based on years of experience, I do not believe it is possible to reliably predict the future requirements of the system. Maybe today most users are asking questions like the first one, but next week a report like the second example will be top priority.
Based on the premise that practical considerations cannot be reliably predicted, I generally try to base my designs on more theoretical considerations. Ie, which schema is more "relationally correct"? I think that this simple and common example demonstrates how hard it can be to answer that question.
One might look at the multi column solution and suggest that it seems, prima facie, to violate the "spirit" of first normal form. We have a data element of a single domain which is repeated in the tuple. Commonly, the "repeating groups" antipattern.
On the other hand, one might look at the single column solution and respond that it seems to look a lot like the dreaded EAV pattern.
Which point of view is correct? To me, that depends on whether we say that different kinds of phone numbers are actually members of the same domain. But this is a very difficult question to answer. Let's presume that they are the same domain, because they are all phone numbers. Then we should go with the single column solution. Easy, problem solved, right?
But wait! I can then apply similar reasoning to other parts of the model. My invoice lines, for example, might have both an "ex GST amount" and a "GST amount" column. But wait... those are both dollar amounts! So, per the reasoning of the previous paragraph, we should have a "dollar amounts" table, with each amount value accompanied by an additional value from another column indicating the type of dollar amount. Now the reasoning in the previous paragraph doesn't seem so convincing anymore, does it?
Besides this, different phone numbers can have different validation rules. A valid pattern for a mobile phone number is different from that for a landline number. So at the very least, it seems like landline numbers and mobile numbers are different domains. But work landline numbers and home landline numbers are both landline numbers, so those are the same domain. Now we have an even more unfortunate schema, where we need *two* phone number tables... landline phone numbers and mobile phone numbers. And we are stll left with the "different types of dollar amount" objection.
Can anyone provide a sound logical argument as to which of these schemas is relationally correct?