RE: Normalization – SQLServerCentral

SSC Guru

Points: 260824

March 2, 2010 at 10:14 am

The main problem with Nulls that I have in normalizing tables is that Null ends up being used for both unknown and no-value. I'd rather have two versions, one for unknown, one for no-value.

For strings, a zero-length string works for no-value. But not for dates, nor for numbers. 0 works for no-value for cardinal numbers, but not necessarily for ordinals. And there really isn't a no-value that I know of for dates. You can hack that one with Jeff's solution of an arbitrary date (usually either the top value or some value too low to be real), but that's only a valid solution if your date range doesn't include the value designated, and it reduces dates to arbitrary pointers to values, instead of atomic values on their own.

Beyond that, normalization by extending optional values into sub-tables works so long as you have a managable set of options.

I've seen some really "normalized" models for people in data that went too far and became significantly unmanageable. For example, names. I have 3: personal, middle, family. But I also have a nickname, and a "most commonly goes by on forms" name. Most databases model this with "first, middle, last", and miss out on potentially important information. My legal name is "Richard Augustin Gwynne", but nobody calls me "Richard", not even my parents. I go by "Gus". On forms that want a first name and a middle initial, it's "Gus A", which doesn't even make sense (since the "A" is the initial for "Augustin", which "Gus" is short for), but it's what appears on a large number of legally binding documents and is legally acceptable. So I ocassionally get two pieces of mail, one addressed to "Richard A", one to "Gus A", both at the same address. None of the usual methods of deduping names will catch that one, since "Gus" isn't normally synonymous with "Richard".

Some models would split the name vertically and use a sequence number to indicate their order. So, with my "ID" (surrogate key of whatever sort), the data would be sequence 1 = Richard, sequence 1 = Gus... oops, just broke that model, since I have two different valid first names.

So it has to be modeled as not only sequence, but valid combinations, so combo 1, sequence 1 = Richard; combo 1, sequence 2 = Augustin; combo 1, sequence 3 = Gwynne; combo 2, sequence 1 = Richard, combo 2, sequence 2 = A; combo 2, sequence 3 = Gwynne; combo 3, sequence 1 = Richard; etc. All the "valid" combinations add up to quite a few rows of data. Reconstructing those requires extensive code compared to other models, but it does achieve the greatest precision. At the cost of often having precision exceed accuracy, which is a waste of time and effort. It also makes for a very, very complex front-end application and a steep learning curve for end users. After all, you can't just split on spaces or some such, since "Mac Donald" is one name-segment in some cases, if "Mac" is a modifier for "Donald", and is two segments if "Mac" is a nickname or some such and stands on its own.

The main advantage to such a system is that it holds "John Smith" just as easily as "Cyrano II de Savignien II de Bergerac III" (I think I misspelled his name badly, but the point is still valid). The main disadvantage is it also makes it just as difficult to enter into the system and to reconstitute from the system.

Another flawed model is "First, Last, Full". I've seen that quite often. Again, has problems and advantages.

Allowing some null columns, makes it much easier to deal with all around.

Very few things in this world are as cut-and-dry in data modeling as many people would like to think. Even gender is complicated by things like California laws and genetic flaws. (What gender, for example, is someone who is diploidy X haploidy Y, with mixed expression on primary sexual characteristics? That's not a legal oddities question, that's a reproductive-function, biology question, and not necessarily easy to answer.)

Personally, I work for whatever achieves the greatest efficiency, scalability, speed, and is easiest to work with for devs and end-users. Always, that's been a compromise. It's never been "fully normalized by relationally pure definitions", but it's been as close as I can get within the real constraints of the data.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon