my favorite normalization case study

Question

my favorite normalization case study

patrickmcginnis59

SSCertifiable

Points: 6483
More actions
January 13, 2012 at 11:29 am

#251399

the problem is an attendance tracker for an educational institution.
The foreign keys would have student number, course number, and meeting date. The data (attendance) could be as small as a bit.
Long story short, I used an array of characters and a nice scrolly matrix gadget with dates across the top and students down the side, the whole matrix being a course section, and I allowed a few extra attendance types. Maximum size of the group per student was greater than the max number of meetings I anticipated.
On the surface, it looks like repeating groups, but according to database theory as I stubbornly choose to interpret it (http://www.dbdebunk.com/page/page/622318.htm "...Maximizing performance is a physical optimization issue..."), as long as the one table with repeating groups was presented to the querying user as two or more joined tables, I'm all set, as the repeating group is simply a physical storage issue which is outside the scope of the topic of normalization.
Its worth noting that since there was no sql server involved, I was essentially the storage engineer, and for reporting, I fed a stream of attendance rows which were normalized with the foreign keys to our reporting system.
Whadya think? Anybody else ever see some funny normalization cases?

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply

GSquared SSC Guru Points: 260824 More actions · Answer 1

I'm more used to "funny" denormalization cases. Busy fixing a bunch of performance problems caused by an EAV structure right now.

In your case, it sounds like your bitmapping the data instead of storing it "relationally". In that case, normalization doesn't even apply, so don't worry about it.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

TomThomson SSC Guru Points: 104773 More actions · Answer 2

I don't see this as repeating groups in the first place; "repeating groups" is a topic that's much misunderstood, with some people classifying far more things as repeating groups than others. When Ted Codd (in his paper "A Relational Model of Data for Large Shared Data Banks" published in CACM vol 13 no 6, June 1970) defined what is now called 1st normal form a repeating group was an attribute in a non-simple domain, that is in a domain where an individual attribute actually represented multiple values in underlying domains. A string of distinct attributes like "attendance at 1st meeting", "attendance at second meeting" and so on in a single row was not a repeating group, and would not be expected to be normalised out. I know it is currently regarded by a lot of people as heresy to rely on Codd's definition of first normal form, but I find it quite satisfactory since I can't think of any update anomalies, insertion amomalies, or deletion anomalies that arise from using that definition rather than one that would forbid me to store someone's home landline number and his mobile number in the same row because "two telephone numbers is a repeating group". There is of course the restriction that the number and domains of attributes in each row of a relation is defined by the relation, so is the same for every row in the relation, and that often makes it necessary to split things out to avoid simulating a variable number of columns by use of nulls, but I don't see how that applies here; and as a practical matter it doesn't make much sense to have a relation with so many attributes that the sheer number of them adds to the programming difficulty, but I don't see that as applying here either.

edit: incidentally, "repeating group" is pre-relational terminology, it's a pity that term isn't shunned and replaced for relational theory by Codd's "non-simple domain".

Tom

PaulB-TheOneAndOnly SSC Guru Points: 66772 More actions · Answer 3

patrickmcginnis59 (1/13/2012)
Long story short, I used an array of characters and a nice scrolly matrix gadget with dates across the top and students down the side, the whole matrix being a course section, and I allowed a few extra attendance types.

If the above is the description of your "course" table I can tell this is not normalized at all.

It would be easier for us to understand your scenario if a list of involved tables is provided including a summary of the data each one contains alongside PKs and FK relationships.

_____________________________________
Pablo (Paul) Berzukov

Author of Understanding Database Administration available at Amazon and other bookstores.

Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.

patrickmcginnis59 SSCertifiable Points: 6483 More actions · Answer 4

PaulB-TheOneAndOnly (1/14/2012)
patrickmcginnis59 (1/13/2012)
Long story short, I used an array of characters and a nice scrolly matrix gadget with dates across the top and students down the side, the whole matrix being a course section, and I allowed a few extra attendance types.
If the above is the description of your "course" table I can tell this is not normalized at all.
It would be easier for us to understand your scenario if a list of involved tables is provided including a summary of the data each one contains alongside PKs and FK relationships.

Well if it had been normalized, and I wanted to track each attendence day, I would have

COURSE_ID VARCHAR(20),

STUDENT_ID VARCHAR(10),

MEETING_ID DATETIME,

ATTENDANCE VARCHAR(1)

With everything but ATTENDANCE being a foreign key. As I actually did it, I had:

COURSE_ID VARCHAR(20),

STUDENT_ID VARCHAR(20),

ATTENDANCE VARCHAR(1) OCCURS MANY TIMES

and

COURSE_ID VARCHAR(20),

MEETING_ID, DATETIME OCCURS MANY TIMES.

and more importantly, the attendance and meeting_id occupy corresponding positions in their respective rows. I did it that way because everything worked nicer when it was "prejoined" so to speak.

Now this is according to my rather informal understanding of normalization. I could elaborate further if I took the time and read up, and of course I'm the new RDBMS person here, feel free to poke holes in anything I post. Its also worth noting that a valid optimization is to only store the absences or other abnormal stuff we want track and otherwise assume the student was present for that meeting.

PaulB-TheOneAndOnly SSC Guru Points: 66772 More actions · Answer 5

Would you mind in writting a query to answer the question... "Did Peter Bishop attend Particle Physics 501 on January 9th, 2012?"

Now... write a query to answer the same question on the 3NF structure.

_____________________________________
Pablo (Paul) Berzukov

Author of Understanding Database Administration available at Amazon and other bookstores.

Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.

patrickmcginnis59 SSCertifiable Points: 6483 More actions · Answer 6

PaulB-TheOneAndOnly (1/16/2012)
Would you mind in writting a query to answer the question... "Did Peter Bishop attend Particle Physics 501 on January 9th, 2012?"
Now... write a query to answer the same question on the 3NF structure.

Its worth noting that since there was no sql server involved, I was essentially the storage engineer, and for reporting, I fed a stream of attendance rows which were normalized with the foreign keys to our reporting system.

The obvious question is then, why did I not do all of it as an RDBMS? Because I preferred that data entry be easy, and I always knew that reporting would be possible, if perhaps requiring more effort.

I don't feel too bad about the design, indeed theres even some parallels to todays trends, that collection of data is often distanced a bit from reporting of that same data.

PaulB-TheOneAndOnly SSC Guru Points: 66772 More actions · Answer 7

Please do not take it the wrong way but, out of curiosity, why is the subject of this thread "my favorite normalization case study" if you never intended to normalize the data and you don't even care about the basics of data modeling?

I'm not judging, you can do it anyway you want 🙂

_____________________________________
Pablo (Paul) Berzukov

Author of Understanding Database Administration available at Amazon and other bookstores.

Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.

patrickmcginnis59 SSCertifiable Points: 6483 More actions · Answer 8

PaulB-TheOneAndOnly (1/16/2012)
Please do not take it the wrong way but, out of curiosity, why is the subject of this thread "my favorite normalization case study" if you never intended to normalize the data and you don't even care about the basics of data modeling?
I'm not judging, you can do it anyway you want 🙂

It wasn't actually meant to be all that detailed a thread, but I'm interested that you felt it important to mention that "you don't even care about the basics of data modeling" when I clearly posted both the normalized and denormalized views, and actually shipped the normalized view to our reporting system. I took advantage of the denormalized view to help me ship. Its not like I didn't consider the normalized view, like I said, I materialized it for reporting.

Why did you feel the need to say "you don't even care about the basics of data modeling"? I thought it was an interesting case, but now your reaction is the interesting item in this thread 😉

So tell me about why you believe I don't care or didn't care when I produced this system?

TomThomson SSC Guru Points: 104773 More actions · Answer 9

patrickmcginnis59 (1/16/2012)
PaulB-TheOneAndOnly (1/16/2012)
Please do not take it the wrong way but, out of curiosity, why is the subject of this thread "my favorite normalization case study" if you never intended to normalize the data and you don't even care about the basics of data modeling?
I'm not judging, you can do it anyway you want 🙂
It wasn't actually meant to be all that detailed a thread, but I'm interested that you felt it important to mention that "you don't even care about the basics of data modeling" when I clearly posted both the normalized and denormalized views, and actually shipped the normalized view to our reporting system. I took advantage of the denormalized view to help me ship. Its not like I didn't consider the normalized view, like I said, I materialized it for reporting.
Why did you feel the need to say "you don't even care about the basics of data modeling"? I thought it was an interesting case, but now your reaction is the interesting item in this thread 😉
So tell me about why you believe I don't care or didn't care when I produced this system?

Don't get too upset Patrick. Paul is (I think - I hope he'll correct me if I'm wrong) taking things from the point of view of someone who uses an RDBMS which sorts out all those storage issues for him, whereas you are taking the point of view of someone who has to design the optimal storage and then present a relational view on top of it. Obviously you will differ sometimes (probably almost always, given the two points of view), and this is one of those times. It doesn't help, of course, that none of our currently available RDBMSs can be told what sort of queries they will get with what frequency and attempt to optimise the storage mechanism to fit, but have instead a fixed set (usually two members to that set) of storage organisations within which they permit the user to attempt to optimise by choosing indices and clustering factors. It's not clear to me that the relational model prefers one of those two views over the other, but for most people it seems to be absolutely clear that the relational model implies Paul's view, not yours (needless to say, I disagree with them; that doesn't mean that I think your storage structure is relational, of course, it just means that I think it's fairly clear that it can be used to support a relational view of the data; I don't for a moment think that Microsoft's view storage view of pages, files, and filegroups is the slightest bit relational either, but it certainly can supprt something close to a relational view (maybe exactly a relational view) of the data. But I do think it's unreasonable to ask for an explanation of storage engine primitives in terms of relational primitives - that's confusing the cart and the horse.

Tom

patrickmcginnis59 SSCertifiable Points: 6483 More actions · Answer 10

L' Eomot Inversé (1/16/2012)
patrickmcginnis59 (1/16/2012)
PaulB-TheOneAndOnly (1/16/2012)
Please do not take it the wrong way but, out of curiosity, why is the subject of this thread "my favorite normalization case study" if you never intended to normalize the data and you don't even care about the basics of data modeling?
I'm not judging, you can do it anyway you want 🙂
It wasn't actually meant to be all that detailed a thread, but I'm interested that you felt it important to mention that "you don't even care about the basics of data modeling" when I clearly posted both the normalized and denormalized views, and actually shipped the normalized view to our reporting system. I took advantage of the denormalized view to help me ship. Its not like I didn't consider the normalized view, like I said, I materialized it for reporting.
Why did you feel the need to say "you don't even care about the basics of data modeling"? I thought it was an interesting case, but now your reaction is the interesting item in this thread 😉
So tell me about why you believe I don't care or didn't care when I produced this system?
Don't get too upset Patrick.

Its ok. Paul is welcome to his opinion.

Paul is (I think - I hope he'll correct me if I'm wrong) taking things from the point of view of someone who uses an RDBMS which sorts out all those storage issues for him, whereas you are taking the point of view of someone who has to design the optimal storage and then present a relational view on top of it.
Obviously you will differ sometimes (probably almost always, given the two points of view), and this is one of those times.

What I think is useful is to consider alternatives, and I think its useful to attempt to understand why folks do consider alternatives. As I have mentioned, there seems to be an orthodoxy in the RDBMS world which prevents this, while I think its less than useful, I'm ok that it exists as I know how strongly many folks feel about it. Its worth my consideration that I may have asked for such a dismissal from Paul by even posting this anecdote on an SQL forum.

It doesn't help, of course, that none of our currently available RDBMSs can be told what sort of queries they will get with what frequency and attempt to optimise the storage mechanism to fit, but have instead a fixed set (usually two members to that set) of storage organisations within which they permit the user to attempt to optimise by choosing indices and clustering factors. It's not clear to me that the relational model prefers one of those two views over the other, but for most people it seems to be absolutely clear that the relational model implies Paul's view, not yours (needless to say, I disagree with them; that doesn't mean that I think your storage structure is relational, of course, it just means that I think it's fairly clear that it can be used to support a relational view of the data; I don't for a moment think that Microsoft's view storage view of pages, files, and filegroups is the slightest bit relational either, but it certainly can supprt something close to a relational view (maybe exactly a relational view) of the data. But I do think it's unreasonable to ask for an explanation of storage engine primitives in terms of relational primitives - that's confusing the cart and the horse.

I believe that there is a continuum of cases with varying degrees of mappability to relational solutions so to speak. While there are clearly cases where relational mappings do no good, and there are cases where they do much good, I'm of the belief that there are some grey areas. To me, I saw clear benefit in "pre-joining" elements, and it looked much like Oracle's vararrays, and what I found interesting was the benefit I got in using this approach.

PaulB-TheOneAndOnly SSC Guru Points: 66772 More actions · Answer 11

patrickmcginnis59 (1/16/2012)
Why did you feel the need to say "you don't even care about the basics of data modeling"? I thought it was an interesting case, but now your reaction is the interesting item in this thread 😉
So tell me about why you believe I don't care or didn't care when I produced this system?

Let me start by appologizing for my harsh language - didn't mean to be offensive but I recognize it didn't come out that well.

Having said that, the process of data modeling is pretty straight forward - after getting your business requirements and collecting the data you need to work with what we usually do is:

1- Create a ER model where we identify the Entities that make up our system as well as the Relationships between such Entities.

2- After having a validated ER model we usually decide either to go with a Relational Physical Model or with a Dimensional Physical Model. First option which I prefer for OLTP systems requires to normalize data usually not below the 3NF; second option which I prefer for DSS a.k.a. Data Warehouse systems requires to develop a star-schema model.

In this particular - and certainly interesting case - I didnt' see a single trace of these expected steps therefore I assumed you decided to be adventurous and don't care about traditional data modeling techniques.

It also surprised me that the stated objective of the design was to favor data entry over reporting when each piece of data will enter the system just once but will be used for reporting forever after - this fact of life usually means that the underlying design favors "reporting" over "data entry" by creating a data structure that favors the process of retrieving the data.

Having said all of that, at the end of the day it don't matter if the cat is white or black, all that matters is the cat getting the mice.

Sorry again, no personal offense was meant.

_____________________________________
Pablo (Paul) Berzukov

Author of Understanding Database Administration available at Amazon and other bookstores.

Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.