Schema question (about conditional dependency)

  • Hello,

    I might be committing schema sin, or on to something new, but I have a question regarding conditional dependency when creating a schema.

    Simply, the database system models a college degree, where there is a representation of what courses or combination of courses that can be taken in order to receive a degree. Courses can require other courses, or combinations of courses, or combinations of combinations to be taken before that course becomes eligible to be taken. Administrators will configure this through a web page, and it will depend on the school. What really got me thinking, is how to model conditional dependency in the database system?

    How can I have a system where a course, can be dependent on

    A) a course

    B) multiple courses

    C) combinations of combinations of courses

    ?

    I've come with a rough idea of what might work, but wanted input if my idea is far-fetched, flat out wrong, or has some truth to it:

    Major

    ---

    id*

    name

    Group // contains a combination of courses

    ---

    id*

    majorid (fk)

    qualifier (fk)

    expression // (feels bad, holds expression of AND's and OR's of group ID's that are a dependency of this group)

    Qualifier // all of the filled in columns must pass true; such that the group is marked as "done" and the requirement is met

    ---

    id*

    pickX // pick x of the courses in this group

    pickXabove // pick x of the courses in this group above course level _

    repeatlimit // limit courses can be repeated

    etc.../

    Course

    ---

    id*

    name

    // other fields

    GroupCourse

    ---

    groupid (fk)

    courseid (fk)

    Please knock some sense into my brain, I feel like there is a way to put this in a different way, but I don't know if its the right way

  • Okay, I'm going to rough you up a little but to no ill intentions. Endure it and you will learn. (Hmmm, does this make me feel like Miyagi?)

    First: Why do you have a primary key called "id" in every entity? Surrogate keys should not be added until you are ready to move from logical design to physical design, and then only for tables that need it (based on the chosen RDBMS). Also (related), why have you not identified the proper candidate key(s) for each entity?

    Second: No idea what "PickX" and "PickXabove" in the Qualifier entity mean, but they sound very non-1NF.

    Based on your explanation, the first simple idea in my mind was that you are actually telling, in a complex way, that courses can depend on 0, 1, or more than 1 other courses. That is a standard optional one-to-many relationship from the course entity to itself. The "combinations of combinations of courses" just sounds like a lot of courses to me. Perhaps you can do some smart UI trick to make it easy for the people who enter the data to select a lot of courses at once, but you can still store them as individual dependencies. Also, perhaps you do not even need all those dependencies - if "data modeling 201" depends on "data modeling 101", and "data modeling 301" depends on "data modeling 201", then you no longer have to explicitly mention that "data modeling 301" also depends on "data modeling 101".

    However, if it is really a functional requirement to have the groups of courses exist as a seperate entity, then your data model becomes a bit more complex - though still not overly so. Just introduce "coursegroup" as an entity. Every coursegroup consists of 1 or more courses (mandatory one-to-many relationship). Every course still depends on 0, 1, or more other courses (optional one-to-many relationship to itself), but now every course also depends on 0, 1, or more coursegroups (another optional one-to-many relationship).

    Note that coding the logic to assess whether a student fulfills all the requirements for a course will be a lot more complex if you choose the second data model!


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • Hugo Kornelis (2/2/2016)


    Okay, I'm going to rough you up a little but to no ill intentions. Endure it and you will learn. (Hmmm, does this make me feel like Miyagi?)

    First: Why do you have a primary key called "id" in every entity? Surrogate keys should not be added until you are ready to move from logical design to physical design, and then only for tables that need it (based on the chosen RDBMS). Also (related), why have you not identified the proper candidate key(s) for each entity?

    Second: No idea what "PickX" and "PickXabove" in the Qualifier entity mean, but they sound very non-1NF.

    Based on your explanation, the first simple idea in my mind was that you are actually telling, in a complex way, that courses can depend on 0, 1, or more than 1 other courses. That is a standard optional one-to-many relationship from the course entity to itself. The "combinations of combinations of courses" just sounds like a lot of courses to me. Perhaps you can do some smart UI trick to make it easy for the people who enter the data to select a lot of courses at once, but you can still store them as individual dependencies. Also, perhaps you do not even need all those dependencies - if "data modeling 201" depends on "data modeling 101", and "data modeling 301" depends on "data modeling 201", then you no longer have to explicitly mention that "data modeling 301" also depends on "data modeling 101".

    However, if it is really a functional requirement to have the groups of courses exist as a seperate entity, then your data model becomes a bit more complex - though still not overly so. Just introduce "coursegroup" as an entity. Every coursegroup consists of 1 or more courses (mandatory one-to-many relationship). Every course still depends on 0, 1, or more other courses (optional one-to-many relationship to itself), but now every course also depends on 0, 1, or more coursegroups (another optional one-to-many relationship).

    Note that coding the logic to assess whether a student fulfills all the requirements for a course will be a lot more complex if you choose the second data model!

    Nothing against roughing up, it helps us learn.

    To reply,

    I have put a primary key of "id" in every entity because my intuition tells me that an auto-incrementing int will distinctly represent a unique row in each entity. I put this in as a placeholder, as I continue to understand what entities and relationships exist in my database schema. I realize that this isn't what we are supposed to do, as my schema is still growing and evolving as I understand my system better, so thank you for calling me out on this. When you say logical design, do you mean ER diagrams? I was not taught in my classes what a candidate key is until I looked it up just now. Are candidate keys usually denoted in ER diagrams?

    Second, the PickX and PickXAbove are attempts at representing course dependencies for a degree. Take two examples:

    Course A, depends on (Pick 3 courses from this list: course b, course c, course d and course e)

    Course Z, depends on (Pick 3 courses above course level 2000: course y, course x, ....)

    In addition to these, I've identified other general cases where one might assume courses have particular dependencies

    A) Pick x courses from a list, where courses can be repeated y times

    B) Pick x courses from a list, where courses above course level y can be repeated z times

    C) Pick x courses from a list, where y must come from z different departments

    D) Pick x courses from a list, OR (A, B, C, .... etc.)

    E) Pick x courses from a list, AND (A, B, C, .... etc.)

    F) Others

    I've designed SQL schemas in the past to accommodate A, B, C, but it is D and E that are causing me trouble. Ideally, to model the real world I need to accommodate any combination of AND's and OR's with course dependencies, which is why I've attempted to conjure a Qualifier entity that represents the requirements for a particular course

  • zacharywenta (2/2/2016)


    To reply,

    I have put a primary key of "id" in every entity because my intuition tells me that an auto-incrementing int will distinctly represent a unique row in each entity. I put this in as a placeholder, as I continue to understand what entities and relationships exist in my database schema. I realize that this isn't what we are supposed to do, as my schema is still growing and evolving as I understand my system better, so thank you for calling me out on this.

    The problem with assigning an auto-incrementing id as the primary key is that it does nothing. Imagine a data entry typist accidentally leaning on the keyboard for a while, causing the same data to submitted multiple times. This surrogate key will keep spawning new values while the same data is entered over and over again.

    I have not ever yet seen a case where people add an artificial key to the logical design and do not end up forgetting to find (and declare) the actual business keys. In theory it is possible to always add this id and still find all uniqueness rules, in practice it never happens. Plus, it will never be removed when the implementation starts, whereas in some tables it will not be needed. For instance, at least 95% of the tables implementing a many-to-many relationship I have seen have an artificial key that serves no purpose at all other than to waste disk (and memory) space.

    When you say logical design, do you mean ER diagrams? I was not taught in my classes what a candidate key is until I looked it up just now. Are candidate keys usually denoted in ER diagrams?

    ER diagrams is just one of many possible techniques to represent the logical design. In data modeling, there are three layers: conceptual design (which is mostly about functionality, and represent the way the business sees the data - this can be useful even when the data is then never entered in any computer system); the logical design (which is an abstraction of the conceptual design that is targeted towards a specific type of implementation, e.g. relational or object-oriented, but not towards a specific vendor), and the physical design (which can be seen as a blueprint for actually building the data store, optimized for a particular implementation). ER diagrams are probably one of the best and definitely the most popular tool for logical designs of a relational database. A physical design can also be captured in an ER diagram, allthough I think that a simple CREATE TABLE statement, though less visually appealing, works better for this.

    Candidate keys should be represented in any representation of the logical and physical data model, and should be implemented in the data store. For a relational database, this is usually done in the form of a UNIQUE constraint. (Or a PRIMARY KEY constraint for the candidate key that is chosen to be the primary key).

    >>Second, the PickX and PickXAbove are attempts at representing course dependencies for a degree. Take two examples:

    Course A, depends on (Pick 3 courses from this list: course b, course c, course d and course e)

    Course Z, depends on (Pick 3 courses above course level 2000: course y, course x, ....)<<

    Yeah, that's what I thought. "Pick 3" implies that you want to store three courses in a single column, That's a violation of one of the rules of 1NF (First Normal Form): every attribute in a relational model should represent exactly one value. This particular violation is called a "repeating group" - a single attribute that can represent multiple values of the same type.

    In addition to these, I've identified other general cases where one might assume courses have particular dependencies

    A) Pick x courses from a list, where courses can be repeated y times

    B) Pick x courses from a list, where courses above course level y can be repeated z times

    C) Pick x courses from a list, where y must come from z different departments

    D) Pick x courses from a list, OR (A, B, C, .... etc.)

    E) Pick x courses from a list, AND (A, B, C, .... etc.)

    F) Others

    I've designed SQL schemas in the past to accommodate A, B, C, but it is D and E that are causing me trouble. Ideally, to model the real world I need to accommodate any combination of AND's and OR's with course dependencies, which is why I've attempted to conjure a Qualifier entity that represents the requirements for a particular course

    Database design is far more complex than many people think. I have a course on this on Pluralsight that is about eight hours of recorded video, so I hope you understand that I cannot cover all of that on a forum.

    The way I was taught to do data modeling (and the method I still fall back on for complex cases) is to use concrete examples, transform then into readings to find the facts represented in those examples, and then follow a set of steps to arrive at a normalized data model. Ideally, you would have access to a domain expert during this process, but if your knowledge of the domain is sufficient you can sometimes act as a domain expert yourself as well.

    In this particular case, the requirements appear to be quite complex. So I started to think about how I would represent the requirements for the courses as a concrete example, and how I would then read that example. My thinking, based on that idea, resulted in this pattern.

    First, you obviously have courses.

    Next, courses may have zero or more dependencies. All of the dependencies have to be met before a course can be taken.

    Each of those dependencies can be of a different type. Some may be simple (a preliminary course), others may be more complex (two courses from a list of four), etc.

    I would either try to find (if the system already exists but needs to be automated) or make up (if this is for a new system) examples of all the different types of requirements that are possible, then read all those examples and then apply the techniques that I teach in my Pluralsight course to transform all those requirements into entities.

    I am sorry that I keep mentioning another site, and a paid one to boot (though there is a free trial option available). But the subject matter is simply too complex and too broad to fully explain here, and the problem area you describe is too broad to give you a fully worked out data model. (plus, the complexity exceeds what I am willing to provide for free).

    One final note, and this is something I do not cover in my course: sometimes, especially in cases like this, you need to push back. The business always asks for more complexity, more flexibility, etc. They are not aware of the cost that this incurs - more time to design the data model, more time to build the database, more time to build and implement the user interface that makes the complex design easy (or at least sort of easy) to use, and going forward more time to maintain the system. When domain experts or business people ask for incredibly complex options, it often helps to ask them how much they are willing to pay for that option. Tell them up front that it's going to blow their budget - often that triggers them to think about how often they will actually use that particular option, and then you will have one option less to worry about,

    I have seen applications that were incredibly complex, offering huge amount of options to the users to set up contracts in almost every imaginable way, and in reality only five to ten permutations were actually used. All the other options that had been built into the system were never used at all. Sometimes because the system had so many options that nobody really understood all of them, sometimes because they were simply not commercially vialble, sometimes both. Just think about all the effort invested in building that, and then it is never used - what a waste!


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • The way I was taught to start with database design is to look at things conceptually. You first establish the "entities." Then you solve how the entities are related to each other.

    Eg//

    Major

    Student

    Course

    CourseGroup...

    Writing this out ...

    A student can have one major but a major can be assigned to many students. (many to one)

    A student can be in more than one course and a course can have more than one student (many to many)

    ...so forth and so forth ...

    This then becomes your physical tables which you enforce with PKs

    Next you formalise the transactions tables, like when a student enrolles in a course.

    row 1) stdA, Course1, 'Fall2015' ... etc ...

    this represents and action. Assigning a course to a course group is itself an action.

    You enforce these with FK's (foreign keyes)--> (A existing student cannot be in a course that does not exist for example.)

    Take it from the top and work it out. A good design in the end is scalable and easy to understand as well.

    Dont forget to study up on a 3rd normal form design (3NF).

    Cheers.

    ----------------------------------------------------

  • Hugo,

    Thank you for taking your time to elaborate on your reply. It has been immensely helpful, and I felt the need to share this with you.

    In addition,

    The point you bring up about auto-incrementing the primary key is quite insightful! I would have never thought that its purpose would be lost if someone other than I or familiar with the system in general would or could create redundant data. It is a definite priority of mine to identify the candidate keys for my entities that more closely represent real-world data in uniqueness (names of courses, for example).

    It appears I was initially looking at the design in a logical design manner. Thank you for clarifying this up.

    And you make such a great point noting that it might not be cost-effective to solve every single corner-case, not to mention the complexity and ease of forgetting such information as time progresses. Initially, I had thought that a general system would satisfy one and all cases of requirements, only to find that building a system such as this is unfeasible and would require significant resources to do so. Not only that, but a system that I described earlier would likely invalidate one of the NF, or perhaps multiple of them.

    I think the above is a golden nugget of knowledge, that you necessarily aren't told when being taught about databases, entities, relationships. Working in the field, there are costs to undertaking such projects and at times it is necessary to push back at what the business describes as necessary to have. There is likely a workaround present. Thank you for your help Hugo, I did previously have a Pluralsight subscription but it has since expired. If I do choose to renew it in the future, I will remember your name

  • MMartin1 (2/3/2016)


    The way I was taught to start with database design is to look at things conceptually. You first establish the "entities." Then you solve how the entities are related to each other.

    Eg//

    Major

    Student

    Course

    CourseGroup...

    Writing this out ...

    A student can have one major but a major can be assigned to many students. (many to one)

    A student can be in more than one course and a course can have more than one student (many to many)

    ...so forth and so forth ...

    This then becomes your physical tables which you enforce with PKs

    Next you formalise the transactions tables, like when a student enrolles in a course.

    row 1) stdA, Course1, 'Fall2015' ... etc ...

    this represents and action. Assigning a course to a course group is itself an action.

    You enforce these with FK's (foreign keyes)--> (A existing student cannot be in a course that does not exist for example.)

    Take it from the top and work it out. A good design in the end is scalable and easy to understand as well.

    Dont forget to study up on a 3rd normal form design (3NF).

    Cheers.

    Thank you for the guidance MMartin1

  • The problem with assigning an auto-incrementing id as the primary key is that it does nothing. Imagine a data entry typist accidentally leaning on the keyboard for a while, causing the same data to submitted multiple times. This surrogate key will keep spawning new values while the same data is entered over and over again.

    The alternative here would be a natural key or a intelligent key? These may be subject to repeat or be accidentally duplicated as well more so. Ex// you get a customer in the 90210 area code last named Barkley. Key becomes like Bark90210. You get another customer in the same area code last named Barker --> Bark90210. Then your managers will really be barking at you.

    As for natural keys ... like a social security number. I would be cautious of using sensitive data like this as a key.

    I think operational keys, like surrogate keys, need not have intelligence or be built with strings. They join better to their related tables as integer indexes. Incrementing numbers are usually associated with a time of entry as well and can be a good indicator of sequence. They need not be a clustered index but they help keep the table size manageable and joins fast in a relational schema.

    My thoughts.

    ----------------------------------------------------

  • MMartin1 (2/4/2016)


    The problem with assigning an auto-incrementing id as the primary key is that it does nothing. Imagine a data entry typist accidentally leaning on the keyboard for a while, causing the same data to submitted multiple times. This surrogate key will keep spawning new values while the same data is entered over and over again.

    The alternative here would be a natural key or a intelligent key? These may be subject to repeat or be accidentally duplicated as well more so. Ex// you get a customer in the 90210 area code last named Barkley. Key becomes like Bark90210. You get another customer in the same area code last named Barker --> Bark90210. Then your managers will really be barking at you.

    As for natural keys ... like a social security number. I would be cautious of using sensitive data like this as a key.

    I think operational keys, like surrogate keys, need not have intelligence or be built with strings. They join better to their related tables as integer indexes. Incrementing numbers are usually associated with a time of entry as well and can be a good indicator of sequence. They need not be a clustered index but they help keep the table size manageable and joins fast in a relational schema.

    My thoughts.

    No, the alternative is to start by looking at how the business identifies instances of the object.

    The way I have been taught to do data modeling is to start by not thinking about IT implementation at all. And in fact, a data model can be useful even when the business then decides not to automate anything.

    Even without databases and computers, people have always had a need to identigy individual instances of an object type. Sometimes that is done by a truly natural (as in: exists as a part of the object itself) key - e.g. in chemistry, the number of atoms in a molecule can be used to identify an element. Much more often, a human-designed and human-assigned key is used, such as the element name in chemistry. Or a street name and house number for an address. Or even a customer number that is assigned when you place your first order. A lot of people like to argue on which of those are natural and which are artificial; I dislike that so I do not use those terms; instead I use the term "business key" for anything the business uses to identify instances, and "surrogate key" for additional keys created (and hopefully contained) in the database to serve as a surrogate for a business key.

    During the modeling phase, I am only interested in business keys. So I visit a warehouse when a truck is unloaded. I see someone using a handscanner to scan all incoming goods, so I look at what he is scanning - bam! there's my business key. I go to the helpdesk and hear that employees pick up the phone and their first question is "what is your customer number" - bam! another business key found. (And then I listen a while longer to see what other candidate keys are used for customers who do not have their customer number).

    Surrogate keys get added at implementation. In my conceptual and logical designs, I have no issue with a table that has a primary key that consists of three long character columns and is referenced by a dozen other tables; in my SQL Server implementation I will add a surrogate key (probably IDENTITY) to that table *while still retaining the uniqueness on the business key* and use that column to implement the foreign keys.

    You also mention "smart keys", such as Bark90210. I do not really like them, as they technically violate 1NF - combining two seperate values into a single attribute. But if the business actually uses that as the business key (as opposed to it being thrust upon them against their will by an IT team), then I will go with it. I always think that IT should try to minimize the changes the force upon the business.


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply