Blog Post

Key Sets, or Inverted Keys - defining a key as a set of tuples

,

TL;DR:

I have been doing a lot of thinking about data models for the last couple of weeks. I have, just now, had some thoughts which evolved from what I was thinking about regarding phone numbers and contacts in a forum post here https://www.sqlservercentral.com/Forums/1911234/A-simple-example-which-after-15-years-of-relational-design-work-still-breaks-my-brain-phone-numbers

TL;DR: Currently when testing for key violation we think of a function f(x, y) where:

  • x is the value (or values) of the key column (or columns) for one tuple, and
  • y is the value (or values) of the key column (or columns) for another tuple, and
  • f is the equality function.

I discuss whether it should be possible that:

  • x is a set of the key columns of "key tuples" relating to one tuple, and
  • y is a set of the key columns of "key tuples" relating to another tuple, and
  • f is the intersection function - which reduces to equality where the key tuples have a single member (ie, this rule reduces to the typical rule described above)

If anyone has seen other discussions along these lines, please send me in that direction! If anyone thinks this hasn't been discussed before, please tell me where I should submit a paper! Wink

Example 1: Contacts

I have recently been wondering what it means to have a unique "contact" in a data model. I am first going to describe an idea that I first thought of in the context of this example, because that is how I came to this idea. I will then describe it using a more familiar example - the problem of overlapping intervals in time. Note that the idea here is about what it means to be a key. I'm not proposing a solution specific to contacts, or intervals of time. I'm just using them as examples to demonstrate the concept. I'm not sure what to call the idea. "Key sets" or "inverted keys" come to mind.

First I obviously need to define what a "contact" is in my example model. That is actually quite difficult, which is, in a way, the point!

A contact is not a phone number, nor an email address. A contact is something that can "own" any number of phone numbers, and/or any number of email addresses, and/or any number of URL's (twitter, website, etc), and/or any number of some other of this kind of thing, (which I will henceforth call a "protocol endpoint").

In general a contact is the owner of some set of these protocol endpoints. In order for a contact to exist we must have some way to communicate with them, so they must have at least one such endpoint.

A contact may of course have other attributes. For example, there is almost certainly some way to address the contact when communicating with them, which I will call the "alias" of the contact. A contact probably has some kind of "salutation" attribute as well. Most of the time the contact represents an interface to a real human being, but that is not always necessarily true. I could, conceivably, have a contact representing a whole department. For example, a contact with an alias of "ACME corp accounts receivable", and an email address of AR@acme.com.

What, then, are the candidate keys for a contact? Clearly the "alias" attribute cannot be the key for the contact entity, because I might have several contacts all with the alias of "Bob", but one is bob@acme.com whereas the other is bob@bobsburgers.com.

I am going to make a claim now that seems a bit strange, but here it is: I claim that the contact is uniquely identified by any single member of the set of all protocol endpoints we can use to communicate with it, and that these endpoints are rows, not columns. Of course these endpoints are of different types (phone vs email vs URL, etc). For the moment, please imagine that I have some parent "protocol endpoint" type, but in the following example I will use values from the set of inherited types, just for clarity.

So, for example, I might have the following:

Alias = "Bob Bobbington", 
Salutation = "Mr", 
protocol endpoints = { 
  email:BB@bobsburgers.com, 
  email:bobbobbington@gmail.com, 
  mobile:0123456789, 
  landline:12345678
}

To be clear, I don't mean that the set of protocol endpoints is stored as a "multivalued attribute" in a single contact tuple, nor as a set of sparse columns in the tuple. Rather, I suggest that the set of protocol endpoints is itself a set of tuples of arbitrary cardinality - ie, rows - related to the contact tuple.

Of course in the real world we know that sometimes people share phone numbers, like the landline number for a residence. I am going to assert that, for the sake of this model, that situation would result in an incoherent idea for a contact, because it doesn't make sense to initiate a communication through one of these protocol endpoints without knowing who it is you are addressing. My claim, then, is that the "key value" for the contact I've described here is the "key set":

{ email:BB@bobsburgers.com, email:bobbobbington@gmail.com, mobile:0123456789, landline:12345678 }. 

A Formal Definition

Specifically, I claim it is a key in the following sense: no member of this set can exist within the set defining the key for any other contact. Or, more formally: the intersection of the "key set" with any other "key set" must be the empty set.

I realise that this implies that the empty set can be the key for any/all tuples, since the intersection of the empty set with itself is the empty set. This is a problem. I can eliminate the problem with an arbitrary rule that the empty set cannot be a key. Perhaps there's a more elegant way to eliminate the problem, (maybe relying on the fact that the empty set it a subset of itself?)

A-fortiori, this implies that the key sets cannot be identical - if two key sets cannot intersect, then no two key sets can be the identical, since identical sets would obviously intersect. So the more traditional idea of a primary key as value which must be unique is maintained. In fact, the "key set" definition of a key contains the normal definition of a key: if the key set is a set containing a single member (ie, a single tuple), then intersection reduces to equality, and we can think of the :"key set tuple" as simply being additional attributes of our entity tuple.

"A single member" here doesn't imply a single scalar value. We could have a multiple values in the case of a compound (multi-column) key, but the identity simplification would still apply. For example, if I had an existing table with a key composed of two columns {A, B}, then using the familiar definition of a key we check to make sure that there is no pair of values of {A,B} in the table that is identical with any other pair of values {A,B}. But If I have a "set based key", then what we want to check is that the set of all pairs of values {A, B} in the key set for one entity does not intersect with the set of all pairs of values {A, B} in the key set for any other entity.

Example 2: Intervals of Time

OK, all of this might seem very strange when we approach it from the point of view of these "protocol endpoints", because I'm describing it in terms of an unfamiliar problem. It just so happens to be what motivated my thinking here. But I can relate this to a much more common problem with which many people will be familiar: The problem of unique intervals.

It is quite common to encounter the problem of needing multiple versions of a thing, which cannot overlap in time. For example in a data warehouse with a slowly changing dimension, the interval represented by a version cannot overlap with the interval representing any other version. Another example might be the case of an employment relationship, where one person can only be employed by a company at most once at any given instant in time.

The problem with this sort of rule is that it cannot be declaratively enforced. I can define the interval via its start date and end date, but I can put no kind of declarative constraint on the start date column, nor the end date column, nor the combination of the two columns, to ensure that two intervals in the table do not overlap.

But there's another way of thinking about the start date/end date definition, and that is to think of it as a set (of times, in this case) which has been intentionally defined (ie, defined by a rule). Given these two dates, you can check whether any point in time is a member of "the set of all times between the specified dates". An easier way to think about might be to think about what would happen if we changed this intentional definition to an extensional definition - ie, a set defined by listing every member. I could create a table containing a person and a date. Then I look at the start and end date of the employment agreement between the person and the company, and put a row in this table for every date >= the start and < the end date. Then we could say that every tuple in this table must be unique over the columns person and date, and thereby we can ensure that there are no overlaps. Or in other words, it would not be possible to create another employment agreement with the same person whose interval overlaps an existing employment interval.

But this is not simply a familiar "many to one" relationship. I am suggesting that the employment agreement is functionally dependent upon the set of rows in this extensional table. In other words, the set of all dates representing the interval of the relationship, plus the person in the relationship, is the key for the employment agreement.

Of course, the problem with that approach - in this example of intervals of time - is that I have turned my intentional definition into an extensional definition, and the members aren't the same. I had to pick some level of granularity for the extensional definition. I happened to pick whole days, but what happens when someone says "actually, they can be the same day, but not the same hour", and then later "actually, they can be the same hour, just not the same second", and so on. But the purpose of the example is to demonstrate that the problem of "non overlapping intervals of time" is in fact the *same kind of problem* as the contact problem above: We would like to define a key as being composed of a set, which in this case happens to be intentionally defined compared to the extensional definition in the case of contact protocol endpoints. In other words, if it were able to define set-based keys, and if it were possible to use intentional as well as extensional definitions for those sets, then the problem of creating a key constraint which disallows overlapping intervals in time is the same as creating a key constraint which disallows the intersection of two sets.

Rate

You rated this post out of 5. Change rating

Share

Share

Rate

You rated this post out of 5. Change rating