Relational database without relations
The strength and holy grail of relational databases lies in
the very name – relations. That’s what they traditionally excel in. To find a
related record through several tables using proper indexes is not a
time-consuming problem even if a database is really big, containing millions of
records. Microsoft has put a good deal of intelligence and cunning into query
optimizations, caching, indexing and execution plans to make the process of
finding related records even smoother and faster.
This small article, however, will try to shatter the very
sacred notion of relational databases, i.e. of linking the tables and also will
attempt to show a slightly different mechanism of relating data.
The idea first came when I tried to write a simple interface
that would via ADO imitate an object database behavior with an underlying SQL
server database as data source. The task is by far not new, nor is there any
shortage of programmatical solutions. Still, most (if not all) of those I came
across required a very rigid database structure that had to be prepared
beforehand, including lots of service tables, triggers, keys, etc.
But what happens if you just create objects (like Person,
Address, Product, etc.) but do not know how they will be related? Or, if they
already are, there is a new relation coming in? Or, in a case when you have
one-to-many, the project manager comes and confesses it really should have been
many-to-many. So, what I mostly tended to achieve was complete freedom from any
constraints, indexes and foreign keys that usually result from building
relations – the goal was: build singular atomic objects and devise some
universal way of relating any object to any other object.
The solution is quite trivial and must have already occurred
to you – a single intermediate table, like the ones used in many-to-many
relationships. This table can serve all other tables in all possible kinds of
relationships, including one-table hierarchies.
Another vital issue is the use of GUIDs (Globally Unique
IDentifiers) that are generated by SQL server (or returned by a function
NewID()) and are guaranteed to be unique throughout the database (actually,
throughout the world). Every table has a clustered primary key on a GUID column
- in this case every record is uniquely referenced by its GUID.
Here is the intermediate table creation script:
CREATE TABLE [dbo].[Relations] (
[GUIDA] [uniqueidentifier] NOT NULL ,
[GUIDB] [uniqueidentifier] NOT NULL
) ON [PRIMARY]
Of course, one would need indexes to speed up searches. Here
they are:
ALTER TABLE [dbo].[Relations] ADD
CONSTRAINT [PK_Relations] PRIMARY KEY CLUSTERED
([GUIDA]) ON
[PRIMARY]
CREATE INDEX
[IX_Relations] ON [dbo].[Relations]([GUIDB]) ON [PRIMARY]
Making both fields the clustered primary key may be tempting
but yields much worse results with heavy scans happening all the time. But here
comes another problem. What if we need to query two tables, but do not know in
what field A or B the first GUID is used? The trigger will take care of
that:
CREATE TRIGGER Relations_Insert ON [dbo].[Relations]
FOR INSERT
AS
SET NOCOUNT ON
INSERT INTO Relations (GUIDA, GUIDB) SELECT GUIDB, GUIDA FROM
inserted
This trigger will effectively flip any new record and insert
the result as a new record – making the order of fields in relation absolutely
irrelevant, but, obviously, duplicating the overall number of records. (Speaking
of which, you may want to add ON DELETE trigger to clean the duplicate when a
record is deleted)
So, what happens now? Suppose we have a Person object (which
is really a table, but treated as an object to imitate object database) and an
Address object. In a conventional database with Address having a foreign key to
Person, the query will look like the following:
SELECT Person.Last_name, Address.Country
FROM Person INNER JOIN Address
ON Person.Person_GUID = Address.Person_GUID
WHERE Address.Country = 'Belgium'
In the case with the universal intermediate table, however,
it will be something like:
SELECT Person.First_name, Address.Country
FROM
Person INNER JOIN Relations ON Person.Person_GUID =
Relations.GUIDA
INNER JOIN Address ON Relations.GUIDB =
Address.Address_GUID
WHERE Address.Country = 'Belgium'
The execution plans look surprisingly the same – actually
there had been almost no drop in performance when both queries were tested on a
database with half a million records in each Person and Address tables and,
consequently a million records in Relations table. Which arouses the natural
question of performance.
This technique obviously is not aimed at increased
performance nor can it ever be made as performant as conventional foreign-key
queries (although with proper indexing and occasional INNER LOOP JOIN query hint
the results surprisingly may be quite acceptable). Indeed, a query including
five tables will, in this case, include nine of them, naturally adding an
overhead of index seeks/scans. The matter is, this technique’s objective is not
performance, but complete freedom of linking objects and creating hierarchies
with no additional structure changes. In the case of programmatical/data objects
this means that any object instance can be logically linked to any other object
instance (or even to itself), at any given moment, in any hierarchical structure
providing flexible data source for objects manipulation.