FK vs Check Constraint for Lookup/Reference Tables

  • I often see lookup tables used to reference some sort of description (status,mode etc..) with a FK relationship to ensure data integrity. I've been thinking about replacing the normal design of linking IDs with storing the actual Status Description in the parent table and using a UDF within a check constraint to ensure the data is accurate based on a reference table. This would save time of having to join tables unnecessarily while still enforcing data integrity. As long as the Status is only used within that single table this seems like a good idea. Obviously if it was shared among multiple tables this wouldn't be a good fit.

    Pros, cons, suggestions? It "seems" like a good idea but is it? TIA

    Traditional Model:

    tEmployees (EmployeeID int,StatusID int)

    tEmployeesStatus (StatusID int,StatusDescription (varchar(20))

    Proposed Model:

    tEmployees (EmployeeID int,StatusDescription varchar(20))

    tEmployeesStatus (StatusDescription varchar(20))

    Create a UDF that validates the data being entered into the StatusDescription field:

    Create FUNCTION [dbo].[fnEmployeeStatusList](@EmployeeStatus varchar(20))

    RETURNS char(1)

    AS

    BEGIN

    Declare @Exists char(1)

    SELECT @Exists = [StatusDescription ] from tEmployeeStatus Where [StatusDescription ] = @EmployeeStatus

    IF @Exists > '' BEGIN SET @Exists = 'Y' END

    IF ISNULL(@Exists,'') = '' BEGIN SET @Exists = 'N' END

    -- Return the result of the function

    RETURN @Exists

    END

    ALTER TABLE [dbo].[tEmployees] WITH CHECK ADD CONSTRAINT [CK_tEmployees_Status]CHECK (([dbo].[fnEmployeeStatusList]([EmployeeStatus])='Y'))

    GO

    ALTER TABLE [dbo].[tEmployees CHECK CONSTRAINT [CK_tEmployees_Status]

    GO

  • I imagine that calling that multi-line UDF for each update/insert operation on the table will be a pretty horrifying overhead, but it's easy enough to rewrite it as a single statement UDF. Another issue is storage: if several employees shares a status, the old model stores that status just once, while your new model stores it several times; that may not matter is the number of employees isn't large. In addition, either there's going to be a search on status for each update/insert operation, which may need to be supported by an index if there are enough distinct statuses; the old model has a index on statusId, which is an int, which suggests a lot of distinct statuses - otherwise why isn't it smallint (or even tinyint)? - so indexing a 20-bye status as required in the new model will cost some extra storage. But it could improve the performance overall provided the employee table is read far more often than it is updated, and usually has to be joined with the status table every time it's read, and there aren't too many employees, and not too many employees have the same status as each other, and the size of the status isn't going to grow too much in future.

    But there are considerations other than the performance of employee table manipulations. Is an individual status value something that applies to a class of employees, rather than to a single employee - ie is an employee status a business-world entity in its own right, that may change wile remaining applicable to the same group of employees, or are statuses just an attribute of an employee, ie something which, when it changes, normally affects just one employee? That may be a more important consideration than performance, because getting the table structure to reflect the business world is generally a good thing, and if status changes that affect many employees simultaneously are common that will be a performance hit against the new model. If in fact the status changes always affect only a single employee the status is an attribute of an employee so it's certainly better design to hold the status in the main table, and the only reason for splitting it out (which is something that people like me regard as dangerous, because someone may carelessly update something that affects many employees when they intended to affect only one) would be to improve performance. On the other hand, if it's a separate entity in its own right, it's certainly better design to put it into a separate table, and the only reason for pulling it into the employee table (which is something everyone calls denormalization) would be to improve performance.

    At the end of the day it comes back to one thing: what are the business considerations - all the business considerations, not just performance, which makes quite a long list. Those include flexibility, ease of development, support cost, and a lot things which at first site aren't schema design issues but on proper consideration are; performance is somewhere on that list, but it usually isn't top by any means.

    edit: your check constraint confuses a parameter's name with its value, so it wont work unless that is fixed.

    Tom

  • That check constraint won't prevent rows in tEmployeesStatus from being deleted/updated if there's a matching row in the child table. A foreign key will (or will cascade the delete/update if required)

    The check constraint doesn't give the optimiser extra information about the relationship, a foreign key will.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • L' Eomot Inversé,

    Thank you for the reply.

    This was a generic example but a few actual examples would be statuses applied to invoices and shipments, both of which will have a significant amount of data. The uniqueness of the status will be diluted as the table grows and more importantly as shipments and invoices move along through their respective process. At some point 95% of shipments and invoices will move into a final status (Paid, Shipment Delivered) so I would think indexes wouldn't be useful. Queries will be made based on these statuses quite often as will updates to these status will be common.

    Yes the statuses won't really grow and there aren't that many of them (1/2 dozen maybe). A tinyint would be a better fit and I copied from a working version but I hastily renamed some of the code for this post (hence the non-working version I posted).

    Yes they will need to be joined almost always because it's vital info to know almost every time a request is made.

    In regards to duplicate data... If I choose to store a StatusID multiple times in the table vs the actual name, it's still duplicate data (imo).

    In regards to the considerations you mentioned about status being applied to just that entity vs a group.. yes these will be specific to just that entity and won't represent a group.

    Design woes.... 😀

  • AVB (8/25/2013)


    At some point 95% of shipments and invoices will move into a final status (Paid, Shipment Delivered) so I would think indexes wouldn't be useful.

    If you frequently query for non-final statuses, those indexes will potentially be very useful indeed.

    p.s. Have you tested and confirmed that the time required for the join is noticeable and unacceptable?

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • GilaMonster,

    Thank you for the reply as well.

    The actual table utilizes a PK/FK relationship for other tables associated based on the "EmployeeID" (really ShipmentId or InvoiceID) with cascade delete/update. I know the post was a little vague but I was trying to convey the idea without getting too specific early on in case it was a quick.. "You're doing WHAT?!" kind of response. 🙂

  • GilaMonster,

    No testing has been done yet as I'm in a "merged" logical and physical design phase.

    I'm probably hesitant of using the lookup table with a FK relationship because of the previous system I used had a huge MUCK table. Of course I wouldn't design it like that but it just begged the question if I can reduce the amount of joins why not? Joining 5+ tables to get a whole overview of a single shipment is cumbersome but there is a lot of data to be displayed and I don't know it's unavoidable.

  • AVB (8/25/2013)


    No testing has been done yet as I'm in a "merged" logical and physical design phase.

    Beware premature optimisation (making non-standard, over-complex design decisions based on untested performance assumptions)

    You're looking at making the database bigger, removing any update/delete constraints for these lookups, adding extra code, unusual check constraints in a non-standard way of doing things that could lead to unanticipated errors when changing data in the child table, with no testing to see if the foreign key/join design it'll be replacing is inefficient.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • GilaMonster you make a good point as well.

  • If the status really is just an attribute of the main item rather than a separate entity that is related to it, the restriction on the set of values is a domain constraint; there are two standard ways of implementing domain constraints: (a) as an auxiliary table and a foreign key relationship with the main table pointing to the auxiliary table; and (b) as a row level check constraint without any auxiliary table (which doesn't call any UDFs which reference any data other than attributes in the row which are passed to them as parameters and literal constants held in the UDF's definition, which is a single statement). Method (a) is the used only when it's needed because the size of the attribute is bigger than the reference to a row in an auxiliary table, and putting the data in the main table would increase the row size enough to cause performance problem, since method (a) was is aimed at expressing entity relationships not at specifying attributes of an entity; method (b) is the norm. I think it's probably not a good idea to invent a hybrid, where you are implementing a domain constraint by an auxiliary table accessed through a check constraint. I wouldn't go as far as Gail and say use method (a) definitely, because you've said that tinyint is more appropriate than varchar(20) and I suspect that you probably have only half a dozen possible values, so it's going to be trivial to write an appropriate check constraint using no UDF at all and no auxiliary table (and you can index the main table on that attribute too, it's quite likely that you'll need to for performance). But I agree completely when she says in effect that trying to design optimisations too early is not a good approach - implement one of the standard approaches, if it doesn't perform try the other, and only when that also doesn't perform try to optimise by doing something new. That wouldn't have been good advice 33 years ago, when there was very little knowledge of good structures and algorithms built from experience in RDBMS, but a third of a century on there is a lot of history to learn from and well established standard practise to follow when there's no good reason not to. Maybe that would not be good advice in a research situation either, but I don't imagine that's where you are.

    Tom

  • Simplest suggestion I can make here is to let the relational system work like a relational system. Use the tools provided because they are a part of how SQL Server has been written and optimized. 5, 8, or even 20 table JOINs are no big deal if the queries are written correctly, constraints are in place and enforced, you have indexes, and all the statistics are up to date. I've seen 86 table JOINs work just fine (well, apart from the 3 minute compile time). Don't sweat that stuff. Focus on core principals.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Thank you for the insight. As always this site is very informative thanks to everyone that contributes.

  • I'm consulting at a company with a developer-designed database which has a very large table (over 2 billion rows) that has a number of small char or varchar categorical columns. Currently, there are no check constraints or foreign keys pointing to look-up tables. I'm wondering what's the best approach to enforce integrity at the DB level rather than at the app level.

    Consider the case where the categorical column is a char(1) and appears only in the subject table. It seems ridiculous to create a lookup table if the only valid values are 'M' or 'F', and given that the meaning of these abbreviations are so well known, a check constraint certainly seems better than a two row lookup table and an FK. Going one step further, there can be a case where there are 15 valid char(1) codes. Here the case for a lookup table is a bit better, since the lookup table could include a description column for clarity. We could also change the PK/FK data type to tinyint and not increase the size of the subject table. I wonder if there is any performance benefit in replacing the char(1) with a tinyint when joining.

    Then there are marginal cases where the categorical char column is larger but not too large. In these cases, there would be a minor size benefit from replacing the char columns with tinyint FKs pointing to a lookup table. But if the number of alternatives was still small (e.g. Male and Female) so clarity was not an issue, I wonder whether a check constraint would perform better than the FK and lookup table.

    For the case where the categorical character variable is quite large relative to the size of the smallest integer-type key value, it seems clear that the FK and lookup table is best from a size perspective alone. And lookup tables are preferred when there are many valid values and their presence adds clarity or when they could be used by multiple tables. Check constraints have the benefit of reducing the number of tiny lookup tables. But it seems they're only good for a 'small' number of 'small' sized categorical alternatives that appear only in one table.

    Has anyone done any timing comparisons on this issue?

  • This is a great discussion. I share similar views as JRoughgarden. But I wonder even on a small number of values that

    never change how do you achieve clarity when using CHECK constraint. Let's use JRoughgarden sex column with check constraint - two values 'M' and 'F' are allowed and its pretty clear what sex is a person when someone looks at the row from this table. But then comes Joe Celko with his sex column: the ISO sex codes are {0= unknown, 1= male, 2= female, 9= lawful persons}. I think he always uses CHECK constraint and not a lookup table for this case. On the front end you can use enumeration with description (i.e. front end written in java) to get what sex is someone. But if I go to the database and see values 0,1,2,9 in a row I think I will not know what they mean - how do you achieve clarity for cases like this? Do you say go see ISO sex codes on google or open front end application and search the enumeration to find code descriptions??

    regards

  • Traditional Model:

    tEmployees (EmployeeID int,StatusID int)

    tEmployeesStatus (StatusID int,StatusDescription (varchar(20))

    Proposed Model:

    tEmployees (EmployeeID int,StatusDescription varchar(20))

    tEmployeesStatus (StatusDescription varchar(20))

    Create a UDF that validates the data being entered into the StatusDescription field:

    Create FUNCTION [dbo].[fnEmployeeStatusList](@EmployeeStatus varchar(20))

    RETURNS char(1)

    AS

    BEGIN

    Declare @Exists char(1)

    SELECT @Exists = [StatusDescription ] from tEmployeeStatus Where [StatusDescription ] = @EmployeeStatus

    IF @Exists > '' BEGIN SET @Exists = 'Y' END

    IF ISNULL(@Exists,'') = '' BEGIN SET @Exists = 'N' END

    -- Return the result of the function

    RETURN @Exists

    END

    ALTER TABLE [dbo].[tEmployees] WITH CHECK ADD CONSTRAINT [CK_tEmployees_Status]CHECK (([dbo].[fnEmployeeStatusList]([EmployeeStatus])='Y'))

    GO

    ALTER TABLE [dbo].[tEmployees CHECK CONSTRAINT [CK_tEmployees_Status]

    GO

    I am not sure if I understand this proposed model correctly. In the proposed model, i don't see any importance of even creating the table tEmployeesStatus as StatusDescription is the only column in this table & this StatusDescription has already been written to the table tEmployees.


    Sujeet Singh

Viewing 15 posts - 1 through 15 (of 17 total)

You must be logged in to reply to this topic. Login to reply