Multiple FK Tables or Single Table w/ Nullable/Empty Fields

  • I am struggling to decide what the best direction would be to go for one portion of a database schema and would appreciate some advice from .

    I am essentially designing a table to store information about various locations. There are several fields that are applicable to all locations that we will track -- e.g. name, category, various relevant dates and geographical information. However, there are several more fields that we would also like to track about locations but that are only applicable to one specific category of location or another. The table(s) in question here, after initial population, will be managed by a custom web-based application and the database will support data analysis and visualization via Power BI, as well as support some automated processes in Alteryx. Changes to the data would only be made via the web tool and, in very limited circumstances, some mass data updates could need to be done via Alteryx.

    As best I can tell, there are two directions I can go:

    1. Setup one "Location" table and include all fields in that table that are relevant for all categories (Category A, B, C.. so forth). Then make fields that are only relevant for certain categories nullable (or populate them with empty values... separate discussion I'm sure) and at the application level ensure only the fields relevant to the category of the record in question are actually populated with data.
    2. Setup a "Location" table and include only the fields that are relevant for all categories of locations. Then add another table for each respective category's additionally relevant fields and tie it back to the original table with a FK relationship and only populate the additional tables with data for locations of that table's associated location category.

    I can think of some benefits and costs of handling it each way, but I can't quite decide which way would be best. Any thoughts and advice are greatly appreciated.

  • Unfortunately, this is a very common question that is asked when modeling data, which means there are few methodologies that various DBA's have to approach such a problem without any real solid, "yes" or "no" answer.

    I would say that if you intend to have an OLTP database that is maintained by an application, then moving towards a normalized approach is likely going to be your best option. Then if you plan to have an analyzation component for BI purposes, then having a separate OLAP database where denormalization is likely going to be the best approach. This means instead of splitting the table up, keep the table as one single entity. This is mainly because splitting the data up will make queries of that table more complicated as well add additional overhead to querying that table as it now requires a JOIN. You will also likely see management of the secondary table more difficult when you do decide to mass update it in the future as well.

    When it comes to allowing or not allowing NULL's regardless of the above approach, I should mention that while I personally have no real issue with it, the problem is NULL's are too vague in the specific sense. This means, allowing NULL's can mean anything to the end user whether they are human or not. You have no idea if the NULL is in result of value being unknown or the application has yet to write the unknown value or that an error has occurred.

    To give a good example, I left NULL values in our log level data for BI purposes in the underlying fact table. The approach here was to take another field on the same record and use a string splitter to parse out a string of data that was delimited by underscores. In one position of the string, end users on the front-end would label it either _B_ or _NB_ as part of that string. In some cases, the position of these values would change, then the value would become NULL and thus give the illusion the front-end users did not flag the data when in fact, it could just mean the position of the flag is off or maybe there was an error in splitting the data correctly or something else entirely. Therefore, retaining the NULL here is likely not the best approach to ensure the end user understands what the TRUE meaning of the value is. I should have taken better care of translating the values in order to ensure the best insights when the data is finally utilized by whomever.

  • I like option 2 for a bunch of reasons. If there is flexibility in what is included/not included in a location, that means that those values are not dependent on the primary key. That pretty much means normalizing it. I'm not hung up on having NULL values in a column in a table if they belong there. However, not having to deal with lots of NULLs is an advantage to breaking these columns out into another table. Joins are really not a big deal and the optimizer deals well with them as long as they're enforced using WITH CHECK.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • xsevensinzx - Saturday, May 12, 2018 12:08 AM

    When it comes to allowing or not allowing NULL's regardless of the above approach, I should mention that while I personally have no real issue with it, the problem is NULL's are too vague in the specific sense. This means, allowing NULL's can mean anything to the end user whether they are human or not. You have no idea if the NULL is in result of value being unknown or the application has yet to write the unknown value or that an error has occurred.

    The obvious response the that statement is something like "Zero-length strings (what you called an "empty values")  can mean anything to the end user whether they are human or not. You have no idea if the empy aalue is in result of the column being inapplcable to this row or unknown or the application has yet to write the value or an error has occurred.   It's not a probem specific to NULLs, in fact Date's "use a default value insted of NULL" idea is dangerous nonsense.

    However, I prefer to avoid having NULLs as data, and especially I don't like rows with many NULLs or base tables with many nullable columns.  So for the case described by Blake Treu I prefer to normalise, and rely on the optimiser to do a good job on the joins this entails, and as these will be supported by foreign key constraints and these constraints should be automatically kept valid (using WITH CHECK - if this is not done then there is no referential tranparency so only lip-service to normalisation) and ideally using ON DELETE CASCADE, the references being from the tables holding data which is not used by all categories to the single table holding data used by all categories;   On UPDATE CASCADE will be needed too if the design allows the primary keys in these tables to be changed.   Of course this sort of structure can have more than two levels, and letting it get too deep can lead to performance issues.

    Tom

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply