Auto-suggesting foreign keys and data model archaeology

  • I'm thinking about narrowing the tables with useful foreign keys, or at least ordering the analysis by row count.

    I've seen more than a few 500 table data models where I only really care about 10-20 tables to start once I know where the data is.

  • I assume the goal of inferring foreign key relationships is for documentation purposes, that is when you're trying the puzzle out how to map from an unfamiliar source system into the data warehouse. Within the data warehouse model itself, at least for a star schema, it's common NOT to enforce foreign key constraints for performance reasons and instead to validate data in an interim staging model.

    Actually, if you're using Clustered ColumnStore for your table type in the DW, there is currently (as of 2016) a restriction whereby you can't define primary, unique, or foreign keys. That's a glaring technical limitation that I hope will be addressed in a future release.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • You didn't mention the Integration Services Data Profiling task (AFAICS). It does a lot of what you show out of the box, including candidate keys. It also gives tons of other information and great hints for CHECK constraints and the correct choices for data types and sizes.Additionally, you can spot issues with your data. Perhaps you think a column should be unique but discover that it is not? Perhaps an "always numeric" column occasionally has character data.

    And so much more!

    Gerald Britton, Pluralsight courses

  • Eric M Russell (1/27/2016)


    I assume the goal of inferring foreign key relationships is for documentation purposes...

    Actually FKs for the purpose of documentation only seems like a bad idea.

    The FK should be to define and enforce the data relationships.

    ATBCharles Kincaid

  • And Eric...how is it that foreign keys are a performance penalty, other than one time at insert? Don't they provide information to the optimizer to make queries faster? I would argue the most important task of a DBA is to guarantee good data. I've had to babysit other people's databases lacking constraints and it is a waste of a life.

  • Charles Kincaid (1/27/2016)


    Eric M Russell (1/27/2016)


    I assume the goal of inferring foreign key relationships is for documentation purposes...

    Actually FKs for the purpose of documentation only seems like a bad idea.

    The FK should be to define and enforce the data relationships.

    I'm talking about inferring (deducing based on evidence rather than explicit statements) logical foreign key relationships. You wouldn't want to go back and create declarative foreign keys on an unfamiliar or legacy source database, and for technical reasons you can't (or wouldn't want to) create them on the target data warehouse. For example, Clustered ColumnStore tables don't currently support key declarations of any type (other than the clustering key).

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • g.britton (1/27/2016)


    You didn't mention the Integration Services Data Profiling task (AFAICS).

    Good suggestion. Why not submit an article on the use of Data Profiler?

    I've not used the SSIS data profiler but have used the Ab Initio equivalent and it is indeed a very powerful tool. What I have noted with data profilers per se is that they hit the databases particularly hard if you want an in depth analysis.

    Ab Initio does a lot of cross correlations and even when using data in its own compressed multi-file storage it hits the resources hard. Given that Ab Initio is incredibly efficient that tells you something about the inherent size of the task

  • Bill Talada (1/27/2016)


    And Eric...how is it that foreign keys are a performance penalty, other than one time at insert? Don't they provide information to the optimizer to make queries faster? I would argue the most important task of a DBA is to guarantee good data. I've had to babysit other people's databases lacking constraints and it is a waste of a life.

    Foreign keys are essential for OLTP databases. If you have a normalized transactional database in production, then you want foreign keys to insure data quality and optimal performance of

    However, the artical by David describes a scenario where he is migrating data from an unfamiliar (probably legacy and retired?) OLTP database into a data warehouse. The OLTP doesn't have any FK declared, which hinders the understandability of the source data model and ETL mapping effort. At least that my read of it.

    In a data warehouse (OLAP / star schema model), foreign keys (as well as primary keys and non-clustered indexes) can get in the way of minimally logged bulk inserts and usage of Clustered ColumnStore. Optimization of data loads is more important than optimization of joins.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • It's migration per se. Could be for the purposes of absorption into the data warehouse or to replace the legacy system.

    The overarching idea is to gain knowledge of the system which may or may not be disused.

    Just so everyone is clear I've identified one way of solving this problem. I am not claiming it is the best solution. There's definitely scope for someone to write a counter article

  • When inheriting a legacy database with little documentation, one thing I do is start creating views as I learn the purpose and usage of the tables. For example, I may create a view called vCustomersWithPendingOrders. These views are not used by any business application or for reporting; they're strictly for documentation purposes and my own ad-hoc querying, and I may contain them in a separate database with limited accessibility. This is especially useful when the database is modeled entity-attribute-value style or has cryptic naming conventions.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • This is an interesting related script:

    2012:

    http://www.sqlservercentral.com/scripts/integrity/136833/

    2008:

    http://www.sqlservercentral.com/Forums/Topic1754876-3745-1.aspx#bm1757302

    412-977-3526 call/text

  • Eric M Russell (1/26/2016)


    Given a database with no defined foreign key constraints, analyzing schema information is essentially inferring relationships by static analysis. Another approach is looking at join operators within stored procedures or even cached sql execution plans.

    That's exactly what we had to do when moving a software package and database to a whole new environment and structure.

  • The purpose of constraints is to improve data integrity.  By "integrity" I mean accuracy, internal consistency, and faithful assertion of business requirements for which the database serves.

    So, I'm pleased you included the word "suggesting" in your title.  The best way (only way, perhaps) to ascertain business requirements is through dialog with the stake holders.

    First of all, it's not sufficient to determine a primary key constraint--all (minimal) unique keys that are important to the business should be identified and enforced with constraints.  While choosing which is the best candidate for the primary key, is perhaps a matter of technical preference, identifying the candidates certainly is not.

    Relying on data types, column names, and content patterns assumes the database was designed with best practices and already exhibits a high degree of integrity.  That's seldom the situation I encounter.

    FWIW, your scripts are interesting and useful.  But, I would rather see a process that agrees upon appropriate constraints in logical dialog with the stake holders and scripts that subsequently expose missing database constraints and content violations.

Viewing 13 posts - 16 through 27 (of 27 total)

You must be logged in to reply to this topic. Login to reply