Auto-suggesting foreign keys and data model archaeology

  • Robert Sterbal-482516

    SSCrazy

    Points: 2784

    I'm thinking about narrowing the tables with useful foreign keys, or at least ordering the analysis by row count.

    I've seen more than a few 500 table data models where I only really care about 10-20 tables to start once I know where the data is.

  • Eric M Russell

    SSC Guru

    Points: 125010

    I assume the goal of inferring foreign key relationships is for documentation purposes, that is when you're trying the puzzle out how to map from an unfamiliar source system into the data warehouse. Within the data warehouse model itself, at least for a star schema, it's common NOT to enforce foreign key constraints for performance reasons and instead to validate data in an interim staging model.

    Actually, if you're using Clustered ColumnStore for your table type in the DW, there is currently (as of 2016) a restriction whereby you can't define primary, unique, or foreign keys. That's a glaring technical limitation that I hope will be addressed in a future release.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • g.britton

    SSChampion

    Points: 13685

    You didn't mention the Integration Services Data Profiling task (AFAICS). It does a lot of what you show out of the box, including candidate keys. It also gives tons of other information and great hints for CHECK constraints and the correct choices for data types and sizes.Additionally, you can spot issues with your data. Perhaps you think a column should be unique but discover that it is not? Perhaps an "always numeric" column occasionally has character data.

    And so much more!

    Gerald Britton, MCSE-DP, MVPToronto PASS Chapter[/url]

  • Charles Kincaid

    SSChampion

    Points: 13593

    Eric M Russell (1/27/2016)


    I assume the goal of inferring foreign key relationships is for documentation purposes...

    Actually FKs for the purpose of documentation only seems like a bad idea.

    The FK should be to define and enforce the data relationships.

    ATBCharles Kincaid

  • Bill Talada

    SSChampion

    Points: 11956

    And Eric...how is it that foreign keys are a performance penalty, other than one time at insert? Don't they provide information to the optimizer to make queries faster? I would argue the most important task of a DBA is to guarantee good data. I've had to babysit other people's databases lacking constraints and it is a waste of a life.

  • Eric M Russell

    SSC Guru

    Points: 125010

    Charles Kincaid (1/27/2016)


    Eric M Russell (1/27/2016)


    I assume the goal of inferring foreign key relationships is for documentation purposes...

    Actually FKs for the purpose of documentation only seems like a bad idea.

    The FK should be to define and enforce the data relationships.

    I'm talking about inferring (deducing based on evidence rather than explicit statements) logical foreign key relationships. You wouldn't want to go back and create declarative foreign keys on an unfamiliar or legacy source database, and for technical reasons you can't (or wouldn't want to) create them on the target data warehouse. For example, Clustered ColumnStore tables don't currently support key declarations of any type (other than the clustering key).

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • David.Poole

    SSC Guru

    Points: 75183

    g.britton (1/27/2016)


    You didn't mention the Integration Services Data Profiling task (AFAICS).

    Good suggestion. Why not submit an article on the use of Data Profiler?

    I've not used the SSIS data profiler but have used the Ab Initio equivalent and it is indeed a very powerful tool. What I have noted with data profilers per se is that they hit the databases particularly hard if you want an in depth analysis.

    Ab Initio does a lot of cross correlations and even when using data in its own compressed multi-file storage it hits the resources hard. Given that Ab Initio is incredibly efficient that tells you something about the inherent size of the task

  • Eric M Russell

    SSC Guru

    Points: 125010

    Bill Talada (1/27/2016)


    And Eric...how is it that foreign keys are a performance penalty, other than one time at insert? Don't they provide information to the optimizer to make queries faster? I would argue the most important task of a DBA is to guarantee good data. I've had to babysit other people's databases lacking constraints and it is a waste of a life.

    Foreign keys are essential for OLTP databases. If you have a normalized transactional database in production, then you want foreign keys to insure data quality and optimal performance of

    However, the artical by David describes a scenario where he is migrating data from an unfamiliar (probably legacy and retired?) OLTP database into a data warehouse. The OLTP doesn't have any FK declared, which hinders the understandability of the source data model and ETL mapping effort. At least that my read of it.

    In a data warehouse (OLAP / star schema model), foreign keys (as well as primary keys and non-clustered indexes) can get in the way of minimally logged bulk inserts and usage of Clustered ColumnStore. Optimization of data loads is more important than optimization of joins.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • David.Poole

    SSC Guru

    Points: 75183

    It's migration per se. Could be for the purposes of absorption into the data warehouse or to replace the legacy system.

    The overarching idea is to gain knowledge of the system which may or may not be disused.

    Just so everyone is clear I've identified one way of solving this problem. I am not claiming it is the best solution. There's definitely scope for someone to write a counter article

  • Eric M Russell

    SSC Guru

    Points: 125010

    When inheriting a legacy database with little documentation, one thing I do is start creating views as I learn the purpose and usage of the tables. For example, I may create a view called vCustomersWithPendingOrders. These views are not used by any business application or for reporting; they're strictly for documentation purposes and my own ad-hoc querying, and I may contain them in a separate database with limited accessibility. This is especially useful when the database is modeled entity-attribute-value style or has cryptic naming conventions.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Robert Sterbal

    SSChampion

    Points: 10954

    This is an interesting related script:

    2012:

    http://www.sqlservercentral.com/scripts/integrity/136833/

    2008:

    http://www.sqlservercentral.com/Forums/Topic1754876-3745-1.aspx#bm1757302

    412-977-3526 call/text

  • akljfhnlaflkj

    SSC Guru

    Points: 76202

    Eric M Russell (1/26/2016)


    Given a database with no defined foreign key constraints, analyzing schema information is essentially inferring relationships by static analysis. Another approach is looking at join operators within stored procedures or even cached sql execution plans.

    That's exactly what we had to do when moving a software package and database to a whole new environment and structure.

  • Dennis Q Miller

    Old Hand

    Points: 366

    The purpose of constraints is to improve data integrity.  By "integrity" I mean accuracy, internal consistency, and faithful assertion of business requirements for which the database serves.

    So, I'm pleased you included the word "suggesting" in your title.  The best way (only way, perhaps) to ascertain business requirements is through dialog with the stake holders.

    First of all, it's not sufficient to determine a primary key constraint--all (minimal) unique keys that are important to the business should be identified and enforced with constraints.  While choosing which is the best candidate for the primary key, is perhaps a matter of technical preference, identifying the candidates certainly is not.

    Relying on data types, column names, and content patterns assumes the database was designed with best practices and already exhibits a high degree of integrity.  That's seldom the situation I encounter.

    FWIW, your scripts are interesting and useful.  But, I would rather see a process that agrees upon appropriate constraints in logical dialog with the stake holders and scripts that subsequently expose missing database constraints and content violations.

Viewing 13 posts - 16 through 28 (of 28 total)

You must be logged in to reply to this topic. Login to reply