Auto-suggesting foreign keys and data model archaeology

Question

Auto-suggesting foreign keys and data model archaeology

Viewing 13 posts - 16 through 28 (of 28 total)

You must be logged in to reply to this topic. Login to reply

Robert Sterbal-482516 SSCrazy Points: 2874 More actions · Answer 1

I'm thinking about narrowing the tables with useful foreign keys, or at least ordering the analysis by row count.

I've seen more than a few 500 table data models where I only really care about 10-20 tables to start once I know where the data is.

Eric M Russell SSC Guru Points: 125620 More actions · Answer 2

I assume the goal of inferring foreign key relationships is for documentation purposes, that is when you're trying the puzzle out how to map from an unfamiliar source system into the data warehouse. Within the data warehouse model itself, at least for a star schema, it's common NOT to enforce foreign key constraints for performance reasons and instead to validate data in an interim staging model.

Actually, if you're using Clustered ColumnStore for your table type in the DW, there is currently (as of 2016) a restriction whereby you can't define primary, unique, or foreign keys. That's a glaring technical limitation that I hope will be addressed in a future release.

"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

g.britton SSChampion Points: 13863 More actions · Answer 3

You didn't mention the Integration Services Data Profiling task (AFAICS). It does a lot of what you show out of the box, including candidate keys. It also gives tons of other information and great hints for CHECK constraints and the correct choices for data types and sizes.Additionally, you can spot issues with your data. Perhaps you think a column should be unique but discover that it is not? Perhaps an "always numeric" column occasionally has character data.

And so much more!

Gerald Britton, Pluralsight courses

Charles Kincaid SSChampion Points: 13593 More actions · Answer 4

Eric M Russell (1/27/2016)
I assume the goal of inferring foreign key relationships is for documentation purposes...

Actually FKs for the purpose of documentation only seems like a bad idea.

The FK should be to define and enforce the data relationships.

ATBCharles Kincaid

Bill Talada SSChampion Points: 11954 More actions · Answer 5

And Eric...how is it that foreign keys are a performance penalty, other than one time at insert? Don't they provide information to the optimizer to make queries faster? I would argue the most important task of a DBA is to guarantee good data. I've had to babysit other people's databases lacking constraints and it is a waste of a life.

Eric M Russell SSC Guru Points: 125620 More actions · Answer 6

Charles Kincaid (1/27/2016)
Eric M Russell (1/27/2016)
I assume the goal of inferring foreign key relationships is for documentation purposes...
Actually FKs for the purpose of documentation only seems like a bad idea.
The FK should be to define and enforce the data relationships.

I'm talking about inferring (deducing based on evidence rather than explicit statements) logical foreign key relationships. You wouldn't want to go back and create declarative foreign keys on an unfamiliar or legacy source database, and for technical reasons you can't (or wouldn't want to) create them on the target data warehouse. For example, Clustered ColumnStore tables don't currently support key declarations of any type (other than the clustering key).

"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

David.Poole SSC Guru Points: 76298 More actions · Answer 7

g.britton (1/27/2016)
You didn't mention the Integration Services Data Profiling task (AFAICS).

Good suggestion. Why not submit an article on the use of Data Profiler?

I've not used the SSIS data profiler but have used the Ab Initio equivalent and it is indeed a very powerful tool. What I have noted with data profilers per se is that they hit the databases particularly hard if you want an in depth analysis.

Ab Initio does a lot of cross correlations and even when using data in its own compressed multi-file storage it hits the resources hard. Given that Ab Initio is incredibly efficient that tells you something about the inherent size of the task

LinkedIn Profile

Eric M Russell SSC Guru Points: 125620 More actions · Answer 8

Bill Talada (1/27/2016)
And Eric...how is it that foreign keys are a performance penalty, other than one time at insert? Don't they provide information to the optimizer to make queries faster? I would argue the most important task of a DBA is to guarantee good data. I've had to babysit other people's databases lacking constraints and it is a waste of a life.

Foreign keys are essential for OLTP databases. If you have a normalized transactional database in production, then you want foreign keys to insure data quality and optimal performance of

However, the artical by David describes a scenario where he is migrating data from an unfamiliar (probably legacy and retired?) OLTP database into a data warehouse. The OLTP doesn't have any FK declared, which hinders the understandability of the source data model and ETL mapping effort. At least that my read of it.

In a data warehouse (OLAP / star schema model), foreign keys (as well as primary keys and non-clustered indexes) can get in the way of minimally logged bulk inserts and usage of Clustered ColumnStore. Optimization of data loads is more important than optimization of joins.

"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

David.Poole SSC Guru Points: 76298 More actions · Answer 9

It's migration per se. Could be for the purposes of absorption into the data warehouse or to replace the legacy system.

The overarching idea is to gain knowledge of the system which may or may not be disused.

Just so everyone is clear I've identified one way of solving this problem. I am not claiming it is the best solution. There's definitely scope for someone to write a counter article

LinkedIn Profile

Eric M Russell SSC Guru Points: 125620 More actions · Answer 10

When inheriting a legacy database with little documentation, one thing I do is start creating views as I learn the purpose and usage of the tables. For example, I may create a view called vCustomersWithPendingOrders. These views are not used by any business application or for reporting; they're strictly for documentation purposes and my own ad-hoc querying, and I may contain them in a separate database with limited accessibility. This is especially useful when the database is modeled entity-attribute-value style or has cryptic naming conventions.

"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

Robert Sterbal SSChampion Points: 11068 More actions · Answer 11

This is an interesting related script:

2012:

http://www.sqlservercentral.com/scripts/integrity/136833/

2008:

http://www.sqlservercentral.com/Forums/Topic1754876-3745-1.aspx#bm1757302

412-977-3526 call/text

akljfhnlaflkj SSC Guru Points: 76202 More actions · Answer 12

Eric M Russell (1/26/2016)
Given a database with no defined foreign key constraints, analyzing schema information is essentially inferring relationships by static analysis. Another approach is looking at join operators within stored procedures or even cached sql execution plans.

That's exactly what we had to do when moving a software package and database to a whole new environment and structure.

Dennis Q Miller Old Hand Points: 366 More actions · Answer 13

The purpose of constraints is to improve data integrity. By "integrity" I mean accuracy, internal consistency, and faithful assertion of business requirements for which the database serves.

So, I'm pleased you included the word "suggesting" in your title. The best way (only way, perhaps) to ascertain business requirements is through dialog with the stake holders.

First of all, it's not sufficient to determine a primary key constraint--all (minimal) unique keys that are important to the business should be identified and enforced with constraints. While choosing which is the best candidate for the primary key, is perhaps a matter of technical preference, identifying the candidates certainly is not.

Relying on data types, column names, and content patterns assumes the database was designed with best practices and already exhibits a high degree of integrity. That's seldom the situation I encounter.

FWIW, your scripts are interesting and useful. But, I would rather see a process that agrees upon appropriate constraints in logical dialog with the stake holders and scripts that subsequently expose missing database constraints and content violations.