Auto-suggesting foreign keys and data model archaeology

  • Comments posted to this topic are about the item Auto-suggesting foreign keys and data model archaeology

  • Wow, lots of work to try to programatically figure out keys!

    Thank heavens other companies don't use any constraints or unique indexes or normalization; they eventually provide customers to me.

    Having done hundreds of imports and conversions, I don't assume matching column names nor matching data types in commercial software. I do assume orphaned data is present that will prevent me from applying suggested primary and alternate and foreign keys. The process to figure out a non-normalized database is even more painful. Being human I have the ability to spot patterns that computers can't quite do yet. For now I'll stick to doing this manually.

  • Given a database with no defined foreign key constraints, analyzing schema information is essentially inferring relationships by static analysis. Another approach is looking at join operators within stored procedures or even cached sql execution plans.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • My worst conversion/import was from a database using a couple generic tables; most other tables were defined as data in the generic tables. This metadata style design had to come from application developers. The performance must have been horrendous.

    Functions, views, stored procedures, and even printed reports sometimes need to be used to try to reverse engineer a mess. It is a shame. A properly designed database should be self evident in how it is to be used.

  • The comment about using query plans to determine the columns used in joins is definitely a worthwhile approach.

    This article was great and I wish developers especially would think more about foreign keys.

    This may have been hinted at in the article but use caution when 'refactoring' an existing database that is used for a front end application. This may be obvious to some but often times the data integrity responsibility (incorrectly) gets placed in the application itself. So keep that in mind and don't rush out to reinvent the wheel. Although, definitely try to reinvent it where necessary.

  • Rarely if ever will query plans be available on a database given to a data warehouse or for importing. Otherwise running profiler or looking at DMVs might help.

    The worst case is where most every table contains PK and FK columns called ID or similar that potentially join everywhere.

    The best case is when GUIDs have been used. It is so much easier to believe a join than when identities are used and every table has the value 1 through 100 in several key columns.

  • https://msdn.microsoft.com/en-us/library/ms186778.aspx

    The information schema views included in SQL Server comply with the ISO standard definition for the INFORMATION_SCHEMA.

    Seems like it might be worth the effort to understand this better.

    412-977-3526 call/text

  • The trick here was to try to encapsulate what a human would do but in code.

    If you do this sort of thing once in a blue moon then manual approaches are as good as any. As to using execution plans, what would be your recommendation for doing so with minimal impact on production systems?

  • Wouldn't it make more sense to capture a trace and replay it on a non production device?

    412-977-3526 call/text

  • Bill Talada (1/26/2016)


    Rarely if ever will query plans be available on a database given to a data warehouse or for importing. Otherwise running profiler or looking at DMVs might help.

    The worst case is where most every table contains PK and FK columns called ID or similar that potentially join everywhere.

    The best case is when GUIDs have been used. It is so much easier to believe a join than when identities are used and every table has the value 1 through 100 in several key columns.

    I'm not sure I agree with the 'Rarely if ever' part of your statement.

    This may have been true in the past...

    However, since the management data warehouse was added to SQL Server many years ago, it's far more common to have a pool of plans to draw upon. This is only going to be more true moving forward with SQL Server 2016 and the Query Data Store.

    While I understand not having plans for a database if you do not have access to the server with the production workload.

    I do not think that poor planning and administration practices is a reason to ignore a powerful approach to a problem.

    This is also one of the first times I have seen "The best case is when GUIDs have been used", usually for performance reasons I've found the opposite to be true. I do think you make a valid point here though for identifying uniqueness.

  • I certainly agree to use all available sources of help possible when deciphering a database's keys. My assumption is from the article that treats restoring a database to a local secure host to begin working on it. I don't know the management data warehouse product at all so I can't say if it saves plans in a database backup or not.

    As for guids, I switched all int keys to guids on a large database and got a general 20% improvement in speed. I'd guess a 64 bit OS and CPUs these days are faster at guids than they are with 16 or 32 bit ints. It would be nice to see how a controlled test compares to my empirical results.

    I used to believe all the wive's tales told on this site.

  • Bill Talada (1/26/2016)


    As for guids, I switched all int keys to guids on a large database and got a general 20% improvement in speed. I'd guess a 64 bit OS and CPUs these days are faster at guids than they are with 16 or 32 bit ints. It would be nice to see how a controlled test compares to my empirical results

    Definitely sounds like a good candidate for an article.

    GUIDs are 128bit, BIGINTs are 64bit so I would be curious to know if you find GUIDs offer an advantage over BIGINTs

  • Great article David.

    It shows how difficult it can be to come up with ways to automate what a human should be doing.

    ATBCharles Kincaid

  • Is there a useful way to include row counts in your process?

    412-977-3526 call/text

  • Row counts are in the tables created in the Validate schema.

    For approximate row counts on tables the safety check query uses the SYS.partitions system view. This relies on statistics being up to date.

Viewing 15 posts - 1 through 15 (of 27 total)

You must be logged in to reply to this topic. Login to reply