Guessing PKs on existing tables

  • Hi all,

    I've inherited a database in which most tables don't have primary keys declared. The entire database is just tables with columns and non-clustered indexes. The problem is, we have a SSIS import that truncates tables and then inserts new data which runs for long time. I'm positive i could greatly reduce database refresh time by replacing full table re-downloads with incrementals. For this, I need to figure out which columns to use for primary keys. The source tables don't have PKs declared either. I can sort of guess which columns might be good candidates for primary keys, but I'm looking for ideas. The two approaches I'm thinking about are select distinct and select/groupby. the only problem is some of the tables are good size (10-100mil rows). Ideas?

  • You could start with taking a random sample from each table into a temp table to use for initial screening:

    select top 10000 * into #t from MyTable order by newid()

    This will let you do queries against small tables to look for potential single-column primary keys. select Col1, count(*) from #t group by Col1 having count(*) 1

    select Col2, count(*) from #t group by Col2 having count(*) 1

    Once you have identified potential primary keys columns, you can run the queries against the main table to verify they are unique.

  • If you're looking for natural PKs (which is what it seems you are doing), the definition will depend entirely on the nature of the data itself.

    Are there columns that the other columns depend on for part of their meaning? Those are your candidate keys. One or more of those will be the best PK, depending on uniqueness and completeness.

    For example, in a table of addresses, the city depends on the state. The city doesn't depend on the Zip code (can be more than one city in a Zip and more than one Zip in a city), but the Zip is limited to one state by postal regulations, even for cities that cross state lines. The street is also independent of the city and Zip, and in some cases the street can even be in multiple states, because they can cross the boundaries on those. Those are your dependencies.

    Street address and Zip code is the minimum set that defines the rest of it, and is unique. Some cities have more than one mailable name, and many city names appear in more than one state, so anything that includes city name isn't really a good key unless it also includes the state and the Zip code. Far simpler to define the key as street address and Zip code.

    That's how you analyze data for candidate keys. Without knowing your data, I can't even begin to suggest how you define your keys.

    For your particular situation, I'd look at how many rows would be updated each load if you weren't doing a full refresh. Say, for example, 90% of the data stays the same from refresh to refresh, and only 10% changes. In that case, an upsert would almost certainly be much more efficient. On the other hand, if 90% of the data changes, then a reload will almost certainly be more efficient. You need to measure that to know which will work better.

    Even if you stay with a reload, there are often ways to make those more efficient than the textbook examples in BOL.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Michael Valentine Jones (9/16/2009)


    You could start with taking a random sample from each table into a temp table to use for initial screening:

    select top 10000 * into #t from MyTable order by newid()

    This will let you do queries against small tables to look for potential single-column primary keys. select Col1, count(*) from #t group by Col1 having count(*) 1

    select Col2, count(*) from #t group by Col2 having count(*) 1

    Once you have identified potential primary keys columns, you can run the queries against the main table to verify they are unique.

    One of those "duh!" moments for me.

    Thanks! This is good idea. Solves my problem of "many columns not enough indexes". I'll update the thread with the results once I have them.

  • GSquared (9/16/2009)


    If you're looking for natural PKs (which is what it seems you are doing), the definition will depend entirely on the nature of the data itself.

    Are there columns that the other columns depend on for part of their meaning? Those are your candidate keys. One or more of those will be the best PK, depending on uniqueness and completeness.

    For example, in a table of addresses, the city depends on the state. The city doesn't depend on the Zip code (can be more than one city in a Zip and more than one Zip in a city), but the Zip is limited to one state by postal regulations, even for cities that cross state lines. The street is also independent of the city and Zip, and in some cases the street can even be in multiple states, because they can cross the boundaries on those. Those are your dependencies.

    Street address and Zip code is the minimum set that defines the rest of it, and is unique. Some cities have more than one mailable name, and many city names appear in more than one state, so anything that includes city name isn't really a good key unless it also includes the state and the Zip code. Far simpler to define the key as street address and Zip code.

    That's how you analyze data for candidate keys. Without knowing your data, I can't even begin to suggest how you define your keys.

    This works well when columns have meaningful names. This database is a vendor-designed database, table names and columns names look like the developers used random password generator to come up with names. Don't you just love great products some of the vendors have? Data inside gives certain clues as to what those primary keys might be, I just needed a way to verify their uniqueness without waiting for days for group/select distinct queries to finish and without creating extra indexes.

    For your particular situation, I'd look at how many rows would be updated each load if you weren't doing a full refresh. Say, for example, 90% of the data stays the same from refresh to refresh, and only 10% changes. In that case, an upsert would almost certainly be much more efficient. On the other hand, if 90% of the data changes, then a reload will almost certainly be more efficient. You need to measure that to know which will work better.

    Even if you stay with a reload, there are often ways to make those more efficient than the textbook examples in BOL.

    I think 200,000 rows for 20,000,000 row table qualifies that table for update instead of a reload. Even a reload of 20mil row table (there are many tables) used to be fine. Then one day the network configuration has changed and download times skyrocketed. that's when the previous database owners decided they didn't want to deal with it anymore and handed it over to DBAs. You know, the usual "it's business critical, etc".

  • Here is a completely different solution.

    With as many rows as you describe, it seems certain that the source data acquisition application has performance problems.

    If there are performance problems or some other justification, you can ask the vendor to add primary keys to their database.

    One advantage to this method is that when the vendor sends out new releases, and the new release includes changes to the database table structure, their upgrade procedure will not wipe out all of your added primary keys or unique indexes.

    David

  • I'm just thinking...

    Could you possibly pick out a column you have suspicions on for being a potential PK candidate and then look for duplicates based on that column using a group by clause.

    See below:

    SELECT email,

    COUNT(email) AS NumOccurrences

    FROM users

    GROUP BY email

    HAVING ( COUNT(email) > 1 )

    Not sure if this would work. Ive done this but not on anything close to a 10-100 mil rows!

    Thanks,

    S

    --
    :hehe:

  • In addition to the other suggestions. why not ask the business owner of the data for their thoughts? Some of them are intelligent enough to know what a primary key is and could point it out.

  • David. (9/22/2009)


    Here is a completely different solution.

    With as many rows as you describe, it seems certain that the source data acquisition application has performance problems.

    If there are performance problems or some other justification, you can ask the vendor to add primary keys to their database.

    One advantage to this method is that when the vendor sends out new releases, and the new release includes changes to the database table structure, their upgrade procedure will not wipe out all of your added primary keys or unique indexes.

    David

    It does. The only issue is the vendor is sitting on a very intelligently written contract (on their part) - they wouldn't care. On the upgrade side, I don't see it being a problem in the near future, given that the app hasn't changed in a very loooong time.

  • Mark Pratt-396155 (9/23/2009)


    In addition to the other suggestions. why not ask the business owner of the data for their thoughts? Some of them are intelligent enough to know what a primary key is and could point it out.

    Yep, you are right. They have pointed me in the right direction, I still needed to verify the couple of possible candidates they pointed out.

  • Slick84 (9/22/2009)


    I'm just thinking...

    Could you possibly pick out a column you have suspicions on for being a potential PK candidate and then look for duplicates based on that column using a group by clause.

    See below:

    SELECT email,

    COUNT(email) AS NumOccurrences

    FROM users

    GROUP BY email

    HAVING ( COUNT(email) > 1 )

    Not sure if this would work. Ive done this but not on anything close to a 10-100 mil rows!

    Thanks,

    S

    I had a couple of possibilities after talking to stakeholders. I used the random sampling method above to weed out all of the candidates but one, and then used something similar to the code you posted to verify that the last one was indeed unique.

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply