Maintaining the Value of My Primary Key Question

  • I find the hardest rule to follow for Primary Keys is that once a record has its ID it can never change. It's all easy for me if the table is loaded incrementally, or is static. It is when I have to do a truncate and repopulate process that things get complicated for me. Most of the time it isn't an issue because I have some ID field I can use, but once in a while I come across a table where I need to create a surrogate key. I need to make sure that surrogate key never changes for that record.

    I have a table of 200,000 records. The natural primary key for this table is the combination of columns A1, B2, C3 and D4. I want to create a surrogate key so that I can have one column for my primary key. Right now to do this, I concatenate the four columns that make up the natural primary keys. If those columns are char(20), then my surrogate key is now char(80).

    The big question is: Is there another way to do this so that my surrogate key is smaller?

    I know I could use the Identity option, but over time some records are added and some are deleted for whatever reason. I need to be able to recreate the table. If I recreate the table, I need to be able to ensure that the surrogate key is still assigned to the correct record and I can't do that if I use the Identity option.

    Are there any functions out there that can take the values of those four columns and assign them a hexadecimal value or something?

    Any suggestions?

    Thanks

  • adams.squared (11/25/2013)


    I have a table of 200,000 records. The natural primary key for this table is the combination of columns A1, B2, C3 and D4. I want to create a surrogate key so that I can have one column for my primary key. Right now to do this, I concatenate the four columns that make up the natural primary keys. If those columns are char(20), then my surrogate key is now char(80).

    Why do you want to create a surrogate key from these four columns instead of using them as they are and saving space?

    adams.squared (11/25/2013)


    I know I could use the Identity option, but over time some records are added and some are deleted for whatever reason. I need to be able to recreate the table. If I recreate the table, I need to be able to ensure that the surrogate key is still assigned to the correct record and I can't do that if I use the Identity option.

    You don't have to rebuild the table and loose the primary key if its an identity. You can use IDENTITY_INSERT.

  • The reason I am creating a surrogate is because I need a KEY field.

    I can't use the identity because now and then records are added or removed. Using identity will not guarantee that the ID field is the same after the rebuild.

  • Let me add a note here. This has to do with an ETL process. I'm taking the data from the source and creating a table on the destination. I want my key on the destination. I also do not have any control over the source.

  • Use ROW_NUMBER() for surrogate key assignment and order on the natural key. As long as you receive the same data set each time the surrogates will be the same.

    Better still, don't truncate your target table. Why would you want to do that? E.g.:

    1. Populate a "landing" table with your snapshot of source data.

    2. Join to the target table to identify only those rows that are newly arriving in the data set. Assign surrogates to the new rows.

    3. Insert new rows to the target table.

  • sqlvogel (11/25/2013)


    Use ROW_NUMBER() for surrogate key assignment and order on the natural key. As long as you receive the same data set each time the surrogates will be the same.

    Better still, don't truncate your target table. Why would you want to do that? E.g.:

    1. Populate a "landing" table with your snapshot of source data.

    2. Join to the target table to identify only those rows that are newly arriving in the data set. Assign surrogates to the new rows.

    3. Insert new rows to the target table.

    I'm not always receiving the same data set, so using row number does not work. For the second part, sometimes we have to rebuild the table.

    The question is:

    Is there a way to get a single value for the four columns, aside from concatenation and not using identity?

  • adams.squared (11/25/2013)


    The question is:

    Is there a way to get a single value for the four columns, aside from concatenation and not using identity?

    Sorry but if you care about what the numbers are then you clearly can't/shouldn't be using IDENTITY to generate them. If you want to rebuild the table while retaining the same numbers then for most practical purposes it doesn't make sense to use IDENTITY. IDENTITY is best avoided in data warehouse / data integration situations - at least as a way of generating keys for a target table. A SEQUENCE is somewhat better because you can generate the values independently of the table.

    SEQUENCE or not, this is a common situation and it really isn't a difficult problem to solve but I'm not sure why you think you have to use an IDENTITY column to do it.

  • sqlvogel (11/25/2013)


    adams.squared (11/25/2013)


    The question is:

    Is there a way to get a single value for the four columns, aside from concatenation and not using identity?

    Sorry but if you care about what the numbers are then you clearly can't/shouldn't be using IDENTITY to generate them. If you want to rebuild the table while retaining the same numbers then for most practical purposes it doesn't make sense to use IDENTITY. IDENTITY is best avoided in data warehouse / data integration situations - at least as a way of generating keys for a target table. A SEQUENCE is somewhat better because you can generate the values independently of the table.

    SEQUENCE or not, this is a common situation and it really isn't a difficult problem to solve but I'm not sure why you think you have to use an IDENTITY column to do it.

    I do not want to use Identity.

  • You could create a separate "key master" table, containing only an identity column and the four natural key columns, clustered on the natural key columns, NOT the identity.

    Then, the first time a given combination of

    A1, B2, C3 and D4

    values are to be inserted to the main table, you first insert a row into the "key master" table; then, when inserting to the main table, you lookup the identity key value to be used from the "key master" table.

    The main table can also be clustered on the natural key columns, if that works best for the queries, while the PK could still be the single identity value.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • What Scott suggested is the way I would do it.

    "Intelligent" numeric keys derived directly from other data (without any lookups) can only work in very limited circumstances I think - mainly those where the full set of data values is predictable and is within the limits of what can be enumerated by a numeric type. Calendar tables with numeric keys based on dates are an exceptional example.

    Keys based on hashes of data are possible and maybe that is an option for you if you want another alternative.

  • Scott's suggestion looks good.

    What I was looking for, and I am probably not drawing a decent picture, is a way to get a value out of multiple columns so that I will have a distinct value for each field and if I have to reload the data, the values will not change.

    What I started playing with is concatenating the four columns and converting them to varbinary and then converting that to varchar(36) with style 2.

    I'm at home now so I don't have the script I was working with. I'll update the post tomorrow with what I wrote.

    Anways, That gives me a value that is based off of the four columns. I was able to do it on the 1.8 million record table I was working with and then assigned that column as the PK. I did not have any PK violations, so I am getting a unique value. I still have some more playing around with it to see if this method is of any value.

  • I think Scott's suggestion is the best. It's similar in concept to what we do in our accounting software. Another option would be some sort of hashing routine but with up to 80 chars seems like it would be problematic at best.

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply