Questions about adding primary key to an existing table

  • Im working on a table that does not have a primary key when created. Here is the original table:

    Create Table tbProduct(ProductID int, ProductName varchar (50), ProductStatus varchar (50), LastUpdated smalldatetime)
    Insert Into tbProduct (ProductID, ProductName, ProductStatus, LastUpdated)
    Values (1001, 'Toy', 'InStore', '1/1/2017')
     , (1001, 'Toy', 'Sold', '1/2/2017')
     , (1002, 'Book', 'InStore', '1/3/2017')
     , (1002, 'Book', 'Sold', '1/4/2017')

    Then I added a primary key field, Primary_key, in Design view. Everything seemed to be working ok until we found one error in Product View, which retrieves the latest records only. The old view was using Max(LastUpdated), and the new view is using Max(Primary_key). In sample table above, the second record for product Book has a smaller number of primary key, Looks like this:

    Primary_key, ProductID, ProductName, ProductStatus, LastUpdated
    62  1002 Book  InStore  1/3/2017
    35  1002 Book  Sold  1/4/2017

    So the new view retrieved the ProductStatus as 'InStore', and it really should be 'Sold'.  Here are my questions:

    1. How did this happened? If we want to modify the Primary Key again, what is the proper proedure(s) to prevent it from happening again?
    2. How can we find out if we have more records with the same type of errors?

  • jay-125866 - Tuesday, January 17, 2017 2:15 PM

    Im working on a table that does not have a primary key when created. Here is the original table:

    Create Table tbProduct(ProductID int, ProductName varchar (50), ProductStatus varchar (50), LastUpdated smalldatetime)
    Insert Into tbProduct (ProductID, ProductName, ProductStatus, LastUpdated)
    Values (1001, 'Toy', 'InStore', '1/1/2017')
     , (1001, 'Toy', 'Sold', '1/2/2017')
     , (1002, 'Book', 'InStore', '1/3/2017')
     , (1002, 'Book', 'Sold', '1/4/2017')

    Then I added a primary key field, Primary_key, in Design view. Everything seemed to be working ok until we found one error in Product View, which retrieves the latest records only. The old view was using Max(LastUpdated), and the new view is using Max(Primary_key). In sample table above, the second record for product Book has a smaller number of primary key, Looks like this:

    Primary_key, ProductID, ProductName, ProductStatus, LastUpdated
    62  1002 Book  InStore  1/3/2017
    35  1002 Book  Sold  1/4/2017

    So the new view retrieved the ProductStatus as 'InStore', and it really should be 'Sold'.  Here are my questions:

    1. How did this happened? If we want to modify the Primary Key again, what is the proper proedure(s) to prevent it from happening again?
    2. How can we find out if we have more records with the same type of errors?

    Why did you change the view?  The result you need appears to be what you had before you made the change...

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • What I suspect you just need is an index on the date field.   Just having a primary key doesn't necessarily improve things all by itself, unless you had some need to make each record unique that couldn't occur without a primary key.   Even then, changing the view doesn't do anything but change your results in a way that you found out was a problem.

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • sgmunson - Tuesday, January 17, 2017 2:24 PM

    jay-125866 - Tuesday, January 17, 2017 2:15 PM

    Im working on a table that does not have a primary key when created. Here is the original table:

    Create Table tbProduct(ProductID int, ProductName varchar (50), ProductStatus varchar (50), LastUpdated smalldatetime)
    Insert Into tbProduct (ProductID, ProductName, ProductStatus, LastUpdated)
    Values (1001, 'Toy', 'InStore', '1/1/2017')
     , (1001, 'Toy', 'Sold', '1/2/2017')
     , (1002, 'Book', 'InStore', '1/3/2017')
     , (1002, 'Book', 'Sold', '1/4/2017')

    Then I added a primary key field, Primary_key, in Design view. Everything seemed to be working ok until we found one error in Product View, which retrieves the latest records only. The old view was using Max(LastUpdated), and the new view is using Max(Primary_key). In sample table above, the second record for product Book has a smaller number of primary key, Looks like this:

    Primary_key, ProductID, ProductName, ProductStatus, LastUpdated
    62  1002 Book  InStore  1/3/2017
    35  1002 Book  Sold  1/4/2017

    So the new view retrieved the ProductStatus as 'InStore', and it really should be 'Sold'.  Here are my questions:

    1. How did this happened? If we want to modify the Primary Key again, what is the proper proedure(s) to prevent it from happening again?
    2. How can we find out if we have more records with the same type of errors?

    Why did you change the view?  The result you need appears to be what you had before you made the change...

    Hi Steve,
    Thanks for reply.
    One reason that we made the change was due to fact that some records have the exactly the same time stamp.  We'd have to add DISTINCT, and which in turn slow down the performance a little. The current argument within our team is to whether switch back to Max(LastUpdated) script, or to modify the primary key.

  • jay-125866 - Tuesday, January 17, 2017 2:31 PM

    sgmunson - Tuesday, January 17, 2017 2:24 PM

    jay-125866 - Tuesday, January 17, 2017 2:15 PM

    Im working on a table that does not have a primary key when created. Here is the original table:

    Create Table tbProduct(ProductID int, ProductName varchar (50), ProductStatus varchar (50), LastUpdated smalldatetime)
    Insert Into tbProduct (ProductID, ProductName, ProductStatus, LastUpdated)
    Values (1001, 'Toy', 'InStore', '1/1/2017')
     , (1001, 'Toy', 'Sold', '1/2/2017')
     , (1002, 'Book', 'InStore', '1/3/2017')
     , (1002, 'Book', 'Sold', '1/4/2017')

    Then I added a primary key field, Primary_key, in Design view. Everything seemed to be working ok until we found one error in Product View, which retrieves the latest records only. The old view was using Max(LastUpdated), and the new view is using Max(Primary_key). In sample table above, the second record for product Book has a smaller number of primary key, Looks like this:

    Primary_key, ProductID, ProductName, ProductStatus, LastUpdated
    62  1002 Book  InStore  1/3/2017
    35  1002 Book  Sold  1/4/2017

    So the new view retrieved the ProductStatus as 'InStore', and it really should be 'Sold'.  Here are my questions:

    1. How did this happened? If we want to modify the Primary Key again, what is the proper proedure(s) to prevent it from happening again?
    2. How can we find out if we have more records with the same type of errors?

    Why did you change the view?  The result you need appears to be what you had before you made the change...

    Hi Steve,
    Thanks for reply.
    One reason that we made the change was due to fact that some records have the exactly the same time stamp.  We'd have to add DISTINCT, and which in turn slow down the performance a little. The current argument within our team is to whether switch back to Max(LastUpdated) script, or to modify the primary key.

    Okay, but I still haven't heard why the view was changed...   Not sure why you expected a primary key that had no relationship to the datetime value to solve that problem for you.   Also, what were you hoping to gain that you apparently did not?   The reason I ask is because I don't really know all that much about your objective with the query involved.   So it's kind of hard to know what to recommend...   If performance is the issue, then an index on the datetime field might help.   Let me know what your ojbective is.

    As to changing back to MAX(LastUpdated), you're going to HAVE to do that anyway, and if you only want 1 record, you can either use TOP (1) in your SELECT along with an ORDER BY, or use DISTINCT.   A primary key isn't going to have any effect on this unless there's information you haven't supplied...

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • sgmunson - Tuesday, January 17, 2017 2:38 PM

    jay-125866 - Tuesday, January 17, 2017 2:31 PM

    sgmunson - Tuesday, January 17, 2017 2:24 PM

    jay-125866 - Tuesday, January 17, 2017 2:15 PM

    Im working on a table that does not have a primary key when created. Here is the original table:

    Create Table tbProduct(ProductID int, ProductName varchar (50), ProductStatus varchar (50), LastUpdated smalldatetime)
    Insert Into tbProduct (ProductID, ProductName, ProductStatus, LastUpdated)
    Values (1001, 'Toy', 'InStore', '1/1/2017')
     , (1001, 'Toy', 'Sold', '1/2/2017')
     , (1002, 'Book', 'InStore', '1/3/2017')
     , (1002, 'Book', 'Sold', '1/4/2017')

    Then I added a primary key field, Primary_key, in Design view. Everything seemed to be working ok until we found one error in Product View, which retrieves the latest records only. The old view was using Max(LastUpdated), and the new view is using Max(Primary_key). In sample table above, the second record for product Book has a smaller number of primary key, Looks like this:

    Primary_key, ProductID, ProductName, ProductStatus, LastUpdated
    62  1002 Book  InStore  1/3/2017
    35  1002 Book  Sold  1/4/2017

    So the new view retrieved the ProductStatus as 'InStore', and it really should be 'Sold'.  Here are my questions:

    1. How did this happened? If we want to modify the Primary Key again, what is the proper proedure(s) to prevent it from happening again?
    2. How can we find out if we have more records with the same type of errors?

    Why did you change the view?  The result you need appears to be what you had before you made the change...

    Hi Steve,
    Thanks for reply.
    One reason that we made the change was due to fact that some records have the exactly the same time stamp.  We'd have to add DISTINCT, and which in turn slow down the performance a little. The current argument within our team is to whether switch back to Max(LastUpdated) script, or to modify the primary key.

    Okay, but I still haven't heard why the view was changed...   Not sure why you expected a primary key that had no relationship to the datetime value to solve that problem for you.   Also, what were you hoping to gain that you apparently did not?   The reason I ask is because I don't really know all that much about your objective with the query involved.   So it's kind of hard to know what to recommend...   If performance is the issue, then an index on the datetime field might help.   Let me know what your ojbective is.

    As to changing back to MAX(LastUpdated), you're going to HAVE to do that anyway, and if you only want 1 record, you can either use TOP (1) in your SELECT along with an ORDER BY, or use DISTINCT.   A primary key isn't going to have any effect on this unless there's information you haven't supplied...

    Thank you for being so patient with me.
    My objective is to select the latest record for each product. Since there are duplicates in LastUpdated field, so we evaluated the following options, using distinct, using top (1) followed by Order By, and using the Primary key. It turned out that using Primary Key gave us the best performance. That's the reason that we'd like to continue to use Primary Key, if possible.

  • The primary key is fine.
    An index on LastUpdated is appropriate as suggested. But wouldn't a RANK WITH TIES be more accurate that a TOP 1?

  • We're confusing two things here, a primary key, which is a column or columns that uniquely identify a row, and an identity column, which is a generated, monotonically increasing (although you can mess with that) value, frequently used as a primary key. They are independent of each other. Don't confuse the two.

    What it sounds like is happening is that you're expecting the data to be in a particular order after you add the identity column (whether it's a primary key or not). That is only guaranteed if you order the data as you insert it into the table when you're migrating the table from the old structure to the new structure. Otherwise, it'll be moved in whatever order it's being stored in currently.

    Not mentioned is how your data is stored. Do you have a clustered index or is this a heap table?

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • jay-125866 - Tuesday, January 17, 2017 2:15 PM

    Im working on a table that does not have a primary key when created. Here is the original table:

    Create Table tbProduct(ProductID int, ProductName varchar (50), ProductStatus varchar (50), LastUpdated smalldatetime)
    Insert Into tbProduct (ProductID, ProductName, ProductStatus, LastUpdated)
    Values (1001, 'Toy', 'InStore', '1/1/2017')
     , (1001, 'Toy', 'Sold', '1/2/2017')
     , (1002, 'Book', 'InStore', '1/3/2017')
     , (1002, 'Book', 'Sold', '1/4/2017')

    Then I added a primary key field, Primary_key, in Design view. Everything seemed to be working ok until we found one error in Product View, which retrieves the latest records only. The old view was using Max(LastUpdated), and the new view is using Max(Primary_key). In sample table above, the second record for product Book has a smaller number of primary key, Looks like this:

    Primary_key, ProductID, ProductName, ProductStatus, LastUpdated
    62  1002 Book  InStore  1/3/2017
    35  1002 Book  Sold  1/4/2017

    So the new view retrieved the ProductStatus as 'InStore', and it really should be 'Sold'.  Here are my questions:

    1. How did this happened? If we want to modify the Primary Key again, what is the proper proedure(s) to prevent it from happening again?
    2. How can we find out if we have more records with the same type of errors?

    >> I am working on a table that does not have a primary key when created. <<

    Then, by definition, this was never really a table. You need to do a review of your schema and see the person that did this to you screwed up other things.

    What you did post of course is full of all kinds of problems. Your "TB_" is a design flaw called a Tibble in the table name is singular (do you really have only one thing called product? Tables model sets. Therefore their names or collective or plural noun by definition). We have a DATE data type in SQL now and have for several years. Identifiers are never numeric (what math were you going to do on the product_id?), Since columns are all NULL-able, there is no way you can have any kind of key; again, this is by definition.

    First, you need alter the table to make the columns not null. My best bet will be you are going to find that you have got nulls for you did not think you had them, so plan on spending some time in scrubbing the data. This will get it up to merely bad 🙁 after that look at your choice of data types. Depending how much scrubbing you have to do, it might be easier to create a properly designed and named table, then move the data over from this imitation deck of punch cards that you got stuck with.

    Do you really have a status that is 50 varying characters long? No, probably not. You just grabbed a convenient default size without thinking about it.

    But if you do not protect your data, you are going to fill this table up with garbage; someone will put in a status code that is a narrative or a tweet. Do not ask why I know this happens 🙁
    You also do not understand there is one and only one display format for dates in ANSI/ISO standard SQL, so you picked an ambiguous local dialect for your data. Since the ISO 8601 format is the second most popular ISO standard on earth after the metric system, this is not very professional

    CREATE TABLE Products
    (product_gtin CHAR(15) NOT NULL,
    product_name VARCHAR (50) NOT NULL, --- careful research or wild stab in the dark?
    product_status CHAR (7) DEFAULT 'instore' NOT NULL
     CHECK (product_status IN ('instore', 'sold')), -- more codes?
    posting_date DEFAULT CURRENT_TIMESTAMP DATE NOT NULL),
    PRIMARY KEY (product_gtin, posting_date));

    See how a CHECK() constraint and a DEFAULT saves you a lot of programming and guarantees data integrity?

    INSERT INTO Products
    VALUES
    ('1001', 'Toy', 'InStore' , '2017-01-01'),
    ('1001', 'Toy', 'Sold', '2017-01-02'),
    ('1002', 'Book', 'InStore', '2017-01-03'),
    ('1002', 'Book', 'Sold', '2017-01-04');

    I will assume you know what a GTIN is or are using some other industry-standard in your actual application. My choice would be to have a pair of (start_date, end_date) that would show the time interval that the product was in a particular status.

    Please post DDL and follow ANSI/ISO standards when asking for help. 

  • jcelko212 32090 - Thursday, January 19, 2017 12:05 PM

    Your "TB_" is a design flaw called a Tibble

    [/quote]
    No, this is not a design flaw, it's a violation of naming conventions.  As far as the processor is concerned, there is no difference between a table with a tbl_ prefix and one without.

    Tables model sets. Therefore their names or collective or plural noun by definition).

    Again, this is a convention.  It is not part of the definition of a table.

    Identifiers are never numeric (what math were you going to do on the product_id?)

    Surrogate keys are overwhelmingly numeric, because they use the arithmetic function of autoincrement that is part of the IDENTITY definition.

    You also missed an egregious problem with this table definition, that is, it's not properly normalized.  The name only applies to the product_id, but the date applies to the product_id and status.  This should be two separate tables: Products and Product_Statuses.

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • drew.allen - Thursday, January 19, 2017 12:38 PM

    jcelko212 32090 - Thursday, January 19, 2017 12:05 PM

    Your "TB_" is a design flaw called a Tibble

    No, this is not a design flaw, it's a violation of naming conventions.  As far as the processor is concerned, there is no difference between a table with a tbl_ prefix and one without.

    Tables model sets. Therefore their names or collective or plural noun by definition).

    Again, this is a convention.  It is not part of the definition of a table.

    Identifiers are never numeric (what math were you going to do on the product_id?)

    Surrogate keys are overwhelmingly numeric, because they use the arithmetic function of autoincrement that is part of the IDENTITY definition.

    You also missed an egregious problem with this table definition, that is, it's not properly normalized.  The name only applies to the product_id, but the date applies to the product_id and status.  This should be two separate tables: Products and Product_Statuses.

    Drew

    These rules are little bit more than just conventions. They are part of the ISO 11179 standards in the L8 metadata committee standards. These are the guys that brought to the metric system I really considered a flaw if somebody uses the wrong measurement system (have you ever use the Potrzebie system of weights and measures? Why not, if it is just a convention ?)

    Let me make another appeal to authority. In "ACM Transactions on Database Systems," Dr. Codd wrote that "…database users may cause the system to generate or delete a surrogate, but they have no control over its value, nor is its value ever displayed to them..."

    This means that a surrogate ought to act like an index, hash table, bit vector or whatever; created by the user, managed by the system and NEVER seen by a user. That means never used in queries, DRI or anything else that a user does. There is no specific data type; in fact, you should not even be able to tell what any particular implementation is. A simple increment can have a lot of problems with data distribution in the hardware, but that is another issue.

    Codd also wrote the following:

    "There are three difficulties in employing user-controlled keys as permanent surrogates for entities:

    1. The actual values of user-controlled keys are determined by users and must therefore be subject to change by them (e.g. if two companies merge, the two employee databases might be combined with the result that some or all of the serial numbers might be changed.).

    2. Two relations may have user-controlled keys defined on distinct domains (e.g. one uses social security, while the other uses employee serial numbers) and yet the entities denoted are the same.

    3. It may be necessary to carry information about an Entity either before it has been assigned a user-controlled key value or after it has ceased to have one (e.g. and applicant for a job and a retiree)."

    These difficulties have the important consequence that an equi-join on common key values may not yield the same result as a join on common entities. A solution is to introduce entity domains that contain system-assigned surrogates.

    Yes, I agree about the normalization problem..

    Please post DDL and follow ANSI/ISO standards when asking for help. 

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply