Table Without PK

  • xsevensinzx (5/8/2015)


    GilaMonster (5/8/2015)


    xsevensinzx (5/8/2015)


    What if you are allowed to have duplicates with transactional records that have customer ID's which forms the relationship? What purpose is the PK giving you outside of deduplicating the records that are absolutely within the business requirements to keep?

    If you are allowed multiple transactions per customer, then customer is not the primary key. What uniquely identifies a transaction? Once you answer that question (and from you last paragraph it looks like maybe Customer ID and the Timestamp), then that becomes the primary key (or a unique constraint if you're using artificial primary keys)

    Yep, in the answer I came too was count(), which is the approach I took to identifying that primary key in this instance.

    But, you still have to answer the question of why the transaction has to be uniquely identified in a world where not being unique is allowed? What use is that primary key that is not used in the relational model anywhere? It's uniquely identifying a record to the table where we do not care it's uniquely identified. 😉

    But each transaction has to be unique in some fashion. Otherwise, through error, someone could enter an identical sale multiple times. Customers might get upset to find they've been billed for 50 of something that they only purchased 1 of. And I understand that a given customer can purchase the same item multiple times, but, either it's a multiple count on one transaction, or it's from a different register or a different sales person or at a different time, all factors making the transaction unique. If you can't uniquely identify a given transaction, what happens on a return or a cancelation? If the customer complains, "Yeah, I bought one of those on Tuesday, but not Wednesday" how do you prove that there was a Wednesday purchase if the Tuesday purchase is identical in every possible way? In short, you do have a mechanism for uniquely identifying these transactions, which means that there is a set of columns that identify a candidate key.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • xsevensinzx (5/8/2015)


    But, you still have to answer the question of why the transaction has to be uniquely identified in a world where not being unique is allowed? What use is that primary key that is not used in the relational model anywhere? It's uniquely identifying a record to the table where we do not care it's uniquely identified. 😉

    How do you tell is a record is legit or a 'duplicate' entered by mistake/bug?

    If duplicates are fine, then having 1 million absolutely identical rows won't mess up any reports/processes? How about if a bug puts 5 million identical ones in there instead? How will you tell?

    Uniqueness is about data quality. I can probably buy two items from Amazon almost simultaneously, so they can't use CustomerID and Timestamp for uniqueness, but that doesn't mean 50 000 concurrent purchases is correct.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Grant Fritchey (5/8/2015)


    xsevensinzx (5/8/2015)


    GilaMonster (5/8/2015)


    xsevensinzx (5/8/2015)


    What if you are allowed to have duplicates with transactional records that have customer ID's which forms the relationship? What purpose is the PK giving you outside of deduplicating the records that are absolutely within the business requirements to keep?

    If you are allowed multiple transactions per customer, then customer is not the primary key. What uniquely identifies a transaction? Once you answer that question (and from you last paragraph it looks like maybe Customer ID and the Timestamp), then that becomes the primary key (or a unique constraint if you're using artificial primary keys)

    Yep, in the answer I came too was count(), which is the approach I took to identifying that primary key in this instance.

    But, you still have to answer the question of why the transaction has to be uniquely identified in a world where not being unique is allowed? What use is that primary key that is not used in the relational model anywhere? It's uniquely identifying a record to the table where we do not care it's uniquely identified. 😉

    But each transaction has to be unique in some fashion. Otherwise, through error, someone could enter an identical sale multiple times. Customers might get upset to find they've been billed for 50 of something that they only purchased 1 of. And I understand that a given customer can purchase the same item multiple times, but, either it's a multiple count on one transaction, or it's from a different register or a different sales person or at a different time, all factors making the transaction unique. If you can't uniquely identify a given transaction, what happens on a return or a cancelation? If the customer complains, "Yeah, I bought one of those on Tuesday, but not Wednesday" how do you prove that there was a Wednesday purchase if the Tuesday purchase is identical in every possible way? In short, you do have a mechanism for uniquely identifying these transactions, which means that there is a set of columns that identify a candidate key.

    Indeed, but you're assuming transaction refers to a sale. In this scenario, a transaction is not a sale. Transaction refers to a unit of work or interaction (handshake) much like a sale would be, but of course no metric such as currency.

    The transaction also has no metrics like you would find on a sales transaction such as costs. Therefore, you can still pull the data without worrying about duplicating the costs much like you would run into without duplicates on say, a sales table.

    But, that's my fault for not clarifying previously what type of transaction we are talking about here.

    GilaMonster (5/8/2015)


    xsevensinzx (5/8/2015)


    But, you still have to answer the question of why the transaction has to be uniquely identified in a world where not being unique is allowed? What use is that primary key that is not used in the relational model anywhere? It's uniquely identifying a record to the table where we do not care it's uniquely identified. 😉

    How do you tell is a record is legit or a 'duplicate' entered by mistake/bug?

    If duplicates are fine, then having 1 million absolutely identical rows won't mess up any reports/processes? How about if a bug puts 5 million identical ones in there instead? How will you tell?

    Uniqueness is about data quality. I can probably buy two items from Amazon almost simultaneously, so they can't use CustomerID and Timestamp for uniqueness, but that doesn't mean 50 000 concurrent purchases is correct.

    This is the uphill battle I face though. I have no verification if 1 million or 5 million identical rows are correct or incorrect. Thus, all are assumed correct until otherwise proven otherwise by the data source that provides them. That's the catch.

    In this scenario, if duplicates are allowed and you have no way to verify what I refer to as "Duplicates on top of Duplicates", then a primary key may not be all the important to the table. At least, that is the general consensus when I talk to more experienced DBA's.

    Yet, not having the primary key would make seasons DBA's like Grant start to twitch and start to reach for that bottle of Jack if they ever audited the table. So, what to do?

    So, how to approach that? I have one that works well, such as count() that uniquely identifies the record that still indicates on the record how many duplicates it has so the end user can still be satisfied.

  • xsevensinzx (5/8/2015)


    This is the uphill battle I face though. I have no verification if 1 million or 5 million identical rows are correct or incorrect. Thus, all are assumed correct until otherwise proven otherwise by the data source that provides them. That's the catch.

    In this scenario, if duplicates are allowed and you have no way to verify what I refer to as "Duplicates on top of Duplicates", then a primary key may not be all the important to the table. At least, that is the general consensus when I talk to more experienced DBA's.

    Yet, not having the primary key would make seasons DBA's like Grant start to twitch and start to reach for that bottle of Jack if they ever audited the table. So, what to do?

    So, how to approach that? I have one that works well, such as count() that uniquely identifies the record that still indicates on the record how many duplicates it has so the end user can still be satisfied.

    If you're dealing with a system that is, to all intents and purposes, intentionally collecting dirty data, then it's collecting dirty data. In fact, this is probably a great place to talk about just throwing the information into a NoSQL storage system. Since the data itself is unstructured, I'd say put it in unstructured storage. No sense in trying to force structured storage rules on data that is inherently not.

    One of the problems you'll see from hard core structuralists and from hard core nonstructuralists, is that they have the best hammer on earth, so every problem is a nail. I'd say, this problem, it's not a nail. It's a bolt. Using the hammer is just going to be frustrating and ineffective.

    Plus, I don't like Jack. Now, Woodford Reserve...

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Grant Fritchey (5/8/2015)


    xsevensinzx (5/8/2015)


    This is the uphill battle I face though. I have no verification if 1 million or 5 million identical rows are correct or incorrect. Thus, all are assumed correct until otherwise proven otherwise by the data source that provides them. That's the catch.

    In this scenario, if duplicates are allowed and you have no way to verify what I refer to as "Duplicates on top of Duplicates", then a primary key may not be all the important to the table. At least, that is the general consensus when I talk to more experienced DBA's.

    Yet, not having the primary key would make seasons DBA's like Grant start to twitch and start to reach for that bottle of Jack if they ever audited the table. So, what to do?

    So, how to approach that? I have one that works well, such as count() that uniquely identifies the record that still indicates on the record how many duplicates it has so the end user can still be satisfied.

    If you're dealing with a system that is, to all intents and purposes, intentionally collecting dirty data, then it's collecting dirty data. In fact, this is probably a great place to talk about just throwing the information into a NoSQL storage system. Since the data itself is unstructured, I'd say put it in unstructured storage. No sense in trying to force structured storage rules on data that is inherently not.

    One of the problems you'll see from hard core structuralists and from hard core nonstructuralists, is that they have the best hammer on earth, so every problem is a nail. I'd say, this problem, it's not a nail. It's a bolt. Using the hammer is just going to be frustrating and ineffective.

    Plus, I don't like Jack. Now, Woodford Reserve...

    Would you define data dirty if the records were multi-dimensional and well structured, but included duplicates? I wouldn't.

    These are transactions, handshakes, cleaned data that do not have primary keys. You can still store the data without a primary key because the data is coming in on batch every day from a system that you don't control. There are no retroactive updates to previous data. It's entirely in your hands to do what you will with the data.

    In this context, I feel you can keep the data without a primary key. But, it breaks the rules of everything else. So, it depended in my case, how the end user is using the data. In this case, they are formulating paths from start to end where each transaction is a step. Duplicates represent multi-step in the same point of time.

    If there are never any retroactive updates, you can effectively roll up the data based on user + time + count(duplicates) to effectively formulate a primary key while also indicating to the end user how many steps (including duplicates) the record represents WHILE ALSO reducing the data footprint if you didn't.

    Everyone is happy I think? :hehe:

  • Don't get overly focused on having a primary key per se. Some tables require them, some don't. Data integrity is vital, but a primary key doesn't guarantee it or even necessarily particularly help it. Foreign keys, check constraints and other validations are much more important for that.

    On the other hand, determining and creating the best clustered index is absolutely critical for best performance. And that is most often not an identity column (identity columns are the most over-used clustering keys in the world). There are cases where a clustering identity column is appropriate, but that is the minority of the time.

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • ScottPletcher (5/8/2015)


    Don't get overly focused on having a primary key per se. Some tables require them, some don't. Data integrity is vital, but a primary key doesn't guarantee it or even necessarily particularly help it. Foreign keys, check constraints and other validations are much more important for that.

    On the other hand, determining and creating the best clustered index is absolutely critical for best performance. And that is most often not an identity column (identity columns are the most over-used clustering keys in the world). There are cases where a clustering identity column is appropriate, but that is the minority of the time.

    That was my take on that as well. At the end of the day, a primary key is just taking up space for this particular instance.

    I clustered on time because it's how we primarily filter the data and of course due to the processing, minimizes the fragmentation levels a lot.

Viewing 7 posts - 16 through 22 (of 22 total)

You must be logged in to reply to this topic. Login to reply