Primary Key

  • PaulB (10/11/2008)


    A delete on a staging table is like saying "I don't care about this row coming from the source application", if that actually happens either the Data Warehouse is not a reflection of the business or the design is flawed.

    Not all staging tables are used to support data warehouses and sometimes the design of the data is, in fact, flawed because it comes from a 3rd party. For example, telephone companies are absolutely notorious for delivering duplicated CDRs in "adjacent" or sometimes the same file. Yeah, I know, depending on the situation, I sometimes just mark them as a dupe. Other times, it's just a lot easier to delete the dupes so you don't need the extra criteria to ignore them during multiple processes.

    Other than that, I pretty much agree... select the right stuff to begin with. 😀

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Heh... and by the way... this...

    T - you do whatever transformation or massaging is needed...

    ... sometimes means deleting duplicated data. 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I would agree that not every table needs a Primary Key. Paul has given a great example, and if you do not need to enforce one, then you might not need one. Especially if you are sure of your source stuff.

    I'd argue you might have a primary key on the data so you can match it up in the destination and determine what to update/insert, but you don't necessarily need to enforce one with RI.

  • And I disagree because RI isn't the only reason to have a PK. 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (10/11/2008)


    Heh... and by the way... this...

    T - you do whatever transformation or massaging is needed...

    ... sometimes means deleting duplicated data. 😉

    Nope. We do not delete on Staging. Period. 🙂

    It looks like our auditors are a little more ... how should I say it ... picky? than yours guys.

    We do not have a single "delete" statement in the whole ETL process which comes pretty handy when Business panics and start pointing fingers regarding some "missing" data; in such situations we can say "that's not possible, ask the auditors, we do not have a single delete statement".

    E: We move raw data from the source to STG tables.

    T: We massage the data according to business rules.

    L: We attempt to load STG into FACT/DIM where we reject whatever can't be loaded -like duplicates - and notify accordingly.

    Believe me, that's the only way you can cope with 14 ETL streams loading 6 Gig a day during the tiny ETL windows we get depending on Business activity on Asia, Europe and the U.S. East and West costs.

    _____________________________________
    Pablo (Paul) Berzukov

    Author of Understanding Database Administration available at Amazon and other bookstores.

    Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.
  • By the way... may I point at another -less sensitive - kind of tables doesn't make sense at all to have a PK?

    How about Monitoring, Auditing and Error logs?

    Either if you have a job looking at I/O every five minutes... a Logon trigger writting down login information... or a simple Error log -like rejected duplicate rows during the last stage or your ETL 😀 why in the world would you like to have a PK?... just to make data modeling more complex? slow your process down? allocate more space? etc?

    The answer is... no reason, you are just dumping events as they happen, there is neither RI nor uniqueness to be enforced, you just want to write down something that happened. 😉

    _____________________________________
    Pablo (Paul) Berzukov

    Author of Understanding Database Administration available at Amazon and other bookstores.

    Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.
  • Audit tables I probably would put a pk onto. Probably just an int identity, but I would want something there to uniquely identify a row. Especially if auditors want reports and reasons and followups and the like, and they want them all stored in the DB (as the last auditors I worked with did)

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • GilaMonster (10/12/2008)


    Audit tables I probably would put a pk onto. Probably just an int identity, but I would want something there to uniquely identify a row. Especially if auditors want reports and reasons and followups and the like, and they want them all stored in the DB (as the last auditors I worked with did)

    Oh... I see.

    Well, if the log becomes part of a more complex tracking application - as opposed to a simple log - I would certainly do the same.

    _____________________________________
    Pablo (Paul) Berzukov

    Author of Understanding Database Administration available at Amazon and other bookstores.

    Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.
  • PaulB (10/12/2008)


    Jeff Moden (10/11/2008)


    Heh... and by the way... this...

    T - you do whatever transformation or massaging is needed...

    ... sometimes means deleting duplicated data. 😉

    Nope. We do not delete on Staging. Period. 🙂

    Understood... different type of business we're in with different requirements.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • PaulB (10/11/2008)


    GilaMonster (10/11/2008)I don't see why I would want to fail the entire load if I know there's dups in the source data that, for whatever reason, cannot be removed from the source.

    2- You do not keep loading the same data every single day, after the initial load you load what happened since your previous ETL process so you wouldn't attempt to load your 57 instances of the same client in a daily basis.

    Of course - one way of doing this would be to remove previously processed records from staging ....

  • I've followed this debate with interest but I'm afraid that Paul has not convinced me. I want to be able to uniquely identify every row in my database for a variety of reasons including auditability, security, and simply because I believe it is a "best practice" to provide every single table with a primary key even if the table only has one row. The nature of the data or its application, i.e. staging table etc really doesn't change that fundamental principle. After all, relational theory doesn't distinguish between production tables and staging tables.

    It may not seem necessary but I've seen people lose their jobs over things that didn't seem necessary including this issue. One of the comforting things about "best practices" is that they can protect us from ourselves. There have been a number of times in my career where I have grudgingly followed a best practice even though I didn't really agree with it only to have some event occur that demonstrated the value of the practice.

    "Beliefs" get in the way of learning.

  • That's the beauty of it Robert, that's why ice cream comes in dozens of flavors 😀

    In regards to your position about relational theory, best practices and staging tables being as relational as FACT or DIM tables are I think you ... mmmhhh ... how to put it nicely, yes... in my humble opinion you are dead wrong, sorry 😎

    Let me put it this way, a football coach resorts to specific drills to coach specific skills; top coaches do coach the skill during the drill, the rest of them end up coaching the drill, not the skill. Did you get it? 😉

    _____________________________________
    Pablo (Paul) Berzukov

    Author of Understanding Database Administration available at Amazon and other bookstores.

    Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.
  • I have to say, I've had a few places where a heap table (no PK, no clustered index) was useful to me, mainly in places where I needed to do a ton of inserts, but no real querying.

    For example, I've done audit logs that way. Dump the data fast into a heap, then archive it periodically into a table that has a PK, clustered index, and a few indexes to help common queries. That makes the audit inserts that much faster.

    But any place where I'm not trying to shave microseconds off of inserts, I'd go with having a PK.

    Even the ETL thing, I keep a PK on staging tables, because it can massively speed up the T part when the query optimizer has some vague clue what rows it's going to be updating. Even a simple clustered index can help with that. If all I'm doing is some very simple aggregates, I might not want a PK, but I find that a warehouse that's just simple aggregates is usually overkill compared to an indexed view.

    But I do have to agree that, in some cases, a heap might be better. I can imagine some. Haven't run into them in real databases, but I have to admit it's possible.

    As a general rule of thumb, though, I'd tend towards having a PK on every table. Generally, it's going to be better to have one you don't need than to miss one you do need.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Don't mean to be fussy but a table can have a PK and be a heap table. A table without a clustered index is part of the heap. It doesn't matter whether the clustered index is the PK. In fact, I often see PK's on identity attributes and clustered indexes on some other natural key that would support range queries more effectively.

    "Beliefs" get in the way of learning.

  • TO make sure the Normalization rule, there must be primary key in a table. If there is no PK then how can you stop data redundancy in a table. PK also helps to make sure each and every records are unique and the required data field also filled by user.

    Another thing is Performance issue. It is easier to apply clustered indexing on PK because it makes sure the criteria for Clustered Index.

    Thanks

    AKP

Viewing 15 posts - 16 through 30 (of 40 total)

You must be logged in to reply to this topic. Login to reply