flatfiles without candidate keys to be used as unique key

  • We receive data as flatfiles from branch areas containing guidelines, policy/procedures.

    We notice that most of the columns are 'N/A' value typically candidate keys for identity (like policy no, guide no, resolution no). The ff. data will be used as a central repositories for policies and procedures. Should this be resolve under transformation process designating a unique key? If yes, how about during update on a specific procedure that would match from the previous to the current ones (the current ones still doesn't have a candidate key)?

    Appreciate your kind help. Thank you

  • If the columns are MOSTLY 'N/A', to me that sounds like a very poor key.  Now, if you are going to overwrite that value with a SEQUENCE or IDENTITY value on the column, you run the risk of losing the data that is in that column in the source when it is not 'N/A'. If you pull in the data that exists when it is not 'N/A', you run the risk of duplicate data.

    In the end, I think this REALLY depends on your clients.  Are they OK with a column like Policy No being overwritten with an auto-generated value?  If not, I would recommend a new column be created for the uniquifier which could be a GUID, INT, BIGINT, VARCHAR, whatever you like.  Best performance will be an ever increasing numeric type, but a GUID will give you a lot more values if you are expecting to have TONS of rows.

    If they are OK with those values being overwritten and lost, then your approach sounds OK for that being a unique key, but you may not want that as your ONLY clustered index key or nonclustered index key.  I expect a compound key, likely compounded  on the policy name and policy number, would make more sense based on what users would likely be searching for.

    But it does depend on how you are using the key.  Policy Number would make a decent clustered index key if it was used by a foreign key, but you also posted this in "Data Warehousing", so I am expecting the tables are not normalized as my understanding is of data warehousing (note it is VERY limited) is that you usually have denormalized tables with good indexes to make the SEEKS fast without needing to do a bunch of joins to get your data.  If this is accurate, then I would say that a policy number that is a sequence or identity value would not be the best key unless it is only part of key used to ensure uniqueness of the row.

    The above is just my opinion though.  Others may have better idea on this.

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!

  • This is what i am worrying about. If the identity key will be automated, the client would not be very concern to it and data lost will be common unless it will be return back to them telling that this is the new identity which they might turn down. Actually i'm thinking that i might exclude all records without candidate keys to avoid incoming problems and tag them as data to be resolved by the client. Would that be a good practice?

    • This reply was modified 6 months ago by  zysirhc.
  • One approach you can take is create your own surrogate key for all record and keep the customer key as its own column.

    "Policy Number would make a decent clustered index key if it was used by a foreign key"  -- That would depend. IF you are using a unique Identifier type then I would not use this to cluster my table on. If the value of policy # is incrementing then it would be a candidate. For indexing you do have to look at how your queries (thourgh execution plans) are using indexes over time to know if you made the right choice or not. That is, if the execution plans even use your index.

    ----------------------------------------------------

Viewing 4 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply