Why is data integrity important?

  • alen teplitsky (8/30/2010)


    i've seen this no NULL's allowed OCD thing being taken to extremes

    say you sign up a customer today and promise some service that will be done in the near future but you're not sure of the date. since the column is not null you have to put in a crazy date like 1/1/2049. or if something was done in the past then 1/1/1901

    so if a customer calls in a few days the CSR will happily tell them they can expect to see someone come to their site in 2049

    Hehe, I've worked for them :w00t:.

  • RalphWilson (8/30/2010)


    Steve Jones - Editor (8/30/2010)


    I've always seen "N/A" as "Not applicable", but you give some good other responses that might be used. Hence another point for the "non-nulls" argument.

    I have dealt with DBA's who insisted on all columns being non-NULL and I have engaged in "disucssions" about such (both in person and in cyberspace) and I have yet to find a really good argument for changing my position: NULLs should be used when appropriate and avoided when not apporpriate.

    If there is a referential integrity rule that, in effect, says that every row in TableA has to have a FK/PK link to TableB, then it is inappropriate for that column in TableA to be NULL. However, if there is a column, for instance, in a Hospital Information System's PatientDetails table, that is supposed to contain a Date of Birth and it is possible that a) that information may notbe readily available (e.g. John Doe is unconscious when admitted) and b) it is not a show stopper if it either is incorrect or unavailable, I would rather have that column in that row left NULL than to assign an arbitrary DOB. If the column is not essential to the overall integrity of that row, when I am faced with the choice of using an arbitrary default value or leaving the column NULL, I choose the NULL.

    One argument that can hardly be touched : no matter how you deal with it in the db, you'll have to do work in the application to handle cases... or waste time training people so that 2049 is not the date we are planning to visit to do the work.

    If you decide to somehow ban all nulls in te db by adding tables, then you'll get all those nulls back as soon as you query anything because of all the left joins you'll have to perform.

  • So, Steve, what is the definition of "N/A"? Is it "Not Applicable" or "Not Available" or "No Answer"?

    Just because there are multiple possibilities does not invalidate the use of Nulls. Sometimes all of the answers are possible, hence the reason some people advocate the creation of multiple types of NULLs. If N/A could be put in an integer or a date field, then someone could make a better case against nulls. Until that time, however...

    In OLAP designs, there are no NULL FKs because OLTP Nulls can be accounted for with a dimension row. These have two possible rows, however, one for N/A (Not Applicable) and one for Unknown, because there should be an answer. An incomplete order will not have a completion date, and will appear in the N/A column for Completed Date. A completed order that doesn't have a complete date, however, will be in the Unknown column, because it should have a complete date. A missing Create Date will always be Unkown, because a created item should have one of these.

    This is possible in an OLAP design the date field can (and should) be masked with a text field, however, so it's possible use 12/31/69 for N/A and 1/1/70 for unknown, because the user will never see these. This would be awkward in an OLTP design.

    While we are on the subject of OLAP designs, there are average calculations that would not work w/o the use of Nulls because that's the best way to omit from a calculation rows that have nothing to do with that particular calculation.

  • I actually do the same thing with my data warehouse, having seperate keys for N/A and Unknown. If there really wasn't a part sold on an order line, then the part is N/A. If I have a customer without a customer type, it is Unknown. The difference is that there should be a customer type, but isn't. There doesn't have to be a part.

    However, since CustomerType is an attribute and not a dimension itself, and the source system allows NULLs, I originally had the field as a nullable column. This caused a lot of problems with groupings and filters, most of which were due to the old NULL <> NULL rule. It was then that all the NULLs were replaced with Unknown.

  • RonKyle (8/30/2010)


    Just because there are multiple possibilities does not invalidate the use of Nulls.

    I would agree, but it does make an argument that NULLS can be problematic. I think NULLs can work well in an application.

    Great point on the OLAP, ignore rows in a calculation, comment.

  • Now that we appear to be including DW in the mix, I have a question for y'all.

    Suppose I have a database in which I am storing the answers that are given during the various exams that a students might take for a course. The design of the exam is such that it is anticipated that no student will complete the exam (thus, there are NULLs expected for some number of answers to questions for each student for each exam). The scoring on the exam is such that 1 point is added for correct answers, 3 points are subtracted for wrong answers and any _un_answered_ (ie. skipped) questions result in a 1 point subtraction. (Thus, the test is designed to try to suppress pure guesses but to also encourage students to answer all questions before skipping them.)

    How do you differentuiate between the NULL of a skipped question (i.e. "No Answer Given") vs the NULL of the question that has not yet been reached (i.e. the "Not Applicable")?

    This has occassionally come up in my experience and I have developed sevral diffent qork-arounds. I am just seeking advise here. 😉

    Ralph D. Wilson II
    Development DBA

    "Give me 6 hours to chop down a tree and I will spend the first 4 sharpening the ax."
    A. Lincoln

  • alen teplitsky (8/30/2010)


    i've seen this no NULL's allowed OCD thing being taken to extremes

    say you sign up a customer today and promise some service that will be done in the near future but you're not sure of the date. since the column is not null you have to put in a crazy date like 1/1/2049. or if something was done in the past then 1/1/1901

    so if a customer calls in a few days the CSR will happily tell them they can expect to see someone come to their site in 2049

    In a scenario like this, when there can be multiple reasons for a date to be NULL, I typically allow the column to be NULL, but I also have a column indicating the status of the record. For example, a sales order shipment can have a status of 'Pending' or 'Cancelled', which would explain the non-existing schedule date. I will also have a check constraint, enforcing not-NULL conditionally for specific status values. That way the application or report builders don't have to make assumptions based on the absence of information.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Not sure I understand? Wouldn't a skipped question have a -1?

    For the answer, is a blank because it was skipped different than a blank because they didn't know the answer?

  • I don't understand either. How do you differentiate between a skipped an unanswered? Is it all questions above the last answered?

  • A "skipped" question is one that has not been answered but has a Question Number that is lower than the highest Question number that has been answered.

    An "Unanswered Question" is a question with a higher Question Number than the highest Question Number that has an answwer.

    Of course, you may have already thought of the question that I came up with when presented with this problem, "How do I know whether Question N+1, where N is the highest Question Number with an answer, is an unanswered vs a skipped question?"

    The Skipped Questions will have a NULL because there is no answer marked (and, therefore, no value to record). They will not have a value of -1 because a) they have no answer and b) there is nothing to preclude their being True/False, Multiple Choice (with alpha choices) or even essay questions.

    Ralph D. Wilson II
    Development DBA

    "Give me 6 hours to chop down a tree and I will spend the first 4 sharpening the ax."
    A. Lincoln

  • Steve Jones - Editor (8/30/2010)


    Not sure I understand? Wouldn't a skipped question have a -1?

    For the answer, is a blank because it was skipped different than a blank because they didn't know the answer?

    What if it was skipped because they did not know the answer?

    As Shakespeare once said: "That which we call a rose by any other name would smell as sweet."



    Alvin Ramard
    Memphis PASS Chapter[/url]

    All my SSC forum answers come with a money back guarantee. If you didn't like the answer then I'll gladly refund what you paid for it.

    For best practices on asking questions, please read the following article: Forum Etiquette: How to post data/code on a forum to get the best help[/url]

  • I don't remember who wrote this comment, but it's the most profound one I've ever come across regarding data quality. I'll paraphrase because I don't remember the exact quote:

    "The best quality information in a company is who owes them money. Everything else goes downhill very rapidly."

  • One argument that can hardly be touched : no matter how you deal with it in the db, you'll have to do work in the application to handle cases... or waste time training people so that 2049 is not the date we are planning to visit to do the work.

    And, if you choose a date in the past, you have to preclude it coming up as a "Late Service Call while, if you choose one in the future you risk it becoming a "reasonable" date (e.g. if you choose '2049-01-01' and it's some time in December of 2048, that date might look reasonable).

    It just seems like nothing is as easy as it should be, right? 😉 After all, in about 1976, some dang fool professor was chastising me and about 243 other students in an SMU Programming Languages course because we were wasting our time because "In another 10 or 15 years, there will be no need for programmers." 😉

    In fact, as I remember it, he claimed that computers would be able to interpet verbalized needs in order to create their own tables, indexes, etc. 😀

    Ralph D. Wilson II
    Development DBA

    "Give me 6 hours to chop down a tree and I will spend the first 4 sharpening the ax."
    A. Lincoln

  • Eric Russell 13013 (8/30/2010)


    Data integrity and constraint begins with choosing the appropriate data type. Obviously (to most of us) integers should be contained in an Integer data type and monetary values in a decimal with 2 decimal places. However, what I see happen all the time, even in major ISV applications, are date/time values contained in a varchar instead of a proper date data type. Not only does the reporting process take a performance hit with the data conversion, but inevitably there will be the occasional Feb 30. Even worse is when different users or applications use a different coding scheme and you have to sort out (in a where clause at runtime) the context of '3/11/09 6:22'.

    Amen,

    I've seen so many of these in older databases where I am... we work to squish them... but in many cases the business won't prioritize developer time to do so.

    Even better is when you find tables that violate 1NF. And have something like a column called groups with values like 'ADMIN;USER;MANAGER' in it.

    Come on... :'(



    --Mark Tassin
    MCITP - SQL Server DBA
    Proud member of the Anti-RBAR alliance.
    For help with Performance click this link[/url]
    For tips on how to post your problems[/url]

  • RalphWilson (8/30/2010)


    A "skipped" question is one that has not been answered but has a Question Number that is lower than the highest Question number that has been answered.

    Why not just store the results as the questions are asked?

    Table_Answers(

    Answerid int identity(1,1) PRIMARY KEY,

    Testid int NOT NULL,

    questionid int NOT NULL,

    TestAnswer int,

    TestScore int NOT NULL)

    As the questions are asked and answered/skipped just store either

    1 in testscore if it's correct

    -3 in testscore if it's wrong

    -1 in testscore if it's skipped

    Then left join Table_Answers to Table_questions if you care about the unanswered questions.

    If Testanswer can be more than multiple choices throw in a varchar(max) or nvarchar(max) column for the free form answers... but those will be harder to grade via SQL anyway.



    --Mark Tassin
    MCITP - SQL Server DBA
    Proud member of the Anti-RBAR alliance.
    For help with Performance click this link[/url]
    For tips on how to post your problems[/url]

Viewing 15 posts - 16 through 30 (of 43 total)

You must be logged in to reply to this topic. Login to reply