I've suffered through my share of data integrity problems in my career, and I thought I'd share some of these with you as well as what I've learned as a result. It all started with a forum post "is data integrity important?", and my first thought was I could write a book on the subject. I'll stick to just an article for now.
Is data integrity important? At face value, the answer is obvious. Of course it's important; of course we want consistent, complete data. If our data is completely invalid, why collect it at all? But the question is really why do we force data integrity, and specifically, why do we impose it at the database level. We all trust our users, right?
I don't, and I'm not ashamed to openly admit it. Every time I hear "human interaction", I assume an approximate 5% chance of error with every transaction. That means about 1 in every 20 transactions will contain erroneous data of some sort. Why? Because they're human. As humans, we forget things, we rush things, we neglect details, and we introduce typos. I'm not even talking malicious intent here. My boss used to get on me all the time for forgetting to fill out a field in our database, and yet I continued to forget. I meant well, I just forgot. I later introduced a program to fill in the data, which never neglected to fill in that field, and also saved me a lot of time. Computers do the same thing every time, humans don't. When I say I don't trust my users, I'm including myself.
As a DBA, if I were to have an arch-nemesis, it would be the nullable field that shouldn't be. I couldn't count on my hands and feet the number of times this ugly monster has reared its head. The above mentioned field is a perfect example. Were it a NOT NULL field, I would have been compelled to fill it out every time. A dialog box would have popped up, prompting me to fill in that required field. But it wasn't, and I was content to go about my job without even considering whether or not that field was blank.
The problem with NULL is that you don't know what it means. NULL doesn't really mean anything, but at the same time it could mean lots of things. If your users don't know what something means, or don't think it's applicable, there's a good chance they'll leave it blank if you let them. So NULL could mean the default value, it could mean not applicable, it could mean the user forgot, or it could mean the user didn't know what to put in the field. I recently ran into a case where NULL meant zero. We were fortunate that it seemingly consistently meant zero, but we spent hours digging into the data to find that out. Had it been zero in those fields, and not NULL, we could have saved a lot of time. Because there were so many NULLs in that particular field, I immediately questioned that data's accuracy.
Recently, I created a program that scheduled our orders automatically. To support this program, we needed more data than our ERP system supported, so we used a user defined (nullable) field to support this additional data. The users decided it would be easiest to just use NULL (blank) as the default code for this field. I objected. They fought back, but I was eventually able to state a good enough case that they let me have my way. NULL just meant too much, and I wasn't prepared to schedule orders based on the assumption that NULL meant default. Further, if NULL was the default, there was no way for me to report on parts that didn't have a proper code. Neglect is an inevitable factor in human interaction. When (not if) it happened, my program would be making the wrong decisions, and these erroneous decisions would be very difficult to discover. Although I cannot make this field a NOT NULL field, I can build a report listing all the parts with a NULL in that field. Further, my program throws an error when it sees a NULL, and does not schedule that order.
I could go on for hours about the evils of NULLs in critical data, but I won't. Referential integrity is equally important and deserves some time as well.
If you give your users a free-form field, how do you control what goes into that field? Simple - you don't. More correctly, you can't. Even if you give them guidelines about what should go in there, it's safe to bet money that something that doesn't belong will end up in that field, and soon. Typos, inconstancies between abbreviations, and just plain garbage data can introduce itself remarkably quickly. Free-form fields work great for comment fields, but are lousy for just about everything else. Anyone who builds reports will tell you: you can't report on a free-form field.
Most of us are familiar with referential integrity enforced through drop-down lists or lookup fields. You don't type the full name of the customer every time you enter an order for them. You look for them in a drop-down list or lookup dialog. Here referential integrity is being enforced by the application (and possibly the database as well), and you are not permitted to create an order for a customer that doesn't exist. What kind of mess would it be if the billing department sent several invoices to the same company, each with a slight variation of their name? They'd be outraged, and likely wouldn't pay until the problem was fixed.
What about when referential integrity isn't enforced? Our ERP system actually uses a free-form field for the ship state. Not only does that mean we could potentially get a shipment sent back to us, it makes it real hard to prepare a sales report by state. A great deal of time was spent mapping invalid state values into correct ones, and nothing's there to stop them from doing it again. Instead of consistently picking "FL" from a drop-down list, we have "FL", "Fla", "Florida", "Flordia", several Florida zip codes, and even NULL, all representing sales to Florida. That's just one state. It's a mess. It's further complicated by the fact that our commissions are determined by state.
Last but not least are check constraints. If you expect a positive integer between 1 and 100, don't assume that's all you'll get. Enforce it. What does -1 mean in a commission percent field? Will the system actually negate the commission from the sales rep? How long would it be before you noticed that?
Application Level vs. Database Level
So I hope I've stated a good case for why data integrity is not only important, but critical. It's also a huge time saver. Although every one of these problems was solvable, I'd have a lot more time on my hands if they never occurred in the first place. I could spend that time on strategic initiatives, analyzing systems, and saving user's time. So now that we've establish its importance, why do we want to enforce it at the database level? The application can enforce it quite well; that's good enough, right?
Not exactly. Application constraints work well if everything goes through the application (and the application never changes), but even the best applications don't meet all of the needs of every user. Invariably, developers end up creating automated systems, data loads and any number of other programs that go around the application and communicate directly with the database. Of course these should run against the test database first, but without constraints the developer can never be 100% sure that the data they are loading is complete. Constraints actually make their jobs easier, because if it's not complete, the system won't accept it. It may be a little frustrating at first, but in the long run it pays for itself with consistent data.
What about the application developers themselves? Can you really assume that the same developers will be developing the application all the time? What if a new developer creates a module that doesn't follow constraints? Consider also that data may be inserted from several different modules. It's doesn't make sense to add constraint checking code to each one. In this case, it's just plain good programming to push the constraint checking down to the lowest level.
Data Integrity is Your Friend
Referential integrity means piece of mind for the DBA. It also means an easier job for the report builder, and better (more accurate) reports as a result. It's when you want to retrieve data from the database that data consistency errors really rear their ugly head. Either someone has to spend valuable time scrubbing the data, or worse - bad data is sent out, and the company makes bad decisions based on it.
In the end, it really does pay off to enforce data integrity. Even if none, or very few, of these points apply to you now, they eventually will. Business applications need to scale because ultimately every business wants to scale. You can't assume that your application will only be used by a handful of people forever. In a short while, it could be the whole company using your application, and hopefully that company is growing. In the end, you'll thank yourself for getting it right the first time.