This editorial was originally published on Oct 13, 2008. It is being re-run as Steve is out today, traveling to a SQL in the City event and SQL Saturday in Sacramento.
It's a fictional story, or at least I hope it is, but Bruce Schneier has a great piece on Identity Farming, a long term way to create false identities that would fit great in the spy world. Mr. Schneier doesn't see a practical point in doing it, but it's interesting from a data standpoint because he brings up a point. This could be done without a real person existing to back up all the data that's created.
The part that strikes me from this piece is that all too often we make assumptions about the people or entities that created all the data we use. One bad foreign key, orphaned child row, or incorrectly transformed piece of data could snowball downhill at a tremendous rate and it might be hard to determine what is wrong.
The blog entry talks about our data shadows, which grow larger and larger all the time. Unless you are actively trying to limit yours, every day that are likely new entries in some database about your life. And more and more, various companies and institutions interact with our data shadows instead of us. Credit checks, marketing efforts, when we board an airplane or make a purchase, all of these require checks on the shadow of data in our lives, not necessarily ensuring that the shadow is tightly linked to each of us.
I've had more than my share of confusion because of my name; it's common, in almost every database, and shared by thousands, if not millions of people. On on hand it means that I'm a little lost in the flood of "Steve Joneses" out there. On the other hand it makes it hard to correct mistakes. If there are 6 people with the same name and you have an orphaned record, who do you link it to? Do you guess? Infer it from the other data? I'd like to think you need to somehow research this, contact me, and make a note that the quality of this data could be suspect.
People working with information try to be accurate, but they get busy, and mistakes occur. I'm sure I'll find more and more over time, and I don't have a great solution for what might work better. I'd like to think that we would implement better checks for data quality, fuzzy searches, and somehow assign "risk" values to data. Something to let people know that there might have been some issue.
It's a thorny problem, one that's not going away, and likely to become more problematic in the future.