Data with Provenance

Relational databases are the good guys of the whole privacy debate. I ought to qualify that statement: a properly-normalised relational database, with its full quota of CHECK constraints, is the Goody-Two-Shoes of the privacy debate. OK, maybe just one shoe, but if you are trying to maintain the viewpoint that compliance is mainly a relational database problem, then you are deluding yourself.

A privacy researcher in the USA once made the point about the dangers of open data and denormalization, by un-masking a pseudonymized database. It was one of the many databases sent by healthcare organisations for epidemiological research. Every piece of data in the medical histories that could identify an individual had been conscientiously masked out. Unfortunately, nobody had spotted an insignificant-looking XML column in an innocent-looking table. The XML fragments in this column recorded contacts between the individual patient and the healthcare staff. It was a miniature, schema-less database full of identifying nuggets that allowed the researcher, and presumably any villain who might come across it, to identify many of the patients.

However, this sort of problem is nothing compared to an entire document-based database that prides itself on being 'schema-less'. We of the old-school always scratched our heads over how this could allow the management of any organisation to handle personal or transactional data responsibly. For example, now that you are obliged to be able to remove personal data when appropriate, such as when a customer leaves, or a member of a society quits, how do you do so?

Schema-less document databases have many uses; even I use them enthusiastically for particular purposes, but never when the data could be required for legal reasons. Why not? Because when a schema isn't enforced, people will, and do, slip data into inappropriate places, and the organisation will be unable to demonstrate the authenticity and provenance of the data. It may not even know that the data is there and will have no idea how to comply with a legal request to remove it.

However, even those document databases that are entirely innocent of the requirement for the enforcement of schemas are relatively virtuous. Personal or financial data held in files, especially password-protected files, backups or in old email accounts, are where the scariest leaks can happen, and where it becomes impossible to redact false data.

You need to give provenance to the data in your document databases, and to do that the documents must be enforced with a schema. JSON Schema is improving greatly and is beginning to be used effectively in databases that store JSON data. If you use JSON in SQL Server, then you should be planning how to enforce its schema, as you do (gulp) with XML documents. Hopefully, one day JSON schemas will be built-in to SQL Server like XML Schema, but I can't see it happening soon.

Phil Factor.