RE: The Challenges of Being Safe

Say Hey Kid

Points: 668

March 10, 2009 at 6:34 pm

As pointed out randomizing the whole lot is rather pointless because of data-consistency and inter-relationships that need to be real(istic). Especially when trouble-shooting it is essential to have the exact data that causes the issue (you don't know before-hand which piece of data is the culprit, so by obfuscation you run a real risk that you also obfuscate the very problem you're trying to track down).

To tell you the truth: I've never actually done obfuscation of test-data and always used a mirror of the production-database in development. However I have been thinking about it on several occasions and there are a few distinct situations for use of data that each have different needs for levels of realism:

- Initial development of new/modified functionality -> can be done with a small set of data created on the fly by the developer in his private development-database (no need for mirroring production-data).

- Automated unit-testing -> has to be done with a known set of data with known characteristics (covering all cases -> is code-paths), and should not be changed over time other than in conjuction with new test-cases (no need for mirroring production-data).

- Performance-testing -> has to be done with a realistic amount of data with realistic relationships/value-distribution (only feasible solution is a mirror of production-data, some obfuscation possible).

- Acceptance-testing -> real users are going to use the data and they need to see real(istic) data (only feasible solution is a mirror of production-data, some obfuscation possible).

- Troubleshooting an issue in the production-system -> only a true mirror of the production-data is useful (no obfuscation because you run the risk of obfuscating the actual issue).

By limiting obfuscation only to the "hard" personal identifyable data: names, addresses and public identification numbers (such as social security number, tax-number and credit card numbers) you can still use it for testing with realistic outcomes. No business-rules will break when a name is actually a random string (within certain size-constraints), since your application should accept pretty much anything in a name-field anyway. In an address only the street-name needs to be obfuscated (together with the randomisation of the person/company name, a malicious user has no hope in using a random name in a random street of a known city).

Credit-card numbers without a name becomes mostly unusable, but for extra measure you could replace those with known test-numbers banks will be able to provide intended specifically for this purpose (they pass internal validation-checks but won't be usable to make any purchases).

Email-addresses are often used explicitly by applications to actually send messages (which you DO NOT want to happen from a non-production environment, ever) and could also be used to identify a real person. So these must be obfuscated but cannot be merely randomised, the best solution for this is to have a set of addresses on a dedicated (test-)domain and distribute those addresses (randomly) in the database.

Now about using external ids for internal PK/FK: BAD IDEA!!!

It is my almost religious belief that a database MUST provide and totally control the mechanisms for referential integrity, any external identification key should be a mere piece of data rather than structural to the database. In my own designs this goes even as far as product-codes and employee-numbers, if any human-entered key is required, I will NOT use it as Foreign Keys, but will provide an internal (hidden) auto-number (or even a GUID in my latest designs). Any information entered by a human is always prone to have mistakes which must be easily corrected -> a single field on the primary object, which is then joined and queried for output. The situation is even worse when the external party that created the key (tax-office, internet provider) chooses to update/refactor their system and changes the keys of existing data........

So if your database has any external keys (especially those that could be used for personal identification if known to a malicious person) each should be considered to be replaced by either a pure random number, or if the application tests validity in some way, the data should be replaced by a set of test-numbers that satisfy the validity-test (possibly even hardcoded bypassing any validity-test if the key is recognisable as a test-number by a special prefix or certain checksum).

So a script/tool only needs to take the following steps to obfuscate a production-database:

- randomize name-fields: people/company names, street-names

- replace external identification numbers (SSN, CC, etc) with valid test-numbers

- replace all emails with known test-email accounts on the internal mail-server

A good idea for randomizing names is to use an algorithm that still renders human-readable names. I would use a (fairly large) dictionary of common names as a source.

And while you're at it: use an algorithm that uses the primary key (autonumber or what-ever you use) to render into the same pseudo-random name. This way in next iterations your testers will find the same "name" to be the same record again.

One caveat I experienced and do not know of a satisfactory solution:

Comment text-blobs. Often those comments contain crucial info about an account that is required to understand the data (for example explainations why this particular data is expected to violate a certain rule, therefore important to keep for testers). But staff often disclose names and even cc-info in those comments, which is of course not a good practice, but a real-world fact. It is impossible (eh, correction: unfeasible) to automatically crawl this data and obfuscate the unwanted info.

But in my experience just plainly trusting the developers/testers is the easiest solution (I've always worked in small companies).