This editorial was originally published on Mar 19, 2009. It is being re-run today as Steve is on vacation.
One thing we recommend is to make sure that production data is always kept safe, which means that you keep control of where it lives and how it's used. I saw a post recently where someone needed to send data to a business partner and wanted to obfuscate it. I think that is a great thing to do and was glad to see someone asking for advice. Sending data to business partners is required, but you want to be sure that you are not necessarily giving away too much information, especially identity, privacy, or financial information that you shouldn't.
However what about your test and development environments? I've seen people obfuscate data here, but not always. In fact, not usually. This is despite the fact that you might turn over developers often, expose that data to testers, or other people that might not normally have access, and the fact that these environments almost never have auditing enabled.
We need to mimic production environments and data, but there are quite a few challenges with doing this in a safe and secure manner. Just having scripts to obfuscate data is a challenge in and of itself. I'd love to see a tool built that would do this automatically, though I wonder how many companies would actually buy it. I'm not sure there's a great market for tools here until insurance companies start require it for your "data loss insurance."
But building those scripts is both hard and time consuming. How do you decide what to obfuscate? What values do you use? How do handle PK/FKs to ensure that things match up correctly if you've used personal information, like a SSN, as a key field?
Beyond that, there's much more. What about ensuring that data matches up correctly? Can you really determine if there is an issue with some calculation or relationship if you have random data. After all people many times will have favorite accounts that they know well and understand what the data should look like. A developer may expect certain order details or address information, and use that as a benchmark when developing new code. If the data is random every time his environment is refreshed, does that slow his productivity? How do you test things like URLs and emails if data is randomized?
And what about when you alter your schemas? How much work is there to update your scripts?
And what about mistakes? How often has someone pointed development code at a production server? What happens if they accidently run an obfuscation script against production? If you've scrambled data around, would you catch it quickly enough to restore before too many transactions were sent through?
Using some type of data obfuscation or randomization is a great way to help ensure that your production data is kept safe, but it definitely makes for a much more complex environment, and likely, more headaches for DBAs and developers.
The Voice of the DBA Podcasts
The podcast feeds are now available at sqlservercentral.mevio.com to get better bandwidth and maybe a little more exposure :). Comments are definitely appreciated and wanted, and you can get feeds from there.
Today's podcast features music by Everyday Jones. No relation, but I stumbled on to them and really like the music. Support this great duo at www.everydayjones.com.
I really appreciate and value feedback on the podcasts. Let us know what you like, don't like, or even send in ideas for the show. If you'd like to comment, post something here. The boss will be sure to read it.