The Challenges of Being Safe

  • I am hoping that data scrambling would be a feature in the next version of Red-Gate's SQL Data Generator. This is currently a great tool for generating sample data. Adding this functionality to this tool would not be a far stretch.

  • In my experience data is given without being obfuscated or not given at all. There is nothing in between. But I think it is something that the market will require nowadays (or in the near future) with the offshoring of development etc.

    ** Don't mistake the ‘stupidity of the crowd’ for the ‘wisdom of the group’! **
  • In theory test data is nice.

    In the real world, though, you need to compare the outpu test data against a known system. You need to see that over history and your test period, sales, payroll, expense figures from the test machine match comparable figures from known production sources. You have to know that reports on customers, employees or suppliers consistently match exactly with known correct sources.

    Hard to do with randomized data.


    -- FORTRAN manual for Xerox Computers --

  • I have to agree with Jay about being able to test against real data to ensure outputs are consistent.

    At the same time, things like SSNs can be randomized without affecting that, so long as they aren't a key value. Considering the nature of SSNs and how poor a key they are, I've not yet had that problem. Not sure exactly how I'll deal with it if I do. Perhaps a cascading update and randomize the SSNs.

    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • I often get sample files from clients to use in building import routines; about 90% of the time i then have to fix the routine because the actual run-time files are different from the samples provided. Some of it is due to formatting issues, but often i find that requested business rules cannot be applied because the data is unsuitable.

    For instance, setting a call date on newly imported records to 2 business days after the Ship date listed on the file; but the ship date field on the file is always blank.

    This is only one example of the kind of data inconsistencies i see every day. If i had to work in a data-obfuscated environment, it would be impossible to get things right the first time (up to 100% from 90). I, for one, don't find the idea of doing extra work to make my job even harder particularly appealing.

  • We have many legacy systems that span databases with SSN or other unique sensitive info about someone. Developers and testers needed a way to compare a paticular person and test across systems so the values needed to be identical. Each person has a unique employee or person id, then I run a script against all the databases and tables to update the SSN, and other numeric unique sensitive fields based on SET [SSN] = '100000000'+[EmployeeID]

    We don't have a massive amount of data so this worked very well for us and very simple to

    manage since everyone has a unique id the math above will generate the same ssn or other unique value everytime unless their id changes, which it never does for us.

  • I find this issue to be very challenging and time consuming when trying to complete my development work. I'm sure we would all love to be able to have test data and/or data obfuscated to protect from any possibility of data breaches. Unfortunately, some of the problems which arise with the data almost require you to have the actual data causing the problem in order to troubleshoot and resolve the problem which has occurred. I personally wish it wasn't so.

    I remember when I first learned how to program. During our coursework, we were required before producing any code to create our expected output (test data). In our test data, it was necessary for us to create valid and invalid records to test the validity of our logic within the applications. Unfortunately, the majority of the development positions I have worked have not made this step a an integral part of the development process.

    Incidentally, the company I currently work for does have several test accounts which are created by the quality assurance team. In fact, working in the financial industry, the company has funded some of the test accounts with money to further provide the ability to test. I found this to be an interesting and welcome change from other jobs I have worked. Though, there never seems to be enough of this accounts available to perform your tests because many development efforts are using them and changing them could hamper their efforts. But, I guess having them is better then not having them.

    On a final note, we have recently purchased Visual Studio Team System Database Edition from Microsoft for our Database Development team. Though, we are still working our way through all the different functionality the software provides, an interesting part of the software related to this discussion are the data generation plans. I have only played around with them through one of the "walkthroughs" but it provides the ability to create, as the name implies, data and save the plan for future use. The data generation plans can be modified and appear to provide the ability to create PK/FK relationships. You also can create your plan to grab the real data, for example a state table, or you can have it generate the data itself. I have not had a chance to experiment any further than the initial walkthrough so I still have to evaluate the pros and cons of using this in our development environment. But, it may prove to be what we need to make our development easier.

    Thanks for the great topic Steve.

  • I agree with the points about totals. I used to work with a financial firm, and when we changed around data for test systems, it caused lots of headaches. We couldn't easily determine if calculations in the application were correct. I think in those cases, you need tight control of a development environment, and limited access for people, and probably auditing as well.

    For other data, we used to scramble SSNs with sequential numbers (111-11-1111, 111-11-1112, etc.) and that worked. CCs were moved to known "test" numbers. banks usually can give you valid numbers from a check standpoint, but not valid for purposes. Emails became email1@mycompany, etc. We'd set up accounts for testing to get limited emails out and in.

    It's a tough battle. Red Gate probably could add some functionality here. Not sure it's simple, but they could do it. I also heard about Data Masker (, which could help.

  • My advice is to use legal Nondisclosure Agreements

    Regards,Yelena Varsha

  • Yelena Varshal (3/10/2009)

    My advice is to use legal Nondisclosure Agreements

    I tend to agree with Yelena. If you don't trust them, they shouldn't be there anyway. Company data or intellectual property would fit this area. However, with customer personal data, like SSN, DOB, CCN, and other similar personal and credit data, I would make the effort to obfuscate/random generation for testing, as only those who actually need to know to conduct business should ever see these, and then only when they absolutely must. If there is a way to automate it so they do not see it, I go for those methods.

  • Steve

    Good article.

    Disclosure up-front, I work with one of the companies you mentioned above who are authors of data masking software. You make many good points and I thought that I might add a few comments to some topics.

    One thing we recommend is to make sure that production data is always kept safe, which means that you keep control of where it lives and how it's used. I saw a post recently where someone needed to send data to a business partner and wanted to obfuscate it. I think that is a great thing to do and was glad to see someone asking for advice.

    In many cases, this seems more of a moral responsibility to protect your clients than it is a legal requirement. In times of tight budgets this is sometimes pointed out. However, if one is trying to assess the cost/benefits of masking test data then it is justifiable to consider the public relations damage a data escape would entail.

    However what about your test and development environments? I've seen people obfuscate data here, but not always. In fact, not usually. This is despite the fact that you might turn over developers often, expose that data to testers, or other people that might not normally have access, and the fact that these environments almost never have auditing enabled.

    The case can be made that if the data visibility is restricted in a production database then the visibility of that same data must be similarly protected in test databases where (usually) far more people have access. Realistically, the only way to do this and still have the wider access required by dev and test teams is to mask the data.

    We need to mimic production environments and data, but there are quite a few challenges with doing this in a safe and secure manner. Just having scripts to obfuscate data is a challenge in and of itself.

    As you say, the mechanics of the process are one issue. Others issues, such as finding reasonable substitution data and managing the transfer of this data into the database are also present. One technique, which avoids the need to find substitution data, is to shuffle. But this is not universally applicable - what if the table is small or contains information such as email addresses which are useful by themselves.

    But building those scripts is both hard and time consuming. How do you decide what to obfuscate? What values do you use?

    In our experience this analysis phase is what takes the time and needs the buy-in of the application owners. When deciding what fields to mask, trade-offs often need to be made. For example, it might be decided that it is not necessary to mask salary amounts since that would have serious “knock on” effects on other data items such as calculated departmental totals. The justification for this may be that since all other personally identifying information associated with this value has been masked, then there is minimal gain and much extra complexity involved with masking that particular item.

    What about ensuring that data matches up correctly? Can you really determine if there is an issue with some calculation or relationship if you have random data.

    This sort of data synchronization issue is common. In fact it is rare to find a database that does not require it. There are really three types of synchronization: internally within the same row (Row-Internal), between rows in the same table (Table-Internal) and between rows in different tables (Table-To-Table). These are all handled differently and sometimes you have to use all three techniques on a single table.

    After all people many times will have favorite accounts that they know well and understand what the data should look like. A developer may expect certain order details or address information, and use that as a benchmark when developing new code. If the data is random every time his environment is refreshed, does that slow his productivity? How do you test things like URLs and emails if data is randomized?

    There are a variety of approaches to this sort of “consistent masking”. We have found that the easiest technique to implement is just to mask the data with random values and then have a set of rules (or scripts) which go back through the database afterward setting to constant known values the specific cases the dev, test and training teams use.

    Using some type of data obfuscation or randomization is a great way to help ensure that your production data is kept safe, but it definitely makes for a much more complex environment, and likely, more headaches for DBAs and developers.

    Very good point. Whether the masking is done with a set of scripts or a purchased tool it is well worth implementing an automated process. You really don’t want the masking of the database to be an “all hands to the pump” type operation every time a test database is masked. This will cost much more in the long term!

  • My advice is to use legal Nondisclosure Agreements

    Nondisclosure is not enough there are laws in place for Banks and Pharma I think these laws needs to be implemented in any place personal data in at risk because a developer did not leave VA with 26 million people data it was a brain dead unskilled data analyst. Personal data needs to be accessed only as needed. These are the reason data theft at banks are not skilled employees but low level employees doing transport or tellers.

    Personal data should not leave any facility without escort and when developing SET algebra must be required because if you know SET algebra teen agers will not be dead and dead people will not be alive because these two are inter dependent, safety and integrity.

    Kind regards,
    Gift Peddie

  • As pointed out randomizing the whole lot is rather pointless because of data-consistency and inter-relationships that need to be real(istic). Especially when trouble-shooting it is essential to have the exact data that causes the issue (you don't know before-hand which piece of data is the culprit, so by obfuscation you run a real risk that you also obfuscate the very problem you're trying to track down).

    To tell you the truth: I've never actually done obfuscation of test-data and always used a mirror of the production-database in development. However I have been thinking about it on several occasions and there are a few distinct situations for use of data that each have different needs for levels of realism:

    - Initial development of new/modified functionality -> can be done with a small set of data created on the fly by the developer in his private development-database (no need for mirroring production-data).

    - Automated unit-testing -> has to be done with a known set of data with known characteristics (covering all cases -> is code-paths), and should not be changed over time other than in conjuction with new test-cases (no need for mirroring production-data).

    - Performance-testing -> has to be done with a realistic amount of data with realistic relationships/value-distribution (only feasible solution is a mirror of production-data, some obfuscation possible).

    - Acceptance-testing -> real users are going to use the data and they need to see real(istic) data (only feasible solution is a mirror of production-data, some obfuscation possible).

    - Troubleshooting an issue in the production-system -> only a true mirror of the production-data is useful (no obfuscation because you run the risk of obfuscating the actual issue).

    By limiting obfuscation only to the "hard" personal identifyable data: names, addresses and public identification numbers (such as social security number, tax-number and credit card numbers) you can still use it for testing with realistic outcomes. No business-rules will break when a name is actually a random string (within certain size-constraints), since your application should accept pretty much anything in a name-field anyway. In an address only the street-name needs to be obfuscated (together with the randomisation of the person/company name, a malicious user has no hope in using a random name in a random street of a known city).

    Credit-card numbers without a name becomes mostly unusable, but for extra measure you could replace those with known test-numbers banks will be able to provide intended specifically for this purpose (they pass internal validation-checks but won't be usable to make any purchases).

    Email-addresses are often used explicitly by applications to actually send messages (which you DO NOT want to happen from a non-production environment, ever) and could also be used to identify a real person. So these must be obfuscated but cannot be merely randomised, the best solution for this is to have a set of addresses on a dedicated (test-)domain and distribute those addresses (randomly) in the database.

    Now about using external ids for internal PK/FK: BAD IDEA!!!

    It is my almost religious belief that a database MUST provide and totally control the mechanisms for referential integrity, any external identification key should be a mere piece of data rather than structural to the database. In my own designs this goes even as far as product-codes and employee-numbers, if any human-entered key is required, I will NOT use it as Foreign Keys, but will provide an internal (hidden) auto-number (or even a GUID in my latest designs). Any information entered by a human is always prone to have mistakes which must be easily corrected -> a single field on the primary object, which is then joined and queried for output. The situation is even worse when the external party that created the key (tax-office, internet provider) chooses to update/refactor their system and changes the keys of existing data........

    So if your database has any external keys (especially those that could be used for personal identification if known to a malicious person) each should be considered to be replaced by either a pure random number, or if the application tests validity in some way, the data should be replaced by a set of test-numbers that satisfy the validity-test (possibly even hardcoded bypassing any validity-test if the key is recognisable as a test-number by a special prefix or certain checksum).

    So a script/tool only needs to take the following steps to obfuscate a production-database:

    - randomize name-fields: people/company names, street-names

    - replace external identification numbers (SSN, CC, etc) with valid test-numbers

    - replace all emails with known test-email accounts on the internal mail-server

    A good idea for randomizing names is to use an algorithm that still renders human-readable names. I would use a (fairly large) dictionary of common names as a source.

    And while you're at it: use an algorithm that uses the primary key (autonumber or what-ever you use) to render into the same pseudo-random name. This way in next iterations your testers will find the same "name" to be the same record again.

    One caveat I experienced and do not know of a satisfactory solution:

    Comment text-blobs. Often those comments contain crucial info about an account that is required to understand the data (for example explainations why this particular data is expected to violate a certain rule, therefore important to keep for testers). But staff often disclose names and even cc-info in those comments, which is of course not a good practice, but a real-world fact. It is impossible (eh, correction: unfeasible) to automatically crawl this data and obfuscate the unwanted info.

    But in my experience just plainly trusting the developers/testers is the easiest solution (I've always worked in small companies).

  • I believe the tool you are looking for is FileAid by CompuWare.

Viewing 15 posts - 1 through 15 (of 19 total)

You must be logged in to reply to this topic. Login to reply