Masking Data

  • Comments posted to this topic are about the item Masking Data

  • Hi,

    Looks like here discussion is happening on encryption of data πŸ™‚ ?. If so, I don't think its a good idea to loose information and again run behind for what the data was πŸ˜‰ when sequential numbers are used instead or real data. As I have little knowledge you could write you own encryption in Stored procedure or before displaying the report mask the important information.

  • Generally the British Government puts the data on a CD and posts it to a random address. πŸ˜€

    Seriously though, I used to have a utility which would take a real data set, but mix up the data randomly so that none of the names or details matched each other. For fields like bank accounts, you had the option to scramble the contents of the field or replace with a random number of the same length. The algorithm also checked that it did not randomly put the data back in the same record.

    David

  • At my company, we do clean up & hide the data from production, but not nearly enough and, unfortunately, not across all teams. The biggest problem we have is getting people to buy into the idea that, while it doesn't LOOK exactly like the production data, it's distributed in the same way inside of the same structure, so it will BEHAVE exactly like the production data.

    For all the good this argument does, I might as well be speaking Greek.

    Molon Labe!

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Does anyone know of any products like this affordable for a small software developer? I don't need to worry about back accounts or social security numbers, just masking the data so it is not recognizable.

  • Funny you should ask. I just recently installed the beta of Red Gate’s forthcoming SQL Data Generator. While it had some draw backs and limitations, which I'm sure the fine folks at Red Gate are working on, it did seem to give a fairly complete set of sample data that I was able to use to build a project around.

    To help us help you read this[/url]For better help with performance problems please read this[/url]

  • I can see one advantage and one disadvantage to the reasons already listed for dummying up data in the test environment:

    - dummy data is better to obscure information for user manuals

    - dummy data could hide problems that could be found in test when working on applications (how many times would you put an apostrophe in dummy data)

    Steve

  • We work with medical data so we CANNOT have copies on our local machines for developement work.

    I wrote a series of scripts to obscure our data. I pull the couple hundred most common first and last names, join them together then overwrite the real first and last names. (Did it twice - once for males and once for females. It looks silly when Jane has testicular cancer or John has an appointment in the OB clinic.) I replaced all the cities and zips with less than 100 entries with the most common city/zip then pulled the most common street names, joined them with a number table to give a sequence of addresses, and overwrote the addresses with the results. I added a random number between 1 and 30 to each birthdate so we'd still have pretty much the same distribution of ages but none of them would actually be the same. (And we wouldn't end up with newborns diagnosed as pregnant which could happen if we randomly assigned birthdates!!) I wimped out on phone - All home phones are 555-1234 and all work phones are 555-6789. All the patient numbers and SSNs were obsured too. Most of the audit logs were just too much to deal with so a lot of them say something like "Patient data here". It took about a week to write and test.

    We take a backup of the PROD database and restore it to a database on the departmental DEV server. (It's in a secured server room so it's allowed to have real data.) We then manually run scripts to delete all but the last 2 years of records then run the scripts to obscure what's left. We backup the obscured database and restore the backup on our developement machines.

    It's a pain but we don't run the risk of losing a developer's machine (sitting in a cubicle while someone is at lunch) with real patient data. HIPPA fines are pretty stiff even for unintentional release of patient records.

  • During my search for obfuscation methods, I ran across this article written by an Oracle person, you might find it interesting:

    http://www.jaredstill.com/downloads/Data_Obfuscation_and_Encryption.doc

    One of the topics covered is how to present data to an external auditor without revealing the real data, but without losing your ability to back-into real data if required for auditing purposes.

  • Yes, often times developers giving a demo or training want to have realistic data. Also when doing any performance testing you cannot get good statistics and index usage if the statistical scattering of your data is not representative of what you have in production. In these situations I would write scripts for every column of every table which held any identifiable data. Let's take city for instance. I would create a temporary table with three columns, id int, cityname varchar(50), used bit. I would insert all the cities (not distinct cities, but every instance of a city name) into this table. I might update all cities with the name of say Chicago to Boston or Denver to San Francisco, then I might add a number of cities, real or fictional, to the table. I would then use a cursor to go through the original data randomly selecting a city from this temporary city table where used = 0 and cityname <> city. For the seeding of the random fuction I would see whatever wild hair I had that day and plug it in.

    Now for numbers (social security numbers, zip codes, phone numbers, street numbers) I would randomly change every digit and try to keep the data statistically proportional to real data.

    For text or image data we would often just update these with a common text like "this is test data" or replace the images with a stock image as these fields were not used for search criteria and gave people a greater sense of ease knowing that even though the data looked fairly real, it was not.

    It is a pain and takes quite a bit of time to identify all the fields that must be modified as well as write the scripts to do so. However once you've written the scripts for a certain type of data you can easliy reuse or modify your scripts. Fortunately I didn't need to do this often.

  • The statistical distribution is a good point and one I hadn't thought of. Any obfuscation technique should be aware of how that is set up and not disturb it..

  • When the data is public domain there is no problem with using the production data.

    When the data is sensitive or 'protected' it can be dummied or 'masked'.

    One interesting idea is that in the development cycle for new applications or enhancements there are numerous tests run to validate each of the functions as they are developed. From this process there are a series of data artifacts left in the development database that might be useful for demonstration IF YOU TELL THE DEVELOPERS OR TESTERS UP FRONT. If you do not tell the developers that the data they are entering would be used for demonstration purposes you might get the shock of your life searching the data or displaying some detail data in the development database.

    One of the other interesting things about this type of data is that in development parts of transactions work and other parts fail, and the resulting data is a very good test case for the rest of the software as well as process analysis for future difficulties.

    But you really have to be careful.

    Have a great weekend!

    Not all gray hairs are Dinosaurs!

  • All, just wanted everyone to know that the Camouflage product mentioned in the SQL Server Central blurb for 3/7/08 is terribly expensive, or they tailor their quotes for the size of the company. We are a large company and when we asked for a contact from a Camouflage rep, we got a call from a lady who had obviously done her research; she knew quite a bit about our company and quoted us $50,000 for a license to use their product. Too bad; we would be better off taking an intermediate level programmer and allowing that person to code for a year and develop something that would do the job. It isn't rocket science, after all...

  • We use a "demo" version of a database to show potential clients.

    I manually changed the most sensitive data in the demo copy, which is isolated to a couple of tables. Then I use RedGate's Data Compare to synchronize everything but the sensitive tables. For the sensitive tables, I don't synchronize differences, but I do synch the new entries, then manually scramble the new entries in the demo database.

    This works in our case because the amount of data I have to manipulate manually is rather small. The result is a database that has no real personal information, but all of the numeric data and accompanying calculations match the production database.

  • We rolled out own data obfuscator program. Whenever the prod DB was restored to a new environment we used a function to alter letters and numbers of sensitive data fields to something else - think back to when we were kids and created "secret codes" We thought we were pretty clever! 😎 The flaw in our process was that the reporting DB (log shipped from prod) was located on the DEV server where all the developers had sysadmin rights- grrr:crazy: In a perfect world we would have kept all production data (reporting data included) in the ultra secure zone.

Viewing 15 posts - 1 through 15 (of 37 total)

You must be logged in to reply to this topic. Login to reply