Masking Data

Question

Masking Data

Viewing 8 posts - 31 through 37 (of 37 total)

You must be logged in to reply to this topic. Login to reply

Eric M Russell SSC Guru Points: 125523 More actions · Answer 1

Honestly, I hate healthcare IT, specifically healthcare data. The problem is that you have a lot of disparate vendors with proprietary and inconsistent data, who out of necessity must exchange data in proprietary and inconsistent ways. I know most Americans don't like the idea of national healthcare IDs, but they have no idea what really happens to their healthcare records as a result of mismatching demographics; it's essentially a sausage factory the way it all works today.

"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

Gary Varga SSC Guru Points: 82166 More actions · Answer 2

You are not allowed to use production data if the data resides within the EU. There are exceptions to this but they are very limited in scope.

Obsfucated or generated data tends to be employed.

Gaz

-- Stop your grinnin' and drop your linen...they're everywhere!!!

Brian J. Parker SSC Eights! Points: 879 More actions · Answer 3

Eric M Russell (10/28/2015)
Honestly, I hate healthcare IT, specifically healthcare data. The problem is that you have a lot of disparate vendors with proprietary and inconsistent data, who out of necessity must exchange data in proprietary and inconsistent ways. I know most Americans don't like the idea of national healthcare IDs, but they have no idea what really happens to their healthcare records as a result of mismatching demographics; it's essentially a sausage factory the way it all works today.

Yeah, and the standards that do exist (ask anybody who works with healthcare data IT about 835/837 data exchange standards) are vague and inconsistently implemented. Protecting confidentiality within the confines of the law and the tools we have takes a lot of time, and the cost for all those hours eventually ends up with the consumer (which is all of us). I don't pretend to have the one "right" answer but there are always tradeoffs. I'm probably preaching to the choir here, but anybody who tells you that you can have privacy and security and convenience with negligible cost is trying to sell something.

julie.woolner SSC-Addicted Points: 489 More actions · Answer 4

Here's some 'light reading' for anyone who thinks that simply obfuscating the data is the solution to the problem http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1450006

We obfuscate production datasets for use by developers but also remove edge cases which would be easy to reverse-engineer, trading some loss of accuracy for better privacy protection.

Like some earlier posters, we looked at commercial products which cost an arm and a leg, and settled on this one, which we found to be cost effective (free for small amounts of data but not unreasonably expensive for large) http://www.dataveil.com/

Kim Crosser SSCommitted Points: 1763 More actions · Answer 5

I work in Public Safety (i.e., law enforcement) and we needed to be able to do support work for some large customers where the actual size of the data AND the variety of data values (particularly bizarre typos, unexpected data values, and copied/pasted control characters, etc.) would often cause problems our support team couldn't reproduce on the "test system".

The customers are extremely concerned about allowing any of their data out of their control. Any person who can even look at the data must be deep backgrounded and vetted. Thus, even sending an encrypted copy was verboten - the data had to be manipulated in some way to prevent even trusted vendor personnel from seeing (or reconstructing) "real" data.

To solve this problem, I wrote an "anonymizer", which works as follows:

1. It creates multiple Catalogs - one for each Production Catalog, where the new Catalogs are named "Anon_<prodcatname>", plus one for the anonymizer itself.

2. It generates a set of random transform tables (new set of values every time it runs). It creates one set for each type of data (business types - not SQL data types - i.e., names, addresses, phones, id numbers, ages, DOBs, etc. - there are about 65 transform tables). For some data types (IDs, SSNs, DLNs, for example), it just generates a random or constant value or string of the same length as the source value it replaces. However, for many data types (like names), it "shuffles" the field values. This keeps those odd and unexpected values in many of those fields, although usually in a different record than in the "real" data.

One of the key things was that the transforms maintain logical consistency. For example, if first name "John" became "Frank" in one record, "John" would also become "Frank" every other place it occurred. This meant that the application would appear consistent - if you searched for "Frank" in one place, you would see the same result set if you searched or viewed elsewhere.

Ages and DOB were particularly tricky, as "juvenile" and "adult" records have associated application business logic, so these transforms did some magic ensuring juveniles remained juveniles and adults remained adults, but with randomized ages and DOBs within ranges.

3. I built a "Table Manifest" that explicitly identified the Tables to be processed - or excluded, and a "Column Manifest" that identified the particular transform to be applied to each column. These were built with a LOT of manual analysis and review.

4. When the process runs, it dynamically determines the appropriate order of the tables based on all the foreign key constraints, then "copies" each table from the Production catalog into the corresponding anonymizer catalog and applies the designated transforms.

It has to run all the way through in a single pass so that the transforms are consistent throughout all the tables, although it can be stopped and resumed at any point. There is also a "restart" mode, which drops every table in all the Anon catalogs and then begins after generating a new set of transforms.

5. At the end, I wrote some analysis routines that did comparisons between the source tables and the anonymized tables, looking for any anomalies (too many values still matching, or too many nulls in output tables, etc.).

This included reviews by customer personnel to make sure no data was at risk.

6. Finally, the Anon... catalogs were backed up and shipped to our support group - minus the catalog that contained the transforms that had been used, so the anonymization could not be reverse-engineered.

(Yes - theoretically if you could figure out one case where "Frank" was "John", then you would know that *some* other "Frank" records were also "John", but I also used lossy transforms, so "Joe" might also have transformed to "Frank", so you couldn't actually be sure that all the other "Frank" records were really "John".)

This seems to have succeeded well. The customer is confident we haven't exposed any confidential data and our test team is now able to run in-depth testing scenarios on millions of rows of "real" data.

The entire process only takes about 2-1/2 hours on around 200 million rows of data, and can be run without taking the production server offline, and we can now do a new run and extract whenever needed.

Some tech notes:

* The table "copy" copies everything except "code" - i.e., no Triggers. It creates an exact copy, including all constraints, indexes, keys, etc.

* The real challenge wasn't the coding of the scripts/procedures, it was coming up with the Table and Column manifests and making sure that all confidential data was properly identified and configured for transformations.

The Table Manifest explicitly listed 920 tables and whether each table was to be excluded (not copied), truncated (copy, but only as an empty shell), or processed (per the Column Manifest for that Table).

The Column Manifest contains 8494 explicit transforms - specifying for each column whether it should be cleared (set null), set to a constant value, left intact, or transformed per a specific transform type.

These were built with a lot of hand-coded queries and then a lot of Excel spreadsheet manipulations to filter on types, names, etc. This took the most effort - probably around 50 hours to get the manifests correct. If you had a really good on-line data dictionary, that should be a lot easier, but this application has NO data dictionary on line, so this required way too much manual work and review.

* When the anonymizer ran, the first thing it did was completely analyze the consistency between the manifests and the real catalogs/tables. If there were any Production tables that weren't listed in the Table Manifest, those had to be fixed first. If there were any discrepancies between the Table and Column Manifests, those prevented further processing, and so on...

* Using the "Anon_..." catalogs avoided a lot of problems usually encountered in these kinds of jobs. We could run and re-run and re-run everything without impacting the Production catalogs. All the code/transform tables, etc., resided in one Anon_... catalog.

* The explicit Table and Column manifests, plus the Anon... catalogs meant that we didn't risk leaving something behind - only those tables explicitly specified to be copied and transformed were brought over.

* I created some simple scripts that the test team could run that took the exported Anon... catalogs and restored them with the original names, so they could easily restore these in their test system.

* I also wrote some scripts that created dummy "attachments" (i.e., blob data) as the application used external file storage for these objects (photos, PDFs, documents, etc.). After restoring the catalogs in the test system, they just ran these to create the missing linked objects.

(Before anyone asks for a copy of the utility, unfortunately it can only be applied on databases for this specific application. The concepts above could be applied anywhere, but not the actual code.)

TomThomson SSC Guru Points: 104773 More actions · Answer 6

Gary Varga (10/28/2015)
You are not allowed to use production data if the data resides within the EU. There are exceptions to this but they are very limited in scope.
Obsfucated or generated data tends to be employed.

It doesn't matter where the data resides. If you collect personally identifable data in the EU (for example if it is typed onto screens on laptops, pads, or desktops in the EU, even if the website collecting the data is elsewhere) that data is covered. If the data resides somewhere where it isn't protected, you are breaking the law if you collect it in the EU.

There are two fundamental exceptions. The first is that data can be used if it is not possible to identify the person it refers to using that data together with any other data for which it can't be stated with reasonable certain that it won't come into the possession of the person who has access to the protected data. The second is that the person who is the subject of the data has given his informed consent to the data being made available to the particular person who gets it - and "informed consent" is definietly NOT provided by explaining things somewhere buried in contractual small print, and NOT by saying "the rules are on our website and we may change them and the only way you'll find out is by going and looking for them" (both those attempts at bypassing the law have been blown out of the water by the courts).

It makes things difficult sometimes: if a system crashes and someone has to study core dumps and/or traces and logs to discover and fix the problem that person has access to the data, so had better be covered by the rules - the data subject must have given his informed consent for the data to be made available under those circumstances. When I was responsible for ensuring conformance with this legislation at Neos I used to be worry a lot about the situation with our European customers.

There are some exceptions also for use for national security purposes and for the prevention or detection of serious crime and the way European national governments have interpreted the EU level exceptions varies widely from country to country. In the UK and some other countries those exceptions are far from limited in scope; in some other countries they are rather narrower. But even in the UK, the standard of protection of personally identifiable data is far higher than in the USA.

Tom

Gary Varga SSC Guru Points: 82166 More actions · Answer 7

TomThomson (12/1/2015)
Gary Varga (10/28/2015)
You are not allowed to use production data if the data resides within the EU. There are exceptions to this but they are very limited in scope.
Obsfucated or generated data tends to be employed.
It doesn't matter where the data resides. If you collect personally identifable data in the EU (for example if it is typed onto screens on laptops, pads, or desktops in the EU, even if the website collecting the data is elsewhere) that data is covered. If the data resides somewhere where it isn't protected, you are breaking the law if you collect it in the EU.

I stand corrected (I knew this but didn't say it).

Gaz

-- Stop your grinnin' and drop your linen...they're everywhere!!!

bill leferink Old Hand Points: 369 More actions · Answer 8

I've used a software tool from Grid Tools called Datamaker that worked quite well.

I've used this product on data sets between 20,000 and 850,000,000 rows.

I've also seen a demo of a data masking tool from Oracle that appeared to work well but I don't have any hands on experience with this product.