Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase ««12

The Challenges of Being Safe Expand / Collapse
Author
Message
Posted Tuesday, March 10, 2009 10:35 AM
Ten Centuries

Ten CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen Centuries

Group: General Forum Members
Last Login: Friday, August 10, 2012 6:08 PM
Points: 1,156, Visits: 801
Yelena Varshal (3/10/2009)
My advice is to use legal Nondisclosure Agreements


I tend to agree with Yelena. If you don't trust them, they shouldn't be there anyway. Company data or intellectual property would fit this area. However, with customer personal data, like SSN, DOB, CCN, and other similar personal and credit data, I would make the effort to obfuscate/random generation for testing, as only those who actually need to know to conduct business should ever see these, and then only when they absolutely must. If there is a way to automate it so they do not see it, I go for those methods.
Post #672606
Posted Tuesday, March 10, 2009 1:39 PM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Friday, March 20, 2009 7:12 AM
Points: 2, Visits: 19
Steve
Good article.
Disclosure up-front, I work with one of the companies you mentioned above who are authors of data masking software. You make many good points and I thought that I might add a few comments to some topics.

One thing we recommend is to make sure that production data is always kept safe, which means that you keep control of where it lives and how it's used. I saw a post recently where someone needed to send data to a business partner and wanted to obfuscate it. I think that is a great thing to do and was glad to see someone asking for advice.

In many cases, this seems more of a moral responsibility to protect your clients than it is a legal requirement. In times of tight budgets this is sometimes pointed out. However, if one is trying to assess the cost/benefits of masking test data then it is justifiable to consider the public relations damage a data escape would entail.

However what about your test and development environments? I've seen people obfuscate data here, but not always. In fact, not usually. This is despite the fact that you might turn over developers often, expose that data to testers, or other people that might not normally have access, and the fact that these environments almost never have auditing enabled.

The case can be made that if the data visibility is restricted in a production database then the visibility of that same data must be similarly protected in test databases where (usually) far more people have access. Realistically, the only way to do this and still have the wider access required by dev and test teams is to mask the data.

We need to mimic production environments and data, but there are quite a few challenges with doing this in a safe and secure manner. Just having scripts to obfuscate data is a challenge in and of itself.

As you say, the mechanics of the process are one issue. Others issues, such as finding reasonable substitution data and managing the transfer of this data into the database are also present. One technique, which avoids the need to find substitution data, is to shuffle. But this is not universally applicable - what if the table is small or contains information such as email addresses which are useful by themselves.

But building those scripts is both hard and time consuming. How do you decide what to obfuscate? What values do you use?

In our experience this analysis phase is what takes the time and needs the buy-in of the application owners. When deciding what fields to mask, trade-offs often need to be made. For example, it might be decided that it is not necessary to mask salary amounts since that would have serious “knock on” effects on other data items such as calculated departmental totals. The justification for this may be that since all other personally identifying information associated with this value has been masked, then there is minimal gain and much extra complexity involved with masking that particular item.

What about ensuring that data matches up correctly? Can you really determine if there is an issue with some calculation or relationship if you have random data.

This sort of data synchronization issue is common. In fact it is rare to find a database that does not require it. There are really three types of synchronization: internally within the same row (Row-Internal), between rows in the same table (Table-Internal) and between rows in different tables (Table-To-Table). These are all handled differently and sometimes you have to use all three techniques on a single table.

After all people many times will have favorite accounts that they know well and understand what the data should look like. A developer may expect certain order details or address information, and use that as a benchmark when developing new code. If the data is random every time his environment is refreshed, does that slow his productivity? How do you test things like URLs and emails if data is randomized?

There are a variety of approaches to this sort of “consistent masking”. We have found that the easiest technique to implement is just to mask the data with random values and then have a set of rules (or scripts) which go back through the database afterward setting to constant known values the specific cases the dev, test and training teams use.

Using some type of data obfuscation or randomization is a great way to help ensure that your production data is kept safe, but it definitely makes for a much more complex environment, and likely, more headaches for DBAs and developers.

Very good point. Whether the masking is done with a set of scripts or a purchased tool it is well worth implementing an automated process. You really don’t want the masking of the database to be an “all hands to the pump” type operation every time a test database is masked. This will cost much more in the long term!
Post #672740
Posted Tuesday, March 10, 2009 3:32 PM
Hall of Fame

Hall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of Fame

Group: General Forum Members
Last Login: Saturday, August 9, 2014 9:17 AM
Points: 3,433, Visits: 14,427
My advice is to use legal Nondisclosure Agreements


Nondisclosure is not enough there are laws in place for Banks and Pharma I think these laws needs to be implemented in any place personal data in at risk because a developer did not leave VA with 26 million people data it was a brain dead unskilled data analyst. Personal data needs to be accessed only as needed. These are the reason data theft at banks are not skilled employees but low level employees doing transport or tellers.

Personal data should not leave any facility without escort and when developing SET algebra must be required because if you know SET algebra teen agers will not be dead and dead people will not be alive because these two are inter dependent, safety and integrity.


Kind regards,
Gift Peddie
Post #672823
Posted Tuesday, March 10, 2009 6:34 PM
SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: Friday, October 5, 2012 3:30 AM
Points: 138, Visits: 351
As pointed out randomizing the whole lot is rather pointless because of data-consistency and inter-relationships that need to be real(istic). Especially when trouble-shooting it is essential to have the exact data that causes the issue (you don't know before-hand which piece of data is the culprit, so by obfuscation you run a real risk that you also obfuscate the very problem you're trying to track down).

To tell you the truth: I've never actually done obfuscation of test-data and always used a mirror of the production-database in development. However I have been thinking about it on several occasions and there are a few distinct situations for use of data that each have different needs for levels of realism:
- Initial development of new/modified functionality -> can be done with a small set of data created on the fly by the developer in his private development-database (no need for mirroring production-data).
- Automated unit-testing -> has to be done with a known set of data with known characteristics (covering all cases -> is code-paths), and should not be changed over time other than in conjuction with new test-cases (no need for mirroring production-data).
- Performance-testing -> has to be done with a realistic amount of data with realistic relationships/value-distribution (only feasible solution is a mirror of production-data, some obfuscation possible).
- Acceptance-testing -> real users are going to use the data and they need to see real(istic) data (only feasible solution is a mirror of production-data, some obfuscation possible).
- Troubleshooting an issue in the production-system -> only a true mirror of the production-data is useful (no obfuscation because you run the risk of obfuscating the actual issue).

By limiting obfuscation only to the "hard" personal identifyable data: names, addresses and public identification numbers (such as social security number, tax-number and credit card numbers) you can still use it for testing with realistic outcomes. No business-rules will break when a name is actually a random string (within certain size-constraints), since your application should accept pretty much anything in a name-field anyway. In an address only the street-name needs to be obfuscated (together with the randomisation of the person/company name, a malicious user has no hope in using a random name in a random street of a known city).

Credit-card numbers without a name becomes mostly unusable, but for extra measure you could replace those with known test-numbers banks will be able to provide intended specifically for this purpose (they pass internal validation-checks but won't be usable to make any purchases).

Email-addresses are often used explicitly by applications to actually send messages (which you DO NOT want to happen from a non-production environment, ever) and could also be used to identify a real person. So these must be obfuscated but cannot be merely randomised, the best solution for this is to have a set of addresses on a dedicated (test-)domain and distribute those addresses (randomly) in the database.

Now about using external ids for internal PK/FK: BAD IDEA!!!
It is my almost religious belief that a database MUST provide and totally control the mechanisms for referential integrity, any external identification key should be a mere piece of data rather than structural to the database. In my own designs this goes even as far as product-codes and employee-numbers, if any human-entered key is required, I will NOT use it as Foreign Keys, but will provide an internal (hidden) auto-number (or even a GUID in my latest designs). Any information entered by a human is always prone to have mistakes which must be easily corrected -> a single field on the primary object, which is then joined and queried for output. The situation is even worse when the external party that created the key (tax-office, internet provider) chooses to update/refactor their system and changes the keys of existing data........
So if your database has any external keys (especially those that could be used for personal identification if known to a malicious person) each should be considered to be replaced by either a pure random number, or if the application tests validity in some way, the data should be replaced by a set of test-numbers that satisfy the validity-test (possibly even hardcoded bypassing any validity-test if the key is recognisable as a test-number by a special prefix or certain checksum).

So a script/tool only needs to take the following steps to obfuscate a production-database:
- randomize name-fields: people/company names, street-names
- replace external identification numbers (SSN, CC, etc) with valid test-numbers
- replace all emails with known test-email accounts on the internal mail-server

A good idea for randomizing names is to use an algorithm that still renders human-readable names. I would use a (fairly large) dictionary of common names as a source.
And while you're at it: use an algorithm that uses the primary key (autonumber or what-ever you use) to render into the same pseudo-random name. This way in next iterations your testers will find the same "name" to be the same record again.

One caveat I experienced and do not know of a satisfactory solution:
Comment text-blobs. Often those comments contain crucial info about an account that is required to understand the data (for example explainations why this particular data is expected to violate a certain rule, therefore important to keep for testers). But staff often disclose names and even cc-info in those comments, which is of course not a good practice, but a real-world fact. It is impossible (eh, correction: unfeasible) to automatically crawl this data and obfuscate the unwanted info.

But in my experience just plainly trusting the developers/testers is the easiest solution (I've always worked in small companies).
Post #672908
Posted Sunday, March 15, 2009 12:30 PM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Monday, July 14, 2014 12:03 PM
Points: 13, Visits: 76
I believe the tool you are looking for is FileAid by CompuWare.
Post #676153
Posted Sunday, March 15, 2009 1:09 PM
Hall of Fame

Hall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of Fame

Group: General Forum Members
Last Login: Saturday, August 9, 2014 9:17 AM
Points: 3,433, Visits: 14,427
Statements like this is the reason we have to start the Personal Data SET algebra required movement so only people who knows SQL and what is needed to resolve very complex issues are the only people who will be approved to handle such data.

If you believe a program can do this job there is a bridge in Brooklyn with your name on it.

http://www.compuware.com/products/fileaid/datasolutions.htm



Kind regards,
Gift Peddie
Post #676161
Posted Monday, March 16, 2009 9:16 AM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: Administrators
Last Login: Today @ 4:39 PM
Points: 33,155, Visits: 15,291
I think a program could do the job, but not by just installing and click "next". A person that undestands the data and relationships should be able to map fields and specify rules and then have it automated from that point on.






Follow me on Twitter: @way0utwest

Forum Etiquette: How to post data/code on a forum to get the best help
Post #676613
Posted Monday, March 16, 2009 9:32 AM
Hall of Fame

Hall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of Fame

Group: General Forum Members
Last Login: Saturday, August 9, 2014 9:17 AM
Points: 3,433, Visits: 14,427
I think a program could do the job, but not by just installing and click "next". A person that undestands the data and relationships should be able to map fields and specify rules and then have it automated from that point on.


That is ETL which takes skills on both ends of the relational model and sometimes includes expensive tools like Informatica on none SQL Server based database or a team of developers using SSIS. So it is still skills and development required.



Kind regards,
Gift Peddie
Post #676643
Posted Monday, March 16, 2009 9:52 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Friday, March 20, 2009 7:12 AM
Points: 2, Visits: 19
Masking is not an ETL operation, it is an in-situ update process and does not need other relational databases to succeed. The knowledge required is an understanding of the relationships in the entity.
Post #676675
Posted Monday, March 16, 2009 10:07 AM
Hall of Fame

Hall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of Fame

Group: General Forum Members
Last Login: Saturday, August 9, 2014 9:17 AM
Points: 3,433, Visits: 14,427
I am not talking about masking which you could also do and currently do in most cases with encryption, I am talking about moving large amount of personal data from different sources and making sure the final data is actually valid. Teen agers should be alive and dead people should remain dead. That is ETL if your employees are not skilled you need Informatica. Informatica in most Enterprise implementation is one million dollars.

Then making sure that data cannot be moved without escort. And in the VA case only encryption would have made the loss limited not masking.

it is an in-situ update process and does not need other relational databases to succeed.


The tool in the link below only support SQL Server some versions and Oracle some version with DB2 edition in development so it is RDBMS versons dependent.

http://www.datamasker.com/


Kind regards,
Gift Peddie
Post #676684
« Prev Topic | Next Topic »

Add to briefcase ««12

Permissions Expand / Collapse