Scrubbing Personally Identifiable Information

  • Comments posted to this topic are about the item Scrubbing Personally Identifiable Information

  • It's better to insert PII data as encrypted using whatever component or mechanism, regardless of where it is stored. Secured server or not. These days, there definitely isn't any good reason PII should be readable by those with direct access to the database. It should only be readable by the business users through the software they use to interact with those records.

    Now if the situation is that the users are using 3rd party software that your organization cannot alter, you need to contact the developers to provide the configurable means to choose which data can be encrypted. If that is not feasible on their part, then perhaps your organization needs to build their own system. If none of that is feasible, then perhaps they shouldn't be in business. Lawsuits are hell.

    All it takes is one person having a bad day to purposely sell PII to someone else. The goal is to make it hard for anyone to do that. Keeping unencrypted data in a secure environment is not good enough. People often wonder how do hackers break into systems, but they don't realize that they don't have to break in. They are already there.

  • Very interesting and helpful approach. I have one suggestion: The Social Security Administration has set aside specific number ranges that will never be used as real SSNs (in general, starting in the 900s). It may be wise to restrict the dummy values to that range only. This case assumes that all the data is isolated from production data. If there is ever a possibility (not good practice, but may occur) that production data may be mixed in with obfuscated data - say test records for a master person index - SSNs that are not in the dummy range can cause trouble.

  • If there is a number range that will never be used for SSNs, it may make a lot of sense to use it for scrubbed data. I wasn't suggesting that the sample views used in the demo should be used in production, so I wasn't overly particular with what they were returning. I was just trying to illustrate the concept, and I hope it's easy enough to apply that to whatever your particular situation is.

  • I'm in favor of encryption, but it doesn't replace obfuscatation. A common scenario is needing to test/investigate against as close as possible to a production copy, without the rrisk of jobs sending emails to customers, third tier security allowing compromise of PII (or more), and all the rest.

    However you do obfuscation, from a security perspective the important part is do before you move it, rather than after you land it in dev. From a PCI perspective if you restore a production backup on a dev server, that server is now "in scope", even if you then immediately obfuscate and regardless of whether you're using encryption. To do it correctly you do it on a "secure" server, or you transform it in flight, but it has to be obfuscated when it hits the target or you've just bad the mess bigger.

  • Scrubbing the data while it is still secure is very important and I addressed that in the overview, but I failed to say why. That was an oversight on my part, so thanks for putting the explanation in.

  • I think that while this process shows what can be done with SQL, I would question whether this should be done in SQL: String manipulation is not SQL's forte (neither from a performance nor a debugging perspective) and storing code in database tables is a no-no for me (Source control? Yes you can do it, but it bcp'ing out the data or managing the data using a data-compare tool adds unnecessary complexity - after all, we want to control source, not data as well if we can avoid it). Using control tables to tie the whole lot together is just asking for something to break - for me, that's a real anti-pattern.

    I believe in simplicity: I would suggest a superior solution would be to create stored procedures for scrubbing, using a one-way hashing algorithm to scrub the data in each column to be obfuscated. Using this method guarantees that the scrubbed values in separate tables are always identical without incurring the overhead of looking up values in other tables.

    You can then call the stored procedures in the correct order from a single ScrubAll stored procedure if required.

  • DragonGod - Tuesday, January 6, 2015 5:50 AM

    It's better to insert PII data as encrypted using whatever component or mechanism, regardless of where it is stored. Secured server or not. These days, there definitely isn't any good reason PII should be readable by those with direct access to the database. It should only be readable by the business users through the software they use to interact with those records.Now if the situation is that the users are using 3rd party software that your organization cannot alter, you need to contact the developers to provide the configurable means to choose which data can be encrypted. If that is not feasible on their part, then perhaps your organization needs to build their own system. If none of that is feasible, then perhaps they shouldn't be in business. Lawsuits are hell.All it takes is one person having a bad day to purposely sell PII to someone else. The goal is to make it hard for anyone to do that. Keeping unencrypted data in a secure environment is not good enough. People often wonder how do hackers break into systems, but they don't realize that they don't have to break in. They are already there.

    However, a couple of things. First, you might not have a choice in the solution design - it might be a 3rd party app for instance. Second, obfuscating the data would potentially still need to be done, if you don't want or aren't allowed to see the PII in a dev/test environment - and that's a lot more of a problem is it's encrypted at the database level.

  • This is a good solution if you can have 'nonsense' data. Our testers/UAT object to that on the whole. We've gone in for randomly combining realistic data - having a set list of first names and surnames, towns and street types ('rd', 'st', 'ave', 'place') etc, and using cross joins to generate names/street names. Emails, in any system we always replace with an email at our own domain, just in case they get accidentally sent.

    The downside of both approaches, is that they're not consistent - if you get a data update, then in the next run you get brand new PII for the same clients, and testers sometimes need to compare like-for-like. However, any consistency does leave you with the potential to get back to the original. Of course, you can keep any code or mapping data from the end users - testers or UAT business users, but not from the DEV team, which is fine if you're the owner of the data and providing it to a third party.

  • I think we are safer just encrypting the data in production.

  • One area where this can be complicated is where the business allocates resources based on geographical location (or indeed on any other sensitive or personal information)
    Typically this would cover such examples as allocating a specific salesperson, engineer or team to any customer in a postcode range. The problem here is that we need to ensure that after obfuscation each salesperson still is assigned correctly. To do this we could limit the obfuscation to the second part of the postcode but we would also then need to not obfuscate town and county - gets even worse if we cannot be guaranteed to have those held in specific columns and addresses are very prone to that particularly in legacy systems.
    Obviously a decision needs to be made as to the level of obfuscation required.- I suspect that in most cases the postal district is probably acceptable (this typically would cover around 10000 people). I'm not aware as to whether US zipcodes or other countries postal codes are hierarchical in the same manner as UK postcodes so this may not be workable elsewhere.

    A further thing to be careful of is to ensure that we obfuscate any column where comments are added by users (including ones where that wasn't the designers intent) as these could contain personal information e.g. a comment could include such things as ..... please contact Mrs Jones on 0123 456 7890 to arrange access.

  • Iwas Bornready - Friday, January 13, 2017 9:26 AM

    I think we are safer just encrypting the data in production.

    Encrypting data in production and obfuscating sensitive data solve two different problems. Encrypting data in production is good practice and it keeps your data safe while it's in production. However, in some situations you can't bring production data down to a lower environment, even when it's encrypted. For example personally identifiable info, credit card info, health care info, etc. Obfuscation removes/replaces the sensitive data that you can't bring down to lower environments.

  • An interesting solution and one I'll look into further,  here I have built something similar but in SSIS and its controlled by a masking database that stores the lists of everything we mask and using which method depending on the column being masked.  

    We only restore production backups to a secure server where we mask (scrub) the data using the package I developed which is entirely meta data driven so in effect a generic process, once masked we backup those masked databases which become the backups used to restore our non production environments.  

    It does take a long time to update all the rows in all the databases \ tables due to the nature of the data we hold and the sheer volume of it but its been effective.  The hardest part being convincing the teams that use the data in non-production that they don't require real PII data to perform their own work.

    MCITP SQL 2005, MCSA SQL 2012

  • crmitchell - Friday, January 13, 2017 10:16 AM

    One area where this can be complicated is where the business allocates resources based on geographical location (or indeed on any other sensitive or personal information)
    Typically this would cover such examples as allocating a specific salesperson, engineer or team to any customer in a postcode range. The problem here is that we need to ensure that after obfuscation each salesperson still is assigned correctly. To do this we could limit the obfuscation to the second part of the postcode but we would also then need to not obfuscate town and county - gets even worse if we cannot be guaranteed to have those held in specific columns and addresses are very prone to that particularly in legacy systems.
    Obviously a decision needs to be made as to the level of obfuscation required.- I suspect that in most cases the postal district is probably acceptable (this typically would cover around 10000 people). I'm not aware as to whether US zipcodes or other countries postal codes are hierarchical in the same manner as UK postcodes so this may not be workable elsewhere.

    A further thing to be careful of is to ensure that we obfuscate any column where comments are added by users (including ones where that wasn't the designers intent) as these could contain personal information e.g. a comment could include such things as ..... please contact Mrs Jones on 0123 456 7890 to arrange access.

    You know, I hadn't even considered note fields as PII - thanks.

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply