De-Identification of PHI Data

Question

De-Identification of PHI Data

ericwenger1

Hall of Fame

Points: 3436
More actions
July 2, 2017 at 7:59 am

#337913

The company that I work for deal with PHI Data and wanted to get other opinions on De-Indentifying this kind of data. With SQL Server 2016 Data Masking is available. We've come up with a home grown way of De-Identification but it involved change the PHI Data, not masking it which takes awhile to do considering the volume of data that we have. If I have to create another copy of our database, it takes me about two days to make this happen. Any other techniques is greatly appreciated. My opinion is that going forward we should utilize the Data Masking Functionality of SQL Server 2016.
November 30, 2021 at 3:53 pm

This was removed by the editor as SPAM

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply

Jeff Moden SSC Guru Points: 1003873 More actions · Answer 1

Let me ask you the same question I've asked many others.... if the data contained information about your or your family, would you actually be happy that someone used datamasking instead of refactoring the actual data?

I wouldn't.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

ericwenger1 Hall of Fame Points: 3436 More actions · Answer 2

I'm not sure it really makes alot of difference so long as you can't identify the individual. My main concern is the time it takes to refactor this data. Most likely we will stick with refactoring the data as long as the powers to be know that its not going to happen within an hour every time they want to refresh.

Ed Wagner SSC Guru Points: 287024 More actions · Answer 3

Jeff Moden - Sunday, July 2, 2017 12:49 PM

Let me ask you the same question I've asked many others.... if the data contained information about your or your family, would you actually be happy that someone used datamasking instead of refactoring the actual data?

I wouldn't.

[/quote]

I agree with Jeff and I'm sure most others would too - including you and the people who's data you have.

ericwenger1 - Sunday, July 2, 2017 7:59 AM
The company that I work for deal with PHI Data and wanted to get other opinions on De-Indentifying this kind of data. With SQL Server 2016 Data Masking is available. We've come up with a home grown way of De-Identification but it involved change the PHI Data, not masking it which takes awhile to do considering the volume of data that we have. If I have to create another copy of our database, it takes me about two days to make this happen. Any other techniques is greatly appreciated. My opinion is that going forward we should utilize the Data Masking Functionality of SQL Server 2016.

Since management probably has an idea of how much data is involved, I don't think it would be a stretch for them to accept that a rebuild can't happen in an hour.

You said that the volume of data that you have is making the anonymization take two days. Maybe we could help by coming up with more efficient ways of anonymizing it. What's the goal of anonymizing it? Is it to recreate test from production? Or something else?

We'd need is the DDL for the tables involved and the code you're using to anonymize it. An idea of the volume would help, as would any rules you must adhere to after it's anonymized. For example, someone can't have a blood type of W+ or 5.3.

We don't need any real data. I'm sure we can come up with a million-row data generation process if we know the rules for the data.

Tally Tables - Performance Personified
String Splitting with True Performance
Best practices on how to ask questions

Jeff Moden SSC Guru Points: 1003873 More actions · Answer 4

ericwenger1 - Sunday, July 2, 2017 1:08 PM
I'm not sure it really makes alot of difference so long as you can't identify the individual. My main concern is the time it takes to refactor this data. Most likely we will stick with refactoring the data as long as the powers to be know that its not going to happen within an hour every time they want to refresh.

You're making it sound like someone won't be able to "identify the individual" solely because of datamasking. Data masking does NOT make it so that no one can see the original data. You have to make the original data change, like you're doing.

You can probably ease the pain of what you're doing. There will be a lot that can be "auto-discovered" thanks to FKs and even just column names. With the FKs, you can also lead code into making better decisions about what to mask and how and in which order so that you don't get error messages about something being used as a key. I've been through similar as what you're doing with bank loans when banks merge and change loan numbers which, unfortunately, they used as part of the main PKs in most of our tables. Such auto-discovery has helped a whole lot.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

ericwenger1 Hall of Fame Points: 3436 More actions · Answer 5

ericwenger1

Hall of Fame

Points: 3436

July 2, 2017 at 1:47 pm

#1949079

OK. Thanks!

Lowell SSC Guru Points: 323504 More actions · Answer 6

I have a DBA_Utilities database that I keep tables in place for randomization. So for example i have a MovieStars database that is a cross join of Real Movie Starts first names and last names. I use that to replace patient names, for example.
I have tons of saved lists i grabbed and put into tables like that that i keep just for updates so i can randomize data., whethere it is replacing comments with famous quotes, names, addresses, etc
then when i'm tasked with anonymizing the data, it's relatively simple. With something that occurs regularly, like anonymizing development, i create a procedure that would update multiple tables in a target database, after investigating what to replace.
With MYFirstNames(FName) AS ( Select 'Leonardo' UNION ALL Select 'Brad' UNION ALL Select 'Arnold' UNION ALL Select 'Mark' UNION ALL Select 'Matt' UNION ALL Select 'Bruce' ), MyLastNames(LName) AS ( Select 'DeCaprio' UNION ALL Select 'Pitt' UNION ALL Select 'Schwarzenegger' UNION ALL Select 'Wahlberg' UNION ALL Select 'Damon' UNION ALL Select 'Willis' )

SELECT TOP 6 A.FName,B.LName FROM MYFirstNames A CROSS JOIN MyLastNames B ORDER BY NEWID()

Lowell

--help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

ericwenger1 Hall of Fame Points: 3436 More actions · Answer 7

ericwenger1

Hall of Fame

Points: 3436

July 2, 2017 at 2:48 pm

#1949082

Thanks!

Steve Jones - SSC Editor SSC Guru Points: 734489 More actions · Answer 8

So many security issues in technology start with "no one will ever ..."

You would be amazed how easy it can be to de-anonymize data. When there is value in doing so, or people are just bored, they'll make something happen. You just never know.

Data masking is not security. Not in any way. Do not use this to hide sensitive data from anyone other than extremely low privileged users that are limited to connecting to the database from an app. For anything else, work on speeding up the anonymizing procedure.

Ed Wagner SSC Guru Points: 287024 More actions · Answer 9

Lowell - Sunday, July 2, 2017 2:41 PM
I have a DBA_Utilities database that I keep tables in place for randomization. So for example i have a MovieStars database that is a cross join of Real Movie Starts first names and last names. I use that to replace patient names, for example.
I have tons of saved lists i grabbed and put into tables like that that i keep just for updates so i can randomize data., whethere it is replacing comments with famous quotes, names, addresses, etc
then when i'm tasked with anonymizing the data, it's relatively simple. With something that occurs regularly, like anonymizing development, i create a procedure that would update multiple tables in a target database, after investigating what to replace.
With MYFirstNames(FName) AS ( Select 'Leonardo' UNION ALL Select 'Brad' UNION ALL Select 'Arnold' UNION ALL Select 'Mark' UNION ALL Select 'Matt' UNION ALL Select 'Bruce' ), MyLastNames(LName) AS ( Select 'DeCaprio' UNION ALL Select 'Pitt' UNION ALL Select 'Schwarzenegger' UNION ALL Select 'Wahlberg' UNION ALL Select 'Damon' UNION ALL Select 'Willis' )
SELECT TOP 6 A.FName,B.LName FROM MYFirstNames A CROSS JOIN MyLastNames B ORDER BY NEWID()

I have something similar. For names, I have 20K first and 20K last (random) names and a cross join handles pretty-much anything. I also downloaded a free zip code database that contained city, state, zip and county. For SSNs and phone numbers, I have ITVFs to generate all possibilities and I can use whatever I need.

I must admit that the famous quotes table sounds interesting to replace comments. Great idea! I'd always just used random strings of garbled nonsense, but the quotations table sounds better.

Tally Tables - Performance Personified
String Splitting with True Performance
Best practices on how to ask questions