De-Identification of PHI Data

  • The company that I work for deal with PHI Data and wanted to get other opinions on De-Indentifying this kind of data. With SQL Server 2016 Data Masking is available. We've come up with a home grown way of De-Identification but it involved change the PHI Data, not masking it which takes awhile to do considering the volume of data that we have. If I have to create another copy of our database, it takes me about two days to make this happen. Any other techniques is greatly appreciated. My opinion is that going forward we should utilize the Data Masking Functionality of SQL Server 2016.

  • Let me ask you the same question I've asked many others.... if the data contained information about your or your family, would you actually be happy that someone used datamasking instead of refactoring the actual data?

    I wouldn't.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I'm not sure it really makes alot of difference so long as you can't identify the individual. My main concern is the time it takes to refactor this data. Most likely we will stick with refactoring the data as long as the powers to be know that its not going to happen within an hour every time they want to refresh.

  • Jeff Moden - Sunday, July 2, 2017 12:49 PM

    Let me ask you the same question I've asked many others.... if the data contained information about your or your family, would you actually be happy that someone used datamasking instead of refactoring the actual data?

    I wouldn't.

    [/quote]

    I agree with Jeff and I'm sure most others would too - including you and the people who's data you have.

    ericwenger1 - Sunday, July 2, 2017 7:59 AM

    The company that I work for deal with PHI Data and wanted to get other opinions on De-Indentifying this kind of data. With SQL Server 2016 Data Masking is available. We've come up with a home grown way of De-Identification but it involved change the PHI Data, not masking it which takes awhile to do considering the volume of data that we have. If I have to create another copy of our database, it takes me about two days to make this happen. Any other techniques is greatly appreciated. My opinion is that going forward we should utilize the Data Masking Functionality of SQL Server 2016.

    Since management probably has an idea of how much data is involved, I don't think it would be a stretch for them to accept that a rebuild can't happen in an hour.

    You said that the volume of data that you have is making the anonymization take two days.  Maybe we could help by coming up with more efficient ways of anonymizing it.  What's the goal of anonymizing it?  Is it to recreate test from production?  Or something else? 

    We'd need is the DDL for the tables involved and the code you're using to anonymize it.  An idea of the volume would help, as would any rules you must adhere to after it's anonymized.  For example, someone can't have a blood type of W+ or 5.3. 

    We don't need any real data.  I'm sure we can come up with a million-row data generation process if we know the rules for the data.

  • ericwenger1 - Sunday, July 2, 2017 1:08 PM

    I'm not sure it really makes alot of difference so long as you can't identify the individual. My main concern is the time it takes to refactor this data. Most likely we will stick with refactoring the data as long as the powers to be know that its not going to happen within an hour every time they want to refresh.

    You're making it sound like someone won't be able to "identify the individual" solely because of datamasking.  Data masking does NOT make it so that no one can see the original data.  You have to make the original data change, like you're doing.

    You can probably ease the pain of what you're doing.  There will be a lot that can be "auto-discovered" thanks to FKs and even just column names.  With the FKs, you can also lead code into making better decisions about what to mask and how and in which order so that you don't get error messages about something being used as a key.  I've been through similar as what you're doing with bank loans when banks merge and change loan numbers which, unfortunately, they used as part of the main PKs in most of our tables.  Such auto-discovery has helped a whole lot.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • OK. Thanks!

  • I have a DBA_Utilities database that I keep tables  in place for randomization. So for example i have a MovieStars database that is a cross join of Real Movie Starts first names and last names. I use that to replace patient names, for example.
    I have tons of saved lists i grabbed and put into tables like that that i keep just for updates so i can randomize data., whethere it is replacing comments with famous quotes, names, addresses, etc
    then when i'm tasked with anonymizing the data, it's relatively simple. With something that occurs regularly, like anonymizing development, i create a procedure that would update multiple tables in a target database, after investigating what to replace.
    With MYFirstNames(FName) AS
    (
    Select 'Leonardo' UNION ALL
    Select 'Brad' UNION ALL
    Select 'Arnold' UNION ALL
    Select 'Mark' UNION ALL
    Select 'Matt' UNION ALL
    Select 'Bruce'
    ),
    MyLastNames(LName) AS
    (
    Select 'DeCaprio' UNION ALL
    Select 'Pitt' UNION ALL
    Select 'Schwarzenegger' UNION ALL
    Select 'Wahlberg' UNION ALL
    Select 'Damon' UNION ALL
    Select 'Willis'
    )

    SELECT TOP 6
    A.FName,B.LName
    FROM MYFirstNames A
    CROSS JOIN MyLastNames B
    ORDER BY NEWID()

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • Thanks!

  • So many security issues in technology start with "no one will ever ..."

    You would be amazed how easy it can be to de-anonymize data. When there is value in doing so, or people are just bored, they'll make something happen. You just never know.

    Data masking is  not security.  Not in any way. Do not use this to hide sensitive data from anyone other than extremely low privileged users that are limited to connecting to the database from an app. For anything else, work on speeding up the anonymizing procedure.

  • Lowell - Sunday, July 2, 2017 2:41 PM

    I have a DBA_Utilities database that I keep tables  in place for randomization. So for example i have a MovieStars database that is a cross join of Real Movie Starts first names and last names. I use that to replace patient names, for example.
    I have tons of saved lists i grabbed and put into tables like that that i keep just for updates so i can randomize data., whethere it is replacing comments with famous quotes, names, addresses, etc
    then when i'm tasked with anonymizing the data, it's relatively simple. With something that occurs regularly, like anonymizing development, i create a procedure that would update multiple tables in a target database, after investigating what to replace.
    With MYFirstNames(FName) AS
    (
    Select 'Leonardo' UNION ALL
    Select 'Brad' UNION ALL
    Select 'Arnold' UNION ALL
    Select 'Mark' UNION ALL
    Select 'Matt' UNION ALL
    Select 'Bruce'
    ),
    MyLastNames(LName) AS
    (
    Select 'DeCaprio' UNION ALL
    Select 'Pitt' UNION ALL
    Select 'Schwarzenegger' UNION ALL
    Select 'Wahlberg' UNION ALL
    Select 'Damon' UNION ALL
    Select 'Willis'
    )

    SELECT TOP 6
    A.FName,B.LName
    FROM MYFirstNames A
    CROSS JOIN MyLastNames B
    ORDER BY NEWID()

    I have something similar.  For names, I have 20K first and 20K last (random) names and a cross join handles pretty-much anything.  I also downloaded a free zip code database that contained city, state, zip and county.  For SSNs and phone numbers, I have ITVFs to generate all possibilities and I can use whatever I need.

    I must admit that the famous quotes table sounds interesting to replace comments.  Great idea!  I'd always just used random strings of garbled nonsense, but the quotations table sounds better.

  • This was removed by the editor as SPAM

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply