Data Masking with a large database - How are others doing this efficiently?

  • Hi,

    I'm looking to take a database that is a copy of a 100+ gb production database with over 1,000 tables and mask the data using a tool (in this case, RedGate Data Masker).  I see that I can go table by table, column by column, and set up rules for shuffling, substitutions and the like, but what I'm hoping to learn more about from the users here is how large of a project something like this would typically be for you so that I can manage my expectations appropriately and pick up info on how it can be done faster.

    A few questions:

    1. When doing data masking for an entire database, do you typically just focus on a small percentage of columns that need to be transformed in some way (customer names, phone numbers, SSNs, etc.)?  Or are you changing bits, PK integers, dollar amounts, category designations, and things like that?
    2. How long does it typically take for you to achieve something like this with a database that size?
    3. Are you making use of a tool that can intelligently assign default suggested data masking for the columns based on data types and column names?  Or does something like that typically not exist?

    Thank you for any thoughts or wisdom you can provide.

  • tarr94 wrote:

    (100+ gb, over 1,000 tables)

    100 GB is not large.

    tarr94 wrote:

    When doing data masking for an entire database, do you typically just focus on a small percentage of columns that need to be transformed or encrypted in some way

    Masking is not encryption.  Masking will not pass most audits.

    As far as what you need to mask, that would be up to you.  "It depends" certainly applies.  In one system, we have encrypted only PII columns.  Emails, names, and so forth.

    In another system, we have simply masked the same data, as well as the columns that hold financial data.

    What columns to mask and encrypt was not determined by the DBA's or developers, this came from the business units and auditors.

    As for how long this takes, data masking does not modify the data.  You simply "turn it on". Encryption, depending upon how many columns and the volume of  encrypted data, varies greatly.  As an example, the system were we have encrypted the users PII data is about 120 GB, the users table is about 45 GB.  There are approximately 10 columns that are encrypted.  Then this was implemented on out development environment, it took hours.  The resources allocated to the dev box is minimal.  On production, which is far more robust hardware, it took about an hour.

    I guess the real question I have is what are you trying to do?  What directive at your company have you been given?

    And, test, test, test.   This is not something trivial.

     

    Michael L John
    If you assassinate a DBA, would you pull a trigger?
    To properly post on a forum:
    http://www.sqlservercentral.com/articles/61537/

  • Thanks for the reply, Michael, some very helpful information there.

    At this point, we are exploring our Data Masker tool to get a feel for how much effort is involved for masking different things.  The more immediate need we'll have is that one of our vendors will need a copy of our database for testing out some development they're doing with the app, and rather than simply hand it over as is, we'd like for them to receive a version with fake data.  I don't think we have any immediate need for auditing purposes.  Ideally, the more fake data, the better, if this is something that can be achieved efficiently.  But my experience with Data Masker so far is that the tool doesn't appear able to make intelligent suggestions about how to treat various columns; rather, any masking that needs to occur has required me to go column by column to set up these rules.

    As someone who is completely new to the concept of data masking, I'm just trying to get a sense for what is normal, and how much of the data people typically expect to have masked.  Since I'll be the one using the tool at my company, I want to have some sense for what is typical at other companies, how much data is normally getting masked, and how long this typically takes to set up.  It makes it easier for me to be able to form a reasonable estimate for how much time I'll need to complete the work.  That way, if the business unit wants us to mask X amount of data, I can say it might take Y amount of time.  Much of this is a matter of me exploring on my own with the tool we're using, but I figured any other information from the folks here would be greatly helpful for me as well.

    I'll make a few updates to my original post to remove any confusion with the references to "encryption" and "large" databases.

     

    • This reply was modified 2 years ago by  tarr94.
  • I think the only way you are going to get an actual representation of how long it takes would be to try it.

    Can you make a copy of your database(s) to another server and actually let the tool run?

    Michael L John
    If you assassinate a DBA, would you pull a trigger?
    To properly post on a forum:
    http://www.sqlservercentral.com/articles/61537/

  • This was removed by the editor as SPAM

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply