Most efficient/least painful way to anonymise data in a large table

  • We have a production database which is being migrated to a new environment, and the customer requires data in certain columns and certain tables to be anonymised while the project is in the development phase.

    The supplier has provided a script which replaces the data - for example:

    UPDATE ThisTable SET Description = 'Anonymised ' + TableKey

    Now the problem is that several of the tables have millions of rows. The biggest is 284,000,000 rows.

    The above statement will, of course, never work for such a table. Locks, TempDb and row versions, log files, etc. etc.

    I have a script which I've used before which in essence does the following:

    1. Creates a temp table of the source table's PK (and creates an index on the PK).

    2. Selects top n PKs from the temp table and processes the appropriate rows in source table.

    3. Deletes the top n PKs from the temp table

    4. Repeats from step 2

    This works well - it gives reasonable performance (and does some metrics to be able to predict end time). However, running it on the large table gives a predicted run time of 4 days!

    Other measures I've taken are to put the database in simple recovery mode.

    We have exclusive access to the server, and can 'do what we want' with it.

    The core problem is that we're talking large numbers of rows. One thought is BCP OUT to text file(s), process offline, and BCP in. However, then we're still into processing a text file with 284,000,000 lines!

    So - any other thoughts on how to achieve the above?

  • There's no magic way to do this. To update and anonymize data, to get a good set of useful information takes time. It's especially problematic when you want to do this regularly for dev refreshes. I've had this discussion many times over the years, and tried to get a product from Redgate that would "shrink" a database to xx% of the original, but it's a very hard engineering effort.

    I've bitten the bullet and had scripts to transform things, delete some data, etc. They are customized and have to be maintained and they take time for every refresh. After working on this with some customers, here's what I've come to.

    First, take production and actually go through and remove data that you don't need for development. Maybe this is 50%, maybe 90%, but you'll need a way to actually clear out data, understanding FKs and relationships.

    Next, run scripts that anonymize data. Again, custom work.

    Now, take a backup and save this. This is a curated set of test data for all dev/test/qa environments. Use this in restores, in CI, in QA, etc.

    As you modify the schema, you'll need to deploy those changes here and slowly have new backups. I'd make sure I version these, in line with how you version production (if you don't do that, start). I'd also have a quick process to add new data as you discover dev/test cases that aren't covered. Evolve the test data as well over time. I'd major version this with the production schema (1.2, 2.3, 2.4, etc) and then minor version this with newer data sets (1.2.1, 1.2.2). However, the minor versions usually don't matter, and likely when you build 1.2.2 with new data, throw 1.2.1 away.

    Same overhead, but once. Maintenance is similar to any other situation I've encountered.

    You won't get this right the first time, which is fine. Try it, adjust the data and make new backups as needed, keep the most current one handy as a way to refresh any environment.

    Your first c

  • Hi Steve,

    Thanks - that's what I thought.

    The problem is the customer wants to put representative data in terms of volumes on the new system, but wants it anonymised. If it was just test/dev data (i.e. representative in terms of structure and content but not volume), then all your very good suggestions are relevant.

    As you say - there's no easy way. If the customer wants his very large database to be anonymised, he's going to have to wait for it to chug through!

  • James Lavery (8/5/2016)


    Hi Steve,

    Thanks - that's what I thought.

    The problem is the customer wants to put representative data in terms of volumes on the new system, but wants it anonymised. If it was just test/dev data (i.e. representative in terms of structure and content but not volume), then all your very good suggestions are relevant.

    As you say - there's no easy way. If the customer wants his very large database to be anonymised, he's going to have to wait for it to chug through!

    No... You don't have to wait. Every system has a "tipping point" where things suddenly take forever but operate very quickly if you don't exceed the tipping point. It's not easy to determine what the tipping point is, either.

    An example of the tipping point is that it may take 2 seconds to update 1 million rows, a very predictable 4 seconds to update 2 million rows, 6 seconds to update 3 million rows, and 6 HOURS to update 4 million rows.

    You would likely be able to do the updates on your tables 2 or 3 million rows at a time using a well written loop to identify the rows to update in chunks.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply