RE: Most efficient/least painful way to anonymise data in a large table

SSC Guru

Points: 1003857

August 5, 2016 at 11:11 am

James Lavery (8/5/2016)
Hi Steve,
Thanks - that's what I thought.
The problem is the customer wants to put representative data in terms of volumes on the new system, but wants it anonymised. If it was just test/dev data (i.e. representative in terms of structure and content but not volume), then all your very good suggestions are relevant.
As you say - there's no easy way. If the customer wants his very large database to be anonymised, he's going to have to wait for it to chug through!

No... You don't have to wait. Every system has a "tipping point" where things suddenly take forever but operate very quickly if you don't exceed the tipping point. It's not easy to determine what the tipping point is, either.

An example of the tipping point is that it may take 2 seconds to update 1 million rows, a very predictable 4 seconds to update 2 million rows, 6 seconds to update 3 million rows, and 6 HOURS to update 4 million rows.

You would likely be able to do the updates on your tables 2 or 3 million rows at a time using a well written loop to identify the rows to update in chunks.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)