SQLServerCentral Editorial

Pseudonymisation

,

Today we have a guest editorial from Phil Factor as Steve is out of the office.

It is an awful word, but ‘Pseudonymisation’ is the process of partial 'data masking' that transforms personal data in such a way that the resulting data cannot be easily attributed to a specific data subject without the use of additional information.  It is used mostly in medical research but also in reporting, training and testing. Unlike full ‘anonymisation’, a record that has ‘Pseudonymised’ data can still be linked back to the original record via a key. It is one of the ‘appropriate technical and organisational measures’ that are required to comply with the EU general data protection regulation 2016/679 (GDPR) where data needs to be used for research or testing purposes.

Pseudonymisation  isn’t a good solution to the problem of protecting privacy, and preventing unauthorised access. Even encryption is a very poor substitute for effective access-control, because there must be no unauthorised access even to the encrypted data and even encryption is best considered as an extreme delaying tactic.  

Partial data masking or other Pseudonymisation techniques tend to fail where they can be subject to inference attacks. It is often easy to take such data and work out, tho a small but useful extent, who it belongs to: You can soon work out who is using that dating site unless all the data is anonymised. Completely anonymised data is fine, especially for training, but if you are using it for database testing you need to check  that the 'statistics' distributions of the data haven't changed.

Pseudonymisation is a variety of data masking. The task of masking sensitive data within a database is always fraught. RDBMSs are designed to make it pretty easy to work out what the data was before it was obfuscated. Even if you aren’t always sure what goes on in the log, you can be confident that the villains know. In fact, there are plenty of other artefacts around within a SQL Server database to guide the curious. You are likely to need to shut down all devices and traces that track changes in the data before you obfuscate data, or export all the data to a different copy of the database.

I suspect that we want to use real production data where we have a bug that can only be repeated reliably on Production data. This tends to be produced by someone like Mr Null or Mr O’Brien, who between them have caused many a NAD system to fail. (Test data tends not to reflect the full variety that is  met in the real world.) With database systems, we also need to understand why certain queries fail when data conforms to particular distributions, and these distributions of data are difficult to achieve by generating data.

We are getting close to the point where we can no longer use live personal data to maintain and enhance databases. It will require considerable skill to find effective ways of partially or fully anymising the data to ensure that the organisations we work for comply with the law, because it may not be as easy as it first seems.

Rate

5 (1)

You rated this post out of 5. Change rating

Share

Share

Rate

5 (1)

You rated this post out of 5. Change rating