The problem of data masking comes up surprisingly often in the world of IT. Any time you need to share some potentially sensitive data, you may need to hide, obfuscate, randomize or otherwise dissimulate some of that data -- we'll call that the secret data.
In this article, we're going to focus on the mechanics of data masking, and gloss over a massive issue, which is data classification -- knowing who can access what data. Data classification is a whole different problem, especially in organizations that have huge amounts of sensitive data. I'll refer you to a different article that touches on this topic. For the rest of this article, we'll assume that this problem has been solved, and that we do in fact know who can access what data. The question is -- how do we hide the secret data?
Data masking is not just for databases -- it can be applied to documents, spreadsheets and so on, but here we'll focus on databases. There are many ways to do data masking, but in general they can be divided into two categories, each one with its own upsides and downsides.
Static masking is the simplest solution. Given a database that contains some secret data, you copy that database and edit the copy to mask whatever data needs to be masked. You can then give the copy to the client, and they can do whatever they want with it.
Of course, for a large data set, this may not be a trivial process. Imagine a relational database with thousands of tables and billions of rows (or more). But there are some (expensive) tools that will help you with that task.
It should be obvious that static masking is a very clean concept. It's the same idea as taking a pair of scissors and cutting out parts of a document. The secret data is not present, or at least not readable, in the copy, so there is no risk of leakage. The final user simply does not have the secret data.
For simple databases, you may not even need any tools: a few simple SQL scripts (or whatever language your database uses) might be enough. Because the secret data is not present, you can give a physical copy of the masked database to the client and let them run it on their own machines.
The duplication of the data can be a problem. It requires more storage, and it's one more copy of the database floating around. This is not usually a problem if, for instance, you are releasing a database to the general public, and therefore there will be only one version of the masked database.
But if different clients have different requirements, you may need to make many copies of the database, each one with a potentially different set of rules about which data is masked and how. And of course, if you have different rules for different clients, you now have to worry about each client getting access only to their own custom version of the database, and not anyone else's. It can get challenging to track all that.
Another problem is that the copies are snapshots of the database, and therefore may need to be updated at regular intervals. Each time you do this is an opportunity for a mistake.
Finally, we live in the era of big data. Some data sets are truly enormous, and making and distributing a copy of such data sets can be a daunting proposition.
Dynamic masking takes a different approach. Instead of making a copy of the data and changing the copy, the data is modified on the fly, as it is accessed, before it reaches the user, thereby providing each user of the same database with a potentially different view of the data. Note that this does not affect the database -- it only affects how the user sees the data in the database.
This obviously assumes that you control the database and that the client is accessing it through some sort of network. If the users were controlling the database, they could easily bypass the masking.
Generally speaking, dynamic masking can be done either by the database itself, or by a layer between the database server and the database client.
For instance, Microsoft SQL Server offers some dynamic data masking capabilities, which may be sufficient for many scenarios. PostgreSQL has the Anonymizer extension. I've gone over data masking in SQL Server in a previous article: it's a powerful feature, but it does have some limitations.
There are some third-party solutions that provide data masking outside of the database, but they typically rely on special drivers or special clients. A more generalized approach is based on proxy filtering, which relies on deep packet inspection and modification to mask data before it reaches the client.
The biggest advantage of dynamic masking is that, in theory at least, it allows you to use just one database for everyone. This avoids most of the issues we identified earlier with static masking.
Dynamic data masking also means that you can update the data masking rules, typically on the fly, and restrict or broaden access to certain data for certain clients at any time. And masking can be dependent on more than just who the user is: it can also depend on their IP address, or the time of day, or what DEFCON level we're at -- you get the picture.
Obviously, clients get access to new and updated data immediately (subject to masking rules), so the problem of data currency disappears.
Dynamic data masking implies that you are controlling the database. You can (and probably should) monitor what the clients are doing. This is critical for forensic analysis if there is a problem later on (think Cambridge Analytica).
Dynamic masking is potentially less secure, since users are in fact connecting to a database that contains the secret data. It turns out to be non-trivial to mask data reliably if the client accesses it using a sophisticated query language such as SQL. For instance, Microsoft specifically warns about this issue in their SQL Server data masking documentation. This can be managed by using query control, if that's an option.
Dynamic masking can also be a more complex solution overall, with more moving parts. The more complex the solution, the more likely it is that something will go wrong.
As is so often the case, there is no perfect solution: there is only a series of trade-offs that need to be weighed against the requirements.
If your data set is of manageable size (and that is very much a relative concept here), it may be practical for you to make a copy of your database and do the masking on the copy. If you're OK with the disadvantages we have outlined, that's a great way to do it. Simple solutions are often the most secure.
But if it's impractical or undesirable to duplicate the data set, especially if you have multiple clients with multiple masking requirements, then dynamic masking may be your only realistic option. In that case, you'll have to consider whether the database can satisfy your requirements, or whether a third-party solution is required. Even if you end up using the data masking capabilities provided by your database, you may still benefit from using a third-party tool to manage permissions and data classifications.