Encrypting SQL Server: Dynamic Data Masking

Dynamic Data Masking is a good way of rendering data unreadable for such purposes as user-acceptance testing, or demonstrating an application. It doesn't encrypt the data, and a knowledgeable SQL user can defeat it. However it provides a simple way to administer from the database what data the various users of a database application can and can not see, making it a useful tool for the developer.

SQL Server 2016 introduced dynamic data masking (DDM) as a way to prevent unauthorized users from viewing certain types of sensitive information. The database engine masks the data when it is retrieved from the database, based on masking rules defined on the column schema. The data stored within the database remains unchanged.

When a user queries the database, the database engine determines whether that user account has the permissions necessary to access the data in its unmasked state. If the account does not, the engine applies the masking rules to the data when it is returned as part of the query. In this way, you can mask all or part of sensitive data such as national identification numbers, credit card numbers, birth dates, phone numbers, or other types of information.

To implement DDM, you define masking rules on the columns that contain the data you want to protect. For each column, you add the MASKED WITH clause to the column definition, using the following syntax:

You start by replacing the <function> placeholder with one of four functions: default, email, random, or partial. For the random and partial functions, you must also provide parameter values, as specified by the <arguments> placeholder. We’ll get into the specifics of how all this works as we go through the examples.

You can add the MASKED WITH clause to your column definition when you create a table or afterwards by using an ALTER TABLE statement. It does not matter whether the column already contains data. There are some limitations, however. For example, you cannot apply DDM to computed columns or Always Encrypted columns.

While we’re on the topic of Always Encrypted, it’s worth noting that this is the fourth article in a series related to SQL Server encryption. The following links point to the first three articles.

I’ve included DDM in this series because Microsoft documentation (SQL Server Encryption) implies that DDM is a type of SQL Server encryption. It is not. The feature simply masks data for non-privileged users upon querying a protected column. No data is being encrypted at rest or in motion. The database engine merely replaces the sensitive data with non-identifying characters. Even Microsoft admits that “unprivileged users with ad-hoc query permissions can apply techniques to gain access to the actual data.”

Although DDM is not really encryption, it still seems worth including it in this series, if for no other reason than to be complete. The DDM feature is easy to implement and requires no changes to the queries themselves. You need only update your column definitions and perhaps tweak the permissions on certain accounts. But keep in mind that DDM should be used only as part of a much larger strategy for protecting data, a strategy that will likely include real encryption.

Applying a default mask

The first DDM function we’ll tackle is default. When you use this function to mask data, and a user with read-only privileges queries that data, the database engine masks the entire value and returns a replacement value. The exact nature of the replacement value depends on the column’s data type.

It is important to note that the default function does not take any arguments. Microsoft’s documentation is a bit confusing in this regard because it suggests otherwise, but you need only specify the function name and an empty set of parentheses.

To demonstrate how this works, we’ll start by creating a table and populating it with data from the AdventureWorks2014 database. When creating the table, we’ll use the default function to mask four of the columns, as shown in the following T-SQL script:

The table includes a number of extra columns, which we’ll be using in subsequent examples. For now, let’s focus on the four DDM columns.

The definition for each of these columns includes the MASKED WITH clause, which must come before the NULL or NOT NULL option. Within the clause, we specify the DDM function, in this case, default, enclosed in single quotes.

With our table in place and populated, let’s query the masked columns, using the same privileged account we used to create the table:

Because we’re running the query under a privileged account, we have full access to the data, despite having implemented the masking rules. For this reason, the SELECT statement will return the same results (shown in the following table) that we would get if the masking rules had not been added to the column definitions.

EmpID FirstName LastName Birthdate CurrentFlag SalesLastYear
274 Stephen Jiang 1951-10-17 1 0.00
275 Michael Blythe 1968-12-25 1 1750406.4785
276 Linda Mitchell 1980-02-27 1 1439156.0291
277 Jillian Carson 1962-08-29 1 1997186.2037
279 Tsvi Reiter 1974-01-18 1 1849640.9418

Notice that we did not have to change our query in any way and that the data is unchanged. When the database engine saw that we had access rights to that data, it returned the values just like they’re stored in the database.

Controlling access to masked data

The key to controlling access to masked data is in the privileges granted to the user accounts. When I created the table and then queried the data, I was logged in with an administrative account. However, an account that has been granted only SELECT permission on the table will see the data as masked.

To test this out, let’s start by creating a local user account on the EmpData4 database and granting that account the SELECT permission:

This should all be fairly straightforward. We’ve created an account in the database that is not linked to a login (to keep things simple for our testing), and then we granted the SELECT permission to that account to allow read access to the EmpInfo table.

Now let’s run our SELECT statement again, only this time, we’ll do it within the context of the new user:

We’re simply using the EXECUTE AS and REVERT statements to get the data that the new user would see. As the following table shows, the values in the four columns are now very different from how they’re stored in the database.

EmpID FirstName LastName Birthdate CurrentFlag SalesLastYear
274 Stephen xxxx 1900-01-01 0 0.00
275 Michael xxxx 1900-01-01 0 0.00
276 Linda xxxx 1900-01-01 0 0.00
277 Jillian xxxx 1900-01-01 0 0.00
279 Tsvi xxxx 1900-01-01 0 0.00

To begin with, the database engine returns each LastName value as xxxx. The LastName column is defined with the NVARCHAR(50) data type. The database engine returns the xxxx value for all masked character data, unless the column is defined at a smaller size. For example, a CHAR(2) column will return a masked value of xx.

The database engine handles date/time values differently. Consider the Birthdate column, which is configured with the DATE data type. Each value is returned as 1900-01-01 (or whatever format is consistent with your regional settings). If this had been some variation of the DATETIME data type, such as DATETIME2, the returned values would be 1900-01-01 00:00:00:0000000.

For numeric columns, the database engine returns 0 or 0.00, if decimals are involved. In this case, the CurrentFlag column is configured with the BIT data type, which qualifies as numeric, and the SalesLastYear is configured with the MONEY data type.

Regardless of the data type, the default function results in the entire value being somehow masked, which means an unauthorized user would not be able to infer the actual value other than to know the type of data, such as character or date.

In some cases, you might want to allow a user to view the data as unmasked. To do so, you would grant the UNMASK permission to the account:

Now when you run the SELECT statement as User1, the results will show the unmasked data. However, if you then decide that the account should not be able to access the data as unmasked, you can revoke the permission:

After you run this statement, the user will again see the data as masked, as defined by the masking rules in the column definitions.

Verifying column masks

If you want to view which columns are masked in your database and how that masking is implemented, you can use the sys.masked_columns system view, as shown in the following SELECT statement:

If you run the statement against our example database, you should see the following results.

TableName ColumnName MaskFunction
EmpInfo Birthdate default()
EmpInfo CurrentFlag default()
EmpInfo LastName default()
EmpInfo SalesLastYear default()

In this case, we’ve defined masking rules on only four columns in the EmpInfo table, with all rules using the default function. If we had used functions that require arguments, those arguments would be shown here as well.

Applying an email mask

The next function we’ll look at is email, which is used specifically to mask email addresses. The email function is as simple to implement as the default function, except that it returns somewhat different results. The default function masks the entire value, whereas the email function masks all but the first letter and, in some cases, the Internet domain (e.g., .com, .org, .edu).

The behavior of the email function might seem a bit odd because it returns the .com domain for all email addresses. For example, if the email address ends in .org or .edu, the masked value will read .com, as it would if the domain actually were .com. Another way to look at this is that the function masks the domain for all email addresses except those that end with .com.

Regardless of the domain issue, the process of defining the masking rule is quite straightforward:

In this case, we’re simply modifying the EmailAddress column to include the masking rule, with the email function specified. Now let’s query the table as our privileged user:

As expected, the SELECT statement returns the results shown in the following table.

EmpID EmailAddress
274 stephen0@adventure-works.com
275 michael9@adventure-works.com
276 linda3@adventure-works.com
277 jillian0@adventure-works.com
279 tsvi0@adventure-works.com

This time, we’ll run the SELECT statement as user1:

Now our results now look quite different.

EmpID EmailAddress
274 sXXX@XXXX.com
275 mXXX@XXXX.com
276 lXXX@XXXX.com
277 jXXX@XXXX.com
279 tXXX@XXXX.com

That’s all there is to masking email addresses. You might decide it’s just as easy to go with the default function to mask the data. In most cases, the value xxxx will work as effectively as tXXX@XXXX.com. The only advantage to the latter is that it telegraphs to the user that this value is a masked email address, and perhaps that’s exactly what you want to do.

Applying a random mask

The next function available to the masking rules is random, which replaces numerical values with a random value within a specified range. For example, we can apply the function to the SickLeave column, which is configured with the INT data type:

In this case, we want the returned random value to fall between 1 and 5. We define the range by adding the range limits as arguments to the function. We can do something similar with the SalesYTD column, which is configured with the MONEY data type:

This time we want the returned values to fall between 101 and 999. Now let’s run a query against these columns using our privileged account:

Not surprisingly, we get the results shown in the following table.

EmpID SickLeave SalesYTD
274 27 559697.5639
275 39 3763178.1787
276 33 4251368.5497
277 32 3189418.3662
279 34 2315185.611

Next, we’ll query the table as user1:

Now the returned values fall within the specified ranges, as shown in the following table.

EmpID SickLeave SalesYTD
274 2 335.2532
275 1 814.2846
276 3 142.2145
277 2 428.1147
279 5 290.1132

Notice that for the SalesYTD column, the database engine preserves the decimal, but still returns values within the specified range.

Be careful when using the random function so you don’t create a situation in which users that need accurate data assume that the information they’re seeing is correct. As with the email function, you might be better off using the default function. At least that way, users are seeing all zeroes and will likely realize that the data is inaccurate. On the other hand, if your intention is to deceive non-privileged users, then by all means, use the random function.

Applying a partial mask

The final function available to the masking rules is partial, which allows you to specify exactly which values are masked. This function works best for column data that is consistently formatted, such as credit card numbers or national IDs.

To demonstrate how the partial function works, we’ll start by applying it to the NatID column, which stores social security numbers, without the hyphens. For these values, we want to mask all but the last four digits, as shown in the following ALTER TABLE statement:

To implement a partial masking rule, we must specify three arguments: how many opening characters to reveal, how many middle characters to mask and how to mask them, and how many trailing characters to reveal. In this case, we are revealing no opening characters (0), masking the first five characters with x values (xxxxx), and revealing the last four characters (4).

Now we’ll do something similar for the PhoneNumber column, only this time, we’ll reveal the first four characters, apply the value xxx-xxxx to the next seven characters, and reveal no trailing characters:

To see what the data looks like, we’ll query the table with our privileged account:

As the following table shows, the results include the same data that’s stored in the database.

EmpID NatID PhoneNumber
274 502097814 238-555-0197
275 841560125 257-555-0154
276 191644724 883-555-0116
277 615389812 517-555-0117
279 716374314 664-555-0112

Next, let’s query the table as user1:

This time, the NatID values and PhoneNumber values are masked according to our masking rules.

EmpID NatID PhoneNumber
274 xxxxx7814 238-xxx-xxxx
275 xxxxx0125 257-xxx-xxxx
276 xxxxx4724 883-xxx-xxxx
277 xxxxx9812 517-xxx-xxxx
279 xxxxx4314 664-xxx-xxxx

The partial function can be a particularly useful tool, as long as the column values are in a consistent format. If they’re not, you might need to separate those values into different columns or come up with another strategy.

Verifying column masks, again

Now that we have all our masking rules in place, let’s query the sys.masked_columns system view once more:

Our results should now look similar to those shown in the following table.

TableName ColumnName MaskFunction
EmpInfo Birthdate default()
EmpInfo CurrentFlag default()
EmpInfo EmailAddress email()
EmpInfo LastName default()
EmpInfo NatID partial(0, “xxxxx”, 4)
EmpInfo PhoneNumber partial(4, “xxx-xxxx”, 0)
EmpInfo SalesLastYear default()
EmpInfo SalesYTD random(101, 999)
EmpInfo SickLeave random(1, 5)

As pointed out earlier, the results include the function arguments, where applicable, making it easier to pinpoint where you might be having issues should your masked data not be exactly what you expect.

Masking columns data

The DDM feature in SQL Server 2016 is a handy way to mask data without needing to modify your queries or applications. The four functions supported by the masking rules—default, email, random, and partial—are simple to implement within your column definitions. And because masking is controlled through SQL Server permissions, you can determine exactly who gets to see the unmasked data and who does not. Again, just be sure that masking is part of a larger security strategy and not your sole solution for protecting sensitive data.