Dynamic Data Unmasking

The SQL Server 2016 dynamic data masking feature may seem like a great way to obfuscate data for downstream systems like dev and QA. Joe Obbish shows us that the data can be “unmasked” with T-SQL statements, so it’s not secure against anyone who can write their own queries.

Dynamic data masking is a SQL Server 2016 feature to mask sensitive data at the column level from non-privileged users. Hiding SSNs is a common example in the documentation. However, the documentation also gives the following warning:

The purpose of dynamic data masking is to limit exposure of sensitive data, preventing users who should not have access to the data from viewing it. Dynamic data masking does not aim to prevent database users from connecting directly to the database and running exhaustive queries that expose pieces of the sensitive data.

How bad can it be? This post explores how quickly a table of SSNs can be unmasked by a non-privileged user.

Simple Demo

Let’s use a table structure very similar to the example in the documentation:

Here’s what the data looks like for a privileged user, such as a user with sa:

a12_sa_results

However, if I login with my lowly erik SQL Server login I can no longer see Pablo Blanco’s SSN:

a12_erik_results

Test Data

To make things more interesting let’s load a million rows into the table. SSNs will be randomized but I didn’t bother randomizing the first and last names.

How quickly can the malicious end user erik decode all of the data? Does he really require a set of exhaustive queries? To make things somewhat realistic, setting trace flags and creating objects is off limits. Only temp tables can be created, since all users can do that.

Decoding the SSN Format

The WHERE clause of queries can be used to infer information about the data. For example, the following query is protected by data masking because all of the action is in the SELECT clause:

However, the following query will only return the subset of rows with 1 as the first digit in their SSNs:

With 90 queries we could get all of the information that we need, but that’s too much work. First we need to verify the format of the SSN in the column. Perhaps it has dashes and perhaps it doesn’t. Let’s say that our malicious end user gets lucky and both of the following queries return a count of one million rows:

It’s a reasonable assumption that the SSN is in a XXX-XX-XXXX format, even though the data mask doesn’t tell us that directly.

Looping to Victory

Armed with our new knowledge, we can create a single SQL query that decodes all of the SSNs. The strategy is to define a single CTE with all ten digits and to use one CROSS APPLY for each digit in the SSN. Each CROSS APPLY only references the SSN column in the WHERE clause and returns the matching prefix of the SSN that we’ve found so far. Here’s a snippet of the code:

In the d1 derived table the first digit is found. That digit is passed to the d2 derived table and the first two digits are returned from d2. This continues all the way to d9 which has the full SSN. The full query is below:

On my machine, this query takes an average of 5952 ms to finish. Here’s a sample of the results:

a12_sample_results

Not bad to unmask one million SSNs.

Looping Even Faster to Victory

The LIKE operator is a bit heavy for what we’re doing. Another way to approach the problem is to have each derived table just focus on a single digit and to concatenate them all together at the end. I found SUBSTRING to be the fastest way to do this. The full query is below:

 

This query runs in an average on 1833 ms on my machine. The query plan looks as you might expect. Each cross apply is implemented as a parallel nested loop join against a constant scan of 10 values. On average each constant scan operator produces roughly 5.5 million rows. This makes sense, since for each loop we’ll need to check an average of 5.5 values before finding a match, assuming perfectly distributed random digits. Here’s a representative part of the plan:

a12_query1

Letting SQL Server do the Work

With nine digits we end up reading almost 50 million values from the constant scan operators. That’s a lot of work. Can we write a simpler query and let SQL Server do the work for us? We know that SSNs are always numeric, so if we had a table full of all billion possible SSNs then we could join to that and just keep the value from the table. Populating a temp table with a billion rows will take too long, but we can simply split up the SSN into its natural three parts and join to those tables. One way to do this is below:

The query now runs in an average of 822 ms. Note that I didn’t try very hard to optimize the inserts into the temp tables because they finish almost instantly. Taking a look at the plan, we see a lot of repartition stream operators because the column for the hash join is different for each query:

a12_repartition

Can we go faster?

Batch Mode to the Rescue

With parallel batch mode hash joins we don’t need to repartition the streams of the larger outer result set. I changed the query to only look at the table with 10000 rows to get more consistent and even parallel row distribution on the temp tables. I also added a clustered index on the temp table for the same reason. In addition to that, maybe we can expect joins to be faster with INT join columns as opposed to VARCHAR. With the canonical #BATCH_MODE_PLZ temp table to make the query eligible for batch mode, the query now looks like this:

The query now runs in an average of 330 ms. The repartition stream operators are no longer present:

a12_no_repart

It wasn’t clear to me how to speed this query up further. The probe residuals in the hash joins are one target:

a12_probe

These appear because SQL Server cannot guarantee that hash collisions won’t occur. Paul White points out the following:

If the join is on a single column typed as TINYINT, SMALLINT or INTEGER and if both columns are constrained to be NOT NULL, the hash function is ‘perfect’ – meaning there is no chance of a hash collision, and the query processor does not have to check the values again to ensure they really match.

Unfortunately, the probe residual remains even with the right temp table definition and adding explicit casts and non-null guarantees to the SUBSTRING expression. Perhaps the type information is lost in the plan and cannot be taken advantage of.

Final Thoughts

I don’t think that there’s really anything new here. This was mostly done for fun. Decoding a million SSNs in half a second is a good trick and a good reminder to be very careful with expectations around how much security data masking really gives you. Thanks for reading!

Editor’s note: This article was originally published at erikdarlingdata.com.