Data Masker for SQL Server is pretty cool. It can reduce your attack surface area and exposure under the GDPR or any other regulation. There are plenty or rules and templates to help you anonymize data, in ways you might not have expected. Data Masker is a part of SQL Provision and the SQL Data Privacy Suite.
In this post, I want to show you a common technique that can be used to anonymize names. That’s a fairly common requirement for many datasets, and Data Masker makes it easy.
Last Names
Since Redgate is a UK company, we call last names, surnames. These might also be family names in your culture, but these are the common names shared by everyone in a family.
In any case, there is a series of datasets for last names/surnames. If I look through the list, I see these:
- Names, Surname Suffixes
- Names, Surname Titles
- Names, Surname Titles (Short List)
- Names, Surnames, Random (DE)
- Names, Surnames, Random (PT)
- Names, Surnames, Random (ES)
- Names, Surnames, Random (FR)
- Names, Surnames, Random (Large List)
- Names, Surnames, Random (NL)
- Names, Surnames, Random (Short List)
Each of these is a file with a list of names. The encoding is binary, but if you choose one, you can click the “Sample” button to see some of the entries. For example, here are the Last Names (Large List) sample:
There are country oriented sets and then a large and small list of English names. In this case, I’ll map the large list to my customer_lastname column. I could make these all upper case or choose unique values if I choose. I won’t here.
I could also use the WHERE Clause and Sampling tab to limit the replacements. The default it to replace rows where the value is NOT NULL or empty. I won’t bother and I’ll just create this rule.
First Names
Replacing first names is a two stage process, or it can be if you care. Gender can matter with first names, so the process can be one of two methods:
- replace gender = ‘F’ with female names
- replace gender=’M’ with male names.
This is slower as it requires checking in both passes. A better solution might be:
- replace all rows with female names
- replace gender=’M’ with male names.
In this case only one set of checks is made in the second step. I’ll do this.
First, I’ll modify the existing substitution rule to add customer_firstname to the list of replacements. In this case, I’ll choose the Names, First Names, Female as the data set.
I’ll leave the WHERE clause alone and save this rule.
Once this is done, I’ll create a new substitution rule. For the dataset, I’ll choose male first names.
Once this is done, I want to limit replacements, so I’ll go to the WHERE clause tab and enter a valid T-SQL clause:
I’ll save this rule and then I see this:
Both rules are listed in order, but there is no linkage. My data masking can potentially run these two rules in parallel. I don’t want that. The second rule depends on the first one having run.
I can add a dependency by dragging the lower rule onto the type of the upper rule. The object behavior is a little funny here, so experiment with “grabbing” the lower rule by clicking on it and holding, them moving it up. You should end up with an indent, signifying dependency.
Now I’ll run the masking set, after querying the table in a window. Once the masking is run, I’ll open a second window in a vertical tab and re-run the same query. You can see the before and after below:
This is a basic look at a multi-stage process to anonymize data in a way that makes sense to your application. Using this technique, I can get production like data in lower environments, without using production data.
Data Masker does some really interesting things as far as cleaning and masking data, so I’d suggest you give this a try as a way to process your lower environments.
If you have questions or want to see other masks, let me know. I also have other articles on Data Masker if you’re interested.