Randomizing some data

Question

Post reply

Randomizing some data

Luke L

SSC-Dedicated

Points: 35547
More actions
June 4, 2010 at 9:50 am

#219679

So I've been working to randomize/anonymize some data so that I can use production volumes of my data to do some user training with a new reporting solution that we'll be rolling out without directly querying the production data. Having the training data will allow us to setup certain situations where we can identify outliers without them changing etc, also there are some privacy issues I'm attempting to work around.
Attached is a script to create some sample data, I didnt' want this post to get unnecessarily long (though I think it's too late for that). What I've got to work with... a table that contains names(firstname, lastname and other information) I've represented it with some sample data called #GoodNames. Then I've got 2 additional tables one with first names and one with last names (thank you Data.gov) in the sample script they are #FirstNames and #LastNames
Business Requirements:
I'd like to replace each name with a random name from the list of names, So that Bob Smith might become Jane Doe, but ideally all of the Bob Smiths need to become Jane Doe. I'd love to just generate some random text in the name columns but apparently people can't make the leap from 'xyz' to 'Bob'. I'd love to use the redgate data generator, but I've got no funding again this year :(...
What I've tried so far.
--Get Max range for random value
DECLARE @lNameCount INT,
@fNameCount INT
SELECT @fNameCount = COUNT(*)
FROM [#FirstNames];
SELECT @lNameCount = COUNT(*)
FROM [#LastNames];
--Execute my Query
SELECT gn.[ID],
[FirstName],
CAST(((@fNameCount + 1) - 1) * RAND(CHECKSUM([FirstName])) + 1 AS INT) AS randomFID,
fn.[Name] AS RandFName,
[LastName],
CAST(((@lNameCount + 1) - 1) * RAND(CHECKSUM([LastName])) + 1 AS INT) AS randomLID,
ln.[Name] AS RandLName
FROM [#GoodNames] gn
INNER JOIN [#FirstNames] fn
ON CAST(((@fNameCount + 1) - 1) * RAND(CHECKSUM([FirstName])) + 1 AS INT) = fn.[id]
INNER JOIN [#LastNames] ln
ON CAST(((@lNameCount + 1) - 1) * RAND(CHECKSUM([LastName])) + 1 AS INT) = ln.[Id]
I've been thinking about this problem for a while and just can't get my head wrapped around how to approach it. I've thought about trying to take a checksum of the firstname and of the lastname and somehow using the modulo or rand functions equate Bob to Jane and Smith to Doe based on the ID columns in the #Firstname and #LastName Tables but I've been unsuccessful in generating a decent random population. It seems that the checksum is forcing too many semi-random values to be the same or at least not random enough. That and performance on this is rather a bit of a dog.
I was hoping someone out there may have done something similar and be able to offer to additional thoughts or techniques.
Thanks in advance,
-Luke.
Edit: Fixed some spellign and such.
To help us help you read this[/url]For better help with performance problems please read this[/url]

Viewing 3 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply

LutzM SSC Guru Points: 107047 More actions · Answer 1

Disclaimer: I never had to do it, so my attempt might not really be a good solution...

Here's how I'd do it:

Assuming sample data like you provided I would generate two tables holding the distinct FirstName's and LastName's from your #GoodNames table.

Then I would assign a random number using NEWID() to each of those values. Next step would be to renumber those names using rownumber() order by newid_col.

Based on those numbers you could match the corresponding tables and replace the names. If you have different real names than "fake" names you could calculate the max from both and use it to assign the values. If you'd have twice as much distinct good first names you might end up with a replacement of "Jake" and "Alex" with "Barney".

Within one replacement function the result will be consistent, meaning all Jake's and Alex' will be renamed to Barney.

Lutz
A pessimist is an optimist with experience.

How to get fast answers to your question[/url]
How to post performance related questions[/url]
Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

Luke L SSC-Dedicated Points: 35547 More actions · Answer 2

Thanks for the thoughts Lutz. I think that's basically what I'm doing or at least attempting to achieve. At first I tried grabbing a distinct set of firstnames and then lastnames, but I wasn't entirely happy with those lists, so I downloaded the most popular names from Data.gov (more specifically the 1990 US census data) and added those to the mix. It helped to increase the size of my first and last name lists.

I'm grabbing the max number of rows from the firstnames and lastnames lists so that I can use rand(checksum(firstname)) and generating a semi-random integer based on the data in the name column then using that to join to my firstnames and lastnames lists so that I can update them all in one set instead of a bunch of different loops.

This seems to be working for the moment, it just seems a bit inefficient, particularly for something that I may want to do more than just once. I was hoping someone out there had done it before and had a slick way to handle it

Thanks again for the input. It validates that I wasn't completely off my rocker last week when I was investigating this.

-Luke.

To help us help you read this[/url]For better help with performance problems please read this[/url]