Obfuscated but readable data

  • I'm looking for ideas from the group, not sure where this post belongs but here goes...

    As a consultant, I would like to demonstrate some of my existing apps to prospective customers, but I don't want to show them real customer data. The goal is basically to obfuscate people's last names and company names, but if you simply replace letters randomly you get gibberish, for example by moving each letter forward by one letter of the alphabet

    MITCHELL becomes NJUDIFNN

    My second thought was to replace consonants with consonants, and vowels with vowels, like this

    MITCHELL becomes NOVDJIMM

    The problem is that the English language has "digrams" and "trigrams" which are multi-letter combinations like th, sh, nd, ll, rr, tch, and so forth. If there was a way to recognize those digrams and replace them in the same manner as my 2nd idea above, that "should" produce obfuscated but readable names.

    I've Google'd obfuscation but most articles relate to protecting the program code, not the data.

    Any ideas would be most appreciated.

    TIA,

    Bill Mitchell

  • If you're worried about people seeing the data - don't use a pattern. Especially if you're handing that stuff out on "demo CD's". Patterns are altogether easy to pick up, and some people do substitution codes naturally, to a degree that they'd be able to pick out the pattern pretty much naturally.

    Personally - i would NOT start with existing data. Period. I don't want the exposure of being accused of not "obscuring" the data enough.

    If you're hell-bent on starting from "real" data:

    - For names you'd want to look at a names table, and randomly assign last and first names on top of the "real names". There are publicly available tables of first and last names (one I remember seeing somewhere on the Census Bureau's web site) you could use for that.

    - random 9-digit generator to replace SSN/phone numbers/TIN/EIN/ etc....

    - again - some random generator to come up with address lines.

    - most anything else - replace with a solid set of XXXXXXX or 111111.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • I agree with Matt. Stay away from simply trying to mask real data. If you need to get the precise types of data distribution that real data gives you, then look at replacing things using generated data of some sort. Several third party tools will do the job. I've been beta testing a nice one from RedGate that will be out soon.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • The attached files contain the setup of tables of the 100 most common digrams and the most common 98 trigrams from the English. There is also the function fnObfuscate which will use these tables to blur the data.

    All trigrams in the table are replaced followed by any digrams. Any remaining single letters are then replaced.

    If your recipient doesn't know how you've obfuscated your data it'll take one hell of a brute force attack to decode it.

    HTH,

    Ade



    Ade

    A Freudian Slip is when you say one thing and mean your mother.
    For detail-enriched answers, ask detail-enriched questions...[/url]

  • I've usually just randomized names, take 100 last names, randomly replace existing data in there.

    Phone numbers can be sequenced, all 555-0000, 555-0001, etc.

    Other stuff, just need to move it around. It should look real otherwise people get caught up on William MTSHOD-who?. If it says John Smith, Steve Jones, etc., no one cares.

    Red Gate also has a new test-data generator. I haven't used it, but need to and see how well it works.

  • Thanks to everybody who responded, especially Adrian - very clever idea using the placeholders. I had no idea how to approach selectively replacing 1, 2 or 3 letter combinations without over-writing what had already been replaced.

    I guess could have avoided some flack by being more specific in how I would use this - I would never send one customer's app or data to another. The reason I posed the question is that I have a number of apps installed on my laptop that I would like to use to demonstrate the types of programming that I had done for other companies - user interfaces, reporting, and things like that.

    In the past I've pulled names out of Northwind or other sample databases, but there's not enough volume of data to really make a good impression.

    Thanks again to all.

  • Check if the script in the link helps.

    http://www.sql-articles.com/index.php?page=Scripts/encrypt_decrypt.html

    Cheers,
    Sugeshkumar Rajendran
    SQL Server MVP
    http://sugeshkr.blogspot.com

  • Never, never, never use customer confidential data in a demo or presentation. You probably signed, or your clients should have had sign, some type of non-disclosure. You won't be a trusted (i.e. employable) consultant for long if confidential data is exposed and you may end up in court. Read Grahm-Leach-Bliley.

    You've probably spent time researching how to obfuscate data and still have it work. It'll be worth it to take the time to build a dataset from scratch.

  • Just to clarify, I said that I never give or display customer data to another customer. But re-creating a relational dataset from scratch is not a trivial task for a real-world database.

    In SQL terms, giving out customer data is like

    WHERE Possible_Outcome IN ('fired', 'sued', 'arrested', 'fined', 'lost reputation')

  • Good one. Thanks for the laugh.

    I know it's not trivial. Only knowing your tables in general terms

  • Oops! Little quick on the draw.

    Good one. Thanks for the laugh.

    I know it's not trivial. Only knowing your tables in general terms I'd make some external tables:

    FirstNameTable

    LastNameTable

    AccountNumber

    SocSecNumber

    etc

    and read from them and insert to actual table.

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply