Who is that masked man anyway?

  • Comments posted to this topic are about the item Who is that masked man anyway?

  • I'm primarily a BI developer and the reason for using production data is to catch the edge cases the client forgot to mention. I once got to 98% of transactions classified within a week then spent three months trying to get the 2% that dropped out of the simulated data. That was still 1m+ records so very important for making the DW sum correctly.

  • I'm a bit puzzled by the "if you really, really need to performance test" part. Who doesn't do performance tests before releasing something to production? It's a part of the development process.

  • There is a third option than masking or simulation.

    It is swapping.

    Suppose you have customers A B C D E F G

    and you swap them with G F E D C B A

    Then the data doesn't represent the real data any more but the distribution is the same.

  • Following on from the above approach, I am sure we could do effective masking by SQL scripting:

    - Names, if we had a table of equivalent names we could use these as replacements, eg BERT JAMES SMITH --> BILL JIMMY SMITH

    - Same thing for addresses, we could have tables of common town names and replace a valid name with another random valid name

    - When replacing names and addresses, we should keep the replacement text the same length to avoid possible overflows

    - Phone numbers and email addresses could be replaced by a random string of equivalent digits

    - Account numbers could also be replaced by an equivalent random string.

    I guess the problem is, some of these fields participate in PK/FK relationships so they would have to be kept consistent. Also from experience I know account numbers can end up in lots of different fields and it may not be easy to track these down.

    But if done effectively, we should have a database that 'looks like' the production data, but the production data could not be recovered from it.

  • Ed Wagner (10/22/2016)


    ...Who doesn't do performance tests before releasing something to production?...

    Surprisingly many. Too many.

    Gaz

    -- Stop your grinnin' and drop your linen...they're everywhere!!!

  • William Rayer (10/24/2016)


    Following on from the above approach, I am sure we could do effective masking by SQL scripting:

    - Names, if we had a table of equivalent names we could use these as replacements, eg BERT JAMES SMITH --> BILL JIMMY SMITH

    - Same thing for addresses, we could have tables of common town names and replace a valid name with another random valid name

    - When replacing names and addresses, we should keep the replacement text the same length to avoid possible overflows

    - Phone numbers and email addresses could be replaced by a random string of equivalent digits

    - Account numbers could also be replaced by an equivalent random string.

    I guess the problem is, some of these fields participate in PK/FK relationships so they would have to be kept consistent. Also from experience I know account numbers can end up in lots of different fields and it may not be easy to track these down.

    But if done effectively, we should have a database that 'looks like' the production data, but the production data could not be recovered from it.

    I have seen this being done successfully. The data maintains it's integrity and shape. The information is hidden.

    This is particularly worthwhile for long living data e.g. systems expected to run multi-year or even multi-decade.

    Gaz

    -- Stop your grinnin' and drop your linen...they're everywhere!!!

  • I worked at one place that had an "AS-LIVE" environment. It was an exact copy of the LIVE environment as it was two weeks ago. It was accessible to a very restricted group of people and was explicitly for testing performance and how releases would affect the production environment.

    Data science poses a range of problems. For normal BI work I've found that when people do have access to stuff they shouldn't it is because they were tasked with achieving a business objective and found a way to deliver it. Effectively they had been granted defacto permissions to circumvent security and compliance.

    When the time came for security and compliance to come into ascension it was possible to restrict privileges as there was no legitimate need for BI people to have access to highly confidential data.

    Data science is a bit different. Ultimately they still have to operated within the law. The scenarios are when they spot a correlation between women called Gail, Pisces and purchase of airline tickets.

    DOB and names should be confidential but are necessary for that correlation to be spotted. Sometimes the correlations aren't for marketing purposes, they are for fraud detection.

  • Gary Varga (10/24/2016)


    Ed Wagner (10/22/2016)


    ...Who doesn't do performance tests before releasing something to production?...

    Surprisingly many. Too many.

    Sadly, you're probably right.

  • Ed Wagner (10/24/2016)


    Gary Varga (10/24/2016)


    Ed Wagner (10/22/2016)


    ...Who doesn't do performance tests before releasing something to production?...

    Surprisingly many. Too many.

    Sadly, you're probably right.

    And then there are those that buy in third-party software and then grow way beyond what the third party was expecting when they designed their systems...

    Thomas Rushton
    blog: https://thelonedba.wordpress.com

  • Ed Wagner (10/24/2016)


    Gary Varga (10/24/2016)


    Ed Wagner (10/22/2016)


    ...Who doesn't do performance tests before releasing something to production?...

    Surprisingly many. Too many.

    Sadly, you're probably right.

    Where I'm at now, as the DBA I've had to start a process where I do a code review of the developers stored procedures, because there were too many bad practices getting into production database code. It's been an uphill struggle to clean up all the existing problems but we are making progress, and I'm ensuring new bad database queries are few and far between.

    I'm not convinced that people use production data in testing primarily for performance purposes, At the many places I've worked at it seems to be more for all the business cases that need testing. Its very easy to generate large volumes of data, but if it doesn't model the way the real business works it's worthless.

  • Ed Wagner (10/22/2016)


    I'm a bit puzzled by the "if you really, really need to performance test" part. Who doesn't do performance tests before releasing something to production? It's a part of the development process.

    You may believe that. I certainly believe that.

    There are many producing software nowadays who absolutely will not accept it. "If it works Ok on 20 rows in dev, it must be a server problem". I've worked with, and implemented software by, shedloads of them. Where a logo goes, or how big it is though ... that's a BIG ISSUE.

    I'm a DBA.
    I'm not paid to solve problems. I'm paid to prevent them.

  • @andrew gothard, @Ed Wagner etc.

    I was not trying to imply that performance tests were unnecessary. On the contrary, I think that they are essential. What does slightly irritate me is the idea amongst some developers that it is impossible to do any development work without using production data. That was what was behind the 'really, really' phrase. Yes, I put a lot of effort to get performance right and the emphasis here is on accurate measurement, familiarity with the query plans, getting the indexing strategy right and having a properly normalised design. Where production data is particularly useful is in making sure that your database performs well with the plans generated with the volume and distribution of production data. If there are good practical reasons for restricting access to production data - and there is nothing new to this, there are established ways to get around this for performance tests without having a huge impact on development. There are other good reasons for having access production data during deployment such as testing your migration scripts. I'd be interested to hear how other developers have got around that where access to production data is restricted.

  • Can you be more detailed about the legal cases you have seen? A corporation is considered a person under the law, so I'm trying to figure out what specifically, and in what fields I can't manipulate my own data, as long as I don't report it. Or just leave it on the customer's test server.

    412-977-3526 call/text

  • I'm not too concerned about how masking or hashing of demographic data in development skews data statistics and performance testing. When coding SQL or designing indexes, it's best not to make strong assumptions about the distribution of data, because that can drift over time in production anyhow. When writing new stored procedures, I'll typically unit test some of the more complex SQL queries in production and compare it's execution plan against development as a final reality check before committing the final version of the code.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply