Creating Test Data

  • Comments posted to this topic are about the item Creating Test Data

  • The more you dig into data generation the more complicated it gets.  Even if you have a good data generator configuring it to generate data that looks like your production data can be time consuming to configure let alone run.

    One of the challenges is that getting a particular type of data, say postal codes, to generate is simple. Generating the sort of data that comes out of real systems complete with unexplained and theoretically impossible errors is a different matter.

    I had a lot of fun experimenting with the Redgate Data Generator and wrote up my experiments.  One of my biggest challenges is how to generate test data that is realistic and might not be destined for an RDBMS.

     

  • This was removed by the editor as SPAM

  • In my org IT doesn't work with confidential data such as payroll or sensitive company secret data so we are allowed to  see the production data. Our DBAs copy our production data to our DEV and TEST environments for us to work with. This way we have a real world data set. Seems like it would be a lot of effort to generate our own data and may lead into the temptation that one of my professors stressed. "Don't program around the data set".

  • richard.cottave wrote:

    In my org IT doesn't work with confidential data such as payroll or sensitive company secret data so we are allowed to  see the production data. Our DBAs copy our production data to our DEV and TEST environments for us to work with. This way we have a real world data set. Seems like it would be a lot of effort to generate our own data and may lead into the temptation that one of my professors stressed. "Don't program around the data set".

    I don't know, if you use your real world experience to identify scenarios of bad data, and then want to check for them in the future, you need a test case positive/negative to test that. If you can't leave that problem in your production data, then someone has cleaned it up and you no longer know if your software is handling that case correctly?

    so, maybe "don't program [solely] around the data set", but augment the data set to include scenarios every time you see something new to fix in production.

    -------------------------------------------------------------------------------------------------------------------------------------
    Please follow Best Practices For Posting On Forums to receive quicker and higher quality responses

  • If you don't have sensitive data in production, that does make things easier, but as data sets grow, there is a lack of agility in dev environments if you are using production data. In most cases, like developing a solution or running unit tests, you don't want the full set of production data because it's too large. Hydrating new environments or resetting them is slower than needed. Things like CI need to be quick, so we want a limited data set.

    It's not hard, but it is work that few people want to do. Same could be said of unit testing, until it became something useful.

     

  • It is a unique challenge. The best approach I've ever seen was an organisation that applied a behaviour driven development practice to their data layer. Their ETL, databases and reporting layers were subject to behavioral tests and had data generated to exercise those tests. Rather than relying on current data to inform the behaviour of the data layer, they took a long, hard look at their data layer and determined what the real constraints, rules, etc. around it was.

    For example (mocked):

    • this date field must be in the range of 1 January 1990 to the end of the previous calendar year. It should not be more than 3 months after the creation date of the record. If it is more than 3 months after the creation date of the record, there must be an exemption recorded by the supervisor no later than 1 week after the 3 month deadline.
    • when you sign up a new customer with the name "John Smith" they are not allowed to overwrite an existing "John Smith" and should appear in the tabular model as a separate customer.
    • when you run the report ABC and the selected customer has a negative account balance their name must be highlighted in red (CodedUI at the time)

    They started off at a high level and drilled down where there was real significance in the business rules. It really changed the way you look at, because you're dealing with the rules around your data and the way it should behave - rather than what previous, systemic behavior was.

    It took us around 4 years to get the whole system covered. A long time, a lot of technical debt, but we were building a scrum team, building automated testing around the solutions, etc. and it was just absorbed alongside.

     

  • Steve Jones - SSC Editor wrote:

    Things like CI need to be quick, so we want a limited data set.

    Gosh... I absolutely cringe when I hear things like that.  That's one of the reasons why I frequently say that CI has become one of the fastest ways do deploy errors in code.

    Why do you state that "we wanted a limited data set" (presumably talking about test data) just because it's CI?  Since you cited Andy's code, what would the difference be in testing against 10 million rows and the 75,000 rows he generated?... The answer is about 1 second to test and about 1 second to generate the 10 million rows on a decent server... and tons of saved apologies to the customer because Andy's  code actually has a hidden flaw in it that his test data doesn't pick up on.

    The bottom line here is that creating test data for a whole lot of different things just isn't that hard (and you very quickly get better at it the more you do it).  If you're actually writing code, you should be prepared to quickly test it for both accuracy and scalability.  Using limited data sets is a big mistake especially the more quickly you want to do CI.  In fact, Andy's code actually fails to generate the report correctly when you use it against a quick 10 million row data set.  He played it against a small heap that was created in sorted order.  Play it against something like a heap created in a different order or even something that might have a unique index in a different order and his code doesn't produce the output in the correct sorted order because he forgot to write it into his code and his "limited data set" didn't produce the otherwise imminent error.

    Also, despite Andy's justification of using longer code to be more readable, testing shows that it's pretty easy to make some shorter, easier to understand, code that runs about 30% faster and is even easier to read especially if you add a comment or two.  The thing is, you need to learn how to make a shedload of test data and it's just not hard especially for this type of thing.  It's the gift that keeps on giving, as well, because the more times you test for accuracy and performance and then look for an alternate solution, the better your code will auto-magically become in the future so that you don't actually have to test for alternates in the future.  You'll just know.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • One of the hardest things to do when generating test data as opposed to obfuscating production data is the non-normal situations.  I've found in my years in IT that users will come up with many creative ways to generate bad data that an IT person would never think of.  Especially in the case where they do one thing wrong on accident, then they do 3 more things wrong trying to fix that first thing they did wrong.

  • Jeff, you want CI to be fast, because you want to get feedback quickly. This runs constantly, so it needs to be fast.

    This IS NOT THE ONLY TESTING done with your app. It's a part of it, and a good CI process will test in a few different runs, each with slightly more complex tests, but you'll also have additional automated and manual tests later before you release software.

    Generating lots of test data can help see the perf of the optimizer, but it doesn't always cover the cases in your app. When you add in dates with names, addresses, orders, locations, etc., it's not a few seconds to generate test data that covers your business, it's an ongoing process. What Pascal described is what some of my very successful customers do. They absorb this into the dev process and slowly and constantly add to their test data set, getting it to resemble their production system and all the crazy things that clients do.

  • Chris Harshman wrote:

    One of the hardest things to do when generating test data as opposed to obfuscating production data is the non-normal situations.  I've found in my years in IT that users will come up with many creative ways to generate bad data that an IT person would never think of.  Especially in the case where they do one thing wrong on accident, then they do 3 more things wrong trying to fix that first thing they did wrong.

    Amen to that.  I'm currently working on a performance issue where the users insist that a mid-string search is absolutely necessary to "not miss any possibilities" and it's been quite the fight over the last couple of years when I first identified it as an up and coming performance issue.

    What I did was look at a very active training system that's in constant use and got permission to make a change to the stored procedure that does the offending midstring search and realized that a trailing string search (which is SARGable and will do a 1 row Seek instead of a forced SCAN of millions of rows) will "always" find the same data as the midstring search.

    None of the Managers, Project Managers, Business Analysts, or the good folks that wrote the code were buying it.  I even offered that I could make the search conditional and tested it.  It turns out that (for the limited data I had) that if the trailing wildcard search doesn't find something, the midstring search doesn't either.  They're still not buying it saying that even if the trailing string search doesn't find something that the midstring search still might.  Ordinarily, I'd agree with them but, even without doing any testing, neither the prep code nor the underlying data would support such a thing ever happening.  It's just not possible.

    So, my next step is to prove over a week of usage, that it never happens and that if it does hit the proverbial lottery (which, again, I think isn't possible), that it's such a rarity and so benign (it just returns "suggestions" to the end screen user) that it's just not going to matter and that the savings in human time and computer resources is well worth it.

    The only way to do that is "insitu".  Test data, in this case, just isn't going to hack it.  It has to be "live action" and it has to be for all clients and not just the one the training system is using.

    There's no CI throw-it-against-the-wall-and-see-what-sticks method that's going to work here.  To be clear, this one is going to have to be a "get it right the first time".

    My point here is that sometimes test data, no matter how cleverly contrived, is actually going to do the job.  And the code that Steve pointed out in the link to Andy's fine demonstration suffers the same problem.  In fact, when it comes to performance testing, even "Holy Grail" articles that seeming prove something with test data, must be looked at with the proverbial "Evil Eye" because the "Devil's in the Data".  That very thing has led to many people (seriously incorrectly) claiming that the use of concatenated XML is faster than DelimitedSplit8k.  They just don't understand that even millions of rows of "striped data" (repeated identical rows) don't make a valid test harness.

    To that end, making a shedload of data isn't (usually isn't) enough when it comes to testing the oddest cases of them all and those are using cases dealing with performance.

     

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • @jeff, how do they know the mid string search will always work? How is that more provable than your method?

    There's a bit more to CI than throwing it against the wall and seeing if it will stick.  There is no way we could run all the tests we need to do for a version upgrade if we had to do it manually.  It's not susceptible to human frailties.  If there are 1500 tests then the system won't overlook some. It'll always run 1500.  The trick is to make sure the 1500 are all different and useful.

  • David.Poole wrote:

    @Jeff, how do they know the mid string search will always work? How is that more provable than your method?

    They don't, actually. 😀  It's more of a visceral fearon their part.  They've never measured to see if it a mid string search will actually ever cover for a trailing string search that returns nothing.  That's the test I'm setting up.  They've also never quantified what the impact is if the trailing search returns returns a row or rows and the mid string search returns more rows than that nor even if that's possible according to the data and how it's being used.

    My contention is that the trailing search will always find what the mid string search does (in this case) and that executing the mid string search either alone or as a result of the trailing search returning nothing is a total waste of time, memory, and other resources.  I just need to prove it to them and, possibly, if it does ever return something else, demonstrate that any extra rows returned by the mid string search don't actually provide any value (or might even be misleading).

    David.Poole wrote:

    There's a bit more to CI than throwing it against the wall and seeing if it will stick.  There is no way we could run all the tests we need to do for a version upgrade if we had to do it manually.  It's not susceptible to human frailties.  If there are 1500 tests then the system won't overlook some. It'll always run 1500.  The trick is to make sure the 1500 are all different and useful.

    Understood and totally appreciated.  That's the way it should work and I have no bones to pick when it works as you've described (and I'm avoiding the "Who checks the people that wrote the test code" problem for this discussion).

    My problem with it is the same as things like AGILE, Knuth's parable, DevOps and..., and..., and... .  People frequently end up warping such things to be something they are not or should not be used for.  One of my favorite examples of this is Knuth's parable, which is generally summarized as "Pre-Optimization is the root of all evil".  While that's incredibly true, I've had people use it as an answer to my question of things like "Why did you design the Telephone number column as an NVARCHAR(255)".  That's not what Knuth intended nor what I'll accept.  That's just thoughtless/very poor programming and I've seen and heard it from a whole lot of programmers.

    Your good team uses CI the way it should be used;  Continuous Integration and related testing of code written with the intent of success on the first iteration.  I have too often seen it where the programmers use it just like I said... a wall to throw spaghetti at to see what sticks instead of just doing it right the first time and that's what the crux of my previous post on the subject was actually all about.  And that's especially true when it comes to queries against the database(s)... even simple C.R.U.D. or (in particular) ORM generated queries never mind when they use the ORM to generate something complex for the sake of mythical portability.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Steve Jones - SSC Editor wrote:

    Jeff, you want CI to be fast, because you want to get feedback quickly. This runs constantly, so it needs to be fast.

    This IS NOT THE ONLY TESTING done with your app. It's a part of it, and a good CI process will test in a few different runs, each with slightly more complex tests, but you'll also have additional automated and manual tests later before you release software.

    Generating lots of test data can help see the perf of the optimizer, but it doesn't always cover the cases in your app. When you add in dates with names, addresses, orders, locations, etc., it's not a few seconds to generate test data that covers your business, it's an ongoing process. What Pascal described is what some of my very successful customers do. They absorb this into the dev process and slowly and constantly add to their test data set, getting it to resemble their production system and all the crazy things that clients do.

    Absolutely understood.  See my previous response immediately above to Mr. Poole.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thank you for this @steve-jones-ssc-editor.

    This is one of the important factors that need to be addressed when developers start building any new data application. In my organization, we often use production data from a customer with their due permission; mask all the sensitive information like email, phone numbers, etc. and also randomize the nominal fields like First Name, Last Name, etc. This data is then used to demonstrate new features to the customers as and when required.

    Aveek has been working as an Analytics Engineer for the past couple of years now. His main areas of interests is SQL Server, SSIS, and SSAS. On his leisure, he likes to travel and is also an amateur photographer.
    https://twitter.com/avikoleum
    https://www.linkedin.com/in/aveekd22/

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply