Chris Harshman wrote:
One of the hardest things to do when generating test data as opposed to obfuscating production data is the non-normal situations. I've found in my years in IT that users will come up with many creative ways to generate bad data that an IT person would never think of. Especially in the case where they do one thing wrong on accident, then they do 3 more things wrong trying to fix that first thing they did wrong.
Amen to that. I'm currently working on a performance issue where the users insist that a mid-string search is absolutely necessary to "not miss any possibilities" and it's been quite the fight over the last couple of years when I first identified it as an up and coming performance issue.
What I did was look at a very active training system that's in constant use and got permission to make a change to the stored procedure that does the offending midstring search and realized that a trailing string search (which is SARGable and will do a 1 row Seek instead of a forced SCAN of millions of rows) will "always" find the same data as the midstring search.
None of the Managers, Project Managers, Business Analysts, or the good folks that wrote the code were buying it. I even offered that I could make the search conditional and tested it. It turns out that (for the limited data I had) that if the trailing wildcard search doesn't find something, the midstring search doesn't either. They're still not buying it saying that even if the trailing string search doesn't find something that the midstring search still might. Ordinarily, I'd agree with them but, even without doing any testing, neither the prep code nor the underlying data would support such a thing ever happening. It's just not possible.
So, my next step is to prove over a week of usage, that it never happens and that if it does hit the proverbial lottery (which, again, I think isn't possible), that it's such a rarity and so benign (it just returns "suggestions" to the end screen user) that it's just not going to matter and that the savings in human time and computer resources is well worth it.
The only way to do that is "insitu". Test data, in this case, just isn't going to hack it. It has to be "live action" and it has to be for all clients and not just the one the training system is using.
There's no CI throw-it-against-the-wall-and-see-what-sticks method that's going to work here. To be clear, this one is going to have to be a "get it right the first time".
My point here is that sometimes test data, no matter how cleverly contrived, is actually going to do the job. And the code that Steve pointed out in the link to Andy's fine demonstration suffers the same problem. In fact, when it comes to performance testing, even "Holy Grail" articles that seeming prove something with test data, must be looked at with the proverbial "Evil Eye" because the "Devil's in the Data". That very thing has led to many people (seriously incorrectly) claiming that the use of concatenated XML is faster than DelimitedSplit8k. They just don't understand that even millions of rows of "striped data" (repeated identical rows) don't make a valid test harness.
To that end, making a shedload of data isn't (usually isn't) enough when it comes to testing the oddest cases of them all and those are using cases dealing with performance.