Production Subsets

Steve Jones, 2022-06-01 (first published: 2014-08-22)

This editorial was originally published on Aug 22, 2014. It is being republished as Steve is traveling.

Continuous delivery recommends developers never use production data. It's too big, too cumbersome, and slows the process too much. Developers should have enough data to determine if their solutions work as they build them. Testing should have enough to do some tuning, but unless you plan on full performance/load tests (which you should), then you don't need the full set of production data.

It's an interesting idea, and overall I agree. A subset of data, hundreds of rows, can usually tell you if you're writing code that works if you profile the code and look for inefficiencies. Note that profiling code doesn't mean use Profiler. It means examining the resource used by your code in terms of CPU, I/O, memory, etc. There are tools to help you, and at some point in your development process, you should be using them.

However it can be time consuming and cumbersome to build small development data sets. There are lots of choices in how you might do this, and I thought this would make an interesting poll. For those of you that deal with development, whether that's T-SQL, .NET, or something else, what do you think?

Should we have a subset of production data, a custom data set, or perhaps deal with complete production data?

Some of this depends on the size of your production data, and I hope, it's contents. I would not want any PII, PCI, medical, etc. in any development area. However if that's not the case, then what do you prefer?

Whether you have custom data set or a subset of production, it can be cumbersome to keep this up to date. Your data may evolve over time and there's overhead in maintaining some scripts that would produce the data you need. Perhaps that's the cost of writing good software, but I'm curious how many of you feel.