Using a Staging Environment for Tracking Down Bugs


A discussion why having a staging environment with data that mirrors production is vital to tracking down issues.

Being able to replicate production bugs is highly important. A key aspect to this is having data that is similar or [preferably] identical in a non-production environment where the issue can be tracked down. Maybe somehow a NULL value got into a database field. Or maybe there’s an issue with the new business logic that just got rolled out. Whatever it is, if the issue can’t be reproduced, fixing it is much more difficult.

At Boston Public Schools, we have a staging environment for tracking down issues. There’s both a staging web server with a clone of the production website as well as a staging database server with data that is refreshed on a nightly basis. In almost all cases this setup has proven sufficient for tracking down issues and replicating user-reported bugs. It helps us get the issue resolved without the risk of modifying real production data.

At this point the staging refresh process is fully automated. A SQL Server backup on production is run early in the overnight period, and at some point later a SQL job is run that copies the database backups to the staging database server and refreshes the databases. Because it takes over an hour to run, it is set up to run nightly, not more frequently than that.

Can the process be improved? Certainly. We recently hired a DBA, and he’s introduced us to the benefits of using database replication. It’s quite possible that the staging data refresh could be changed to use replication instead of database copies and restores. At first, we’d likely use snapshot replication and at some point transition to using transactional replication [which would allow the data to be kept up to date and in synch with production, thus making it even easier to track down bugs]. We’re not there yet, though.

Truth be told, the setup we have now is significantly better than what we had a few years ago – no nightly data refresh at all. In other words, to track down an issue, we had 3 options:

    1) We could log in to the production environment and change real data to replicate the issue.

    2) We could hope that the bug would show up on the development server [which would likely have out of date data].

    3) We could wait an hour or more to copy the latest production backup over and do a restore on the development environment.

None of these options are good solutions – we’ve advanced quite a bit from where we were before.