Managing Test Data as a Database CI Component – Part 2

Once you're clear about what data the developers and testers need to be able to be effective, you have several alternatives approaches to providing them exactly the data that will give you rock solid database tests that will allow deployments with far less risk. Tim goes through some practical demos to show you how you can automatically keep your testing environments up to date, and thereby ship more robust databases.

DevOps, Continuous Delivery & Database Lifecycle Management
Continuous Integration

In Managing Test Data as a Database CI Component – Part 1, we discussed the importance of giving your developers as realistic a test environment as possible, and how that requires that your test data be constantly refreshed. We also looked at ways to generate and scrub the data, as well as several methods for managing test data as a component of continuous integration. In part 2, we’ll be going through some practical demos to show you how you can start automatically keeping your testing environments up to date, and shipping more robust databases as a result.

Automated restores

Our first example will be using the automated restore technique. As a DBA I am particularly fond of this approach because it also allows us to test our restores! You would be surprised at how many DBAs are very diligent and meticulous with taking backups, but then fail to test restoring them until the need actually arises. I don’t know about you, but I would hate to find out that my backups didn’t work at the time when they are most needed. This is mostly because I hate getting fired, and this is something that, as a DBA, would seriously increase the odds of that happening!

For this method, I have two examples. The first assumes that you are using Ola Hallengen’s backup scripts – If you aren’t then I would highly suggest you take a peek at what he has produced and give it a try. Once this requirement is met and you’ve gone through this process, you’ll have a stored procedure that originated from Greg Robidoux, was enhanced by Jason Carter and finally modified by myself. The modifications I made were: turning the script into a stored procedure, giving the stored procedure the ability to execute the commands, and also allowing the stored procedure to utilize point in time restores.

Restoring databases can be a complex task, especially if you take log backups at very frequent intervals. This stored procedure will take care of almost all of the heavy lifting, allowing you to restore specific databases using a single command. I also have versions of this stored procedure that will restore all databases on a single server, and I am currently working on introducing new features such as DBCC checks and using Service Broker to run multiple restores asynchronously.

If rolling your own solution isn’t something you feel comfortable with, then Redgate SQL Backup is a fantastic tool for you to use. You can use it to configure and schedule the restore, as well as run a plethora of checks afterwards.

Here is a video created by Grant Fritchey that’s goes over the features of SQL Backup Pro, and also shows how to schedule a restore.

A couple things to note:

  • If you are restoring backups across different domains, you will need to establish a trust and give users folder permissions as needed.
  • Another problem I ran into was trying to create a new database as part of the restore job, but everything seemed to work fine after I created the database separately and just scheduled the restore to overwrite.

Some other considerations might be the time taken to restore depending on the size and speed of the system, as well as some post restore scripts to adjust security settings and cleanse data.

 

Replication

We won’t be covering replication in these demos. I think that it’s pretty straightforward and that so many others out there have written great tutorials for getting replication started. For those that are curious and would like a primer on replication, then RedGate press has a free book by Sebastian Meine which you can download. I would recommend this book, and it’s actually sitting in my office book case right now.

SSIS Incremental Load

The SSIS incremental load process we can show you is, in my opinion, one of the more elegant ways to keep staging updated and in-sync with production. I call this solution elegant for a few reasons; SSIS has a visually natural flow, it can grow into something as complex as you like (such as incorporating a data cleansing step if you’re dealing with sensitive details), and for the most part any DBA can install the tools and learn it. Before we start, I’d like to thank Connor Francis for developing and documenting the SSIS packages we’re going to cover here. There is a video walk through of the SSIS project located on YouTube, and you can also grab the code from Github.

The incremental load has some requirements. First, your tables will need some sort of column that is a timestamp or date for last modified. Without this column, things could get tricky, so in most cases you will just have to buck up and add the column. If you run into a case where you absolutely cannot add the column, like a 3rd party vendor, then I would suggest looking into Change Data Capture as a way to get your hands on ordered change data. Alternatively, if this is starting to sound like more work than you had in mind, you might want to consider replication or automated restores.

Without further ado, here is a demo and explanation that was put together by my colleague Connor Francis:

The Control Flow of the package begins by deleting any existing updates with an Execute SQL Task that truncates the updates tables. It then proceeds to be split into two Data Flows which separately handle the insertion of new data into the account table and user table. The Data Flows gather any new data added to the production tables that postdates the latest timestamp in the staging data tables by using an OLE DB source object and an SQL command as its given data access method. This is implemented because, although the job is scheduled for every 24 hours, if it fails one night it will not miss any data when it is next runs. All data is time stamped, whether that timestamp be a modification to existing data or addition of new data these timestamps discern what is pulled into the data flow.

Bearing mind the potential importance of data cleansing we touched upon at the start, the data then flows into a translation script where it’s transformed using c# and data masking techniques to jumble the user’s email address and hide their password. The jumble of the email is done by breaking the string into a character array, then using a Random object in c# assigning each letter a random key, and finally using SQL command to pull these keys in order and adding them into a new array in their new order, before transferring it back into a string. The algorithm for hiding the password is not quite as extravagant as we just change the letters, numbers, and symbols of the password into asterisks.

Finally, as the data exits the translation, it enters into the staging tables through an OLE DB Destination object, unless its ID is already present in the staging table meaning that it is an update to an already inserted datarow, then it fails to insert into the staging table and the row is redirected to a separate temporary update table. Then we exit the dataflow and are back to the control flow where the final step is to run an SQL script which takes the entries of the temporary update table and updates the entries of the staging tables that have matching ID’s.

Generating Data

As we discussed in part 1, generated data can play an important part in the continuous integration process. When you’re trying to test for unpredictable behaviour, the randomly generated data can be more useful that structured customer data. Likewise, as I mentioned earlier, if you’re trying to test a new feature, or to load-test a system for which only small amounts of data exist, the ability to create arbitrary amounts of artificial data (structured or random) can be very handy. That said, however useful random data can be, you need to be wary of inconsistencies between each sample of generated data. It’s possible that one generated sample might allow a set of tests to pass, while the next – generated from the same process – would cause them to fail.

The data for this project was generated by scheduling a SSIS package designed to create data for any given table. The package reads the table it is making data for and then, according to the data types in the table, produces randomized data for each data type in the columns. Given a table, the C# code will download a data table of a few records, run through a loop for the number of columns in the data table, and creates a new data table with data columns matching those of the data type in the database table. Using C#’s random generator and some simple algorithms to generate strings, numbers, dates, and Booleans rows of data are created and added to the new table.

There is also a query at the beginning of the data generator to find the largest primary key ID and only add primary keys sequentially following the largest already in the table. Because of the foreign key dependency, data generation for accounts is slower than that for users because, for each new row added it must use a SQL command and execute a scalar to find a random already existing userID in the user table. Once the appropriate data rows have been generated and the table hasn’t broken any of the key dependencies the new data table is appended to the database table with C# using statements.

So which approach do you choose?

Returning to our DBA who thought he had created a rocking staging environment and longed to take undisturbed vacations to Hawaii. Utilizing any of the test data processes we’ve discussed, he is now able to leave the development team with an automated way for the staging environments to synchronize and for test data to be generated. Now, even when the DBA is away, development can test with a high level of confidence that the data they are using is current and compliant and that, if their tests pass, they won’t break the database.

In this article we have demonstrated three ways that you can keep your test environments fresh with production or generated data, and it’s important to remember that, from the multiple ways we have discussed, there is no one-size-fits-all solution. This is mostly due to how unique and complex some environments may be, as well as your individual business requirements. Just to give you an example, at my current gig we use a mix of automated restores and ssis packages. Once a month we due a full refresh of staging which requires developers to check in their changes, and also allows us to avoid any kind of database drift that has happened during testing. The SSIS packages allow us to update our test environments with recent production data in-between.

I’d recommend you talk to your developers and testers to learn more about their processes, and find out exactly what data they need to be able to be effective, and how “fresh” they need it to be. Once you know that, you can work with them to pick and mix the approaches that will suit your needs, and give you rock solid database tests, and let DBAs go on vacation.

DevOps, Continuous Delivery & Database Lifecycle Management
Go to the Simple Talk library to find more articles, or visit www.red-gate.com/solutions for more information on the benefits of extending DevOps practices to SQL Server databases.