Are your automated tests slowing you down?

By SuperJew - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=21975423Slow, unreliable tests prevent teams doing great work, and make continuous delivery impossible.

This was true for our SQL Source Control team when I started working with them. From pushing a commit to getting a complete set of tests took almost 12 hours. Then you had to rerun the raft of tests that sometimes failed, if you had time.

It had been years since a green build. The team was so comfortable seeing flaky test failures that they didn’t see real failures any more.

We realized this was not okay. We needed our tests to give us:

  • Timely feedback on our work
  • Confidence to make changes to our product
  • Confidence to ship improvements to our users

For those to happen, we had to work differently. We had to make the tests work for us.

Report on what you want to know

Tests build up over time, but they aren’t always worth keeping. So how do you spot tests that aren’t useful any more? We went about this in two steps.

Organize tests by purpose

Previous teams had different ideas of how to organize a codebase. We had tests everywhere, with no structure. We decided to re-categorize all of our tests into a few groups:

  • Checking basic functionality of code
  • Checking high-level behavior of a specific feature
  • Checking integration with third-party products
  • Checking the behavior of an installed product

Organizing our tests in a meaningful way helped us get them running more efficiently:

  • Related tests ran together, giving us quick feedback on areas of the product
  • The test infrastructure was only used by test runs that needed it
  • The time taken to run tests suggested areas we were testing too little or too much

Know why they exist

Once related tests were next to each other, we applied a consistent naming scheme based on the popular acceptance test template:

Given<Setup>_When<Action>_Then<Outcome>

This showed a lot of tests that were doing practically the same thing. For these tests we asked Are they different enough to be useful?. When the answer was No, we deleted the duplicates.

We never said Yes.

Do not tolerate flakiness

If a test changes from passing to failing (or vice versa) without the product changing, we’d call it ‘flaky’. In the past, the standard reaction was Just run it again. That led to wasted time and missing genuine test failures.

We’d already agreed that some tests weren’t as valuable now as they were originally. That made it easy to ask Is this still useful? whenever a test looked flaky.

If we said It’s valuable, we’d look for a way to make the test more reliable. If we found one, we’d apply that fix and carry on.

If we didn’t, we would ask how to test the same thing in a different way. This often led to big, slow tests being broken down into a number of smaller, faster tests.

It we said It’s not valuable, we’d delete it. We did this a lot.

Size matters

When we split tests apart into smaller, quicker tests, we realized something. When it comes to tests, size does matter. Inspired by Google’s Tests Sizes (but adapted to our context), we started to agree what we want our tests to look like.

Smaller is (normally) better

We prefer tests to be as small as possible, to give fast feedback on the code we write. That means not interacting with system resources, databases, even other classes if possible.

This made talking about our tests easier. We would actively look to split one larger test down into a number of small tests, or justify why a larger test was worthwhile.

When is bigger also better?

As great as small, quick-running tests are, there are things they don’t tell us.

SQL Source Control works with third-party tools (like Git and Microsoft Team Foundation Server). It’s imperative that those integration points work reliably for our users. For now, we test that with integration tests between real systems, which is definitely a job for larger tests.

Quicker feedback is better

Our agreement about preferring smaller tests soon led to another idea: We want as much feedback on our work, as quickly as we can get it.

Since we work in Visual Studio, the team were very keen to adopt NCrunch to get our small tests running as we wrote or changed code. This had been tried before, but failed because of the number and complexity of tests. By splitting out our small tests, we made this possible.

Putting the tool in place was a good start, but getting all of our small tests running automatically took some work. It also highlighted the areas of our product that we were frequently changing that weren’t covered by small tests, so we started adding them as we worked on the product.

Out with the old…

Maintaining code takes time and effort, and test automation code is no different.

Maintenance doesn’t just mean keeping it going. To us it means keeping the quality level high, by modern standards.

We use UI automation in some tests. These tests rely on a Virtual Machine Automation tool for creating test environments. Sticking to old technology here proved to really slow us down.

We ditched our home-grown virtual machine automation project in favor of using Vagrant. That gave huge reductions in test runtime (due to quicker environment provision), freed us from maintaining a huge codebase and gave us easy access to local, self-contained test environments.

Where are we now?

For the team working on SQL Source Control today, life looks like this:
By SuperJew - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=21975417

  • Any small tests are automatically run as our software engineers change code
  • All other tests can be easily run in a development environment, whenever we wish
  • All tests automatically run on every change-set we push
  • End-to-end runtime of those tests is about 20 minutes
  • We expect those runs to be green, and take action if they’re not
  • A green test run is necessary before we merge changes
  • A green build gives us confidence to release to our users

That’s an awful long way from Maybe it’ll be green one day, but we always want to get better. We want instant, meaningful feedback on any code we change, for example, and we want that feedback in under ten minutes. And, of course, we want to know that a red test means we broke something.

So next time you’re frustrated waiting for feedback, or spending your time diagnosing flaky tests, try asking yourself:

  • Who are these tests helpful for?
  • How can I get this feedback in a quicker way?
  • How are our tools holding us back?