Continuous Integration (CI) is a software development practice that should help you build your software faster, and at a higher quality level if done right. This piece will examine what this entails for software, and how it specifically can be used with databases.
What Is Continuous Integration?
The original article I read on Continuous Integration was from ThoughtWorks and Martin Fowler, describing how they implemented a build process for their clients. This article goes back to 2000, at a time when I was a database developer working in an agile fashion with a small startup. I hadn't heard of CI back then, but I wish we had as our source code was a mess. However, I'm not sure if we would have used it as I struggled just to get all our VB and T-SQL code into a Version Control System (VCS).
The Daily Build
One of the tomes of software development is The Mythical Man Month by Fred Brooks. In this book, Mr. Brooks writes about his experiences working on large software projects at IBM. He had many conclusions and ideas based on his experiences, but one of them was that large projects suffer from large numbers of errors when the work of many programmers is integrated together. It's natural, and the longer the time between integration of code, or event subsystems of an application, the more likely it is that things will not work together.
One of the ideas that came out of these experiences was the idea of the daily build. Microsoft used this extensively in their processes, as did many other companies. In a daily build environment, the current version of all code is compiled at the end of the day. Usually a little before the end of the day as there may be problems and no one is allowed to go home until the build works. Developers learn very quickly that delaying their check-ins to VCS doesn't make this process easier, as a few weeks of work checked in one day might result in many hours of debugging by the team to get the code to compile.
This works well and helps ensure that developers check in their code regularly, but this a very low bar for your software. Merely having the code compile in now way ensures that the code will perform as expected when it is used. It is important that you have some static code analysis or testing as well to minimize future issues.
The advantage of integrating your code every day is that you know you have code the compiles at the end of every day, and could potentially be released. The most testing you perform on this code, the more your confidence grows that the work performed during the day moved the application closer to meeting its goals.
Accelerating the Pace
If we could build software once a day, some developers wondered if we could do it more often. After all, if we find that we have less integration bugs with only a day's worth of changes, wouldn't we have even less issues if we compiled twice a day, or eight times a day?
With the advent of cheap computing resources, the pace of builds accelerated and one of the ideas of ThoughtWorks updated paper was that every checkin should trigger a build. If the build process is automated (and it should be), then each time a developer checks code into the VCS, a new executable can be compiled. This allows each developer to determine if their change (or small batch of changes) caused the build to break. At such a granular level, it becomes very quick and easy to determine what code is causing issues with your mainline of code.
One of the further tenets of CI is that the build must be self-testing. This means that there are automated tests that are run against your code to determine if various modules, results, or assertions are correct. With a strong set of unit tests added to a CI process, not only can developers learn that their code integrates with other developers' code, but they can also have more confidence that their code works correctly without causing regression bugs. The more testing that is performend, the more confidence everyone has in the code.
In reality, most companies don't have enough exhaustive tests to be sure their code is bug free, but they can be sure that where they have test coverage, often in the places that the code has broken, the changes made to the mainline are not causing old bugs to re-surface.
Database Continuous Integration
As I've studied this subject, I've found that very few software teams actually include the database in their CI process, though the numbers of growing. As more and more software development teams realize that the changes they make to the database can cause severe application failure, they are looking to ensure their database code is versioned just like their application code is in a VCS, and that the code works well together (if the CI process tests the application and database code). This also provides your software development team with some flexibility in knowing which version(s) of the front end code work with which version(s) of the database.
Note that you do not need to keep the application and database at the same version in your VCS. Many teams have these two projects in the VCS with independent versions.
Extending CI to databases is a fairly straightforward process, though not necessarily simple to implement. In essence, all code changes made to the database must be applied to a copy of a database on each check-in, and all of the tests available for your database code must be run. This is the same as application code, but slightly more complex.
Since a database is often a standalone service, the "build" of a database consists of these parts:
- Check out the DDL code from a VCS.
- Execute a DDL script in a database on a database server, in proper order to avoid dependency issues.
- Verify all objects exists.
- Add lookup or reference data to the database.
- Optionally add sample data to tables.
- Execute unit tests (and potentially integration tests) against the database
None of these processes is hard to perform manually, but automating them in a general sense is a bit complex. Database developers don't want to add a new object and then edit a build script of some sort to ensure the object is created. They also don't want to worry about dependencies and ensuring indexes and constraints are created after tables along with any other ordering requirements.
The problem of data becomes complex as well. While we can easily perform step #2 on a new database and validate our code, we rarely get that opportunity for much of our database work. The changes we develop must be executed on live databases in a way that preserves the existing data. This is much more complex than application code, which can easily delete an old version and replace it with a new one.
We also have the challenges of writing unit tests that work with data in a variety of ways. Just as methods and functions in application code may deal with edge cases of data, in databases we have similar issues with not only the values of data, but we also have volumes of data that can affect our application in ways that are difficult to anticipate if we do not have adequate testing at scale.
A strong database CI process will ensure that not only can your database DDL compile, but that the changes made can be made safely on an existing database in a way that doesn't require object drop and recreation for tables. Often a CI process will produce an upgrade or change script that can be run on an existing database to apply the changes and ensure you have a database environment with the latest version of your objects.
Many developers work on a shared database while building their applications. Each time a developer changes an object in the shared database, the change is essentially compiled and integrated with all other objects that exist. There can be collission and potential lost work here, but strong habits of checking in code regularly can avoid most of these issues.
In this case, is a CI process needed? Many developers believe it is because the formal process of validating the changes and executing tests ensures that no work is lost, and that developers are not making changes or alterations to their code to correct errors that we would not want to make in production databases. There is less integration checking, but there certainly can be more verification and validation of the database code changes. This would indicate that there is no loss from implementing a CI process, but there are potential large gains in quality assurance.
Continuous Integration has many benefits that can help improve your software development process, including the code changes you make to databases. It is actually more important for databases, as we benefit from testing our changes early, before additional front end work is built on poorly implemented database objects. While implementing CI is not a trivial undertaking, the effort should far outweigh the investment in time and resources. In addition, more tooling is being created by vendors that can dramatically ease the effort of implementing a CI process.
I would urge you to read the Thoughtworks page on CI and consider how extending these ideas to your application code might help ensure that your developers are creating the best code they can. We also have a number of articles at Simple Talk that showcase how CI can be applied to databases.