DevOps, a set of practices that combines software development (Dev) and information-technology operations (Ops), has become a very popular way to shorten the systems development life cycle and provide continuous delivery of applications (“software”). The implementation of continuous delivery and DevOps to data analytics has been termed DataOps, which is the topic of this blog.
Databases are more difficult to manage than applications from a development perspective. Applications, generally, do not concern themselves with state. For any given “release” or build an application can be deployed and overlaid over the previous version without needing to maintain any portion of the previous application. Databases are different. It’s much harder to deploy the next version of your database if you need to be concerned with maintaining “state” in the database.
So what is the “state” you need to be concerned with
Lookup data is the simple example. Almost every database has tables that are used for allowable values, lookup data, and reference data. If you need to change that data for a new release, how do you do that? What happens if the customer or user has already changed that data? How do you migrate that data?
Another example: a table undergoes a major schema migration. New columns are added and the table is split and normalized among new tables. How do we write the migration code to ensure it runs exactly once or runs multiple times without side effects (using scripts that are “idempotent”)?
Other objects that require state to be considered during
- Indexes: what happens if an index is renamed or an included column is added? What happens if the DBA adds a new emergency index? Will your DevOps tool remove it since it isn’t in an official build?
- Keys: if you change a primary key, will that change require the PK to be dropped and recreated? If so, what happens to the foreign keys?
In most cases, database objects like functions, views, and stored procedures have no state considerations and can be re-deployed during every release.
So how do you overcome these “state” difficulties,
especially if you are aiming towards frequent releases and agile, collaborative
The first step is to make a major decision when including databases in your DevOps processes, and that is how you will store the data model. There are two options:
Migration-based deployment: Sometimes called transformation-based deployment, this is the most common option today and is a very traditional way to work with databases during development. At some point you create an initial database (a “seed” database that is a single migration script stored inside source control), and after that you keep every script that’s needed to bring the database schema up to the current point (you can use SQL Server Management Studio to create the scripts). Those migration scripts will have an incremental version number and will often include data fixes or new values for reference tables, along with the Data Definition Language (DDL) required for the schema changes. So basically you are migrating the database from one state to another. The system of truth in a migration-based approach is the database itself. There are a few problems with this option:
- Deployments keep taking longer as more and more scripts need to be applied when upgrading a database. A way around this is to create new seed databases on a regular basis to avoid starting with the very first database
- A lot of wasted time can happen with large databases when dealing with, for example, the design of an index. If the requirements keep changing, a large index can be added to the database, then deleted, then reapplied slightly differently (i.e. adding a new column to it), and this can be repeated many times
- There is no data model that shows what the database should really look like. The only option is to look at the freshly updated database
- Upgrade scripts can break if schema drift occurs. This could happen if a patch was made to a production server and those changes didn’t make it back to the development environment or were not implemented the same way as was done in the production environment
- Upgrade scripts can also break if not run in the correct order
State-based deployment: With this option you store the data model by taking a snapshot of the current state of the database and putting it in source control, and using comparison tools to figure out what needs to be deployed (i.e. doing a schema compare between your repository and the target database). Every table, stored procedure, view, and trigger will be saved as separate sql files which will be the real representation of the state of your database object. This is a much faster option as the only changes deployed are those that are needed to move from the current state to the required state (usually via a DACPAC). This is what SQL Server Data Tools (SSDT) for Visual Studio does with its database projects that includes schema comparison and data comparison tools, or you can use a product like SQL Compare from Red-Gate. Using the example above of creating an index, in this option you simply create the final index instead of creating and modifying it multiple times. In a state-based approach the system of truth is the source code itself. Another good thing is that you do not have to deal with ALTER scripts with a state-based approach – the schema/data compare tool takes care of generating the ALTER scripts and runs it against the target database without any manual intervention. So the developer just needs to keep the database structure up-to-date and the tools will do all the work. The end result is there is much less work needed with this option compared to the migration-based deployment.
While it may seem state-based deployment is always the way to go, the migration-based deployment may make more sense in scenario’s where you need more fine-grain control in the scripts as with the state-based deployment you are not able to modify the difference script. And having control over the scripts allows you to write better scripts than you think the script compare would generate. Other reasons are: by making the change a first class artifact, you can “build once, deploy often” (as opposed to something new that is generated prior to each deployment); you encourage small, incremental changes (per Agile/DevOps philosophy); and it’s much easier to support parallel development strategies with migrations – in part because the migrations themselves are small, incremental changes (i.e. the ability to deploy different features or development branches to target databases, that is environments like stage and production).
Once you figure out which deployment method you will use, the next step is to learn the options for version control and an automated build process. Check out these blogs for help: Automating Builds from Source Control for the WideWorldImporters Database (state-based approach), Database Development in Visual Studio using SQL Change Automation: Getting Started (migration-based approach), Deploying Database changes with Redgate SQL Change Automation and Azure DevOps (lab), Introduction to SQL Server database continuous integration, Basic Database Continuous Integration and Delivery (CI/CD) using Visual Studio Team Services (VSTS), Continuous database deployments with Azure DevOps, Why You Should Use a SSDT Database Project For Your Data Warehouse.
One last point of clarification: DataOps focuses on keeping track of database objects such as tables, stored procedures, views, and triggers, while DevOps focuses on source control for application code. These are usually done separately (i.e. separate projects and pipelines in Azure DevOps). Also usually done separately are Azure Databricks (see CI/CD with Databricks and Azure DevOps), Power BI (see The future of content lifecycle management in Power BI), Azure Data Factory (see Continuous integration and delivery (CI/CD) in Azure Data Factory, Azure data Factory –DevOps, and Azure DevOps Pipeline Setup for Azure Data Factory (v2)), and Azure Analysis Services (see Azure Analysis Services deploy via DevOps and You, Me, & CI/CD: Deploying Azure Analysis Services Tabular Models through Azure Pipelines). There is also a whole separate category for machine learning operations (MLOps), see the MLOps workshop by Dave Wetnzel (who helped with this blog). If you have any interest in these topics, please comment below and based on the interest I will follow up with additional blog posts.