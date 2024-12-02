My first exposure to Terraform was when I started to work with data in the AWS cloud. As a Cloud Data Engineer I was expected to cover a wide range of infrastructure provisioning so I found myself having to learn rather more about cloud infrastructure than I had expected.

We want repeatability when deploying infrastructure across environments. A web based Cloud/SaaS Console remains useful for visualising your infrastructure but for actual maintenance we must have infrastructure-as-code (IAC). This is just as we have SQL scripts to develop and maintain our databases.

When I started using AWS, it had three methods of spinning up infrastructure

Nowadays we also have the AWS CDK that allows us to maintain infrastructure in a programming language in which we might already have skills.

The AWS CLI has its uses though for IAC I would keep it as a lump hammer for when all else has failed. It isn't really intended to script up and maintain a large infrastructure estate.

Cloud Formation was fine for basic activities but as our needs became more sophisticated it became a bit of a nightmare. We also had to consider other SaaS products such a GitHub and working with other cloud providers. Maintaining several products is especially hard when each product and cloud has its own approach to IAC.

Terraform removed a lot of the stress by providing a common approach across all products. Take a look at the Hashicorp Providers page shows the breadth of providers available.

Terraform is a command line tool that allows you define your infrastructure in the Hashicorp Configuration Language, HCL. HCL acts as an abstraction layer for the cloud APIs and libraries. Just like SQL this is declarative language so you tell it what you want the end result to be and it works out how it is going to achieve it.

HCL has its own in-built functions for manipulating the contents of its data types. Examples include the following.

Look at the last point. HCL does not edit data, it takes an immutable input and provides an immutable output. HCL is a declarative, functional programming language.

There are 3 types of component in Terraform

A unit of code you write in Terraform is called a module and your modules can be made up of other modules. An example would be a module that defines the infrastructure for pulling data (extraction) from outside your organisation. Sub-modules could include the following infrastructure.

A Terraform provider is a wrapper for an API and within that API its endpoints are represented by the following: -

Think of it as being a DESCRIBE or SHOW query or SELECT against INFORMATION_SCHEMA objects.

There are two things that a data object can do

Creates or updates infrastructure. In SQL terms this is akin to CREATE or ALTER

Terraform Resource and Data objects are also modules however when we talk about Terraform modules we tend to mean something we write as represented by the graphic below.

Let's look at these in a bit more details

Providers

Let us suppose that I want to use the Github provider. I would set up a requirements.tf file as follows.

terraform { required_version = ">= 1.9.0" required_providers { github = { source = "integrations/github" version = ">= 6.2" } } } # Config settings for the Github provider provider "github" { owner = "djp-corp" }

I am telling Terraform that I wish to use a minimum of version 1.9.0 and the Github provider (plugin) must be at least version 6.2. I have configured my GitHub provider so that the owner is my company, the DJP Corporation.

Provider authentication

For authentication Terraform does read environment variables and for Github the default is to look for an environment variable called GITHUB_TOKEN. This will contain a Github PAT (Personal Access Token) that I set up in Github and configure to expire every 30 days.

For AWS I use the aws sso (AWS Single Sign-On) command line and a simple shell script that assumes the relevant AWS role and makes the temporary AWS credentials available for my session as environment variables.

One of the challenges with Terraform authentication is to have just the amount of permissions you need to deploy the infrastructure you want. The way my company approaches this is for the data engineers in the development environment to have close to admin privileges. Deployment to higher environments can only take place using the CICD pipeline.

Variables

Again, think of a module as being a function or stored procedure. It must have its input parameters defined and variables are how we define those input parameters. The convention (but not an enforced rule) is to store those variables in a file called variables.tf.

Here are two examples of variable declaration.

variable "api_concurrency" { description = "The maximum number of concurrent api calls that are allowed to run at one time." type = number default = 16 } variable "log_level" { description = "Valid values are NONE, ERROR, WARN, INFO, DEBUG" type = string validation { condition = contains(["NONE", "ERROR", "WARN", "INFO", "DEBUG"], var.log_level) error_message = "log levels must be one of NONE, ERROR, WARN, INFO, DEBUG" } }

So variables are declared using a variable {} block but they are referenced in Terraform code using the var prefix, var.log_level .

The minimum declaration would be the variable type however my company standard is also to provide a description that is meaningful to the maintainer.

The 2nd declaration illustrates that you can choose to validate the input for your variables too.

Your variables type can be

Primitives which are number, string, bool

Collection types such as list, set or map

Structural type such as object.

These can be as complex or as simple as you need. For example if we wanted a variable that held a list of columns we might have something similar to the code below.

variable "column_list" { type = list(object({ name = string alias = optional(string, null) type = string required = optional(bool, true) unique = optional(bool, false) exclude_column = optional(bool, false) })) }

Outputs

The convention is to name the file outputs.tf.

Just as variables represent the contract for the inputs to a module, outputs describe what the module will return. Using the example of our extraction module we may need also have an ingestion module that needs to know the values of some of the extraction outputs.

An output can be a primitive type such as a string, number, boolean etc or the complex output from a module.

output "rds_instance" { value = module.postgres_rds_instance description = "The full set of exposed properties for the Postgres RDS instance created within the module" }

The minimum definition is the value, however a description meaningful to the users of the module is a sensible standard to adopt, particularly if you write modules that are to be shared as common components.

By the default Terraform will echo all outputs to the terminal. If you have some outputs that you don't want visible then an additional sensitive = true parameter will stop that happening.

Locals

The convention is to name the file locals.tf.

This is where we can use the HCL in-built functions to read data structures and write out other data structures. I'll illustrate this with an example

# This would normally be in the module variables.tf file. variable "name" { description = "The list of strings that will make up resource names" type = list(string) } # This would be in the module locals.tf file locals { name = concat(var.name, ["api", "data"]) data_content = templatefile("${path.module}/templates/api-data-retrieval.json", { lambda_function_arn = module.api_data_lambda.aws_lambda_function_arn lambda_name = join("_",local.name) support_email_address = var.support_email_address hostname = var.api_base_url } ) }

So we have our extraction module that calls our api_data module.

var.name is passed into our api_data module from our extraction module and contains a list defining the name of our overall application, say ["external", "extract"]

module from our module and contains a list defining the name of our overall application, say ["external", "extract"] The concat function will join these lists together to product ["external", "extract", "api","data"]

templatefile is an inbuilt HCL function that allows us to read a file containing place markers and replace those placemarkers with the values we assign to those place markers. In this case we are submitting

The output from the api_data_lambda module

module external_extract_api_data which we get by joining our local.name elements with an underscore character.

Variables declared in the module passed in using the var prefix.

The utility module

Almost all the modules I have seen are for spinning up infrastructure. At a minimum they comprise of the following

At least one provider

Variables

At least one resource module

Possibly a data module

Optionally but usually outputs

Optionally locals

In older versions of Terraform the capability to make infrastructure more data driven were limited. Today we might have a JSON or YAML file that describes the properties we wish to use to generate Github repositories. Our locals file extracts and builds the sets of data required to be able to generate infrastructure in a loop. This has allowed us to come up with modules that don't spin up infrastructure, they just carry out the data transformation necessary for to provide data for other modules to use.

Such a module will consist of the following

Variables for input

Locals to transform that input

Output to expose the transformed input.

We simply call this utility module from whatever module needs it.

module "utility_transforms" { source = "../utility_transforms" name = var.name a_parameter = var.an_input_variable another_parameter. = var.another_parameter } module "our_resource_module" { source = "github.com/some_reusable_module?ref=v1.0.0" for_each module.utility_transforms.some_list_of_properties # rest of module }

Resources

The convention is for the entry point for a module to be called main.tf.

A resource is what creates/alters infrastructure. It has two arguments and in addition the items within the {} containing the parameters we wish to pass. The arguments are as follows

The function name/API endpoint for the object we wish to create

A label that we can use to refer to the resource outputs. Again, remember a resource is also a module.

Earlier my Provider example used Github so we can now create a GitHub repository. The code in our main.tf file would contain the following

resource "github_repository" "blog_content" { name = "sql_server_central_content" description = "Keeps graphics, example code and documents for SSC articles" }

Only the name property is mandatory but where a resource provides a property to supply a description I strongly recommend you use it.

Once our GitHub repository has been created then we can create any other objects either in or related to that object. These could include

.gitignore file

file Standardised .github\workflow files

files Repository rules

Tags

resource "github_repository_topics" "blog_tags" { repository = github_repository.blog_content.name topics = ["content", "graphics"] }

So why don't we just put name = "sql_server_central_content" in the code to create repository topics? As a declarative language Terraform the dependency between the github_repository resource and the github_respository_topic means that the repository will be created first. If you hard-code the name then Terraform won't know that there is a relationship and might try to create the topics before there is a repository in which to create them. This would fail.

For GitHub the resources use the name of the repository which we provide. For other Terraform providers such as AWS resources have an identifier known as an ARN (Amazon Resource Name) which isn't known until the object is created.

Data

As mentioned earlier, data modules are akin to SELECT statements. They have the same two arguments that a resource does but their parameters tend to be those items that, in database terms, would be indexed columns we would use in our WHERE clause.

The Terraform State file

So far I have not mentioned the most important Terraform component. That is the Terraform State file which is a big JSON file that keeps track of the infrastructure that your Terraform code has deployed.

As AWS users we keep our state files in an S3 bucket with versioning switched on so that we can retrieve previous versions of that file.

The relationship between your code, state file and infrastructure can be summed up in the table below.

Artefacts Description Your code files What you want the infrastructure to be when Terraform carries out a deployment Terraform State file Terraforms view of your infrastructure as of the last deployment Infrastructure What your infrastructure actually is Infrastructure deployed by Terraform

Other infrastructure deployed by other means.

The diagram below shows what happens when we tell Terraform to produce an execution plan or to deploy infrastructure.