Using pre-commit to prevent simple mistakes

There is no level of experience or intelligence where we cease to make mistakes. The simpler the mistake the harder we kick ourselves. The good news is that the simple mistakes often have a simple means of ensuring that we do not make the same mistake again.

I saw a Tweet from someone kicking themselves for committing a large file to their git repository. An easy mistake to make, a bit clunky to fix but very easy to prevent. This is where the pre-commit tool is useful.

What is the pre-commit tool?

At its most basic it is a tool that runs various linters, code formatters and file checkers using git hooks. Git hooks are scripts you wish to run before (pre) or after (post) you carry out some Git action. In this case, before we git commit, hence the name pre-commit. When you try to do a git commit, if your files fail the checks you have configured, then your commit will be rejected.

Look at the list of supported hooks on the pre-commit website. There are a huge number covering most of the languages and many of the day-to-day file formats that a data engineer is likely to use.

Docker
File format validation (JSON, TOML, XML, YAML)
Python
R
SQL
Terraform

Why use pre-commit?

Many of the code quality checks you would want in a CI/CD pipeline have either an equivalent hook to allow you to run those checks locally. In some cases, the CI/CD pipeline uses the same hook. This means that before your code is submitted to your CI/CD pipeline you know that it is unlikely to fail for linting or code quality rules. I say "unlikely" because, for those tools where we can tweak the configuration, we must make sure that both our local and CI/CD configuration has the same settings.

My work computer is an M1 MacBook which is the most powerful work machine I have ever had. It easily outperforms the servers used to run CI/CD pipelines. If your pipeline is in the cloud then a CI/CD run will cost you money. This can be direct run cost or the cost of lost time when work is queueing up in the CI/CD pipeline. It can be frustrating to have to wait only to find that either your run fails or you were queued up behind something that took ages only to eventually fail. If a CI/CD pipeline run fails then whatever broke it will have to be fixed and will have to queue up to run again. So briefly the reasons for using pre-commit are as follows:-

Speed
Cost saving
Consistency
Quality enforcement
Security best practice detection

Getting started with pre-commit

After following the installation instructions you will have to create a file called .pre-commit-config.yaml. This tells the tool where to get git hooks from and which hooks you wish to use.

As I mentioned earlier, I work on a MacBook so the Windows equivalent commands will be similar but with some differences. Fortunately the Windows Linux Subsystem will give you a near identical experience.

The tool allows you to generate a small configuration simply by running the following command.

pre-commit sample-config > .pre-commit-config.yaml

Your .pre-commit-config.yaml file should look similar to the one shown below.

# See https://pre-commit.com for more information
# See https://pre-commit.com/hooks.html for more hooks
repos:
-   repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v3.2.0
    hooks:
    -   id: trailing-whitespace
    -   id: end-of-file-fixer
    -   id: check-yaml
    -   id: check-added-large-files

These four hooks do the following

Remove any trailing whitespace from the end of a line of code
Ensure that the last character in a file is a new line character
Make sure that any yaml files are valid yaml
Enforce a maximum size of 500Kb (the default setting) for the files you try to commit.

The latter hook would have helped our Tweeter. As mentioned earlier, some hooks have optional parameters to allow us to tweak their behaviour. For example, if we wanted our check-added-large-files hook to prevent us committing files larger than 100Kb then the parameter for this would appear as follows.

- id: check-added-large-files
      args: ['--maxkb=100']

To attach the pre-commit hooks to your project run the following command.

pre-commit install

This will make sure that whenever you run a git commit the files in that commit will be scanned by the hooks you chose. If any of the hooks signal a failure then the commit will be prevented.

You can run the hooks for all the files in your project that git knows about, not just the ones in a commit. This us useful when you want to run a check without having to go through git add, git commit.

 pre-commit run --all-files

If you haven't run pre-commit install the above command will still work though failures will not prevent you from committing the files.

Dettaching pre-commit from your commits

Why would you want to? If you make changes to .pre-commit-config.yaml and test the changes you can find that your change affects more files than you intended. Ideally you want to git add and git commit just the changes to .pre-commit-config.yaml and then git add and git commit the other changes as their own commit.

For Python programming I use the MyPy type checker in a git hook. Trying to comply with MyPy can be a frustrating experience so again, when experimenting, I detach pre-commit using the command below.

pre-commit uninstall

Pre-commit hooks for SQL code

For linting SQL code and for fixing any linting errors detected SQLFluff is a popular package with two hooks

sqlfluff-lint
sqlfluff-fix

We would add these to our .pre-commit-config.yaml file as follows

-   repo: https://github.com/sqlfluff/sqlfluff
    rev: 1.4.5
    hooks:
    -   id: sqlfluff-lint
        args: [ '--dialect=tsql' ]
    -   id: sqlfluff-fix
        args: [ '--dialect=tsql' ]

This will check your .sql files against a large set of SQL code style rules. You may find that some of the rules are not suitable for what your particular scenario.

Telling SQLFluff to ignore rules.

As with other linters, SQLFluff does have the capability for you to configure which rules to ignore. As a principle I try not to use rule avoidance, and if I must, to do it in specific circumstances only. I have found rule avoidance to be a slippery slope.

For specific cases we can use an inline comment in our SQL code such as -- noqa: L044. This would ignore a SELECT * as a rules violation.

Even inline, instructions can be more sophisticated than simply ignoring a single rule or single violation of a rule. The SQLFluff documentation covers this quite well.

Where we want to configure SQLFluff's behaviour as a set of general rules then we can use a .sqlfluff configuration file. Python programmers have the option to use pyproject.toml instead.

Housekeeping for pre-commit

When new versions of your chosen hooks become available you will want to upgrade to those versions. For any repo the assocatied rev: line tells us what version we are using. Fortunately pre-commit provides a command to update the versions.

pre-commit autoupdate

This command is also useful if you do not know what version is appropriate for the hooks you wish to use. If in doubt. Set rev: 0.0.1 and run pre-commit autoupdate.

Periodically, I also run two clean up commands to get rid of old cached versions of the hooks.

pre-commit clean
pre-commit gc

I use pre-commit in all my projects so have a script to run a pre-commit autoupdate in ever repo in which it is installed.

Concluding thoughts

I have found that the pre-commit tool provides a good 1st line of defence for codes style and standard violations. It is good for having a shared set of rules across the team or beyond.

Many open-source repositories use it too which I take to be an endorsement.

Many of the hooks are written in Python or Ruby with the source-code being available both for modification and also as an educational resource. As an open-source project, if a hook doesn't quite have the facilities you require there is nothing to stop you applying your skills to add those facilities. If you are successful then you should consider contributing your modification back to the main code base.

Using pre-commit to prevent simple mistakes

What is the pre-commit tool?

Why use pre-commit?

Getting started with pre-commit

Dettaching pre-commit from your commits

Pre-commit hooks for SQL code

Telling SQLFluff to ignore rules.

Housekeeping for pre-commit

Concluding thoughts

Rate

Share

Categories

Share

Rate

Using pre-commit to prevent simple mistakes

What is the pre-commit tool?

Why use pre-commit?

Getting started with pre-commit

Dettaching pre-commit from your commits

Pre-commit hooks for SQL code

Telling SQLFluff to ignore rules.

Housekeeping for pre-commit

Concluding thoughts

Rate

Share

Categories

Share

Rate

Related content

Why Standards are Important

Creating and Using Inline Table-Valued Functions

Index Column Order – Be Happy!

Replicate NetSuite Data to Azure SQL (SSIS in Azure Data Factory)

Some of my greatest mistakes

Cookies on SQLServerCentral