Throughout my career I have seen arguments as to which IT standard in a particular area, is the one true standard. Searching any forum for discussions on standards reveals that these debates frequently become acrimonious and heated. In fact, the combatants have re-imagined Gulliver's Travels in an IT context.
The friction of cognitive load
People in tech often have mixed feelings about standards so why do we use them? Simply put, it is so we reduce unproductive cognitive load and don't introduce friction where and when it is least desirable. In other words, we want to focus on what the code is doing rather than how it is written.
Remember, WIP (work in progress) has no value. Delivery is where we earn our daily bread. You must deal ruthlessly with anything that introduces friction in the delivery process. When you deviate from a standard you are introducing friction and I don't want that friction when I have to solve a production support incident.
Wherever I have worked I have adopted the standards my employer or client required of me. In some cases I have been part of the process of specifying what that standard should be. Yes I have personal preferences but it is not about what I want, it is about what is best for the team.
From the perspective of diagnosing production support issues there are three things that introduce friction into the process:
- Non-existent or poorly thought out metrics, logs and traces
- Documentation that is non-existent or represents random brain activity
- Inconsistent coding styles and standards
Logging and metrics
I do a lot of software archeology both within my organisations and for clients. My experience has been that there is a gap where there should be a company standard for logging and metrics. This is true for databases, data engineering pipelines and software engineering in general.
Production incidents are stressful. This is true even when you have good information. The nightmare scenario is having no log messages, messages that are misleading and no indication as to where a system is under stress.
Give SQL Server credit for telling us what is wrong from the SQL Server perspective. Similarly, Python 3.13 has improved its error messages from a language perspective. However, those messages do not give you context from an overall system perspective and especially not from a business context.
Logging standards
In terms of existing logging standards, for SQL Server we do have severity levels. We can use RAISERROR to specify the severity level of an error message that we wish to generate. All THROW statements are at severity level 16 unless re-throwing a caught exception at a different level.
For application/data pipeline code we have log levels, which are one of the following.
LOG LEVEL | DESCRIPTION |
TRACE | Extremely fined grained messages. Personally, I find these so verbose that finding relevant information is like looking for a needle in a haystack. |
DEBUG | Messages that confirm things work at a fine grain. This can be down to the level of individual functions or parts of functions |
INFO | Generally high level confirmation that things work. This is usually at the level of components |
WARNING | Things that happen that shouldn't but the code handles it |
ERROR | Something serious is broken. |
CRITICAL | Something serious is seriously broken. The system cannot continue and must terminate. |
Those standards are the just foundation capabilities. Built on those foundations we need to define logging standards for our company's use. Consider the following when defining a standard for log entries and messages:
- We are trying to reduce our cognitive load while diagnosing a problem.
- Consider the audience. Who does the message help?
- When is the message useful and in what circumstance?
- In a worst case scenario, how often will that message be produced? You don't want a Tsunami of messages flooding your logs which can create problems in their own right.
- What information is useful in the log entry?
Include an Execution Id of some sort with your log entries as this allows you to follow the path of interest through the logs. For SQL Server the SESSION_ID() function may provide a suitable proxy for this.
Why system and business context is important
The table below shows a somewhat contrived example.
Context | Example |
Bare Minimum | Non-existent file, numpty.json |
System | CONFIG FILE MISSING: numpty.json |
Business | CONFIG FILE MISSING: numpty.json for Data Provider Experian |
By trapping exceptions and re-throwing them with a message that includes the system and/or business context we gain information that speeds up diagnosis.
Business context is also important because it enables you to talk to the right people to get the problem resolved. Remember, the longer it takes you to determine the cause the more likely you are to be assumed to be responsible and accountable for the issues.
Metrics standards
I started my career with SQL Server 6.5 so as far as I am aware SQL Server has always been a good citizen when it comes to emitting metrics. Metrics are numeric facts emitted by an application. Look at sys.dm_os_performance_counters if you are not familiar with what is available. Think of what you would need in your data pipeline application to be able to spot when all is not well or the system was coming under stress.
In a cloud environment we have two major weapons in our arsenal
- Open Telemetry, which is an open-source framework and standard for collecting metrics. AWS, Azure and GCP have tooling to collect and report on Open Telemetry metrics.
- Tagging and/or labelling to provide a means to dice, slice and aggregate metrics
Tagging standards
The ability to tag or label components is out-of-the-box functionality in the big 3 cloud providers. What those tags or labels may be is a standard to be defined by your organisation. Examples could be as follows.
- Application (ETL Tool)
- Application Role (Ingestion)
- Environment (Dev, Test, Prod etc)
- Cost Centre
What happens if you don't use tags/labels? As your system becomes more complex drilling in to find the problem area becomes ever more difficult. If you don't have an existing standard, start simple and extend if needed. Do not start with an all-encompassing tagging scheme only to find that you have born the cost and complexity to get very little out.
You need a means of cataloguing both the tags/labels and their values. If you don't have this then you will end up with a lot of similar items which prevent you gaining an accurate picture using your metrics. Where you have a tagging/labelling standard I would expect all components in a cloud based data pipeline to be tagged and labelled correctly.
Metric Collection
What metrics would I expect to collect as a minimum standard?
- Every component in a data pipeline to emit a count of its success and failures.
- Emit a duration metric for processes within a component
- Document every metric so people know how to interpret them correctly.
There are many ways for a component to fail so taking the standard beyond success/failure requires you to consider the granularity of failure.
- Missing input
- Input that is an incorrect format (JSON when CSV was expected)
- Incomplete input
- Input that is too large
- ...etc
Depending on the impact of collecting the information I would also consider number of records flowing through a component.
A consistent approach to documentation
My career started with SQL Server 6.5 and I used SQL Server until 2014. I used SQL Server BOL (Books Online) heavily and absorbed a lot of information rapidly because the standards applied by the authors of SQL BOL. This is another example of consistency reducing cognitive load and friction.
Help your database users by introducing a standard for annotating your databases. For SQL Server use the MS_DESCRIPTION extended property that tools such as Redgate SQLDoc harvest to provide database documentation in a consistent, easily consumed format. Back in 2014 I wrote re-oiling the gears for the data dictionary or catalog to make this task less onerous. It shows how careful phrasing in MS_DESCRIPTION properties allows you to apply them automatically across the database. You can apply this technique to most databases, including PostGres and derivatives.
The beauty of being able to query the database system catalog is that you can measure completeness of database documentation.
For those of you who use DBT, harvest these properties to provide DBT Docs with the information it needs to incorporate their content into the documentation it produces. Take a look at DBT Labs Jaffle Shop example to see the sort of documentation it produces.
Some of us have to take part in an auditing regime. When you have an automated means of generating documentation, both your demonstration of the process and result is a powerful artefact in a governance regime. Auditing is painful for all involved, auto-generated documentation is definitely a lubricant rather than a source of friction.
Code consistency and cognitive load
I have saved this until last because, as I said in the opening paragraph, discussions can become heated and acrimonious. Even when you have two well thought out, strongly expressed but opposing viewpoints on coding standards both combatants are trying to achieve the same thing. Code that follows the same style allows you focus on what the code is doing rather than taxing your brain power interpreting its style.
Faced with the need to resolve two opposing, strongly held views we ask ourselves:
- Who is right? Rhetorical, they are both well thought out views
- Does it matter? Also rhetorical. Flip the coin and move on!
For me the only relevant questions are:
- What needs to change?
- What you are going to do about code that is already in existence?
- Can your standard be enforced and/or applied mechanically?
Mechanical application, enforcement and monitoring
The ability to use mechanical means to govern a standard is a big deal. It eliminates the pointless arguing. Computer says NO!
This is one of the reasons I am a big fan of Python with its PEP rules, especially PEP20. Enforce the PEP rules using tools such as Ruff. There is no point arguing with mechanical enforcement, especially when it is built into a CICD pipeline.
For SQL we have SQLFluff with its support for many SQL Dialects and also templated SQL as used in DBT. Visual Studio Code also has extensions that implement SQLFluff. There is more scope for choice/disagreement with the SQLFluff rules though the thrust of my argument remains the same. Pick your rules and make them an unavoidable part of your pipeline thus eliminating an unproductive argument.
No code formatting/linting tool can be all encompassing. Linters and formatters usually have a mechanism for ignoring rule violations which is for special cases. For this reason monitoring is important. You need to know if the ability to ignore a coding standards rule is being abused.
Closing thoughts
There's a joke about a conversation as to why there were 'n' standards for web browsers. What was needed to was a final standard for all web browsers. The result was 'n+1' standards.
I dealt with BS5750 and ISO90001 early in my career. Those standards were verbose and attempted to record every process at fine grained detail. They would be useful for replacing people with robots but weren't really useful for that human beings trying to do their jobs. As one director put it, "...this doesn't say anything about the standard of our product, we could be producing absolute crap. This just tells people we have an excellent standard for producing that crap".
Then there is the natural rejection of standards where people have a different opinion. I am mindful that "by the time a man thinks his father might have been right he has a son who thinks he is wrong".
Standards adoption requires the following:
- The standard must be useful and relevant and shown to be such.
- The team must view the standard as being useful and relevant
- Ownership of the standard must be within the team
- Conflicts in opinion need to be resolved quickly, preferably within the team.
- Mechanical measurement and enforcement of the standard must be present
Initially there will be a cost to adopting a standard. As it becomes "just the way we work" you should observe some measurable benefit from adoption.