Effective TechOps with Error Catalogues

Maintaining applications is often challenging, especially if you haven’t created them yourself. Tracing down the root of obscure errors, spending too much time in understanding the root cause of some error message, seeing a pretty clear error message, but having no clear way of resolving this. This is an often occurring issue for teams maintaining software. This blogpost describes not the solution, but a slightly better way of handling these types of errors in a pragmatic approach.

Error codes in Microservices

The solution to this dates back to a very old tried and tested solution, but a solution which seems pretty uncommon in the Java world of Microservices, in which logging of warnings and errors seems to be the goto solution. This solution builds on that, but has the goal to make maintaining applications much easier. The solution I’m referring to is to introduce unique codes to any error we log, with the purpose of creative a living document with resolutions for these kind of errors. 

Languages like Typescript use a similar approach, of which an example can be seen in the TypeScript diagnostics source code.

Of course, preventing errors is a much better solution, but sometimes unexpected things happen. Instead of focusing to prevent all errors, we instead focus on a quick way of resolving these problems.

An example of using error codes

Imagine an application which needs the weather information to operate correctly. Whenever an error happens in our (Kotlin) application, we follow a similar pattern to the code below:

try {
	return weatherService.findWeather(city)
} catch(e : WeatherLoadingException) {
	log.error(“Couldn’t load the weather for {}”, city, e)
}

Now, whenever the weather couldn’t be loaded, an error is logged, and will most likely up in some alerting system. The most common reason for this error could be that there’s no weather information available for the city. Maybe the city was misspelled. Or maybe the city isn’t available in the weatherService. Perhaps there was a timeout while retrieving the weather, or maybe an API key expired. As the developer of this system, you probably know the common errors, and it’s not always possible to recover from them.

Now, imagine it’s not you maintaining this application, but someone else, in a different team, someone who has never contributed to developing this application. Would they have the same knowledge about this system as the original developers? Do they, based on a message like this, know what to do? Do they even know if they have to do something, or is this just a warning of which the system will recover by itself? It’s hard to say, based on just an error message and an exception. 

A solution to managing error codes

Instead, we’ve changed all our log.errors and log.warns in our codebase, and introduced the concept of error codes. The format of the error code is quite simple:

<name of microservice><sequential number>, eg: WS0001

This format has the benefit that it’s unique enough to be found in the code and in our documentation, without conflicting with other error codes from other projects or microservices, and allows us to properly document our errors, and possible solutions. These possible solutions are not per se a static document, but they could be a living document updated by other people’s experiences. A complete solution (in Kotlin) could look like this:

enum class ErrorMessage(private val code: String, private val message: String) {
    WEATHER_LOADING_ERROR("WS0001", "Couldn't load weather for city of {}"),
    CITY_NOT_SUPPORTED("WS0002", "City not supported");

    override fun toString(): String = "$code: $message"
}

// Extension methods to make logging easier
fun org.log4j.Logger.warn(message: ErrorMessage, exception: Exception) {
	warn(message.toString(), exception)
}
fun org.log4j.Logger.error(message: ErrorMessage, exception: Exception) {
	warn(message.toString(), exception)
}

An example usage:

try {
	return weatherService.findWeather(city)
} catch(WeatherLoadingException e) {
	log.error(WEATHER_LOADING_ERROR, city, e);
}

Now, whenever an error happens, the following is logged:

WS0001: Couldn’t load the weather for Melbourne

The WS0001 is often unique enough to be linked to a living document, such as a wiki page, looking like this:

| Code   | Description               | Resolution                  |
| WS0001 | Couldn’t load the weather | Call Susan from Weathercorp |

Error codes to the rescue

Perfect? Certainly not. Useful? I think so. Especially in bigger organisations, having an error catalogue helps resolving issues quite a bit. There’s less communication needed, there’s less frustration, and overall, the time to resolve errors has significantly improved for applications using an approach such as the one above. Feedback or questions? Leave a message below, or check out the demo project linked here.

Older Post
Newer Post

Leave a Reply

Your email address will not be published. Required fields are marked *