What's in a Good Error Message?
Update Jan 13: This post is discussed on Reddit
Update Feb 7: This post is discussed on Hacker News
As software developers, we’ve all come across those annoying, not-so-useful error messages when using some library or framework: "Couldn’t parse config file", "Lacking permission for this operation", etc. Ok, ok, so something went wrong apparently; but what exactly? What config file? Which permissions? And what should you do about it? Error messages lacking this kind of information quickly create a feeling of frustration and helplessness.
So what makes a good error message then? To me, it boils down to three pieces of information which should be conveyed by an error message:
Context: What led to the error? What was the code trying to do when it failed?
The error itself: What exactly failed?
Mitigation: What needs to be done in order to overcome the error?
Let’s dive into these individidual aspects a bit. Before we start, let me clarify that this is about error messages created by library or framework code, for instance in form of an exception message, or in form of a message written to some log file. This means the consumers of these error messages will typically be either other software developers (encountering errors raised by 3rd party dependencies during application development), or ops folks (encountering errors while running an application).
That’s in contrast to user-facing error messages, for which other guidance and rules (in particular in regards to security concerns) should be applied. For instance, you typically should not expose any implementation details in a user-facing message, whereas that’s not that much of a concern — or on the contrary, it can even be desirable — for the kind of error messages discussed here.
In a way, an error message tells a story; and as with every good story, you need to establish some context about its general settings. For an error message, this should tell the recipient what the code in question was trying to do when it failed. In that light, the first example above, "Couldn’t parse config file", is addressing this aspect (and only this one) to some degree, but probably it’s not enough. For instance, it would be very useful to know the exact name of the file:
Couldn’t parse config file: /etc/sample-config.properties"
Using an example from Debezium, the open-source change data capture platform I am working on in my day job, the second message could read like so with some context about what happened:
Failed to create an initial snapshot of the data; lacking permission for this operation
Coming back to error messages related to the processing of some input or configuration file, it can be a good idea to print the absolute path. In case file system resources are provided as relative paths, this can help to identify wrong assumptions around the current working directory, or whatever else is used as the root for resolving relative paths. On the other hand, in particular in case of multi-tenant or SaaS scenarios, you may consider filesystem layouts as a confidential implementation detail, which you may prefer to not reveal to unknown code you run. What’s best here depends on your specific situation.
If some framework supports different kinds of files, the specific kind of the problematic file in question should be part of the message as well: "Couldn’t parse entity mapping file…". If the error is about specific parts of the contents of a file, displaying the line number and/or the line itself is a good idea.
In terms of how to convey the context of an error, it can be part of messages themselves, as shown above. Many logging frameworks also support the notion of a Mapped Diagnostic Context (MDC), a map for propagating arbitrary key/value pairs into log messages. So if your messages are meant to show up in logs, setting contextual information to the MDC can be very useful. In Debezium this is used for instance to propagate the name of the affected connector, allowing Kafka Connect users to tell apart log messages originating from different connectors deployed to the same Connect cluster.
|As far as propagating contextual information via log messages is concerned (as opposed to, say, error messages printed by a CLI tool), structured logging, typically in form of JSON, simplifies any downstream processing. By putting contextual information into separate attributes of a structured log entry, consumers can easily filter messages, ingest only specific sub-sets of messages based on their contents, etc.|
In case of exceptions, the chain of exceptions leading to the root cause is an important contextual information, too. So I’d recommend to always log the entire exception chain, rather than catching exceptions and only logging some substitute message instead.
The Error Itself
On to the next part then, the description of the actual error itself. That’s where you should describe what exactly happened in a concise way. Sticking to the examples above, the first message, including context and error description could read like so:
Couldn’t parse config file: /etc/sample-config.properties; given snapshot mode 'nevr' isn’t valid
And for the second one:
Failed to create an initial snapshot of the data; database user 'snapper' is lacking the required permissions
Other than that, there’s not too much to be said here; try to be efficient: make messages as long as needed, and as short as possible. One idea could be to work with different variants of messages for the same kind of error, a shorter and a longer one. Which one is used could be controlled via log levels or some kind of "verbose" flag. Java developers may find Cédric Champeau’s jdoctor library useful for implementing this. Personally, I haven’t used such an approach yet, but it may be worth the effort for specific situations.
Having established the context of the failure and what went wrong exactly, the last — and oftentimes most interesting — part is a description of how the user can overcome the error. What’s the action they need to take in order to avoid it? This could be as simple as telling the user about the constraints and/or valid values in case of the config file example (i.e. akin to test failure messages, which show both expected and actual values):
Couldn’t parse config file: /etc/sample-config.properties; given snapshot mode 'nevr' isn’t valid (must be one of 'initial', 'always', 'never')
In case of the permission issue, you may clarify which ones are needed:
Couldn’t take database snapshot: database user 'snapper' is lacking the required permissions 'SELECT', 'REPLICATION'
Alternatively, if longer mitigation strategies are required, you may point to a (stable!) URL in your reference documention which provides the required information:
Couldn’t take database snapshot: database user 'snapper' is lacking the required permissions. Please see https://example.com/knowledge-base/snapshot-permissions/ for the complete set of necessary permissions
If some configuration change is required (for instance database or IAM permissions), your users will love you even more if you share that information in "executable" form,
for instance as GRANT statements which they can simply copy,
or vendor-specific CLI invocations such as
aws iam attach-role-policy --policy-arn arn:aws:iam::aws:policy/SomePolicy --role-name SomeRole.
Speaking of external resources referenced in error messages, it’s a great idea to have unique error codes as part of your messages (such as Oracle’s ORA codes, or the error messages produced by WildFly and its components). Corresponding resources (either provided by yourself, or externally, for instance in answers on StackOverflow) will then be easy to find using your favourite search engine. Bonus points for adding a reference to your own canonical resource right to the error message itself:
Couldn’t take database snapshot: database user 'snapper' is lacking the required permissions (DBZ-42). Please see https://dbz.codes/dbz-42/ for the complete set of necessary permissions
(That’s a made-up example, we don’t make use of this approach in Debezium currently; but I probably should look into buying the dbz.codes domain 😉).
The key take-away is that you should not leave your users in the dark about what they need to do in order to address the error they ran into. Nothing is more frustrating than essentially being told "You did it wrong!", without getting hinted at what’s the right thing to do instead.
General Best Practices
Lastly, some practices in regards to error messages which I try to adhere to, and which I would generally recommend:
Uniform voice and style: The specific style chosen doesn’t matter too much, but you should settle on either active vs. passive voice ("couldn’t parse config file" vs. "config file couldn’t be parsed"), apply consistent casing, either finish or not finishes messages with a dot, etc.; not a big thing, but it will make your messages a bit easier to deal with
One concept, one term: Avoid referring to the same concept from your domain using different terms in different error messages; similarly, avoid using the same term for multiple things. Use the same terms as in other places, e.g. your API documentation, reference guides etc.; The more consisent and unambiguous you are, the better
Don’t localize error messages: This one is not as clear cut, but I’d generally recommend to not translate error messages into other languages than English; Again, this all is not about user-facing error messages, but about messages geared towards software developers and ops folks, who generally should command reasonable English skills; depending on your audience and target market, translations to specific languages might make sense, in which case a common, unambiguous error code should definitely be part of messages, so as to facilitate searching for the error on the internet
Don’t make error messages an API contract: In case consumers of your API should be able to react to different kinds of errors, they should not be required to parse any error messages in order to do so. Instead, raise an exception type which exposes a machine-processable error code, or raise specific exception types which can be caught separately by the caller
Be cautious about exposing sensitive data: if your library is in the business of handling and processing sensitive user data, make sure to to not create any privacy concerns; for instance, "show actual vs. expected value" may not pose a problem for values provided by an application developer or administrator; but it can pose a problem if the actual value is GDPR protected user data
Either raise an exception OR log an error, but not both: A given error should either be communicated by raising an exception or by logging an error. Otherwise, when doing both, as the exception will typically end up being logged via some kind of generic handler anyways, the user would see information about the same error in their logs twice, which only adds confusion
Fail early: This one is not so much about how to express error messages, but when to raise them; in general, the earlier, the better; a message at application start-up beats one later at runtime; a message at build time beats one at start-up, etc. Quicker feedback makes for shorter turn-around times for fixes and also helps to provide the context of any failures
With that all being said, what’s your take on the matter? Any best practices you would recommend? Do you have any examples for particularly well (or poorly) crafted messages? Let me know in the comments below!