Gunnar Morling

Gunnar Morling

Random Musings on All Things Software Engineering

Gunnar Morling

Gunnar Morling

Random Musings on All Things Software Engineering

O Kafka, Where Art Thou?

Posted at Nov 29, 2021

The other day, I came across an interesting thread in the Java sub-reddit, with someone asking: "Has anyone attempted to write logs directly to Kafka?". This triggered a number of thoughts and questions for myself, in particular how one should deal in an application when an attempt to send messages to Kafka fails, for instance due to some network connectivity issue? What do you do when you cannot reach the Kafka broker?

While the Java Kafka producer buffers requests internally (primarily for performance reasons) and also supports retries, you cannot do so indefinitely (or can you?), so I went to ask the Kafka community on Twitter how they would handle this situation:

This question spawned a great discussion with tons of insightful replies (thanks a lot to you all!), so I thought I’d try and give an overview on the different comments and arguments. As with everything, the right strategy and solution depends on the specific requirements of the use case at hand; in particular whether you can or cannot afford for potential inconsistencies between the state of the caller of your application, its own state, and the state in the Kafka cluster.

As an example, let’s consider an application which exposes a REST API for placing purchase orders. Acknowledging such a request while actually failing to send a Kafka message with the purchase order to some fulfillment system would be pretty bad: the user would believe their order has been received and will be fulfilled eventually, whereas that’s actually not the case.

On the other hand, if the incoming request was savely persisted in a database, and a message is sent to Kafka only for logging purposes, we may be fine to accept this inconsistency between the user’s state ("my order has been received"), the application’s state (order is stored in the database), and the state in Kafka (log message got lost; not ideal, but not the end of the world either).

Understanding these different semantics helps to put the replies to the question into context. There’s one group of replies along the lines of "buffer indefinitely, block inbound requests until messages are sent", e.g. by Pere Urbón-Bayes:

This strategy makes a lot of sense if you cannot afford any inconsistency between the state of the different actors at all: e.g. when you’d rather tell the user that you cannot receive their purchase order right now, instead of being at the risk of telling them that you did, whereas you actually didn’t.

What though, if we don’t want to let the availability of a resource like Apache Kafka — which is used for asynchronous message exchanges to begin with — impact the availability of our own application? Can we somehow buffer requests in a safe way, if they cannot be sent to Kafka right away? This would allow to complete the inbound request, while hopefully still avoiding any inconsistencies, at least eventually.

Now simply buffering requests in memory isn’t reliable in any meaningful sense of the word; if the producing application crashes, any unsent messages will be lost, making this approach not different in terms of reliability from working with ack = 0, i.e. not waiting for any acknowledgements from the Kafka broker. It may be useful for pure fire-and-forget use cases, where you don’t care about delivery guarantees at at all, but these tend to be rare.

Multiple folks therefore suggested more reliable means of implementing such buffering, e.g. by storing un-sent messages on disk or by using some local, persistent queuing implementation. Some have built solutions using existing open-source components, as Antón Rodriguez and Josh Reagan suggest:

You even could think of having a Kafka cluster close by (which then may have other accessibility characteristics than your "primary" cluster e.g. running in another availability zone) and keeping everything in sync via tools such as MirrorMaker 2. Others, like Jonathan Santilli, create their own custom solutions by forking existing projects:

Also ready-made wrappers aound the producer exists, e.g. in Wix' Greyhound Kafka client library, which supports producing via local disk as per Derek Moore:

But there be dragons! Persisting to disk will actually not be any better at all, if it’s for instance an ephermeral disk of a Kubernetes pod which gets destroyed after an application crash. But even when using persistent volumes, you may end up with an inherently unreliable solution, as Mic Hussey points out:

So it shouldn’t come at a surprise that people in this situation have been looking at alternatives, e.g. by using DynamoDB or S3 as an intermediary buffer; The team around Natan Silnitsky working on Greyhound at Wix are exploring this option currently:

At this point it’s worth thinking about failure domains, though. Say your application is in its own network and it cannot write to Kafka due to some network split, chances are that it cannot reach other services like S3 either. So another option could be to use a datastore close by as a buffer, for instance a replicated database running on the same Kubernetes cluster or at least in the same availability zone.

If this reminds you of change data capture (CDC) and the outbox pattern, you’re absolutely right; multiple folks made this point as well in the conversation, including Natan Silnitsky and R.J. Lorimer:

As Kacper Zielinski tells us, this approach is an example of a staged event-driven architecture, or SEDA for short:

In this model, a database serves as the buffer for persisting messages before they are sent to Kafka, which makes for for a highly reliable solution, provided the right degree of redundancy is implemented e.g. in form of replicas. In fact, if your application needs to write to a database anyways, "sending" messages to Kafka via an outbox table and CDC tools like Debezium is a great way to avoid any inconsistencies between the state in the database and Kafka, without incurring any unsafe dual writes.

But of course there is a price to pay here too: end-to-end latency will be increased when going through a database first and then to Kafka, rather than going to Kafka directly. You also should keep in mind that the more moving pieces your solution has, the more complex to operate it will become of course, and the more subtle and hard-to-understand failure modes and edge cases it will have.

An excellent point is made by Adam Kotwasinski by stating that it’s not a question of whether things will go wrong, but only when they will go wrong, and that you need to have the right policies in place in order to be prepared for that:

In the end it’s all about trade-offs, probabilities and acceptable risks. For instance, would you receive and acknowledge that purchase order request as long as you can store it in a replicated database in the local availability zone, or would you rather reject it, as long as you cannot safely persist it in a multi-AZ Kafka cluster?

These questions aren’t merely technical ones any longer, but they require close collaboration with product owners and subject matter experts in the business domain at hand, so to make the most suitable decisions for your specific situation. Managed services with defined SLAs guaranteeing high availability values can make the deciding difference here, as Vikas Sood mentions:

Thanks a lot again to everyone chiming in and sharing their experiences, this was highly interesting and insightful! You have further ideas and thoughts to share? Let me and the community at large know either by leaving a comment below, or by replying to the thread on Twitter. I’m also curious about your feedback on this format of putting a Twitter discussion into some expanded context. It’s the first time I’ve been doing it, and I’d be eager to know whether you find it useful or not. Thanks!