"You Don't Need Kafka, Just Use Postgres" Considered Harmful
Looking to make it to the front page of HackerNews? Then writing a post arguing that "Postgres is enough", or why "you don’t need Kafka at your scale" is a pretty failsafe way of achieving exactly that. No matter how often it has been discussed before, this topic is always doing well. And sure, what’s not to love about that? I mean, it has it all: Postgres, everybody’s most favorite RDBMS—check! Keeping things lean and easy—sure, count me in! A somewhat spicy take—bring it on!
The thing is, I feel all these articles kinda miss the point; Postgres and Kafka are tools designed for very different purposes, and naturally, which tool to use depends very much on the problem you actually want to solve. To me, the advice "You Don’t Need Kafka, Just Use Postgres" is doing more harm than good, leading to systems built in a less than ideal way, and I’d like to discuss why this is in more detail in this post. Before getting started though, let me get one thing out of the way really quick: this is not an anti-Postgres post. I enjoy working with Postgres as much as the next person (for those use cases it is meant for). I’ve used it in past jobs, and I’ve written about it on this blog before. No, this is a pro-"use the right tool for the job" post.
So what’s the argument of the "You Don’t Need Kafka, Just Use Postgres" posts? Typically, they argue that Kafka is hard to run or expensive to run, or a combination thereof. When you don’t have "big data", this cost may not be justified. And if you already have Postgres as a database in your tech stack, why not keep using this, instead of adding yet another technology?
Usually, these posts then go on to show how to use SELECT ... FOR UPDATE SKIP LOCKED for building a… job queue. Which is where things already start to make a bit less sense to me. The reason being that queuing just is not a typical use case for Kafka to begin with. It requires message-level consumer parallelism, as well as the ability to acknowledge individual messages, something Kafka historically has not supported. Now, the Kafka community actually is working towards queue support via KIP-932, but this is not quite ready for primetime yet. Until then, the argument boils down to not use Kafka for something it has not been designed for in the first place. Hm, yeah, ok?
That being said, building a robust queue on top of Postgres is actually harder than it may sound. Long-running transactions by queue consumers can cause MVCC bloat and WAL pile-up; Postgres' vacuum process not being able to keep up with the rate of changes can quickly become a problem for this use case. So if you want to go down that path, make sure to run representative performance tests, for a sustained period of time. You won’t find out about issues like this by running two minute tests.
So let’s actually take a closer look at the "small scale" argument, as in "with such a low data volume, you just can use Postgres". But to use it for what exactly? What is the problem you are trying to solve? After all, Postgres and Kafka are tools designed for addressing specific use cases. One is a database, the other is an event streaming platform. Without knowing and talking about what one actually wants to achieve, the conversation boils down to "I like this tool better than that" and is pretty meaningless.
Kafka enables a wide range of use cases such as microservices communication and data exchange, ingesting IoT sensor data, click streams, or metrics, log processing and aggregation, low-latency data pipelines between operational databases and data lakes/warehouses, or realtime stream processing, for instance for fraud detection and recommendation systems.
So if you have one of those use cases, but at a small scale (low volume of data), could you then use Postgres instead of Kafka? And if so, does it make sense? To answer this, you need to consider the capabilities and features you get from Kafka which make it such a good fit for these applications. And while scalability indeed is one of Kafka’s core characteristics, it has many other traits which make it very attractive for event streaming applications:
-
Log semantics: At its core, Kafka is a persistent ordered event log. Records are not deleted after processing, instead they are subject to time-based retention policies or key-based compaction, or they could be retained indefinitely. Consumers can replay a topic from a given offset, or from the very beginning. If needed, consumers can work with exactly-once semantics. This goes way beyond simple queue semantics and replicating it on top of Postgres will be a substantial undertaking.
-
Fault tolerance and high availability (HA): Kafka workloads are scaled out in clusters running on multiple compute nodes. This is done for two reasons: increasing the throughput the system can handle (not relevant at small scale) and increasing reliability (very much relevant also at small scale). By replicating the data to multiple nodes, instance failures can be easily tolerated. Each node in the cluster can be a leader for a topic partition (i.e., receive writes), with another node taking over if the previous leader becomes unavailable.
With Postgres in contrast, all writes go to a single node, while replicas only support read requests. A broker failover in Kafka will affect (in the form of increased latencies) only those partitions it is the leader for, whereas the failure of the Postgres primary node in a cluster is going to affect all writers. While Kafka broker failovers happen automatically, manual intervention is required in order to promote a Postgres replica to primary, or an external coordinator such as Patroni must be used. Alternatively, you might consider Postgres-compatible distributed databases such as CockroachDB, but then the conversation shifts quite a bit away from "Just use Postgres". -
Consumer groups: One of the strengths of the Kafka protocol is its support for organizing consumers in groups. Multiple clients can distribute the load of reading the messages from a given topic, making sure that each message is processed by exactly one member of the group. Also when handling only a low volume of messages, this is very useful. For instance, consider a microservice which receives messages from another service. For the purposes of fault-tolerance, the service is scaled out to multiple instances. By configuring a Kafka consumer group for all the service instances, the incoming messages will be distributed amongst them.
How would the same look when using Postgres? Considering the "small scale" scenario, you could decide that only one of the service instances should read all the messages. But which one do you select? What happens if that node fails? Some kind of leader election would be required. Ok, so let’s make each member of the application cluster consume from the topic then? For this you need to think about how to distribute the messages from the Postgres-based topic, how to handle client failures, etc. So your job now essentially is to re-implement Kafka’s consumer rebalance protocol. This is far from trivial and it certainly goes against the initial goal of keeping things simple. -
Low latency: Let’s talk about latency, i.e. the time it takes from sending a message to a topic until it gets processed by a consumer. Having a low data volume doesn’t necessarily imply that you do not want low latency. Think about fraud detection, for example. Also when processing only a handful of transactions per second, you want to be able to spot fraudulent patterns very quickly and take action accordingly. Or a data pipeline from your operational data store to a search index. For a good user experience, search results should be based on the latest data as much as possible. With Kafka, latencies in the milli-second range can be achieved for use cases like this. Trying to do the same with Postgres would be really tough, if possible at all. You don’t want to hammer your database with queries from a herd of poll-based queue clients too often, while LISTEN/NOTIFY is known to suffer from heavy lock contention problems.
-
Connectors: One important aspect which is usually omitted from all the "Just use Postgres" posts is connectivity. When implementing data pipelines and ETL use cases, you need to get data out of your data source and put it into Kafka. From there, it needs to be propagated into all kinds of data sinks, with the same dataset oftentimes flowing into multiple sinks at once, such as a search index and a data lake. Via Kafka Connect, Kafka has a vast ecosystem of source and sink connectors, which can be combined, mix-and-match style. Taking data from MySQL into Iceberg? Easy. Going from Salesforce to Snowflake? Sure. There’s ready-made connectors for pretty much every data system under the sun.
Now, what would this look like when using Postgres instead? There’s no connector ecosystem for Postgres like there is for Kafka. This makes sense, as Postgres never has been meant to be a data integration platform, but it means you’ll have to implement bespoke source and sink connectors for all the systems you want to integrate with. -
Clients, schemas, developer experience: One last thing I want to address is the general programming model of a "Just use Postgres" event streaming solution. You might think of using SQL as the primary interface for producing and consuming messages. That sounds easy enough, but it’s also very low level. Building some sort of client will probably make sense. You may need consumer group support, as discussed above. You’ll need support for metrics and observability ("What’s my consumer lag?"). How do you actually go about converting your events into a persistent format? Some kind of serializer/deserializer infrastructure will be needed, and while at it, you probably should have support for schema management and evolution, too. What about DLQ support? With Kafka and its ecosystem, you get battle-proven clients and tooling, which will help you with all that, for all kinds of programming languages. You could rebuild all this, of course, but it would take a long time and essentially equate to recreating large parts of Kafka and its ecosystem.
So where does all that leave us? Should you use Postgres as a job queue then? I mean, why not, if it fits the bill for you, go for it. Don’t build it yourself though, use an existing extension like pgmq. And make sure to understand the potential implications on MVCC bloat and vacuuming discussed above.
Now, when it comes to using Postgres instead of Kafka as an event streaming platform, this proposition just doesn’t make an awful lot of sense to me, no matter what the volume of the data is going to be. There’s so much more to event streaming than what’s typically discussed in the "Just use Postgres" posts; while you might be able to punt some of the challenges for some time, you’ll eventually find yourself in the business of rebuilding your own version of Kafka, on top of Postgres. But what’s the point of recreating and maintaining the work already done by hundreds of contributors in the course of many years? What starts as an effort to "keep things simple" actually creates a substantial amount of unnecessary complexity. Solving this challenge might sound like a lot of fun purely from an engineering perspective, but for most organizations out there, it’s probably just not the right problem they should focus on.
Another problem of the "small scale" argument is that what’s a low data volume today may be a much bigger volume next week. This is a trade-off, of course, but a common piece of advice is to build your systems for the current and the next order of magnitude of load: you should be able to sustain 10x of your current load and data volume as your business grows. This will be easily doable with Kafka which has been designed with scalability at its core, but it may be much harder for a queue implementation based on Postgres. It is single-writer as discussed above, so you’d have to look at scaling up, which becomes really expensive really quickly. So you might decide to migrate to Kafka eventually, which will be a substantial effort when thinking of migrating data, moving your applications from your home-grown clients to Kafka, etc.
In the end, it all comes down to choosing the right tool for the job. Use Postgres if you want to manage and query a relational data set. Use Kafka if you need to implement realtime event streaming use cases. Which means, yes, oftentimes, it actually makes sense to work with both tools as part of your overall solution: Postgres for managing a service’s internal state, and Kafka for exchanging data and events with other services. Rather than trying to emulate one with the other, use each one for its specific strengths. How to keep both Postgres and Kafka in sync in this scenario? Change data capture, and in particular the outbox pattern can help there. So if there is a place for "Postgres over Kafka", it is actually here: for many cases it makes sense to write to Kafka not directly, but through your database, and then to emit events to Kafka via CDC, using tools such as Debezium. That way, both resources are (eventually) consistent, keeping things very simple from an application developer perspective.
This approach also has the benefit of decoupling (and protecting) your operational datastore from the potential impact of downstream event consumers. You probably don’t want to be at the risk of increased tail latencies of your operational REST API because there’s a data lake ingest process, perhaps owned by another team, which happens to reread an entire topic from a table in your service’s database at the wrong time. Adhering to the idea of the synchrony budget, it makes sense to separate the systems for addressing these different concerns.
What about the operational overhead then? While this definitely warrants consideration, I believe that oftentimes that concern is overblown. Running Kafka for small data sets really isn’t that hard. With the move from ZooKeeper to KRaft mode, running a single Kafka instance is trivial for scenarios not requiring fault tolerance. Managed services make running Kafka a very uneventful experience (pun intended) and should be the first choice, in particular when setting out with low scale use cases. Cost will be manageable kinda by definition by virtue of having a low volume of data. Plus, the time and effort for solving all the issues with a custom implementation discussed above should be part of the TCO consideration to be useful.
So yes, if you want to make it to the front page of HackerNews, arguing that "Postgres is enough" may get you there; but if you actually want to solve your real-world problems in an effective and robust way, use the right tool for the job.
Gunnar Morling is an open-source software engineer in the Java and data streaming space. He currently works as a Technologist at Confluent. In his past role at Decodable he focused on developer outreach and helped them build their stream processing platform based on Apache Flink. Prior to that, he spent ten years at Red Hat, where he led the Debezium project, a platform for change data capture.