I wrote a tool called “kafka-topic-analyzer”. It helps you to gather statistics and metrics of a Kafka topic. You can find the project on GitHub: https://github.com/xenji/kafka-topic-analyzer
Before I tell you about the tool, I want to provide some background why I wrote it. At trivago, we do a lot of Apache Kafka. One of our common use-cases is CDC, where we use Debezium to to read the binlog of our MySQL servers. You can dive into the details watching a talk which I gave with a colleague of mine at the Code Talks 2017. I’ve embedded it at the end of the article.
With a technology stack like this, you can encounter a broad spectrum of challenges. One the one side, there are technological problems, which should not surprise you. Most of the time, the bigger challenge for us is with the trust of people in the tech stack. The most frequent question is “are all my records of a certain table recorded in the Kafka topic?”. Unfortunately, this is one of the hardest questions to answer in the scope of Kafka. Not being able to respond to this question creates uncertainty and may result in a lack of trust.
Let me explain the technical aspect bit. We publish the changes to a “log-compacted” Kafka topic. A message in Kafka consists of a key and a value. In our case, the key consists of all parts of the table’s primary key to guarantee uniqueness across the topic. Kafka stores the messages in insertion order. If there is more than one change to a certain key, a key occurs more than once until Kafka runs a compaction. A change can be either a “upsert” or a delete. Deletes are represented as
key => NULL, upserts as
key => some-value . A compaction removes all older versions of a message, keeping only the latest one. This leads to the fact, that a key can occur many times, when reading a topic from start to end. As a consequence, counting the messages is no appropriate way to determine the semantic size of the topic. To get to the “real” size of the topic, you must keep track of all mutations to every single key. Topic sizes vary from a few thousand, up to a hundreds of millions of messages. Stateful counting becomes a cumbersome task.
Here is a screenshot of an example output:
The kafka-topic-analyzer tries to fill the gap by providing a lot of metrics and counters. One of them is the count of “alive” keys in a log-compacted topic. This enables you to give a clear statement about the semantic size of a topic. You are now able to compare this value against a count query on the related table. Be aware: you need to expect a slight difference on a topic with a high change rate. This is due to the delay that the services introduce.
Below is the video recording of the conference talk about the whole process we run at trivago.