Sunday, November 17, 2024
Google search engine
HomeLanguagesJavaTopics, Partitions, and Offsets in Apache Kafka

Topics, Partitions, and Offsets in Apache Kafka

Apache Kafka is a publish-subscribe messaging system. A messaging system let you send messages between processes, applications, and servers. Broadly Speaking, Apache Kafka is software where topics (A topic might be a category) can be defined and further processed. In this article, we are going to discuss the 3 most important components of Apache Kafka

  1. Topics
  2. Partitions
  3. Offsets

Topics, Partitions, and Offsets

In Kafka we have Topics and Topics represent a particular stream of data. So a Kafka Topic is going to be pretty similar to what a table is in a database without all the constraints, so if you have many tables in a database you will have many topics in Apache Kafka. You can have as many Topics as you want in Apache Kafka and the way to identify a Topic is by its name. So when you name a Topic it will need to have a unique name. Topics are split into Partitions. So when you create a Kafka Topic we will have to specify how many Partitions we want for the Kafka topics. Each partition is going to be a stream of data as well and each Partition will have the data in it being ordered and each message within a Partition will get an incremental ID which is the position of the message in the Partition and that specific ID is called an Offset. 

 

So if we take this example of a Kafka Topic with 3 partitions then if we look at Partition 0, it will have the message with Offset 0, then the message with Offset 1, 2, 3..etc, maybe all the way up to 11. And then the next message to be written is going to be message number 12, offset number 12. And then Partition 1 is also part of our Kafka Topic and this one has also Offsets going from 0 all the way to 7 and then the next message to be written is number 8 and Partition 2 has messages or offsets going from 0 to 9 and the next message should be written is number 10. So as we can see in this example the partitions are independent. We will be writing to each partition independently at its own speed, so the Offsets in each partition are independent and again a message has a coordinate of a Topic name a Partition id, and an Offset.

Topic Example

Let’s go through an example where we have cars and the cars are ground on the road. So we have a fleet of cars and we’re a car company and what we want to do is to have the car position in Kafka. Why because maybe we have many applications we need that stream of car positions for maybe a dashboard or some alerting or so on. So we’re going to create in Kafka a Topic and name that cars_gps and that topic will contain the position of all the cars in real-time and so what we’ll do is that each car is going to send to Kafka maybe every 20 seconds, their position and their position will be included as part of a message and each message will contain the carID so we can know which car the position belongs to as well as the car position itself. 

For example the latitude and longitude. But we could choose to add more data to that message we can add the speed, we can add the weight of the car, we can add how many hours the car has been on, and so on. So we choose to create a topic with 10 partitions but in Kafka, the more partitions you have the more throughput can go through your topic. So this is something you have to do as part of testing and capacity planning. So from there maybe consumer applications are going to be a location dashboard for a mobile application or notification service. For example, if a car hasn’t been moving for more than 10 minutes, maybe it’s broken or maybe your car has arrived at its destination and we want to send a notification to wherever it has arrived.

 

Some Major Points to Remember in Topics, Partitions, and Offsets

Please refer to the same example.

 

  • Offsets only have a meaning for a specific partition. That means offset number 3 in Partition 0 does not represent the same data or the same message as offset number 3 in partition 1.
  • Order is going to be guaranteed only from within a partition. 
  • But across partitions, we have no ordering guarantee. So this is a very important certainty of Kafka is that you’re going to have ordered at the partition level only.
  • Data in Kafka by default is kept only for a limited amount of time and the default is one week. That means that after one week the data is going to be erased from a partition and this allows Kafka to keep on renewing its disk and to make sure it does not run out of disk space. 
  • Kafka is immutable. That means once the data is written into a partition, it cannot be changed. So if you write the message number 3 in partition 0 you cannot overwrite. So as such, you want to be careful about the kind of data you send to a Kafka topic and your recovery mechanism instead of in case you send bad data.
  • Also if you don’t provide a key to your message, then when you send a message to a Kafka topic the data is going to be assigned to a random partition.
  • Finally, a topic can have as many partitions as you want but it is not it is common to have topics with say 10, 20, 30, or 1000 partitions unless you have a truly high throughput topic. 

Topic Naming Convention

Naming a topic is a “free-for-all”. So you can do whatever you want but once you go into production with Kafka you need to enforce guidelines internally to ease the management of your cluster. So you’re free to come up with your own guidelines. If you want you can also the following guidelines for naming a topic.

<message type>.<dataset name>.<data name>

  • Message Type:
    • logging: For logging data (slf4j, Syslog, etc)
    • queuing: For classical queuing use cases
    • tracking: For tracking events such as user clicks, page views, etc.
    • user: For user-specific data such as scratch and test topics.
  • Dataset Name: The dataset name is analogous to a database name in traditional RDBMS systems. It is used as a category to group topics together.
  • Data Name: The data name filed is analogous to a table name in traditional RDBMS systems, though it’s fine to include further dotted notation if developers wish to impose their own hierarchy within the dataset namespace.
  • Use snake_case: And finally, for making things feel simple it is recommended to use snake_case. So all lowercase and with an underscore. 
RELATED ARTICLES

Most Popular

Recent Comments