Apache Kafka is a well known open-source stream processing platform which aims to provide a high-throughput, low-latency & fault-tolerant platform which is capable of handling real-time data input.
So what is it that makes Apache Kafka the go-to platform of choice when it comes to real-time data processing? Apart from all the other perks that Kafka provides, speed is one of the most important ones. Let us see how Kafka is built to be so fast.
1. Low-Latency I/O: There are two possible places which can be used for storing and caching the data: Random Access Memory (RAM) and Disk.
- An orthodox way to achieve low latency while delivering messages is to use the RAM. It’s preferred over the disk because disks have high seek-time, thus making them slower.
- The downside of this approach is that it can be expensive to use the RAM when the data flowing through your system is around 10 to 500 GB per second or even more.
Thus, Kafka relies on the filesystem for the storage and caching of messages. Although it uses the disk approach and not the RAM approach, it still manages to achieve low latency! You might wonder how is this possible, given the high seek time. Let’s find out.
2. Kafka Avoids the Seek Time: Yes! Kafka smartly avoids the seek time by using a concept called Sequential I/O.
- It uses a data structure called ‘log’ which is an append-only sequence of records, ordered by time. The log is basically a queue and it can be appended at its end by the producer and the subscribers can process the messages in their own accord by maintaining pointers.
- The first record published gets an offset of 0, the second gets an offset of 1 and so on.
- The data is consumed by the consumers by accessing the position specified by an offset. The consumers save their position in a log periodically.
- This also makes Kafka a fault-tolerant system since the stored offsets can be used by other consumers to read the new records in case the current consumer instance fails. This approach removes the need for disk seeks as the data is present in a sequential manner as depicted below:
3. Zero Copy Principle: The most common way to send data over a network requires multiple context switches between the Kernel mode and the User mode, which results in the consumption of memory bandwidth and CPU cycles. The Zero Copy Principle aims to reduce this by requesting the kernel to move the data directly to the response socket rather than moving it via the application. Kafka’s speed is tremendously improved by the implementation of the zero-copy principle.
4. Optimal Data Structure: Tree vs. Queue: The tree seems to be the data structure of choice when it comes to data storage. Most of the modern databases use some form of the tree data structure. Eg. MongoDB uses BTree.
- Kafka, on the other hand, is not a database but a messaging system and hence it experiences more read/write operations compared to a database.
- Using a tree for this may lead to random I/O, eventually resulting in a disk seeks – which is catastrophic in terms of performance.
Thus, it uses a queue since all the data is appended at the end and the reads are very simple by the use of pointers. These operations are O(1) thereby confirming the efficiency of the queue data structure for Kafka.
5. Horizontal Scaling: Kafka has the ability to have multiple partitions for a single topic that can be spread across thousands of machines. This enables it to maintain the high-throughput and provide low latency.
6. Compression & Batching of Data: Kafka batches the data into chunks which helps in reducing the network calls and converting most of the random writes to sequential ones. It’s more efficient to compress a batch of data as compared to compressing individual messages.
Hence, Kafka compresses a batch of messages and sends them to the server where they’re written in the compressed form itself. They are decompressed when consumed by the subscriber. GZIP & Snappy compression protocols are supported by Kafka.