Dec. 7, 2020, 11:43 p.m.

What is Apache Kafka?

Known Unknowns

I was talking to a recruiter today, and he mentioned that his company is switching to using Kafka for a lot of their systems. To be honest, I have heard the name Kafka, but had literally no idea what it was. This was an unknown unknown to me until today. I decided to learn a little bit about what Apache Kafka is.

According to TechBeacon, Kafka is:

“a scalable, fault-tolerant, publish-subscribe messaging system that enables you to build distributed applications and powers web-scale Internet companies such as LinkedIn, Twitter, AirBnB, and many others.”

Yeesh, that’s a mouthful, and it didn’t make anything clearer to me.

Kafka was first created by LinkedIn in 2010 where it was used to ingest large amounts of data in real-time. Traditionally, large quantities of data are handled through batch processing, but with a site like LinkedIn all of the things that happen on a large social media site need to be captured very quickly. It is written in Scala and Java.

Kafka works as a large scale messaging queue that allows data to be captured quickly and in a fault-tolerant way. Its reliable for use in big data applications.

Data in Kafka is referred to as events. Events are a single, atomic piece of data. It’s like a message with some information attached. Kafka uses a log data structure - essentially each event is simply an array of bytes. The data itself doesn’t have a key like a traditional database table. Instead events are referenced by an offset from the first event in the log. The log is time-ordered and is append-only.

It is a distributed system, built to scale horizontally (across multiple servers by adding nodes) rather than vertically like other messaging systems (which scale by adding more power to the machine the system runs on).

Events are added to the log by a producer (sometimes called a publisher). Multiple producers can add data to Kafka simultaneously; even up to 1 trillion events a day, according to LinkedIn. Producers are any source that creates data.

Consumers request events from Kafka to do something with the information in the event. Kafka does not know anything about the consumers of the data. A piece of code can potentially be both a producer and a consumer of an event.

Kafka is the middleman between the producers and consumers. It runs software called brokers on each of the nodes. Because it is distributed, the cluster can contain multiple copies of the data on different nodes. These are called replicas. The replicas are what make Kafka fault-tolerant. If one broker goes down, another can pick up the work and continue to process it.

Topics are a category of events. For example, producers may provide sales invoices as events, and these events are grouped together by topic. A consumer can ask for a specific topic and retrieve the sales invoices in that topic.

In the event that a topic is larger than the storage capacity of the machine on which it resides, that topic can be broken into multiple partitions that reside on multiple machines. It’s up to the administrator of the cluster to decide when to break a topic into multiple partitions, but it can’t be redone once the decision has been made.

Locating a specific event requires three parts: The topic, the partition, and the offset of the event. These three items can uniquely identify the event.

Finally, you can create a consumer group. This is a logical method to group consumers who are sharing a unit of work. You can add consumers in a consumer group up to the number of partitions, so that there are never more than one consumer interacting with a partition and inadvertently double-reading events.

Kafka seems like a fascinating tool to use in highly distributed, highly scalable systems like social media platforms. I hope this introduction made sense and lets you understand how it works.

Resources:

  • Apache Kafka Use Cases
  • Kafka Tutorial - Core Concepts

You just read issue #14 of Known Unknowns. You can also browse the full archives of this newsletter.

This email brought to you by Buttondown, the easiest way to start and grow your newsletter.