Apache Kafka is a distributed,  low latency, high throughput, fault-tolerant platform for handling different real-time data feeds. Kafka can publish and scribe streams data, store it in a durable way and process it if needed. Consider a situation where you have many sources generating data and there are many target systems that need that data. As the number of sources and targets increases, this system becomes very difficult and inefficient to manage. Kafka is designed to handle these cases. Kafka provides us the middle layer which can take data from different sources, store it on servers and make it available to all target systems in a reliable way.

Where can we use Kafka

Kafka APIs:

Kafka has four core APIs.

  • Producer API: It is used by applications to publish records/data to Kafka topics.
  • Consumer API: It allows applications to read records from one or more topics and process these records.
  • Streams API: It allows applications to consume records from one or more topics, process them and publish these records to one or more topics.
  • Connect API: This API allows the reuse of the code of producers and consumers.
Simple architecture for Apache Kafka

Topics and Partitions:

In Kafka, topics are categories of the data feed. Each record/message will get published to a topic. Consumers can read data from one or more topics. There can be n number of topics on the server as long as their names are different.

Topics are split into Partitions. Each partition has offset assigned which are increasing numbers starting from 0. Messages will be appended to end of partitions. The topic can have n number of partitions.

Apache Kafka topics and partition

Messages written to partitions cannot be changed, they are immutable. The ordering of messages will be maintained only within the partition. Data is assigned randomly to partitions unless a key is provided with the message. The key will be used to assign each message to a particular partition.

Data in Kafka will be stored for a limited time (default 2 weeks). After that time, data will be erased from the server. So after 2 weeks of writing data, offset 0 in partition 0 will be deleted. Remember offset will always be increasing. Even if zero offset is deleted, zero offset won’t be assigned to any new data.


Kafka is made of many servers and these servers are called brokers. Each broker will be identified by its ID. Once you connect to any broker within the cluster, you are connected to all brokers. Each broker contains some of the topic partitions. Each partition has one server which acts as ‘Leader’ and zero and more servers as followers. All the read and write of that partition will be handled by the leader and will get replicate on followers. If the Leader of one of partition goes down due to some reason, one of the followers of that partition will become the leader of that partition automatically.


Each partition in Kafka will get replicated to one or more servers. This gives us fault-tolerant storage. Even if one of the servers goes down we can use replicated data from another server.


Producers publish data on topics. Producers have to give a topic name and one of a broker to connect to while publishing data. The producer is responsible for which records to be assigned to which partitions. Producers can choose to receive acknowledgment for data writes.

  • ack = 0: Producer will not wait for any acknowledgment
  • ack = 1: Producer will wait for only leader acknowledgment
  • ack = 2: Producer will wait for a leader as well as replica acknowledgment


Consumers read data from topics.  Consumers can be grouped together in consumer groups. Each consumer in a group will data from one partition at a time. If there are 3 consumers in group and 3 partitions then each consumer will read data from one partition in parallel.  So it is pointless to have more consumers than partitions as some of the consumers will sit idle as one partition will get read by only one consumer in that group at a time.

When a consumer receives data and processed it, It will commit offsets to Kafka. Kafka stores these offsets in a topic named ‘__consumer_offsets’. If a consumer dies and comes back online after some time, using this offset it will be able to read from the point it left.


Kafka gives us following guarantees,

  • Messages will be appended to partition in the order they are sent.
  • Consumer will read messages in the order they are stored in a partition
  • For a topic with replication factor N, we can tolerate N-1 server failures without losing any records.

We have gone through the basics of  Apache Kafka in this article. In the next article, we will learn how to install Kafka.

what is Apache kafka
Mahesh Mogal

Mahesh Mogal

I am passionate about Cloud, Data Analytics, Machine Learning, and Artificial Intelligence. I like to learn and try out new things. I have started blogging about my experience while learning these exciting technologies.

Table of Contents
    Add a header to begin generating the table of contents

    Stay updated with latest blogs

    Posts you may be interested in

    Manage S3 Bucket Polices

    Set, Get and Delete AWS S3 bucket policies

    In this blog, we are going to learn how to get, put and delete S3 bucket policies suing S3 Console as well as programmatically using AWS CLI & Python

    Manage S3 Bucket Polices
    Read More →
    iam policy vs s3 policy vs s3 acls

    IAM Policies VS S3 Policies VS S3 Bucket ACLs – What should be used?

    You can manage S3 permission using IAM policy or S3 Policy or S3 ACLs. We will understand the difference between them and use cases for each way.

    iam policy vs s3 policy vs s3 acls
    Read More →
    Create S3 bucket

    Create S3 bucket using AWS CLI and Python Boto3

    In this blog, we are going to learn how to create an S3 bucket using AWS CLI, Python Boto3 and S3 management console.

    Create S3 bucket
    Read More →

    Leave a Comment

    Your email address will not be published. Required fields are marked *

    Share via
    Copy link