In this blog, we are going to learn different spark join types. We will also write code and validate data output for each join type to better understand them.
Apache Kafka is a distributed, low latency, high throughput, fault-tolerant platform for handling different real-time data feeds. Kafka can publish and scribe streams data, store it in a durable way and process it if needed. Consider a situation where you have many sources generating data and there are many target systems that need that data. As the number of sources and targets increases, this system becomes very difficult and inefficient to manage. Kafka is designed to handle these cases. Kafka provides us the middle layer which can take data from different sources, store it on servers and make it available to all target systems in a reliable way.
Kafka has four core APIs.
In Kafka, topics are categories of the data feed. Each record/message will get published to a topic. Consumers can read data from one or more topics. There can be n number of topics on the server as long as their names are different.
Topics are split into Partitions. Each partition has offset assigned which are increasing numbers starting from 0. Messages will be appended to end of partitions. The topic can have n number of partitions.
Messages written to partitions cannot be changed, they are immutable. The ordering of messages will be maintained only within the partition. Data is assigned randomly to partitions unless a key is provided with the message. The key will be used to assign each message to a particular partition.
Data in Kafka will be stored for a limited time (default 2 weeks). After that time, data will be erased from the server. So after 2 weeks of writing data, offset 0 in partition 0 will be deleted. Remember offset will always be increasing. Even if zero offset is deleted, zero offset won't be assigned to any new data.
Kafka is made of many servers and these servers are called brokers. Each broker will be identified by its ID. Once you connect to any broker within the cluster, you are connected to all brokers. Each broker contains some of the topic partitions. Each partition has one server which acts as 'Leader' and zero and more servers as followers. All the read and write of that partition will be handled by the leader and will get replicate on followers. If the Leader of one of partition goes down due to some reason, one of the followers of that partition will become the leader of that partition automatically.
Each partition in Kafka will get replicated to one or more servers. This gives us fault-tolerant storage. Even if one of the servers goes down we can use replicated data from another server.
Producers publish data on topics. Producers have to give a topic name and one of a broker to connect to while publishing data. The producer is responsible for which records to be assigned to which partitions. Producers can choose to receive acknowledgment for data writes.
Consumers read data from topics. Consumers can be grouped together in consumer groups. Each consumer in a group will data from one partition at a time. If there are 3 consumers in group and 3 partitions then each consumer will read data from one partition in parallel. So it is pointless to have more consumers than partitions as some of the consumers will sit idle as one partition will get read by only one consumer in that group at a time.
When a consumer receives data and processed it, It will commit offsets to Kafka. Kafka stores these offsets in a topic named '__consumer_offsets'. If a consumer dies and comes back online after some time, using this offset it will be able to read from the point it left.
Kafka gives us following guarantees,
We have gone through the basics of Apache Kafka in this article. In the next article, we will learn how to install Kafka.
In this blog, we are going to integrate spark with jupyter notebook and visual studio code to create easy-to-use development environment.