Kafka (CCDAK)
What is Apache Kafka?
Kafka is a distributed streaming platform. This means that kafka enables you to:
- Publish and subscribe to streams of data records
- Store the records in a fault-tolerant and scalable fashion for as long as you need
- Process streams of event as they occur or retrospectively
Kafka is written in java and it was created by linkedin in 2011
Kafka use cases
Messaging - building real-time streaming data pipelines that reliably get data between systems or applications
Kafka is mainly used by 2 types of applications:
- As real-time streaming data pupelines that reliably get data between systems or applications
- As real-time streaming applications that transform or react to the streams of data
Benefits
- Strong reliability garantee
- Fault tolerance
- Robust apis
- Idempotent option
Main concepts and terminology
Events - Also known as record or message, they store information of an event that happened in your applications. An event has a key, value, timestamp and optional metadata headers.
key: alice
value: "Made a payment of 200 to bob"
timestamp: "Jun. 25, 2020 at 2:06 p.m"
Producers - Are applications that publish events to kafka
Offset - A Sequential and unique Id of event in a partition
Consumers - are applications that subscribe to kafka.
Consumer group - By default consumers will re-consume the same records. Consumer groups allow to change this behaviour, allowing for consumers to only consume different events/partitions.
Garantees - There are multiple types of delivery garantees that could be provided:
- At most once - Messages may be lost but never redelivered
- At least once - Messages are never lost but may be redelivered
- Exactly once - Each message is delivered only once
Topics - Events are stored in topics. Topics in Kafka are always multi-producer and multi-subscriber - They can have zero, one or many producers or consumers. Events can be read as often as needed, since they are not deleted on consumption. Deletion is defined by a configuration which defines how long a topic should retain messages.
Partition - Topics are partitioned, meaning that a topic is spread over a number of buckets located in different Kafka Brokers. Wen a new event is published to a topic it is appended to a topic partition. Events with the same key (eg. customer id) are written to the same partition and kafka garantees that consumers will read events in order.
Log - Immutable and partitioned Data structure used to store a sequence of events
Replication - To make your data fault tolerant and highly available every topic can be replicated in different brokers. A common configuration is a replication factor of 3.
Architecture
Brokers are servers that make a Kafka Cluster. Producers and consumers communicate with them to publish and consume messages.
Kafka depends on zookeeper to manage clusters. It coordinates cluster communication, adds, removes and monitor brokers/nodes.
Kafka elects a broker to be the Controller, which coordinates assigning partitions and replicas to other nodes. If the controller goes down another one will be elected.
Kafka uses TCP to handle message communication
Design
https://kafka.apache.org/documentation/#design
Partitions and replication
A partition is assigned to an specific broker and to achieve fault tolerance this partition is replicated to other clusters. The replication factor is the number of replicas that will kept in different brokers.
To ensure consistency kafka chooses a leader for each partition, the leader handles all reads and writes for the partition. If the leader is down the cluster tries to choose a new one by leader election.
By default Kafka will only set as electable replicas that are in sync (ISR), if there are no ISR Kafka will just wait until one becomes available. It is possible to configure this behaviour so it will work as unclean leader election.
Kafka Streams
Work as Node.js Transformation streams (get an input and pipe to an output).
There are 2 types of transformations in Kafka Streams:
- Stateless transformations - Do not require any additional storage to manage state (1 record at a time)
- Stateful Transformations - require a state store to manage the state (needs historical data)
Stateless Transformations
- Branch - Splits a stream into multiple streams based on a predicate
- Filter - Removes messages based on condition
- FlatMap - Takes events and turns into different set of records
- ForEach - Terminal operation
- GroupBy/GroupByKey
- Map - Map to another format
- Merge - Merges 2 streams into one
- Peek - Similar to foreach but does not stop processing
Stateful Transformations
- Agregate - uses Groupby/groupbykey to aggregate/count/reduce
- Join - Similar to merge but join things as a SQL join (Inner join, left join, outer join)