Innovative

Screenshot 2025 08 17 232517
Innovative, Technical

Apache Kafka: How It Works Under The Hood

Apache Kafka: How It Works Under The Hood In the previous blog, we introduced Apache Kafka and its key concepts—topics, partitions, producers, consumers, and consumer groups. We saw how Kafka helps decouple microservices and improve scalability and fault tolerance. But how does Kafka actually achieve this at scale? The answer lies in its architecture, which is built around distributed logs, offsets, and replication. 1. Kafka Architecture 1.1 Brokers A Kafka broker is a server that stores messages and serves client requests. Producers send data to brokers. Consumers fetch data from brokers. Brokers manage topics and partitions. Each broker can handle thousands of partitions and millions of messages per second. A Kafka cluster is typically made up of multiple brokers for scalability (spreading the load) and resilience (no single point of failure). Example: In a three-broker cluster, a topic with six partitions might be distributed so that each broker manages two partitions. 1.2 Log Files and Structure When people say “Kafka is a distributed log,” they mean it literally. Kafka stores data in log files on disk, and the way these logs are structured is the secret to its performance and reliability. Each partition in Kafka corresponds to a log file on disk. This log file is an append-only commit log, meaning you can only append new data to the end of the file and cannot remove or overwrite already stored records. New messages are written at the end of the file. Each message is assigned a unique, ever-increasing offset. Each log entry typically contains: Offset – a unique ID for ordering within the partition. Message size – how many bytes the record occupies. Message payload – the actual data. Metadata – such as a timestamp and checksums for validation. 1.3 Consumer Offsets Kafka doesn’t delete messages once a consumer reads them. Instead: Each consumer group maintains its own record of offsets. This means multiple consumer groups can read the same topic independently without interfering. If a consumer crashes, it can restart and resume at the last committed offset. Offsets are stored in an internal Kafka topic (__consumer_offsets), which allows the cluster to keep track of every group’s progress reliably. This design is what makes Kafka both a queue (messages processed once per consumer group) and a publish–subscribe system (multiple groups can consume the same data). Example: Suppose a producer writes a message to partition 2 of the orders topic. Kafka appends it to the active log segment and assigns it offset 105. The consumer group order-processing has committed its last read offset as 104. When the consumer fetches again, Kafka delivers offset 105 onward. Meanwhile, another group, analytics, may still be reading from offset 90 independently. 1.4 Segments in Log Files Kafka doesn’t keep one giant file for each partition. Instead, each partition’s log is split into segments (smaller files) on disk. A segment is typically a few megabytes to gigabytes in size (configurable). When a segment is full, Kafka closes it and starts writing to a new one. Each segment is named by the offset of its first message. This segmentation makes log management efficient: old segments can be deleted or compacted without touching active ones. 1.5 Retention and TTL Unlike traditional queues, Kafka doesn’t erase messages once they are consumed. Messages remain on disk until their retention policy is triggered. You can configure: Time-based retention (TTL) – for example, keep messages for seven days. Size-based retention – for example, keep up to 500 GB of logs. Infinite retention – messages are never deleted, so Kafka acts like a permanent log store. This means consumers can re-read old messages or even rebuild state from scratch if needed. 1.6 Leaders and Leader Election Each partition has a leader replica and one or more follower replicas. The leader handles all reads and writes. Followers replicate the data for fault tolerance. If the leader fails, Kafka performs a leader election and promotes one of the followers. This guarantees high availability and prevents data loss. Traditionally, Kafka used Zookeeper to keep track of cluster metadata (for example, broker membership and leader election for partitions). However, modern Kafka versions (2.8+) can run without Zookeeper thanks to the KRaft (Kafka Raft) protocol, which simplifies operations. 2. Takeaway Kafka’s true strength lies in its log-based architecture: Append-only commit logs provide durability. Consumer-managed offsets enable replay and resilience. Segmented log files keep storage efficient. Retention policies let you balance cost and reprocessability. Leader election ensures high availability. Previous Post Popular Posts Top 7 AI writing tools for Engineering Students Top AI Posting Tools Every Engineering Student Should Use to Be Innovative Without Design Skills Best AI Video Editing & Graphic Design Tools for Engineering Student Living in Germany as Engineering Student : Why is it hard living in Germany without speaking German? Apache Kafka: Intro and Key Concepts Every Developer Should Know

Probabilistic Generative Models Overview
Academic, Innovative

Probabilistic Generative Models Overview

Probabilistic Generative Models Overview This post is the first and introductory post in the series: From Probabilistic Modeling to Generative Modeling. It is my attempt to reproduce the knowledge I gained from learning about probabilistic generative models. In this entry, I break down the general concepts needed to understand five probabilistic generative models: Gaussian Mixture Model (GMM), Variational Autoencoders (VAE), Normalizing Flows (NF), Generative Adversarial Networks (GAN), and Diffusion Models (DM), which will be explained in the next posts. In this series, these models will be presented in a logical order that I find more intuitive, rather than following a strict categorization. However, I will highlight the characteristics that can be used to differentiate between them for clearer and deeper understanding. Before starting, it is necessary to understand why generative modeling is useful… 1 – Why Probabilistic Generative Modeling? To answer this question, we need to understand the difference between discriminative (predictive) and generative models. Let us consider the famous task of image classification using deep neural networks that predict the class of an image: cat or dog. Szegedy et al. (2013) showed that adding noise to the images can result in false predictions. Original Image P(y=cat|x) = 0.95 P(y=dog|x) = 0.05 Noisy Image P(y=cat|x) = 0.1 P(y=dog|x) = 0.9 The classifier predicts the probabilities of the labels (y) given the images (x), which means it learns the conditional probability (p(y|x)). How come adding a limited amount of noise that barely affects the signal results in false probabilities? This shows that discriminative models don’t really understand the image; they only capture useful patterns to make predictions. Once those patterns are changed even slightly, you get nonsense predictions. According to the Deep Learning book: “Classification algorithms can take an input from such a rich high-dimensional distribution and summarize it with a categorical label—what object is in a photo, what word is spoken in a recording, what topic a document is about. The process of classification discards most of the information in the input and produces a single output (or a probability distribution over values of that single output). The classifier is also often able to ignore many parts of the input. For example, when recognizing an object in a photo, it is usually possible to ignore the background of the photo.” However, again quoting from the same book: “The goal of deep learning is to scale machine learning to the kinds of challenges needed to solve artificial intelligence.” In other words, we need models that understand the reality represented by an image—meaning being able to capture the structure of images and have a semantic understanding of what they represent and the whole environment represented in the image. The book gives concrete examples of tasks where this requirement is fundamental, such as density estimation, denoising, missing value imputation, and more relevant in this post series: sampling. 2 – Why Probability? “Probability theory is nothing but common sense reduced to computation.” — Pierre Laplace (1812). Probability is the mathematical toolkit we use to deal with uncertainty. It provides us with the required rules and axioms to quantify uncertain events. As already mentioned, being able to express uncertainty is fundamental for AI systems to make reliable decisions. According to the Deep Learning book, there are two main goals for applying probability theory in AI applications: Start from the laws of probability to specify how the system should behave and design it accordingly. Use probability and statistics as analysis tools to understand the decisions taken by AI systems. 2.1 – Bayesian vs. Frequentist Approach There are two approaches to understanding probability: frequentist and Bayesian. The frequentist interpretation focuses on long-run frequencies of events. For example: if I flip a coin infinitely many times, half will be heads. So it is the long-run probability of success in estimation/decision-making. The Bayesian interpretation is about the degree of belief. From the Bayesian view, the same example would mean: “we believe the coin is equally likely to land heads or tails on the next toss.” So it is more about information rather than repeated trials. A better example to differentiate between them is: “The probability that a candidate will win the elections is 60%.” This is about one occurrence of the event (winning the election), so the frequentist approach is not optimal in this case. For more details, refer to Machine Learning: A Probabilistic Perspective. The main point is that the Bayesian approach is used to quantify uncertainty, which is why it is more appropriate for probabilistic generative models. 2.2 – Most Relevant Rules Needed for This Series Here we focus on the rules needed for higher-level concepts and their applications. If interested in deeper understanding, I recommend this YouTube video from the Machine Learning groups at the University of Tübingen. 2.2.1 – Sum Rule [ P(X) = P(X, Y) + P(X, bar{Y}) ] (P(X)): probability distribution of random variable (X) (P(X, Y)): joint probability distribution of (X) and (Y) (bar{Y} = mathcal{E} – Y), where (mathcal{E}) is the space of events The sum rule can be used to eliminate some random variables. 2.2.2 – Marginalization [ P(X) = sum_{y in mathcal{Y}} P(X, y) quad text{if } P text{ is discrete} ] [ P(X) = int_{y in mathcal{Y}} P(X, y) , dy quad text{if } P text{ is continuous} ] Marginalization gives you the probability distribution of (X) if you ignore (Y). 2.2.3 – Conditional Probability [ P(Y|X) = frac{P(X,Y)}{P(X)} ] (P(Y|X)): conditional probability of (Y) given (X) (P(X)): marginal probability 2.2.4 – Product Rule [ P(X,Y) = P(Y|X)P(X) = P(X|Y)cdot P(Y) ] The product rule helps you express joint distributions using conditional distributions. 2.2.5 – Law of Total Probability [ P(X) = sum_{i=1}^{n} P(X|Y_i) cdot P(Y_i) quad , quad Y_i text{ are disjoint for all } i ] 2.2.6 – Bayes’ Theorem [ P(Y_i|X) = frac{P(X|Y_i) cdot P(Y_i)}{sum_{j=1}^{n} P(X|Y_j) cdot P(Y_j)} = frac{P(X|Y_i) cdot P(Y_i)}{P(X)} ] (P(Y_i|X)): posterior probability of (Y_i) given (X) (P(X|Y_i)): likelihood of (X) under (Y_i) (P(Y_i)): prior probability of (Y_i) (sum_{j=1}^{n} P(X|Y_j) cdot

Screenshot 2025 08 17 232517
Technical

Apache Kafka: Intro and Key Concepts Every Developer Should Know

Apache Kafka: Intro and Key Concepts Every Developer Should Know 1. What you need to know before Kafka 1.1. What’s a Microservice Architecture? In a microservice architecture, a server is decomposed into different “smaller servers,” each responsible for a specific functionality—also known as a microservice. These can be deployed on separate hardware nodes or isolated within one node using containers or virtual machines. The microservices communicate with each other through endpoints (RESTful APIs) and/or interprocess communication (sockets, pipes). Typically, there’s also an orchestrator (server logic) that receives client requests, processes them, and forwards them to the appropriate microservices, as well as a shared database accessed by different nodes. 1.2. What’s a Message Queue? In System Design, a message queue is a system that allows one service to send messages that are stored temporarily in a queue. Other services can then read and process these messages. This helps decouple services so that if one fails or becomes slow, it doesn’t immediately cause the entire system to fail. 2. Introduction A group of software-passionate friends decided to start a new project on GitHub: a simple client-server design continuously deployed on the internet. At first, the system was small and minimalistic, and users who discovered it were happy with the service. But as more users and companies started adopting it, the number of requests skyrocketed. This created latency issues and, eventually, a complete system crash. After days of debugging, the friends discovered the problem: the Data Analysis microservice was overloaded. While that service could normally afford to lose some requests, its tight coupling with other microservices caused failures to cascade across the system. This issue is a classic example of high coupling in system design. To solve this, the friends researched and decided to adopt Apache Kafka—a decision that could transform their project. 3. Key Concepts 3.1. Producer/Consumer Architecture The simplest way to understand Apache Kafka is to imagine a message queue with a producer/consumer architecture. A producer sends messages into a queue. Consumers read messages from the queue and process them. 3.2. Offsets Unlike a traditional queue, Kafka doesn’t delete messages once consumed. Instead, it uses offsets to track what each consumer has read. Think of it like a log file—you don’t delete old lines, but you mark the last one you’ve read. This makes Kafka better described with a publish-subscribe architecture. (In this blog, we’ll use subscriber and consumer interchangeably.) 3.3. Topics Kafka organizes messages into topics. For example, the group of friends could have: sales data-analysis logging Multiple services can subscribe to the same topic. For instance, a payment topic might be consumed by: one service handling banking, another logging transactions, and another updating stock levels. 3.4. Partitions Each topic can be split into partitions, which are ordered sequences of messages. Producers write messages to partitions. Consumers read them. Kafka guarantees order within a partition, not across an entire topic. Each partition is consumed by exactly one consumer in a consumer group, but a consumer can read from multiple partitions. 3.5. Consumer Groups Consumers can be grouped into consumer groups for load balancing and fault tolerance. If one consumer fails, Kafka redistributes its partitions to the others. If a new consumer joins, it takes over some partitions. This provides scalability (add more consumers to handle more data) and high availability (failures don’t crash the whole system). 4. Takeaway Kafka helped our group of friends solve their high coupling problem by acting as a buffer and decoupling microservices. Instead of services being tightly connected and dependent on each other, they now communicate through Kafka topics. This means one overloaded or failing service no longer brings down the entire system. In short, Kafka provides: Decoupling between services Scalability through partitions and consumer groups Fault tolerance through message retention and redistribution In the next blog, we’ll look deeper into how Kafka is used in real-world scenarios and how it works internally under the hood. Stay tuned! Previous PostNext Post Popular Posts Top 7 AI writing tools for Engineering Students Top AI Posting Tools Every Engineering Student Should Use to Be Innovative Without Design Skills Best AI Video Editing & Graphic Design Tools for Engineering Student Living in Germany as Engineering Student : Why is it hard living in Germany without speaking German? Apache Kafka: Intro and Key Concepts Every Developer Should Know