Author name: EfficientxInnovative

Probabilistic Generative Models Overview
Academic, Innovative

Probabilistic Generative Models Overview

Probabilistic Generative Models Overview This post is the first and introductory post in the series: From Probabilistic Modeling to Generative Modeling. It is my attempt to reproduce the knowledge I gained from learning about probabilistic generative models. In this entry, I break down the general concepts needed to understand five probabilistic generative models: Gaussian Mixture Model (GMM), Variational Autoencoders (VAE), Normalizing Flows (NF), Generative Adversarial Networks (GAN), and Diffusion Models (DM), which will be explained in the next posts. In this series, these models will be presented in a logical order that I find more intuitive, rather than following a strict categorization. However, I will highlight the characteristics that can be used to differentiate between them for clearer and deeper understanding. Before starting, it is necessary to understand why generative modeling is useful… 1 – Why Probabilistic Generative Modeling? To answer this question, we need to understand the difference between discriminative (predictive) and generative models. Let us consider the famous task of image classification using deep neural networks that predict the class of an image: cat or dog. Szegedy et al. (2013) showed that adding noise to the images can result in false predictions. Original Image P(y=cat|x) = 0.95 P(y=dog|x) = 0.05 Noisy Image P(y=cat|x) = 0.1 P(y=dog|x) = 0.9 The classifier predicts the probabilities of the labels (y) given the images (x), which means it learns the conditional probability (p(y|x)). How come adding a limited amount of noise that barely affects the signal results in false probabilities? This shows that discriminative models don’t really understand the image; they only capture useful patterns to make predictions. Once those patterns are changed even slightly, you get nonsense predictions. According to the Deep Learning book: “Classification algorithms can take an input from such a rich high-dimensional distribution and summarize it with a categorical label—what object is in a photo, what word is spoken in a recording, what topic a document is about. The process of classification discards most of the information in the input and produces a single output (or a probability distribution over values of that single output). The classifier is also often able to ignore many parts of the input. For example, when recognizing an object in a photo, it is usually possible to ignore the background of the photo.” However, again quoting from the same book: “The goal of deep learning is to scale machine learning to the kinds of challenges needed to solve artificial intelligence.” In other words, we need models that understand the reality represented by an image—meaning being able to capture the structure of images and have a semantic understanding of what they represent and the whole environment represented in the image. The book gives concrete examples of tasks where this requirement is fundamental, such as density estimation, denoising, missing value imputation, and more relevant in this post series: sampling. 2 – Why Probability? “Probability theory is nothing but common sense reduced to computation.” — Pierre Laplace (1812). Probability is the mathematical toolkit we use to deal with uncertainty. It provides us with the required rules and axioms to quantify uncertain events. As already mentioned, being able to express uncertainty is fundamental for AI systems to make reliable decisions. According to the Deep Learning book, there are two main goals for applying probability theory in AI applications: Start from the laws of probability to specify how the system should behave and design it accordingly. Use probability and statistics as analysis tools to understand the decisions taken by AI systems. 2.1 – Bayesian vs. Frequentist Approach There are two approaches to understanding probability: frequentist and Bayesian. The frequentist interpretation focuses on long-run frequencies of events. For example: if I flip a coin infinitely many times, half will be heads. So it is the long-run probability of success in estimation/decision-making. The Bayesian interpretation is about the degree of belief. From the Bayesian view, the same example would mean: “we believe the coin is equally likely to land heads or tails on the next toss.” So it is more about information rather than repeated trials. A better example to differentiate between them is: “The probability that a candidate will win the elections is 60%.” This is about one occurrence of the event (winning the election), so the frequentist approach is not optimal in this case. For more details, refer to Machine Learning: A Probabilistic Perspective. The main point is that the Bayesian approach is used to quantify uncertainty, which is why it is more appropriate for probabilistic generative models. 2.2 – Most Relevant Rules Needed for This Series Here we focus on the rules needed for higher-level concepts and their applications. If interested in deeper understanding, I recommend this YouTube video from the Machine Learning groups at the University of Tübingen. 2.2.1 – Sum Rule [ P(X) = P(X, Y) + P(X, bar{Y}) ] (P(X)): probability distribution of random variable (X) (P(X, Y)): joint probability distribution of (X) and (Y) (bar{Y} = mathcal{E} – Y), where (mathcal{E}) is the space of events The sum rule can be used to eliminate some random variables. 2.2.2 – Marginalization [ P(X) = sum_{y in mathcal{Y}} P(X, y) quad text{if } P text{ is discrete} ] [ P(X) = int_{y in mathcal{Y}} P(X, y) , dy quad text{if } P text{ is continuous} ] Marginalization gives you the probability distribution of (X) if you ignore (Y). 2.2.3 – Conditional Probability [ P(Y|X) = frac{P(X,Y)}{P(X)} ] (P(Y|X)): conditional probability of (Y) given (X) (P(X)): marginal probability 2.2.4 – Product Rule [ P(X,Y) = P(Y|X)P(X) = P(X|Y)cdot P(Y) ] The product rule helps you express joint distributions using conditional distributions. 2.2.5 – Law of Total Probability [ P(X) = sum_{i=1}^{n} P(X|Y_i) cdot P(Y_i) quad , quad Y_i text{ are disjoint for all } i ] 2.2.6 – Bayes’ Theorem [ P(Y_i|X) = frac{P(X|Y_i) cdot P(Y_i)}{sum_{j=1}^{n} P(X|Y_j) cdot P(Y_j)} = frac{P(X|Y_i) cdot P(Y_i)}{P(X)} ] (P(Y_i|X)): posterior probability of (Y_i) given (X) (P(X|Y_i)): likelihood of (X) under (Y_i) (P(Y_i)): prior probability of (Y_i) (sum_{j=1}^{n} P(X|Y_j) cdot