Probabilistic Generative Models Overview

Probabilistic Generative Models Overview

This post is the first and introductory post in the series: From Probabilistic Modeling to Generative Modeling. It is my attempt to reproduce the knowledge I gained from learning about probabilistic generative models. In this entry, I break down the general concepts needed to understand five probabilistic generative models: Gaussian Mixture Model (GMM), Variational Autoencoders (VAE), Normalizing Flows (NF), Generative Adversarial Networks (GAN), and Diffusion Models (DM), which will be explained in the next posts.

In this series, these models will be presented in a logical order that I find more intuitive, rather than following a strict categorization. However, I will highlight the characteristics that can be used to differentiate between them for clearer and deeper understanding. Before starting, it is necessary to understand why generative modeling is useful…

1 – Why Probabilistic Generative Modeling?

To answer this question, we need to understand the difference between discriminative (predictive) and generative models. Let us consider the famous task of image classification using deep neural networks that predict the class of an image: cat or dog. Szegedy et al. (2013) showed that adding noise to the images can result in false predictions.




Original image
Original Image P(y=cat|x) = 0.95 P(y=dog|x) = 0.05
Noisy image
Noisy Image P(y=cat|x) = 0.1 P(y=dog|x) = 0.9


The classifier predicts the probabilities of the labels \(y\) given the images \(x\), which means it learns the conditional probability \(p(y|x)\). How come adding a limited amount of noise that barely affects the signal results in false probabilities?

This shows that discriminative models don’t really understand the image; they only capture useful patterns to make predictions. Once those patterns are changed even slightly, you get nonsense predictions.

According to the Deep Learning book:

“Classification algorithms can take an input from such a rich high-dimensional distribution and summarize it with a categorical label—what object is in a photo, what word is spoken in a recording, what topic a document is about. The process of classification discards most of the information in the input and produces a single output (or a probability distribution over values of that single output). The classifier is also often able to ignore many parts of the input. For example, when recognizing an object in a photo, it is usually possible to ignore the background of the photo.”

However, again quoting from the same book: “The goal of deep learning is to scale machine learning to the kinds of challenges needed to solve artificial intelligence.” In other words, we need models that understand the reality represented by an image—meaning being able to capture the structure of images and have a semantic understanding of what they represent and the whole environment represented in the image. The book gives concrete examples of tasks where this requirement is fundamental, such as density estimation, denoising, missing value imputation, and more relevant in this post series: sampling.

2 – Why Probability?

“Probability theory is nothing but common sense reduced to computation.” — Pierre Laplace (1812).

Probability is the mathematical toolkit we use to deal with uncertainty. It provides us with the required rules and axioms to quantify uncertain events. As already mentioned, being able to express uncertainty is fundamental for AI systems to make reliable decisions.

According to the Deep Learning book, there are two main goals for applying probability theory in AI applications:

  1. Start from the laws of probability to specify how the system should behave and design it accordingly.
  2. Use probability and statistics as analysis tools to understand the decisions taken by AI systems.

2.1 – Bayesian vs. Frequentist Approach

There are two approaches to understanding probability: frequentist and Bayesian.

The frequentist interpretation focuses on long-run frequencies of events. For example: if I flip a coin infinitely many times, half will be heads. So it is the long-run probability of success in estimation/decision-making.

The Bayesian interpretation is about the degree of belief. From the Bayesian view, the same example would mean: “we believe the coin is equally likely to land heads or tails on the next toss.” So it is more about information rather than repeated trials.

A better example to differentiate between them is: “The probability that a candidate will win the elections is 60%.” This is about one occurrence of the event (winning the election), so the frequentist approach is not optimal in this case. For more details, refer to Machine Learning: A Probabilistic Perspective. The main point is that the Bayesian approach is used to quantify uncertainty, which is why it is more appropriate for probabilistic generative models.

2.2 – Most Relevant Rules Needed for This Series

Here we focus on the rules needed for higher-level concepts and their applications. If interested in deeper understanding, I recommend this YouTube video from the Machine Learning groups at the University of Tübingen.

2.2.1 – Sum Rule

\[ P(X) = P(X, Y) + P(X, \bar{Y}) \]
  • \(P(X)\): probability distribution of random variable \(X\)
  • \(P(X, Y)\): joint probability distribution of \(X\) and \(Y\)
  • \(\bar{Y} = \mathcal{E} – Y\), where \(\mathcal{E}\) is the space of events

The sum rule can be used to eliminate some random variables.

2.2.2 – Marginalization

\[ P(X) = \sum_{y \in \mathcal{Y}} P(X, y) \quad \text{if } P \text{ is discrete} \] \[ P(X) = \int_{y \in \mathcal{Y}} P(X, y) \, dy \quad \text{if } P \text{ is continuous} \]

Marginalization gives you the probability distribution of \(X\) if you ignore \(Y\).

2.2.3 – Conditional Probability

\[ P(Y|X) = \frac{P(X,Y)}{P(X)} \]
  • \(P(Y|X)\): conditional probability of \(Y\) given \(X\)
  • \(P(X)\): marginal probability

2.2.4 – Product Rule

\[ P(X,Y) = P(Y|X)P(X) = P(X|Y)\cdot P(Y) \]

The product rule helps you express joint distributions using conditional distributions.

2.2.5 – Law of Total Probability

\[ P(X) = \sum_{i=1}^{n} P(X|Y_i) \cdot P(Y_i) \quad , \quad Y_i \text{ are disjoint for all } i \]

2.2.6 – Bayes’ Theorem

\[ P(Y_i|X) = \frac{P(X|Y_i) \cdot P(Y_i)}{\sum_{j=1}^{n} P(X|Y_j) \cdot P(Y_j)} = \frac{P(X|Y_i) \cdot P(Y_i)}{P(X)} \]
  • \(P(Y_i|X)\): posterior probability of \(Y_i\) given \(X\)
  • \(P(X|Y_i)\): likelihood of \(X\) under \(Y_i\)
  • \(P(Y_i)\): prior probability of \(Y_i\)
  • \(\sum_{j=1}^{n} P(X|Y_j) \cdot P(Y_j)\): normalization term
  • \(P(X)\): marginal probability of \(X\) (evidence)

2.2.7 – Bayes’ Theorem in Learning

\[ P(X|D) = \frac{P(D|X) \cdot P(X)}{\sum_{j=1}^{n} P(D|X) \cdot P(X)} = \frac{P(D|X) \cdot P(X)}{P(D)} \]
  • \(P(X|D)\): posterior
  • \(P(X)\): prior
  • \(P(D|X)\): likelihood given \(X\)
  • \(P(D)\): evidence

Bayes’ theorem tells us how to update our belief in a hypothesis \(X\) when observing data \(D\). \(X\) is usually the set of parameters of the model. Despite the name, the prior is not necessarily what we know before seeing the data, but rather the marginal distribution \(P(X) = \sum_{d \in \mathcal{D}} P(X,d)\) under all possible data.

3 – Probabilistic Modeling

3.1 – Probabilistic Modeling & Probabilistic Inference

Probabilistic modeling means using probability theory to represent uncertainty. Instead of deterministic values, probabilistic models use probability distributions to describe randomness, variability, and incomplete knowledge. Inference comes from the act of inferring unknown variables.

Probabilistic inference is the use of probabilistic models to compute the probability distribution of unknown variables given observed data. There are different methods to learn probabilistic models. In the following, three main methods will be explained considering this setup: \(D\) is the random variable representing data and \(\theta\) the random variable for parameters.

3.2 – Maximum Likelihood Estimation (MLE)

We assume the data samples are i.i.d. (independent and identically distributed). We update the parameters \(\theta\) such that the likelihood \(P(D|\theta)\) is maximized. Usually, we maximize the log-likelihood instead, because the logarithm preserves critical points and simplifies computations (e.g., turning multiplications into additions).

\[ \hat{\boldsymbol{\theta}}_{\text{ML}} = \arg \max_{\boldsymbol{\theta}} \ell(\boldsymbol{\theta}) = \arg \max_{\boldsymbol{\theta}} \log p(\mathcal{D} \mid \boldsymbol{\theta}) \]

In simple words: in MLE we look for the parameter values that maximize the probability of seeing the data.

3.3 – Bayesian Estimation / Bayesian Inference

In Bayesian estimation (also called the full Bayesian approach), we use the posterior distribution from Bayes’ theorem to make predictions. It encodes our beliefs in the value of \(\theta\) after observing data.

\[ p(\boldsymbol{\theta} \mid \mathcal{D}) = \frac{p(\mathcal{D} \mid \boldsymbol{\theta}) \cdot p(\boldsymbol{\theta})}{p(\mathcal{D})} \]

Bayesian inference is hard to apply in practice because finding closed-form solutions is often not possible. The problem lies in evaluating the evidence term through marginalization:

\[ p(\mathcal{D}) = \int p(\mathcal{D}, \boldsymbol{\theta}) \, d\boldsymbol{\theta} = \int p(\mathcal{D} \mid \boldsymbol{\theta}) \, p(\boldsymbol{\theta}) \, d\boldsymbol{\theta} \]

3.4 – Maximum a Posteriori (MAP)

Bayesian inference requires computing integrals, which are costly and intractable in many cases (e.g., neural networks). Instead, we can formulate learning as an optimization problem. Maximum a Posteriori (MAP) estimation is a compromise between MLE and Bayesian estimation.

\[ \theta_{\text{MAP}} = \arg \max_{\theta} p(\theta \mid \mathcal{D}) = \arg \max_{\theta} \frac{p(\mathcal{D} \mid \theta) \, p(\theta)}{p(\mathcal{D})} = \arg \max_{\theta} p(\mathcal{D} \mid \theta) \, p(\theta) \]

Similar to MLE, we use the log formulation:

\[ \hat{\boldsymbol{\theta}}_{\text{MAP}} = \arg \max_{\boldsymbol{\theta}} \log p(\boldsymbol{\theta} \mid \mathcal{D}) = \arg \max_{\boldsymbol{\theta}} \big( \log p(\boldsymbol{\theta}) + \log p(\mathcal{D} \mid \boldsymbol{\theta}) \big) \]

These concepts will be useful later, since generative models rely on the same principles.

4. Generative Models

These are probabilistic models that not only model the data as a probability distribution, but can also generate or sample from that learned distribution. Deep generative models use deep learning to learn the underlying probability distribution of data and then generate from it, in addition to applying probabilistic inference.

Some models provide explicit access to the distribution for evaluation, while others model it implicitly and draw samples from it.

5. Latent Generative Models

First, let’s define latent variables.

5.1 Latent Variables

There are observed variables, which represent the available data, and latent variables, which represent the hidden structure in the data. They are higher-level (simpler) and lower-dimensional. Latent variables are best understood in the generative context, meaning when we have some object and try to recreate it. For example, when we have a picture of a person and try to draw that same person again. Instead of copying the picture line by line or point by point, we extract structural features such as a long nose, round eyes, or a beard and use those features to create a similar picture.




Untitled Diagram
Another example for latent variables. Laten variable models are expected to automatically identify the hidden representations written on the image. Generated with GPT 5.


5.2 Latent Models

Latent models are probabilistic models where we assume there are hidden (latent) variables \(z\) that influence the observed data \(x\). We cannot directly observe \(z\), but we assume that \(x\) is generated conditionally from \(z\) following \(p(x \mid z)\). The latent variables \(z\) are drawn from a known prior distribution \(p(z)\), usually Gaussian.

The Gaussian prior is often chosen because it has convenient properties, for example, it is fully defined by mean and variance/covariance, and operations such as adding two Gaussians result in another Gaussian, which simplifies computations. In latent models, we start from the joint distribution over observed variables (data in learning settings), latent variables, and model parameters (the weights of the model). We then use marginalization to obtain the distribution of the observed variables.

6. Hands-On: Generating Images

In the next posts, generative models will be implemented to generate images. I believe that understanding the “setup” from a probabilistic perspective is very helpful.

6.1 Data

When we generate images – for example, of cats – we treat those images as realizations of a random variable drawn from a probability distribution. Typically, we have a provided dataset of such images, which we use to train the model with parameters \( \theta \). The parameters \( \theta \) are fitted to represent the data distribution from which the images are drawn.

Why are images considered realizations of random variables with a probability distribution? A random variable is a variable that can take on different values randomly (see Deep Learning Book). It is called random because its outcomes are not necessarily predictable due to unobserved factors.

Images are captured through sensors, so they represent signals obtained by a measurement procedure, such as digital signals in the form of bits organized as pixels. If you take multiple photos of the same cat, the resulting pixel values may vary due to lighting, movement, or camera shake. These uncontrolled or unmeasured factors introduce noise and randomness, which is why images can be seen as random variables.

However, pixel values are not entirely random. Cats in pictures cannot have two heads, nor can they be purple (at least as far as we know). Some pixel values are highly likely, while others are not. This reflects the existence of a probability distribution over possible pixel values.

6.2 Model

We need a model that learns the data distribution from the training dataset so that we can later use it to generate new samples. This raises two important questions:

  1. What does the model look like (its architecture)?
  2. How does the model learn?

There are many possible designs for a model’s architecture. These will be discussed for each model separately. The second question is more important: what is the loss function that evaluates whether the model has successfully learned the probability distribution?

Here, the maximum likelihood estimation (MLE) principle is applied. We estimate the parameters of the model that maximize the likelihood of the data. Since the likelihood function is not always easy to evaluate, certain tricks are applied in each model for training and sampling. Deep learning plays a fundamental role in designing powerful models.

For completeness, it is worth mentioning that generative models can also be used as MAP estimators or conditional mean (CM) estimators, where CM estimators apply expectation within Bayesian inference.

7. Applications of Generative Models

At the time of writing this post, deep generative modeling is one of the leading research fields in AI. The hype surrounding generative AI is very high. Reliable AI systems should be equipped with generative modeling mechanisms, as highlighted in the introduction.

Applications range from medical imaging, such as creating more precise brain scans for better diagnosis, to synthetic data generation and image translation tasks. For concrete examples, the paper “Generative artificial intelligence: a systematic review and applications” may be useful.

8. Sources

The following resources were used as references for this post series:

0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x