Stanford CS236: Deep Generative Models I 2023 I Lecture 5 - VAEs

06 May 2024 (12 months ago)

Latent Variable Models

Latent variable models introduce unobserved random variables (Z) to capture unobserved factors of variation in the data.
These models aim to model the joint probability distribution between observed variables (X) and latent variables (Z).
Latent variable models offer advantages such as increased flexibility and the ability to extract latent features for representation learning.
By conditioning on latent features, modeling the distribution of data points becomes easier as there is less variation to capture.
Deep neural networks are used to model the conditional distribution of X given Z, with parameters depending on the latent variables through neural networks.
The challenge lies in learning the parameters of the neural network since the latent variables are not observed during training.
The function of Z is to represent latent variables that affect the distribution of X.
In this model, X is not autoregressively generated, and P(x|Z) is a simple Gaussian distribution.
The parameters of the Gaussian distribution for P(x|Z) are determined through a potentially complex nonlinear relationship with respect to Z.
The individual conditional distributions P(x|Z) are typically modeled with simple distributions like Gaussians, but other distributions can be used as well.
The functions used to model P(x|Z) are the same for every Z, but different prior distributions can be used for Z.
The motivation for modeling P(x|Z) and P(X) is to make learning easier by clustering the data using the Z variables and to potentially gain insights into the latent factors of variation in the data.
The number of latent variables is a hyperparameter that is determined through training.
Sampling from the model is straightforward by first sampling Z from a Gaussian and then sampling X from the corresponding Gaussian defined by the mean and covariance predicted by the neural networks.

Mixture of Gaussians

The mixture of Gaussians is a simple example of a latent variable model where Z is a categorical random variable that determines the mixture component, and P(x|Z) is a Gaussian distribution with different means and covariances for each mixture component.
Mixture models can be useful for clustering data and can provide a better fit to data compared to a single Gaussian distribution.
Unsupervised learning aims to discover meaningful structures in data, but it's not always clear what constitutes a good structure or clustering.
Mixture models, such as a mixture of Gaussians, can be used for unsupervised learning by identifying the mixture components and clustering data points accordingly.
However, mixture models may not perform well on tasks like image classification unless the number of mixture components is extremely large.
An example of using a generative model on MNIST data shows that it can achieve reasonable clustering, but the clustering is not perfect and there are limitations to what can be discovered.

Variational Autoencoders (VAEs)

Variational autoencoders (VAEs) are a powerful way of combining simple models to create more expressive generative models.
VAEs use a continuous latent variable Z, which can take an infinite number of values, instead of a finite number of mixture components.
The means and standard deviations of the Gaussian components in a VAE are determined by neural networks, providing more flexibility than a mixture of Gaussians with a lookup table.
The sampling process in a VAE is similar to that of a mixture of Gaussians, involving sampling from the latent variable Z and then using neural networks to determine the parameters of the Gaussian distribution from which to sample.
In a mixture of Gaussians (MoG), Z is continuous, allowing for smooth transitions between clusters.
The mean of the latent representation C can be interpreted as an average representation of all the data points.
The prior distribution for Z does not have to be uniform; it can be any simple distribution that allows for efficient sampling.
In a Gaussian mixture model (GMM), there is no neural network; instead, a lookup table is used to map the latent variable Z to the parameters of the Gaussian distribution.
The marginal distribution over X in a mixture of Gaussians is obtained by integrating over all possible values of Z.
The dimensionality of Z is typically much lower than the dimensionality of X, allowing for dimensionality reduction.
It is possible to incorporate more information into the prior distribution by using a more complex model, such as an autoregressive model.
The number of components in a mixture of Gaussians is equal to the number of classes K.

Challenges in Learning Latent Variable Models

The challenge in learning mixture models is that the latent variables are missing, requiring marginalization over all possible completions of the data.
Evaluating the marginal probability over X requires integrating over all possible values of the latent variables Z, which can be intractable.
If the Z variables can only take a finite number of values, the sum can be computed by brute force, but for continuous variables, an integral is required.
Gradient computations are also expensive, making direct optimization challenging.
One approach is to use Monte Carlo sampling to approximate the sum, by randomly sampling a small number of values for Z and using the sample average as an approximation.
Sampling from a uniform distribution over Z is used because the sum is being converted into an expectation with respect to a uniform distribution, making the approximation tractable.
However, this approach is not ideal because it does not take into account the actual distribution of Z.
Uniformly sampling latent variables for completion is not effective because most completions would have low probability and high variance.
A smarter weighting of selecting latent variables is needed to improve the model's performance.

Latent Variables and Disentangled Representations

Latent variables (Z) are not necessarily meaningful features like hair or eye color, but they capture important factors of variation in the data.
Z can be a vector of multiple values, allowing for the representation of multiple salient factors of variation.
There are challenges in learning disentangled representations where latent variables have clear semantic meanings.
If labels are available, semi-supervised learning can be used to steer latent variables towards desired semantic meanings.
Important sampling is used to sample latent variables more efficiently by focusing on important completions.

Importance Sampling and the Evidence Lower Bound (ELBO)

The choice of the proposal distribution Q for important sampling is crucial and can significantly affect the model's performance.
The goal is to estimate the log marginal probability of a data point, which is intractable to compute directly.
Importance sampling is used to estimate the log marginal probability by sampling from a distribution Q(Z|X) and then taking the log of the ratio of the joint probability of (X, Z) under the true distribution and the joint probability of (X, Z) under Q(Z|X).
This estimator is unbiased, but it can be improved by using Jensen's inequality to derive a lower bound on the log marginal probability.
The evidence lower bound (ELBO) is a lower bound on the log marginal probability that can be optimized instead of the true log marginal probability.
The choice of Q(Z|X) controls how tight the ELBO is, and a good choice of Q(Z|X) can make the ELBO a very good approximation to the true log marginal probability.
It is easier to derive a lower bound on the log marginal probability than an upper bound.
The tightness of the bound on the quantity of interest can be quantified.
The bound becomes tight when Q is chosen to be the conditional distribution of Z given X under the model.
This optimal Q distribution is not easy to evaluate, which is why other methods are needed.
The optimal way of inferring the latent variables is to use the true distribution.
Inverting the neural network to find the likely inputs that would produce a given X is generally hard.
The machinery for training a VAE involves optimizing both P and Q.