Stanford CS236: Deep Generative Models I 2023 I Lecture 4 - Maximum Likelihood Learning

06 May 2024 (7 months ago)
Stanford CS236: Deep Generative Models I 2023 I Lecture 4 - Maximum Likelihood Learning

Autoregressive Models

  • RNNs are a type of autoregressive model that uses a hidden vector to summarize the context and make predictions.
  • Attention mechanisms allow models to take into account the full context when making predictions while being selective about which parts of the sequence are relevant.
  • Transformers are more efficient to train than RNNs and can be parallelized, making them suitable for large-scale language modeling.
  • Autoregressive models can be used to generate images pixel by pixel, but they are slow due to the need to unroll the recursion.
  • Convolutional architectures are better suited for images, but they need to be masked to enforce autoregressive structure.
  • Attention mechanisms can also be used for images, but they are more computationally intensive to train.
  • Autoregressive models are easy to sample from and evaluate probabilities, making them useful for anomaly detection and extending to continuous variables.
  • Autoregressive models can be trained by treating them as a sequence of classifiers.

Generative Models

  • Generative models aim to learn a joint probability distribution over random variables that approximates the unknown data distribution.
  • Autoregressive models use the likelihood and the Kullback-Leibler (KL) divergence to define similarity.
  • KL divergence measures the difference between two probability distributions in terms of compression efficiency.
  • Optimizing KL divergence is equivalent to building a generative model that can compress data efficiently.
  • Computing KL divergence directly is challenging, but it can be simplified for optimization.
  • Other distance metrics besides KL divergence can be used to compare distributions, leading to different types of generative models.
  • The choice of using P or Q as the reference distribution in KL divergence affects the behavior of the model.

Training Autoregressive Models

  • The objective of autoregressive models is to maximize the probability of observing a given dataset.
  • Evaluating the likelihood of a single data point is straightforward using the chain rule.
  • The probability of a dataset is the product of the probabilities of individual data points.
  • Maximum likelihood estimation involves finding the parameters that maximize the probability of observing the dataset.
  • Minimizing cross-entropy is equivalent to maximizing log-likelihood.
  • Training involves initializing parameters randomly, computing gradients on the loss using backpropagation, and performing gradient ascent.
  • Stochastic gradient descent or mini-batch can be used to make training scalable.
  • Regularization techniques are used to prevent overfitting.
  • Cross-validation can be used to evaluate the performance of a model on unseen data and identify overfitting.

Overwhelmed by Endless Content?