Stanford AA228/CS238 Decision Making Under Uncertainty I Policy Gradient Estimation & Optimization

22 Nov 2024 (4 months ago)

Introduction and Parameterized Policies

The proposal for the project should be a sequential decision problem, and if there's uncertainty, it's recommended to come by office hours or post on Ed for clarification (27s).
A policy, denoted as π, is a function that tells us what action to take based on the state we're in, and it can be used to control the temperature in a building by turning on the heat or AC (1m15s).
The policy might look like a table with associated actions for every possible temperature, but storing such a policy wouldn't be possible for large or continuous state spaces (2m13s).
To address this issue, policies can be parameterized using parameters θ, and an example of a parameterized policy for temperature control is given using two parameters θ1 and θ2 (2m24s).

Utility and Gradient Optimization

The utility of the policy, denoted as U(θ), is the thing that's generally interested in optimizing, and many optimization methods rely on the gradient of θ (3m16s).
The gradient of θ, denoted as ∇θ, is a vector of partial derivatives, one for each parameter θi, and it's used to estimate the gradient of the utility function (3m51s).
The lecture will cover a few different options for estimating the gradient of θ, and it's noted that this portion of the lecture is calculus-heavy (4m27s).

Finite Differences Method for Gradient Estimation

The first method for estimating the gradient of a function is finite differences, which involves computing the slope between two points that are a small distance apart, denoted as Delta, to estimate the derivative at a point of interest (4m56s).
In the multivariable case, the finite difference method is used to estimate the gradient of U(θ) by computing U(θ + Δe) - U(θ) / Δ, where e is a standard basis vector, and repeating this process for each parameter θ (6m8s).
The standard basis vector e has a one in the position corresponding to the parameter being estimated and zeros elsewhere, allowing the method to estimate the gradient of U(θ) around a point (6m50s).
The finite difference method can be impacted by the variance of U(θ), which may require a large number of rollouts to obtain a good estimate, and techniques can be used to mitigate this variance (7m46s).
The parameters θ are the parameters of the parameterization, and the first-order approximation may not always work for high-dimensional cases where the parameters are on different scales (8m11s).

Regression Gradient Method for Gradient Estimation

The next method discussed is the regression gradient, which involves collecting data points and fitting a regression line to estimate the gradient of U(θ) around a point (9m0s).
The regression gradient method collects data points in the neighborhood of θ and fits a regression line to estimate the gradient at that point (10m38s).
A matrix called Delta Theta is created for perturbation of each parameter, with dimensions M by N, where M is the number of perturbations and N is the number of parameters (10m42s).
The number of perturbations is chosen empirically, with a common rule of thumb being to have double the number of parameters as the number of perturbations, but this can be based on computational constraints or expert knowledge (11m21s).
The perturbations are obtained by randomly sampling from a normal distribution, which works well empirically, and these samples are normalized (11m48s).
The perturbations can be visualized as points along the surface of a sphere, with each point representing a change in Theta (12m10s).
When randomly sampling around the point Theta, the line is found by collecting data points of differences from the original Theta and differences from the original U of Theta, and then using a regression method to fit the line (12m42s).
The method of sampling on the sphere involves sampling from a normal distribution and then normalizing the samples, so that all points are at a distance of one from the original value (13m36s).
The same set of perturbations is applied to all coordinates, which gives the M by N matrix (14m31s).
The random points are called perturbations, and the perturbation is the change applied to Theta (14m53s).
The data set of points for Theta is built by applying the perturbations and computing the differences in U of Theta (15m21s).
The computation of Delta U involves using rollouts, such as Monte Carlo simulation, to find the difference in U of Theta plus Delta Theta and U of Theta for each parameter (15m29s).
The gradient of U of theta is estimated by fitting a line between Delta Theta and Delta U, where Delta Theta and Delta U come from policy rollouts (16m17s).
The gradient U of theta is approximately equal to Delta Theta times the pseudo-inverse of Delta U, where the pseudo-inverse is used instead of the regular inverse because Delta Theta is not guaranteed to be invertible (16m44s).
The pseudo-inverse of a matrix X is equivalent to the inverse of X transpose X times X transpose, which allows for the inversion of a non-square matrix (17m19s).
The dimensions of Delta Theta and Delta U are m by 1, where m is the number of perturbations, and it is important to have an equal number of perturbations of theta and U (18m15s).
The perturbations are sampled from an n-dimensional normal distribution, and it is not necessary to do this for every single theta since all parameters are moved in each perturbation (19m1s).

Likelihood Ratio Method for Gradient Estimation

The likelihood ratio method is used to compute the gradient of U of theta, which involves using an analytical form of U of theta that includes an integral over all possible trajectories (19m30s).
The integral includes the probability of a trajectory given theta, the return for that trajectory, and the discount over the entire sequence (20m14s).
The return for a trajectory can be formulated using the Bellman optimality equation, which allows for a discount over the entire sequence (21m33s).
The return for a trajectory is determined, and the likelihood of the trajectory is determined by the policy parameterized by Theta, which tells us what action to take based on a state (21m45s).
The policy is determining how likely a trajectory is, and the goal is to optimize the utility of the policy by finding the gradient of the utility (23m1s).
The process of optimizing the policy is similar to optimizing the parameters of a neural network by finding the gradient of the loss and slowly changing the parameters to improve the utility (23m42s).
The policy determines the likelihood of a trajectory, and the goal is to take the analytical expression and get an expression for the gradient of U of theta from it (24m16s).
The first step in finding the gradient of U of theta is to take the gradient of both sides of the expression and move the gradient inside the integral (24m33s).
Multiplying the expression inside the integral by one allows it to be written in a way that can be simplified, and it can be written as the expectation over a certain expression (25m44s).
The expression can be written as the expectation over tow of G Theta P Theta of to over P Theta of to R of to D to, which is a form of expectation (26m19s).
The log derivative trick, mentioned in chapter 10.5, allows the expression to be written in a certain way, but it will not be derived today (27m52s).
The expectation form of the gradient U of theta is expressed as an expectation, providing a relationship that can be used to estimate the gradient.
To estimate the gradient, it is necessary to find the gradient with respect to Theta of the log of P Theta of to, which depends on the type of policy used, either stochastic or deterministic.
A deterministic policy always returns the same action for a given state, whereas a stochastic policy provides a distribution over actions for a given state.
For a stochastic policy, Pi Theta of a specific action given a state is the probability of taking that action from that state.
The lecture will focus on the stochastic case, which is easier to compute and provides an interesting insight into why it is preferred.
To find the gradient with respect to Theta log P Theta of to for a stochastic policy, it is necessary to consider a sequence of states, actions, and rewards or returns up to a certain time horizon.
The probability of a trajectory to is calculated as the product of the probability of the initial state, the transition probabilities, and the likelihood of each action according to the policy.
The policy does not affect the initial state, which is sampled from a distribution independent of the policy.
The transition probabilities and the likelihood of each action are used to calculate the likelihood of the next state in the trajectory.
The product of these probabilities gives the likelihood of the trajectory to according to the policy parameterized by Theta.
The likelihood of a trajectory is calculated by multiplying the probability of the first state by the product of the policy and transition probabilities for each state in the sequence, resulting in the overall likelihood of the trajectory (34m39s).
The product involves the policy and transition probabilities for each state in the sequence, with the superscripts on the states representing the time step (35m17s).
The product starts from the first state, not the second, as it represents the transition from the first state to the second (35m40s).
The expression for the likelihood of the trajectory is then taken to the log, converting the product into a summation (36m36s).
The log of the probability of the first state and the summation of the log of the policy and transition probabilities for each state in the sequence are the components of the log likelihood expression (37m17s).
Taking the gradient of the log likelihood with respect to the policy parameter results in the terms that are not dependent on the policy parameter dropping out (38m19s).
The resulting expression is a summation that provides another option for estimating the log likelihood (38m52s).
The terms that drop out are those that do not have any dependence on the policy parameter, similar to taking a derivative with respect to a variable that is not present in the expression (39m12s).
The correct range for the summation is from K=1 to D-1, where D is the depth of the trajectory, as going beyond D-1 would involve states that are not part of the current trajectory (39m44s).
When taking the derivative of theta, we end up looking at the state-action pair, and the last term is only going to D minus one, but for the second summation, it can still go to D because the transition at the D level doesn't matter (41m29s).
The counter K is stopping at D, so we would do K from one to D, including the first state and action, and then the second, and so on, all the way to D in the summation (43m7s).
To find the expectation, we need to consider multiple trajectories, and in practice, it's hard to iterate over all of them, so we would be sampling trajectories, and we can use different characteristics or methods to sample (43m55s).
The goal is to calculate the gradient of the utility function, and there are three different methods to estimate the gradient, including finite differences, to optimize the parameters of our policy Theta with respect to utility (45m14s).
We're assuming we're not able to compute the gradient directly, and these methods are possible options to estimate the gradient of the utility function (45m27s).
The methods discussed are used to estimate the gradient of the utility function so that we can optimize the parameters of our policy, and Kiana will discuss what to do with these estimates (42m53s).