Stanford AA228/CS238 Decision Making Under Uncertainty I Online Planning and Policy Search

05 Dec 2024 (30 days ago)
Stanford AA228/CS238 Decision Making Under Uncertainty I Online Planning and Policy Search

Introduction to Online Planning and Policy Search

  • The lecture focuses on online planning and policy search, covering the differences between offline and online planning, online planning algorithms, and hybrid planning approaches, such as AlphaZero style methods, which have seen significant success and can be extended to language model applications (5s).
  • Offline policies are typically pre-computed before any actions are taken in the real-world problem, often solving over the entire state space to prescribe an action from any state (59s).
  • Online policies, on the other hand, find actions based on reasoning about the reachable state space, planning forward from the initial state and looking at possible transitions (1m22s).
  • Online planning methods use a receding horizon planning scheme to plan over the reachable state space, which is the foundation of all online planning methods (1m39s).

Online vs. Offline Policies

  • The reachable state space refers to all states that lie in the future, affected by the different actions that can be taken, and is used to plan forward from the initial state (2m7s).
  • The concept of the reachable state space can be applied to real-life situations, such as considering different job offers or moving to a new city, and is also relevant in game setups like chess (2m20s).
  • Offline policies aim to plan from every possible state, prescribing an action to take from any state, whereas online policies only plan from the initial state and look at possible transitions (3m37s).
  • Offline policies are concerned with the entire game strategy, including opening moves, mid-game, and end-game, whereas online policies focus on the current state and possible future transitions (3m48s).
  • Planning from the current board state is focused on the reachable state space, which is directly connected to the idea of receding horizon planning, where planning is done up to a certain depth or horizon D, and then the horizon is shifted as actions are taken and new states are transitioned into (4m9s).

Receding Horizon Planning

  • In receding horizon planning, the process involves planning from the current state up to a certain depth, taking an action, transitioning to a new state, and then replanning, which is similar to what has been seen in the chess example (4m18s).
  • A real-world example of receding horizon planning is in autonomous driving, such as Waymo, where the car's current plan is updated as new information becomes available, and the plan is modified and executed as the car continues on (5m13s).
  • Neither offline nor online planning methods are inherently better at taking into account future rewards or information, as it depends on the problem, size of the state space, and action spaces being dealt with (6m8s).
  • Receding horizon planning is connected to the idea of an agent acting in an environment, taking an action, receiving an observation, and then replanning, which is a key concept in online planning methods (6m31s).

System 1 and System 2 Thinking in Online Planning

  • Online planning methods typically involve a trade-off between spending more time searching or planning forward over different possible futures, resulting in a better-performing policy, which can be related to the idea of system 1 and system 2 level thinking (7m15s).
  • The trade-off between system 1 and system 2 styles of thinking can be seen in online planning, where spending more time thinking before acting can result in better performance, as illustrated by the example of bullet chess (7m48s).
  • Bullet chess is an analogy to system 1 type of thinking, where players have to think and move quickly, relying on intuition and pre-memorized board patterns, without spending too much time thinking about their opponent's response (8m1s).
  • System 2 thinking, on the other hand, involves spending more time evaluating many different possible futures that could occur, thinking about the opponent's response to each move, and rolling out several steps into the future (8m43s).
  • The difference between system 1 and system 2 style of thinking is directly related to the concept of policy rollouts, which is used to estimate the utility of a policy (9m31s).

Policy Rollouts

  • To perform a policy rollout, two things are needed: a policy, typically a stochastic rollout policy that prescribes a distribution over actions to take in some state, and a model of transitions and rewards (9m47s).
  • The policy rollout process involves sampling an action from the stochastic rollout policy, sampling a transition from the transition model, and using the reward model to get a reward for taking an action in a specific state (10m17s).
  • The policy rollout example starts with an initial board state, samples an action from the rollout policy, samples a transition from the transition model, and repeats this process to estimate the utility of a policy (10m52s).
  • The rollout policy can be a random action or an offline learned policy specifically tailored to the game, such as chess (11m1s).
  • The transition model allows sampling successor state transitions, and the reward model gives a reward for taking an action in a specific state (10m25s).
  • The policy rollout process is used to estimate the utility of a policy by rolling out several steps into the future and evaluating the rewards obtained (10m48s).
  • The opponent's strategy is incorporated into the transition model, allowing for the simulation of different scenarios, such as a random opponent or a modeled policy for the opponent (11m48s).
  • A state transition is sampled from the transition model, which includes the opponent's play, and a reward is received before transitioning to a new state (12m5s).
  • This process is repeated, with actions sampled from a rollout policy, and rewards received, to simulate a trajectory of states and actions (12m30s).
  • The trajectory continues until a specified depth or horizon is reached, resulting in a trajectory represented by states, actions, and rewards (13m7s).
  • The discounted sum of rewards received along the rollout trajectory provides an estimate of the policy's performance (13m29s).
  • Multiple policy rollouts can be performed to estimate the utility of following a particular policy, with each rollout providing a reward that can be used to estimate the policy's utility (13m51s).
  • The utility of the rollout policy is estimated as the average of the rewards received over multiple policy rollouts (14m50s).
  • The average reward is calculated by summing the rewards received over m policy rollouts and dividing by m (15m21s).
  • This estimation method allows for the evaluation of a policy's performance and provides a foundation for online planning algorithms (14m26s).

Lookahead with Rollouts Algorithm

  • The lookahead with rollouts algorithm is an online planning algorithm that approximates the utility from a state by using m policy rollouts to estimate the utility from the next state, rather than exactly computing it (17m18s).
  • The lookahead equation is used to extract a policy by taking the argmax over actions, and the lookahead with rollouts policy is different from the rollout policy (17m8s).
  • The lookahead with rollouts algorithm replaces the exact computation of the utility with an estimation using m policy rollouts, which can be a worse approximation depending on the rollout depth (18m57s).
  • The original lookahead equation involves going over all actions, possible states, and evaluating the utility of being in the next state, whereas lookahead with rollouts uses policy rollouts to estimate the utility (18m20s).
  • One limitation of using policy rollouts to estimate the utility is that the rollouts might not visit all possible state transitions, resulting in high variance estimates of the rewards (19m39s).
  • The lookahead with rollouts algorithm uses a rollout policy to estimate the utility from a state, which can be a limited approximation depending on the number of rollouts and the rollout depth (17m49s).
  • The main difference between lookahead with rollouts and the original lookahead equation is that lookahead with rollouts uses policy rollouts to estimate the utility, whereas the original lookahead equation exactly computes the utility (18m17s).
  • The lookahead with rollouts method has limitations, including the need to perform rollouts over all actions and next state transitions, which can be computationally intensive for large state and action spaces (19m45s).
  • Another limitation of the lookahead with rollouts method is that it requires a rollout policy to begin with, and if the initial rollout policy is random, improving upon it may not be sufficient for the actual problem (20m14s).

Forward Search

  • Forward search is an online planning algorithm that builds the entire search tree of all states and action transitions up to a certain depth, allowing it to determine the best action to take from an initial state (20m56s).
  • The depth of the search tree in forward search refers to the number of levels of branching, with each level representing a possible transition (21m31s).
  • The branching factor of forward search is the size of the state space times the size of the action space, which determines how quickly the tree grows as the depth increases (22m1s).
  • The computational complexity of forward search is directly related to the branching factor and is represented as O(|S| * |A|^d), where |S| is the size of the state space, |A| is the size of the action space, and d is the depth of the search tree (22m42s).
  • The big O notation is used to compare algorithms and estimate the amount of work required to perform a specific search algorithm (22m50s).
  • In the example given, with three states and two actions, the computational complexity of forward search at a depth of 1 is 3 * 2^1 = 6, resulting in six nodes in the first layer of the search tree (23m18s).
  • Forward search constructs the entire search tree to find the perfect solution, but it has limitations due to the branching factor, which can lead to exponential complexity, especially in real-world problems with large state and action spaces (23m33s).

Branch and Bound Algorithm

  • The branch and bound algorithm is used to reduce the exponential complexity of forward search by reasoning about bounds on the value function, requiring a lower bound on the value function and an upper bound on the action value function (24m26s).
  • The branch and bound algorithm allows for pruning parts of the search tree when the upper bound of an action at a state is lower than the lower bound of a previously explored action from that state (24m54s).
  • Pruning occurs when the upper bound estimate of an action is less than the best utility estimate already seen, indicating that exploring that part of the search tree would not lead to a better solution (25m2s).
  • The upper bound on the action value function represents the best possible outcome of taking an action in a state, and the lower bound estimate is used to determine the best utility estimate (25m22s).
  • The branch and bound algorithm uses the upper and lower bounds to determine whether to explore a part of the search tree, and if the upper bound is less than the best utility estimate already seen, it prunes that part of the tree (25m25s).
  • The algorithm works by exploring the search tree and using the bounds to determine which actions to take and which parts of the tree to prune, with the goal of finding the best solution without exploring the entire tree (26m40s).
  • The branch and bound algorithm works by comparing the upper bound estimate of the Q value function with the best utility seen so far, and if the upper bound is less than the best utility, it prunes off the rest of the search tree to avoid wasting time exploring that branch (27m22s).
  • When calculating the best utility estimate (U best) versus the upper bound on the action value function, the algorithm goes all the way down to the depth of the search tree to get U best, but uses a lower bound estimate at the bottom of the search tree (28m3s).
  • The upper bound estimate on the Q function only looks at the state-action pair and requires offline prior knowledge of what that upper bound is going to be (28m30s).
  • One of the drawbacks of the branch and bound algorithm is that it requires knowing the upper and lower bounds, which can be difficult to come up with, especially in real-world problems where the reward function is learned from data (28m53s).
  • If the bounds are not tight, the worst-case computational complexity of branch and bound is the same as forward search, as no pruning of the search tree occurs (29m26s).

Sparse Sampling

  • Sparse sampling is an algorithm that tries to reduce the computational complexity of forward search and branch and bound by only considering a limited number of samples of the state transition, introducing an approximation into the search (30m13s).
  • Sparse sampling reduces the branching factor of forward search and branch and bound by not branching on all possible states, but instead taking samples of the state transition (30m28s).
  • Sparse sampling is a technique used to reduce the computational complexity of forward search by sampling state transitions instead of branching on all possible state transitions (31m1s).
  • In sparse sampling, the algorithm still branches on all possible actions but only takes a limited number of samples of the state transition, resulting in a reduced computational complexity of m times A to the d, where m is the number of samples (31m40s).
  • The drawback of sparse sampling is that it relies on the number of samples taken, and if the state space is very large, taking a small number of samples may not provide a good understanding of the state transitions (32m9s).
  • The trade-off in sparse sampling is between taking too many samples and not taking enough samples, and finding the right balance is crucial (32m25s).
  • The algorithms discussed so far are forward search, branch and bound, and sparse sampling, with forward search and branch and bound having a computational complexity of S times A to the d, and sparse sampling having a computational complexity of m times A to the d (32m38s).
  • A way to remember the difference between the algorithms is that forward search and branch and bound make you "sad" due to their high computational complexity, while sparse sampling makes you "mad" due to the need to use sampling-based techniques (33m9s).
  • Using sparse sampling with the number of samples equal to the size of the state space is not equivalent to forward search, as sampling may result in duplicate state transitions (33m31s).
  • Even if the number of samples is equal to the size of the state space, sparse sampling may not build up a unique search tree, especially in large state spaces (34m16s).

Monte Carlo Tree Search (MCTS)

  • Monte Carlo Tree Search (MCTS) is a popular online planning algorithm due to its practicality and scalability to larger problems (34m42s).
  • MCTS works by running m simulations from the current state to avoid exponential complexity, requiring a model of the transition function and the reward function (34m58s).
  • During simulations, MCTS tracks two things: the Q function, an estimate of the Q function as the search tree is built, and the counts, which count the number of times an action is taken in a state (35m29s).
  • The counts can also be used to track the number of times a state is visited by marginalizing out over the actions (36m7s).
  • After building the search tree, MCTS can choose the action that maximizes the Q function, or use other methods such as maximizing the action with the most counts, or a combination of both (36m33s).
  • MCTS has four main steps: selection, expansion, simulation, and backpropagation (37m3s).
  • The selection step involves choosing an action to expand further, which can be done using various methods (37m5s).
  • The expansion step involves sampling a state transition after taking the selected action (37m30s).
  • The simulation step can involve using a rollout policy, such as a random rollout policy, or other methods like an offline learned neural network that estimates the utility of being in a particular state (37m43s).
  • MCTS can be used in hybrid planning methods, which will be discussed later (38m17s).
  • The Monte Carlo Tree Search (MCTS) process involves selection, expansion, simulation, and update steps, which are repeated over and over to estimate the utility of actions in a given state (38m46s).
  • In MCTS, action selection can be done using different exploration strategies, such as the upper confidence bound (UCB) metric, which balances exploration and exploitation (38m59s).
  • The UCB metric is calculated by adding the action value function to a constant C times the logarithm of the number of times a state has been visited, divided by the number of times an action has been taken in that state (39m9s).
  • The UCB metric incentivizes taking actions that have not been tried yet, as the logarithmic term will be infinity if an action has never been taken in a particular state (39m51s).
  • The UCB metric also includes an exploitation term, which incentivizes taking actions that have been seen to perform well in the past (40m9s).
  • The constant C in the UCB metric weights the importance of exploration versus exploitation, allowing for a trade-off between trying new actions and sticking with known good actions (40m27s).
  • The goal of MCTS is to build up an estimate of the action value function, which can be used to determine the best policy to follow (41m6s).
  • The action value function is updated as the MCTS process continues, with the ultimate goal of getting a fully updated action value function that can be used to make decisions (40m46s).
  • MCTS can be used with different policies, but a common approach is to use a greedy policy with respect to the action value function (41m17s).

MCTS Example: 2048 Game

  • The game 2048 is a simple game played on a 4 by 4 board, where the game starts with a 2 or 4 randomly appearing on the board, and the player has four actions to take: move left, right, up, or down (42m13s).
  • The goal of the game is to get the tiles to add up to 2048 by combining like tiles, and the general strategy is to keep high-value tiles in one of the corners of the board (42m52s).
  • The game can be played using the Monte Carlo Tree Search (MCTS) algorithm, which consists of four steps: selection, expansion, simulation, and backpropagation (43m29s).
  • In the selection step, the algorithm selects the action with the highest upper confidence bound, but if all actions have the same count and Q estimate, it arbitrarily selects the first action (44m5s).
  • In the expansion step, the algorithm samples a state transition from the transition function, which in this case is sliding the tiles to the left and randomly generating a new 2 or 4 on the board (44m35s).
  • In the simulation step, the algorithm uses a random rollout policy to take 10 random actions into the future and estimates the utility of the current state (44m55s).
  • The algorithm then updates the visitation counts and Q estimates based on the utility estimate obtained from the simulation step (45m25s).
  • The MCTS algorithm initializes all counts and Q estimates to 0 at the beginning of the game, and updates them as it explores the game tree (43m47s).
  • The process of selecting an action involves choosing the action with the highest upper confidence bound value, which is calculated using the exploration constant and the visitation count of each action (45m54s).
  • The upper confidence bound is used to balance exploration and exploitation, with a higher value indicating a greater need for exploration (45m59s).
  • The process of expanding a node in the search tree involves sampling a state transition, simulating a rollout of 10 random actions, and estimating the utility of the resulting state (46m25s).
  • The utility estimate is then used to update the visitation count and the estimated Q-value of the action (46m52s).
  • The Q-value represents the action value function, which tells us the utility of taking an action in a particular state (49m8s).
  • The process of selecting an action is repeated, with the algorithm choosing the action with the highest Q-value when all actions have been explored (47m37s).
  • When all actions have been explored, the algorithm chooses the action with the highest Q-value, which represents the most exploitative choice (47m53s).
  • The algorithm continues to explore and update the Q-values until a satisfactory solution is found (48m58s).
  • The visitation count is updated by 1 each time an action is taken, and the Q-value is updated by taking the average of the utility estimates (48m56s).
  • The algorithm uses the upper confidence bound to balance exploration and exploitation, and the Q-value to represent the action value function (49m8s).

Q Function and Reward Formulation in 2048

  • The Q function is explicitly written as the immediate reward plus the discount factor times the sum over all states that we could transition to, the transition probability, and then the utility of being in that next state, with the action to take passed in and forcing that forward with that particular action (49m17s).
  • In the context of the game 2048, the reward is formulated based on the board configuration, with different weighting functions that can be used on the tiles, such as higher weights in the corners to incentivize the agent to put high-value tiles in those corners, or an even weighting and summing up the tiles on the board (49m53s).
  • Different heuristics or utility functions can be used, such as an end state of the game of 1 or 0, did you win, but this can be challenging as it requires searching for a really long depth horizon to get any signal (50m28s).
  • After updating the Q function, the search tree is left, and this process would repeat over and over until the number of simulations to perform has been reached or a time limit has been exceeded (50m53s).
  • In Monte Carlo Tree Search (MCTS), the number of simulations can be set, and a time limit can be specified, making it an any-time algorithm that can be stopped whenever, with the estimates returned being used to make an action (51m5s).
  • The reason for having two counts for the up action is that a new board state is randomly generated, and it's unlikely to end up with the exact same state as before due to the random appearance of 2 or 4 in open squares (51m31s).
  • In MCTS, it's expected to go through every single action at least once, but this is dependent on the number of iterations and the use of the upper confidence bound, as a large action space and limited iterations may prevent branching on all actions (52m41s).

Upper Confidence Bound (UCB) in MCTS

  • The Upper Confidence Bound (UCB) algorithm is an exploration strategy that was originally developed for multi-armed bandit problems, which involves choosing which lever to pull in a multi-armed bandit (53m44s).
  • The UCB algorithm chooses an action that hasn't been seen yet, unless it has an infinite Q value, which is not reasonable (53m16s).
  • One of the drawbacks of the algorithm is that it needs to keep track of each possible state, which can take up a huge amount of memory as the number of steps increases (54m4s).
  • However, the algorithm only stores state transitions once they are seen, which remedies the drawback of storing every possible state (54m33s).
  • The algorithm can result in a shallow search tree, especially in problems with a large branching factor, such as the game 2048 (55m8s).
  • The shallowness of the search tree can be overcome by changing the value of C, which controls the exploration term (55m54s).
  • The algorithm can be modified to enforce going deeper in the tree, such as by adjusting the value of C or using other techniques (56m14s).
  • In the example of the game 2048, the algorithm would select the action "left" after the initial simulations, due to its higher exploration bonus and Q estimate (56m37s).

MCTS in Autonomous Agents

  • In the context of an autonomous agent using the Monte Carlo Tree Search (MCTS) approach in a game like 2048, the real board state in the real world is perceived by the agent, which then builds a search tree internally through forward simulation and rollouts based on its model of the game, without actually affecting the real game state (57m17s).
  • The agent's search tree is generated internally, and after selecting an action, the agent acts in the real-world environment, changing the actual board state, and then the process repeats (57m46s).
  • When the process repeats, the agent can maintain the same search tree but continue from the actual state it's in, which is a more efficient implementation, but not necessarily part of a vanilla MCTS implementation (58m18s).
  • There are different variants of MCTS, including tree search depth and rollout depth, which can be set to a fixed number or until an end game state is reached (58m47s).
  • Another variant is the use of a value function estimate, which can be an offline learned estimate, instead of relying solely on rollouts, leading to a hybrid planning style (59m13s).
  • The exploration strategy used in MCTS is not limited to upper confidence bound, and the source of the model is crucial, as MCTS requires a good understanding of the transition and reward model to be effective (59m26s).
  • If the transition and reward model are not well understood, using MCTS may not be the best approach, as it can lead to wasting time building an unrealistic search tree (59m33s).

Transition and Rollout Policies in MCTS

  • In the context of the 2048 game simulation, the transition model used in the rollouts assumes a uniform probability distribution over possible actions (up, down, left, right) after the initial action is chosen (1h0m29s).
  • The state transitions in a game simulator are based on random rollouts and action selection, with the transition model determining where the 2 or 4 is generated on the board and the action that is selected (1h1m8s).
  • The rollout policy is used to select actions, but the exploration/exploitation step is not run on the future rollout steps, and the upper confidence bound is not used in the rollouts (1h1m25s).
  • The rollout depth is the number of steps that are simulated, and in the example, it was set to 10, while the tree search depth was 3 (1h1m48s).
  • Progressive widening is a method that can be used to limit the number of branches in the search tree, forcing it to go deeper before it gets wider (1h2m7s).

AlphaGo Zero and Hybrid Planning

  • AlphaGo Zero achieved superhuman performance in the game of Go using a hybrid planning approach that combined an offline learned neural network with Monte Carlo Tree Search (MCTS) and self-play (1h3m0s).
  • The self-play step in AlphaGo Zero involved having the agent play against itself, using MCTS and the offline learned neural network to select actions and update the network (1h3m24s).
  • The offline learned neural network in AlphaGo Zero took the raw board state as input and output a vector of action probabilities and a value function (1h4m20s).
  • The AlphaGo Zero algorithm uses a neural network to output two things: the probability of selecting each action and the value estimate, which is the probability that the current agent ends up winning the game (1h4m35s).
  • The algorithm updates the network parameters based on examples from self-play iterations, where the outcome of the game is passed back through to all the training examples (1h4m59s).
  • The algorithm tries to match the action probabilities to what was produced by the tree search setup, and by repeating this process of self-play and network updates, AlphaGo Zero achieved superhuman performance (1h5m26s).
  • The results show that AlphaGo Zero outperforms previous versions of AlphaGo, with the y-axis showing the Elo rating, a measure of human performance in games (1h5m37s).
  • The raw network, without the tree search, performs worse than the AlphaGo Zero algorithm, illustrating the importance of the system 2 level of thinking, which involves looking at multiple possible actions and using intuition to guide decision-making (1h6m0s).
  • Currently, no one has trained a raw neural network that is superhuman in Go, although it is possible that this may happen soon (1h6m37s).
  • DeepMind released a paper that achieved grandmaster performance in chess without using search, but this has not been done yet for Go (1h6m48s).
  • During the self-play step, the search comes into the algorithm by building up an MCTS search tree using the network estimates, and then sampling an action from the policy built up as a result of this search tree (1h7m36s).
  • In the self-play step, each player has a separate MCTS search tree, although this needs to be double-checked (1h8m4s).

Hybrid Planning in Language Models

  • Extending hybrid planning approaches to language model applications is an interesting area, where the idea of system 1 and system 2 thinking also applies (1h8m21s).
  • The field of online planning algorithms is rapidly advancing, with many researchers working on it, and it's interesting to relate it to current research, particularly in the context of language models and hybrid planning approaches (1h8m36s).
  • Current language models use an autoregressive next token prediction approach, which is a system 1 level approach that involves predicting the next token based on the input sequence, similar to the AlphaZero style example (1h8m57s).
  • This approach is like playing with the raw network, taking the next action with respect to the raw network, and playing greedily with respect to that (1h9m20s).
  • System 2 level thinking involves taking a hybrid planning approach, searching over future sequences of tokens to build up a better strategy (1h9m31s).
  • Hybrid planning methods have a trade-off between offline and online compute, and research has shown that investing in online compute can achieve the same performance as investing in offline learned neural networks, but with a 10 to 15 trade-off (1h10m17s).

Policy Search

  • Policy search is a method that focuses on searching the space of policies without directly computing a value function, and it's often more efficient than searching over the state space (1h11m36s).
  • Policy search involves introducing a parameterized policy, represented by a vector of parameters, which can be the weights and biases of a neural network or the parameters of a linear function (1h12m1s).
  • The parameterized policy can be searched more efficiently than the state space, depending on how the policy is parameterized (1h11m47s).
  • A policy is represented as pi theta, denoted by a vector of parameters theta, which takes in a state and outputs an action to take from that particular state (1h12m34s).
  • The goal is to update the parameters in a way that improves the policy, and the utility of the policy is represented as u of theta (1h13m18s).
  • The utility of the policy can be approximated using m policy rollouts, which involves performing m rollouts using the current policy and estimating the utility of those specific policy parameters (1h13m35s).

Local Search for Policy Search

  • Local search is a method for policy search that starts with a current policy parameter vector and performs m policy rollouts to estimate the utility of those parameters (1h14m9s).
  • Local search works by taking a plus and minus step in each parameter direction, evaluating each candidate point using m policy rollouts, and stepping to the best performing point (1h14m36s).
  • The process is repeated, and if none of the candidate points outperform the current policy, the alpha factor is shrunk, and the process is repeated until convergence criteria are met (1h15m47s).
  • The local search algorithm terminates when the convergence criteria are met, and the final policy parameters are determined (1h16m5s).
  • Local search involves making changes to the current policy and evaluating the results through policy rollouts, but it has drawbacks such as requiring m rollouts for each of n candidate points, making it intensive for large parameterizations, and needing to set an initial alpha step size (1h16m11s).

Cross Entropy Method for Policy Search

  • The cross entropy method improves on local search by using a search distribution, represented by p theta of psi, which parameterizes the policy vector of parameters, and can be a Gaussian distribution or other types of distributions (1h16m47s).
  • The cross entropy method works by sampling k points from the search distribution, evaluating each sample using policy rollouts, and refitting the search distribution to the best-performing samples, repeating the process until convergence criteria are met (1h17m32s).
  • The method uses a Gaussian distribution, with the policy parameters distributed according to this distribution, and the contours of the Gaussian are used to visualize the search space (1h17m21s).
  • The cross entropy method is more efficient than local search in distributing candidate points and allows control over the number of samples taken, but requires choosing the number of best-performing samples to keep around, which is a trade-off (1h19m12s).
  • If too many best-performing samples are kept, the search distribution will not change much, but if too few are kept, the search distribution will collapse quickly (1h19m32s).
  • The cross entropy method can converge to a different local optima than local search, and the process is repeated until convergence criteria are met (1h18m47s).
  • Policy search involves a trade-off between keeping too many samples and not keeping enough, with the number of samples being a tuning parameter (1h19m49s).

Summary of Online Planning and Policy Search

  • Policy rollouts can be used to estimate the performance of a parameterized policy (1h20m0s).
  • Local search focuses on making local changes to the policy (1h20m7s).
  • The cross entropy method introduces a search distribution to improve upon local search (1h20m13s).
  • Online planning and policy search methods focus on reasoning about states reachable from the current state using a receding horizon planning scheme (1h20m20s).
  • Approaches to online planning include pruning the state space, sampling state transitions, and replanning along promising trajectories (1h20m34s).
  • Policy search methods, including policy rollouts, local search, and the cross entropy method, can be used in conjunction with online planning (1h20m44s).

Overwhelmed by Endless Content?