Stanford CS234 Reinforcement Learning I Value Alignment I 2024 I Lecture 16

03 Nov 2024 (5 months ago)

CS234 Last Lecture Review and Quiz Discussion

The last lecture for CS234 will cover a review and wrap-up of the course, as well as a discussion of the quiz, which will be returned to the students within a day (5s).
The quiz was comprehensive, covering the entire course, and some questions proved to be more challenging for the students than others (57s).
A clarification was provided that when a justification for a choice was requested, an explanation or rationale was expected, rather than just restating the choice itself (1m7s).
The second question on the quiz was identical to the midterm question, providing an opportunity for students to refresh their knowledge (1m29s).
The third question on the quiz focused on proximal policy optimization (PPO), which was implemented by the students, and emphasized the importance of understanding that the first step in PPO is always on-policy (1m45s).
PPO allows for the use of data and multiple gradient steps, but the first step is always on-policy, and subsequent steps are off-policy using data collected from the previous round (2m4s).
The second part of the third question stated that there are no guarantees on B, and the third part emphasized that importance sampling is only done over actions, not states (2m42s).
The last part of the third question noted that various types of advantage estimators can be used, and that generalized advantage estimation is not the only option (3m17s).

Alignment Problem and Autonomy in AI

The fourth question was provided by a guest lecturer, Dan, who discussed the alignment problem and its importance, and the first part of the question was found to be not true (3m30s).
There are different ways to think about autonomy, but the focus is on the implications for broader society and collections of individuals, rather than just individual preferences (3m51s).
Moral theories provide a way to think about broader benefits to society and collections of individuals, rather than just individual benefits (3m58s).
Autonomy is often a core principle when thinking about the value of different decisions, and an AI agent that allows people to have autonomy might support suboptimal decisions (4m28s).
An AI agent that decides what is not in a person's best interest, such as not telling them where to buy cigarettes, can be seen as a form of paternalism that undermines user autonomy (5m12s).
The concept of "best interest" and "suboptimal decisions" can be contradictory, as decisions that are in a person's best interest are often considered optimal (5m41s).
Different notions of what is considered optimal or not optimal can lead to conflicting objectives, such as valuing user autonomy over health outcomes (6m12s).
The importance of user autonomy can be weighed against other factors, such as health outcomes, when determining what is in a person's best interest (6m33s).
It can be challenging to generalize what is in a single user's best interest, as people may have different needs and abilities when it comes to making decisions (7m6s).
The principle of autonomy suggests that everyone needs some amount of autonomy, and this can be used as a guiding principle when designing AI systems (7m28s).
Autonomy is important for individuals, and this autonomy should include the freedom to make bad decisions, which a supporting Large Language Model (LLM) should respect, as long as it's assumed that every individual should have some amount of autonomy (7m42s).
Different people in society have different amounts of autonomy, with children generally having less than adults (8m3s).
The justification for autonomy is not something that can be assumed, and it's a premise based on theories that promote autonomy for everyone (8m19s).

Monte Carlo Tree Search (MCTS) and its Applications

Monte Carlo Tree Search (MCTS) is a method that can be used in both Markov and non-Markov systems, as it samples from the dynamics model to get to a next state (8m56s).
MCTS uses sampling to estimate the expectation of the next state, allowing for an accurate estimation without having to enumerate all possible states (9m33s).
Upper confidence bounds are useful in MCTS, even when the reward model is known, as they allow for prioritizing actions and expanding the tree (10m3s).
AlphaGo uses MCTS with upper confidence bounds to prioritize actions and expand the tree (10m16s).
Alpha Zero uses MCTS with self-play to improve the network and predict values and action probabilities (10m28s).
Upper Confidence Trees (UCT) is a type of MCTS, and MCTS is a superset of UCT (11m22s).

Reinforcement Learning (RL) Algorithms and Applications

CH GBT learned rewards from humans by preparing preferences over prompt output pairs and then used PO to train a better prompt, effectively doing ranking for long-horizon problems and large action spaces where forward search would be expensive (11m46s).
Using methods like Alpha Zero, which builds a subset of the tree search, can be helpful in such cases, and Probably Approximately Correct (PAC) methods guarantee learning an epsilon-optimal policy, but epsilon might be non-zero, meaning the policy may not be optimal (12m16s).
PAC RL algorithms can be used if a slightly suboptimal policy is acceptable, as they make a finite number of mistakes but keep the policy "pretty neat" most of the time (12m31s).
Offline RL may be beneficial in high-stakes settings like healthcare, where online exploration can be risky or expensive (12m39s).
Optimism in RL is not guaranteed in general, but PAC algorithms guarantee epsilon-optimality, and minimizing regret is equivalent to maximizing expected cumulative rewards (13m18s).
PAC algorithms do not guarantee sublinear regret, as epsilon mistakes can still occur, leading to linear regret (13m33s).
The only true statement about PAC algorithms is that they make a finite number of mistakes, which implies polynomial convergence to the optimal policy, but the time it takes can be expensive (14m20s).
PAC algorithms require that the number of mistakes made, with high probability, is a polynomial function of problem parameters, including the size of the state and action spaces, and one over epsilon (15m32s).
Reinforcement learning can be used as long as there is a guarantee with high probability that the total number of mistakes would be small, and algorithms have been developed to accommodate different epsilons based on available budget and data (16m3s).

Course Objectives and Expectations

The course covers key features of reinforcement learning, its differences from supervised learning and AI planning, and how to apply it to various application problems (17m36s).
Students are expected to understand how to implement and code RL algorithms, compare and contrast different algorithms, and evaluate their performance using metrics such as regret, sample complexity, and empirical performance (17m55s).
The exploration-exploitation challenge is a fundamental problem in reinforcement learning, where there is a trade-off between gathering data to learn about the environment and using that information to obtain high rewards (18m21s).
The course aims to equip students with the skills to decide whether reinforcement learning is an appropriate tool to solve a given problem, and to apply RL to various domains (18m52s).

Motivating Domains for Reinforcement Learning: AlphaTensor, Plasma Control, and COVID Testing

The motivating domains for reinforcement learning include AlphaGo, which was discussed in the first lecture (19m20s).
AlphaTensor is an algorithm that uses reinforcement learning to learn how to multiply matrices more efficiently, which is a common task in AI and machine learning, and researchers at DeepMind were thinking about inventing better algorithms for basic substructures (19m31s).
The AlphaTensor algorithm is an example of using reinforcement learning to learn algorithms, which is a creative and exciting approach (19m41s).
The problem of matrix multiplication is a basic task that comes up everywhere in AI and machine learning, and researchers were thinking about how to operationalize it to reduce computational complexity (19m51s).
The researchers at DeepMind were trying to think about how to invent better algorithms for basic substructures, and this is one of the domains where students now have the tools to do the same type of algorithms (20m25s).
Three problems are being revisited: AlphaTensor, plasma control, and COVID testing, and students are asked to think about how to formulate them given what they know now (20m41s).
Plasma control is a control problem where the goal is to manipulate and control plasma to achieve different configurations, and a control policy is needed to achieve this (20m59s).
COVID testing is a problem where the goal is to figure out who to test given finite resources, and the problem is to determine who to test to better understand who might be sick and restrict the spread of COVID (21m17s).
Students are asked to think about the type of problem each domain is, the setting, the state, date, action, and rewards, and what algorithms they would use to tackle the problem (22m15s).
The problems may have issues with distribution shift, and it's essential to consider whether to be conservative with respect to the results being generalized (26m10s).
The discussion involves considering whether distribution shift is a concern in certain cases, and whether there could be unsafe states or overly risky situations, with a focus on exploring and understanding the implications of these scenarios (26m20s).

Student Project Discussions and Problem Formulations

The conversation shifts to a group discussion among students who worked on different projects, including plasma and Alpha tenser, to compare their answers and formulations (26m40s).
One student mentions that they considered the total number of test cases for a country as a possible solution, but notes that this may not be the best approach unless it can be proxied to be closer to an offline setting (30m5s).
The discussion highlights the importance of exploring and understanding the distinction between online and offline settings, with a focus on the batch setting with delay, where decisions must be made without immediate feedback (31m21s).
The conversation touches on the idea of using reinforcement learning and policy networks, as well as Monte Carlo tree search, to solve complex problems, such as the Alpha tenser project (34m31s).
The Alpha tenser project is described as a multi-step reinforcement learning problem, where the goal is to learn an algorithm that can correctly solve matrix multiplication problems through a series of steps (35m15s).
The problem involves considering different operations, such as products and sums, and learning to refactor them in a way that can be applied to different matrix multiplication problems (35m47s).
The ultimate goal of the Alpha tenser project is to learn a general algorithm that can solve matrix multiplication problems, rather than just learning to multiply two specific matrices (36m1s).
The problem being discussed involves finding the minimum number of steps to achieve a goal, with the reward function being the number of steps, computations, and correctness of the algorithm (36m7s).
The goal is to search within the space of correct algorithms, and there are nice properties in this particular problem that allow for this (36m19s).
Given the ability to verify correctness, the approach is to optimize for length, using a neural network with both a policy head and a value head, similar to AlphaZero (36m52s).
The approach uses forward search, policy networks, and value networks to find the best possible algorithm, without doing additional Monte Carlo tree search at runtime (36m59s).
Unlike AlphaGo, the algorithm does not continue to do Monte Carlo tree search at runtime, as the assumption is that the best algorithm has been found and will be applied (37m26s).
The approach combines Monte Carlo tree search, policy networks, and value networks, with a single neural network sharing representations (37m39s).
The algorithm is designed to overcome distribution shifts, as all searched algorithms are correct, and there is no distribution shift in correctness (38m15s).
The algorithm may not find the most optimal solution, but it will always find a correct one, and there is no problem with deploying it on new matrix multiplication problems (38m49s).
The cleverness of the approach lies in having the policy network search within the space of correct algorithms, ensuring correctness (39m0s).

Multi-Step Reinforcement Learning Problems and Applications

There is no high-level insight that can be translated to other problems, and it would be interesting to revisit the paper to see if there are any insights that were missed (39m33s).
Researchers relearned known algorithms and discovered new ones for solving a problem, highlighting the potential for utility functions to provide a set of solutions for downstream use, allowing people to pick the best ones (39m50s).
A multi-step reinforcement learning problem involves finding the operations needed to solve a system, with the state being the operations done so far and the reward being the length of the solution (40m14s).
Learning plasma control for fusion science requires an offline phase due to safety concerns, and a simulator can be used to address this issue, allowing for optimization and model-based reinforcement learning (40m45s).
The architecture used in this case involves an actor-critic method, with a control policy and a critic learned simultaneously, and a simulator is used to represent the problem and provide a model for the offline phase (41m18s).
The objective in this problem is not just to minimize computations but to manipulate plasma into specific configurations, requiring careful consideration of the reward function and the ability to quickly learn policies (41m28s).
A high-fidelity simulator is used to address the offline safety issue, allowing for optimization and model-based reinforcement learning, and the simulator is constructed from a physics model rather than data (42m7s).
The use of simulators is of interest to mechanical engineers, as they can be used to model computationally expensive physical processes, and this case highlights the potential for simulators in reinforcement learning (41m58s).
The act-critic method used is related to, but not exactly the same as, something previously discussed, and is called No. (42m58s)
In the simulating period, complicated critics can be used, but when deploying, the critic has to be real-time, requiring a fast and computationally efficient actor or policy. (43m15s)
The actor-critic architecture allows for a low-dimensional actor with a small network, while having a complicated critic with many parameters to specify the value function. (43m48s)
The actor is trained to find a good point in the policy space using the complicated critic, with the representation of the control policy factor restricted to run on TCV with real-time guarantees. (44m32s)
The critic is unrestricted, allowing for a nice asymmetry between computational efficiency and offline affordances. (44m17s)
To address the problem of translating from offline to online, safety guarantees are implemented by defining areas that could cause bad outcomes and putting them inside the reward function to lead to a policy that veers away from those areas. (45m4s)
This approach is similar to the idea of pessimism over uncertain places, whether due to data or simulator inaccuracies, and is a common idea in robotics and other fields, used by researchers such as Claire Tomlin. (46m30s)
Reward hacking and sensitive rewards can be challenging in reinforcement learning due to sparsity or known problems in simulators, and avoiding these issues may require conservative approaches or double checks to ensure constraints are not violated (46m36s).
To address these challenges, it's essential to verify that the system will not reach unsafe regions, and there are methods available to achieve this verification, although most of the discussed approaches do not involve verification (47m37s).

COVID-19 Border Testing as a Reinforcement Learning Problem

Efficient and targeted COVID-19 border testing is a multi-step reinforcement learning problem that can be viewed as a repeated bandit problem with delayed outcomes, where the goal is to optimize testing policies based on limited information and constraints (47m51s).
The COVID-19 border testing problem involves a batch bandit with delayed outcomes, where people arrive with some prior information, and the policy must decide who to test, considering limited test capacity and potential fairness constraints (48m33s).
The delayed outcome problem in this scenario means that algorithms like Thompson sampling may be helpful, and the presence of multiple constraints, such as test capacity and fairness, adds complexity to the problem (49m9s).
The constraints in the COVID-19 border testing problem can be thought of as restricting the policy class, and addressing these constraints is crucial to developing an effective testing strategy (49m51s).
The problem also involves updating the policy based on the results of previous tests, which are received 24 hours later, and using this information to inform future testing decisions (48m58s).
The interaction problem in this case is interesting due to its budgeted nature, which means that outcomes are coupled in a way that they might not be if there were no budget constraints (49m57s).
The data that can be observed is also affected by the interaction, as the data available for one test or action may not be available for others (50m13s).
Defining the reward is crucial, and it's often challenging because the immediate rewards that can be used to optimize a policy may be different from the downstream outcomes that are truly important (50m30s).
This challenge of short-term outcomes versus long-term rewards is common in many areas, including advertising, and companies like Netflix and Spotify face this issue when making policy decisions (51m10s).
The paper being discussed argues that relying on lagged information can lead to suboptimal decisions, and shorter-term outcomes are necessary for effective decision-making (51m34s).

Key Characteristics and Settings of Reinforcement Learning

The class has covered a range of topics, including training to supervise policy, direct preference solicitation, and DPO, which can be applied to training large language models (52m0s).
The key characteristics of reinforcement learning include learning directly from data, optimization, delayed consequences, exploration, and generalization (52m32s).
A crucial aspect of reinforcement learning is that actions can impact the data distribution, including the rewards observed and the states that can be reached (52m50s).
Reinforcement learning is different from supervised or unsupervised learning due to the dynamic nature of the data, which presents both opportunities and challenges, particularly in terms of distribution shift (53m1s).
Standard settings in reinforcement learning include Bandits, where the next state is independent of the prior state and action, and general decision processes, where the next state may depend on all previous actions and states (53m25s).
Online and offline settings are also common, where either historical data is used to learn better policies or new data is actively gathered, with many real-world settings falling between these two extremes (53m44s).
Experimental design is used to gather a small amount of new online data to learn a good decision policy, often in combination with a large pool of offline data (54m12s).
Core ideas in reinforcement learning include function approximation, off-policy learning, and the challenges of generalization and extrapolation when combining online and offline data (54m31s).
Function approximation is necessary for handling complex problems and off-policy learning, but it is a hard problem due to the potential for data distribution shift when using a new policy (54m51s).
Off-policy learning involves using data generated from one decision policy to learn about another policy, which is a challenging problem that many papers in reinforcement learning aim to address (55m5s).
The goal of using offline data is to be data-efficient, but this can be difficult due to the potential for generalization or extrapolation errors when combining online and offline data (55m44s).

Challenges and Approaches in Reinforcement Learning

Chelsea Finn's work highlights the importance of considering common themes and challenges in reinforcement learning, including the need to mitigate the risks of generalization and extrapolation errors (54m40s).
The extrapolation problem in reinforcement learning can be mitigated by techniques such as clipping, which limits the step size inside gradients to prevent overly optimistic predictions (56m31s).
In the DAgger case, the extrapolation problem was addressed by collecting more expert labels, especially in states where the learned policy differs from the expert policy (56m42s).
Pessimistic Q-learning methods, such as CQL from Berkeley and MPO from Stanford, introduce pessimism into offline RL to limit the extrapolation problem (57m2s).
These methods are not the only solutions to the extrapolation problem, but rather inspire further thinking on how to address this issue, which is present throughout reinforcement learning (57m21s).

Models, Values, and Policies in Reinforcement Learning

Models, values, and policies are core objects in reinforcement learning, and each has its own use cases, with models being particularly useful for representing uncertainty (57m42s).
Representing uncertainty is often easier with models, as they are prediction problems that can leverage tools from supervised learning and statistics, unlike policies and Q-functions, which involve planning and decision-making (57m56s).
Models can represent uncertainty about how the world works, whereas policies and Q-functions involve joint uncertainty about the world and the best actions to take (59m4s).
Policy uncertainty combines uncertainty about the world and the best actions, making it a more complex problem than modeling the world, which has a single source of uncertainty (59m17s).
There are ways to directly represent uncertainty over policies and Q-functions, but models provide a more straightforward prediction problem (59m31s).
Representing uncertainty in reinforcement learning (RL) can be easier when done at the level of the model rather than propagating it through the policy and value functions, although there is no free lunch and some trade-offs are involved (1h0m4s).
Models can be useful for things like Monte Carlo research, simulators, and plasma, and can help in thinking about risky domains or being very data-efficient (1h0m21s).
The Q function is a central part of RL, summarizing the performance of a policy and allowing for direct action by taking an argmax with respect to the Q function (1h0m34s).
Policies are ultimately what is desired in RL, as they enable good decision-making, and the Q function can help in understanding how good a policy is (1h0m50s).
There is a trade-off between computation and data efficiency in RL, and in some cases, they are the same, such as when using a simulator, where data is equivalent to computation (1h1m10s).
In some situations, data is limited, and being data-efficient requires trading off for computational cost, leading to the use of more computationally intensive methods (1h2m11s).
Real-world constraints, such as in plasma, self-driving cars, or robotics, can require fast computation, and there may be hidden actions or default actions that occur if decisions are not made quickly enough (1h2m28s).

Open Challenges and Future Directions in Reinforcement Learning

Open challenges in RL include developing off-the-shelf, robust, and reliable methods, as many current algorithms have hyperparameters that need to be picked, such as the learning rate (1h3m11s).
Challenges in reinforcement learning include optimizing hyperparameters in real-world settings where only one trajectory or deployment is available, necessitating automatic hyperparameter tuning and model selection (1h3m27s).
There is a need for robust methods, including model selection and guarantees that performance will not suddenly degrade (1h3m58s).
Reinforcement learning often requires balancing data and computation efficiency, and it would be beneficial to have methods that allow practitioners to trade off between these two factors (1h4m13s).
The hybrid offline-online case is also important, as many organizations may be willing to collect additional data but not engage in fully online learning (1h4m44s).
The Markov decision process formulation, which is commonly used in reinforcement learning, may not be the best way to solve data-driven decision-making problems (1h5m6s).
Alternative formulations, such as multi-agent or partially observable Markov processes, may be more efficient or effective in certain situations (1h5m21s).
Historically, reinforcement learning has focused on learning from a single task from scratch, but humans often build on prior experience and learn across multiple tasks (1h6m2s).
Learning across multiple tasks, as seen in generative AI and large language models, may be a powerful approach in reinforcement learning (1h6m11s).
Shared representations, as seen in Alpha zero and Alpha tens, can have huge benefits in reinforcement learning (1h6m26s).
Alternative forms of reinforcement learning, such as those that incorporate prior experience or multiple tasks, may be productive ways to accelerate decision-making and learning (1h6m38s).
Feedback in reinforcement learning can be more than just scalar rewards, and with the help of large language models, richer feedback such as preference pairs, thumbs up or thumbs down, or detailed examples can be used, offering a more significant opportunity for exploration (1h6m42s).
Most of the class has focused on stochastic settings, but real-world settings often involve other stakeholders or multi-agents that can be adversarial or cooperative, making these settings important to consider (1h7m9s).
The class has integrated learning, planning, and decision-making, but there are approximations to this, such as system identification, which involves learning the dynamics model and then planning, offering flexibility but also introducing complexity (1h7m46s).
There is enormous room for improvement in data-driven decision-making in domains that could benefit, and many areas where society could benefit from better decision-making, making it an exciting area for impact (1h8m20s).

Stanford Resources and Further Learning Opportunities

Stanford offers various classes and resources for those interested in reinforcement learning, including classes on deep RL, decision-making under uncertainty, and advanced courses on decision-making and RL (1h8m53s).
The course has covered various application areas, but there are many more areas where reinforcement learning can be applied, and students are well-equipped to start answering these questions and making an impact (1h8m43s).