Stanford CS234 I Guest Lecture on DPO: Rafael Rafailov, Archit Sharma, Eric Mitchell I Lecture 9
01 Nov 2024 (14 days ago)
Logistical Announcements and Review
- The lecture begins with logistical announcements, including details about an upcoming midterm exam scheduled for Wednesday, where students are allowed to bring one side of a normal sheet of paper with notes. Solutions for Homework 2 will be released by the end of the day, although it will not be graded in time for the midterm due to late submissions. (5s)
- A quick review of previous material is conducted, focusing on the Bradley Terry model, which expresses the probability of selecting one option over another. It is confirmed that in RHF (Reinforcement Learning from Human Feedback), the model is not updated after each policy rollout. (2m32s)
- The discussion includes a clarification on the impact of multiplying by a negative constant, which flips all rewards and does not preserve preferences, while shifting by a constant does preserve them. (4m35s)
- The lecture revisits topics from the previous session, such as maximum entropy inverse reinforcement learning and RHF, and discusses the use of the Bradley Terry model for Markov decision processes. (5m3s)
Imitation Learning and Preference Learning
- The concept of imitation learning is contrasted with pairwise preference learning, which serves as an intermediary between dense reward labeling and simple demonstrations. This approach has motivated research in preference learning, including learning parameters of the Bradley Terry model. (5m23s)
- The lecture covers how to maximize preferences using cross-entropy and applies these concepts to both trajectories and bandit-like problems with finite actions. Homework 3 will involve implementing DPO (Direct Policy Optimization) and RHF for Markov decision processes. (6m14s)
ChatGPT Pipeline and Reward Model Evaluation
- The discussion begins with an exploration of using rollouts from MuJoCo-like problems and transitions into a brief overview of the process from learning reward models to developing systems like ChatGPT. This is based on Tatsu Hashimoto's lecture notes from an NLP class. (6m30s)
- The pipeline for ChatGPT involves collecting demonstration data, comparison data, and then optimizing a policy. The process includes generating pairwise preferences or full rankings to learn a reward model. (6m58s)
- An example is given involving language preferences, where different scenarios are ranked based on their outcomes, such as the impact of earthquakes in San Francisco. This illustrates how context can be seen as a prompt and outputs as various responses that people rank. (7m25s)
- Before implementing a policy optimization (PO), it is important to assess the quality of the reward model. The effectiveness of capturing the underlying reward model depends on the data amount and model complexity. Large models, similar in size to large language models, are often used. (7m59s)
- Validation accuracy is crucial, and with sufficient data and a large model, complex reward models can be captured. This is relevant for projects and homework, where understanding the complexity needed to capture human preferences is essential. (8m42s)
- The pipeline requires a significant amount of preference data, though not as much as needed for training a large language model. There is ongoing research on reducing the amount of online preference data required for training. (9m5s)
Reinforcement Learning from Human Feedback (RHF)
- In reinforcement learning from human feedback, once a reward model is learned, it can be used with policy optimization. The process involves using additional data compared to historical data to improve the model. (9m36s)
- A reference decision policy is needed to regularize the process and prevent deviation from established methods like behavior cloning or supervised fine-tuning during policy optimization (PO) (9m55s).
- Leveraging rewards and using reinforcement learning with human feedback (RHF) has shown substantial gains over previous approaches, even when the model size is fixed, as demonstrated by the success of ChatGPT (10m23s).
- The focus is on training large language models to perform a wide range of tasks, akin to meta or multitask reinforcement learning, rather than training an agent for a single task (10m56s).
- The "best of n" approach is an alternative to reinforcement learning, where multiple samples are generated, and the reward model selects the best one. This method performs well relative to PO but generally not as well (11m45s).
- The area of training and refining models using reward models and off-the-shelf large language models (LLMs) is active and ongoing, with efforts to determine the best methods (12m34s).
Direct Preference Optimization (DPO)
- Direct Preference Optimization (DPO) is highlighted as an exciting alternative to RHF, having received recognition as the outstanding best paper runner-up at the NeurIPS conference, and it has significantly impacted the L1 community (13m14s).
- The discussion focuses on reinforcement learning from human feedback (RHF) and direct policy optimization (DPO), with a background setup to ensure understanding of these concepts. (13m38s)
- The significance of reinforcement learning (RL) on language models is highlighted, noting that it has been practiced for a long time but gained prominence with the advent of models like ChatGPT, which demonstrated effective results. (14m48s)
Three-Stage Pipeline for Language Models
- The success of RL in language models is attributed to starting with strong pre-trained models that possess pre-learned skills, allowing for effective fine-tuning rather than starting from scratch. (15m14s)
- A three-stage pipeline popularized by ChatGPT is described, beginning with unsupervised pre-training to develop a generative model from extensive text data, followed by supervised fine-tuning using human demonstrations to create a reference policy. (15m34s)
- The second stage involves learning a reward model by collecting preference data, where responses are sampled from the supervised fine-tuned model, and human feedback is used to guide the learning process. (16m35s)
- The process involves three main stages: obtaining ranking annotations from a supervised fine-tuned model, using these preferences to learn a reward model, and performing policy learning to fine-tune the model to generate high-reward responses. (16m50s)
Supervised Fine-tuning and Preference Data Collection
- The first step, supervised fine-tuning, is straightforward and involves generating a dataset of prompts and responses, where binary preferences are given over responses. (17m15s)
- It is often more effective to have more prompts with fewer responses per prompt, as returns can plateau quickly with more responses. (17m53s)
- Preferences over responses are used instead of direct reward annotations because humans may not be calibrated to each other in terms of absolute rewards, and it is cognitively easier to compare responses rather than assign absolute scores. (18m1s)
- Gathering preferences is seen as a way to obtain higher quality annotation information with less cognitive effort from human labelers. (19m12s)
- The Bradley-Terry model is used to relate a scoring function or reward function to a probabilistic decision over discrete choices, with the choices being the preferred and non-preferred responses in the dataset. (19m29s)
- A probabilistic model is trained to maximize the likelihood of observed data using a scoring function related to choices, specifically employing the Bradley-Terry model for maximum likelihood estimation. This approach is treated as a binary classification problem where the logit is the difference in reward between the chosen and rejected responses. (19m54s)
Policy Optimization and KL Penalty
- After developing the reward model, the next step involves finding a policy that optimizes this reward, which is a part of reinforcement learning. The policy, denoted as Pi Theta, is fine-tuned to achieve high rewards for responses sampled from it, based on a dataset of prompts or conversation histories. (20m44s)
- There is a concern that optimizing solely for maximum reward may not generalize well outside the distribution of the training data. To address this, a KL penalty is introduced to prevent the policy from drifting too far from the starting reference model, ensuring the reward model remains valid within its trained distribution. (21m44s)
- The reference model, initially fixed in the canonical version of RHF (Reinforcement Learning from Human Feedback), can be updated over time. Various methods have been proposed to update the reference model and select response pairs for human preference evaluation, although the original version used a fixed model. (22m34s)
- In practice, two responses are typically sampled from a reference model to obtain a preference, which is then used to train a reward model. This process involves ensuring good coverage over the state-action space, which includes conversational history and responses, to achieve meaningful rewards when updating the policy. A diverse preference dataset is beneficial, balancing high-quality data with a wide variety to avoid overestimating rewards for poor responses. (23m14s)
- Even with a limited dataset, a large enough network trained to near-zero error on the training data can generalize well on test data. However, there are limits to this generalization due to the finite information content in a dataset. For example, a dataset focused on preferences about pets cannot provide insights into unrelated topics like quantum field theory, regardless of model size. (24m27s)
From Proximal Policy Optimization (PPO) to DPO
- The complexity of policy learning in Proximal Optimization (PO) motivated the development of Direct Policy Optimization (DPO). PO was found to be challenging to implement effectively for specific research problems, leading to the creation of DPO to address these issues. (25m30s)
- The discussion transitions to Archit Sharma, who is set to provide an overview of DPO, following the background on reward model training and the challenges associated with PO. (26m5s)
- The discussion explores the concept of using a reward function to determine what humans like and dislike, parameterized as a separate network that scores answers as good or bad. The goal is to leverage language models to assign probabilities only to responses that humans prefer, creating a mapping between the language model and the reward model. This approach aims to produce a distribution of responses that align with human preferences, known as direct preference optimization. (26m40s)
- The objective is to maximize the expected reward over completions with a constraint to the reference distribution. The math involved applies to any reward function, typically a learned one, and the problem has a closed-form solution. This solution is related to the Boltzmann distribution, which upgrades responses by the exponentiated reward, giving higher probability to higher reward responses. (27m42s)
DPO and the Partition Function Challenge
- The process involves normalizing the distribution using a partition function, which sums over every possible completion for a given question. However, computing this partition function is intractable due to the complexity of evaluating every sentence and its probability. The partition function is defined as the sum over all responses, weighted by the exponentiated reward and a temperature term that balances the reward and the KL constraint. (28m59s)
- A relationship is established between the optimal policy ( P^* ) and the reward ( R ), which can be rewritten using algebra to express the reward in terms of the optimal policy itself. This involves a beta log ratio between the optimal policy ( \Pi^* ) and a reference distribution, along with a partition function that is difficult to handle. (29m50s)
- The intuition behind this relationship is that if an optimal policy assigns a higher probability to a response than the reference distribution, the reward is higher, and vice versa. This aligns with the idea that preferred responses should have higher probabilities and rewards. (30m18s)
- The main challenge is the intractability of the partition function, which complicates the process. The approach involves using a loss function on reward functions and transforming it to obtain a loss function on the policies. (30m50s)
- The logit, which is the difference between the rewards of preferred and dispreferred responses, does not depend on the input or the partition function. By expressing the reward in terms of the policy and plugging it into the reward modeling loss, the partition function cancels out, leading to a final loss function called the DPO loss function. (31m27s)
- The DPO loss function aims to maximize the difference between the log probabilities of preferred and dispreferred responses for a given question. This means increasing the log probability of the preferred response and decreasing that of the dispreferred response compared to the reference distribution. (32m15s)
- The cancellation of the log partition function is beneficial, allowing for the shifting of rewards by a constant, which is leveraged in this context. (32m56s)
DPO Experiments and Pareto Curves
- An initial control experiment was conducted using the IMDb reviews dataset to train a model to generate positive movie reviews. A pre-trained sentiment classifier was used as a reward function, and synthetic preferences were created by ranking data generated from a base model. The goal was to evaluate the effectiveness of DPO as an optimizer for the core objective, focusing on the reward versus KL tradeoff. (33m24s)
- The concept of a Pareto curve, which is used to analyze tradeoffs between different factors, was applied to assess the optimal tradeoff between reward and KL divergence. DPO was found to achieve a better tradeoff compared to other methods, as it could provide more reward for the same KL value. (34m27s)
- Extensive experimentation with baselines was conducted, and it was noted that while other methods like PPO showed some results, they could not match the performance of the DPO objective. The importance of plotting Pareto curves in research papers was emphasized, as it provides a clearer understanding of optimization performance beyond simple win rates or comparisons. (34m59s)
- The discussion highlighted that many research papers might be evaluating optimization incorrectly by not considering the full tradeoff curve. It was suggested that the community should focus more on this aspect to better understand optimization effectiveness. (35m47s)
Signal-to-Noise Ratio and Sampling
- A key insight was that sampling more answers per prompt improved performance, addressing the variance problem in the RF setting. This approach was noted as a significant factor in achieving better results, despite being a relatively simple adjustment. (36m22s)
- The signal-to-noise ratio in regular policy optimization (PO) is about 40%, and mixing this with the whole process can cause the variance to explode, making the signal sparse and difficult to learn from. (37m0s)
- In reinforcement learning, the goal is to maximize reward, but this is subject to a constraint, denoted as K. The graph discussed plots the reward for different levels of K using various baselines, aiming to achieve Pareto optimality by maximizing reward for a given K. (37m23s)
- Comparing different points on the graph, such as DPO and PO, can be misleading if only win rates or rewards are considered without accounting for the K value. (37m51s)
- The interpretability of the model and the ability to explain the process under noisy data conditions is complex and involves multiple lines of research, including noisy feedback and multimodal feedback. (38m10s)
- The poor signal-to-noise ratio in normal policy optimization is a complex issue, and the solution involves sampling more answers per response. (38m51s)
- In sentiment generation, the reward is based on the sentiment of the sentence, with a score of one indicating very good sentiment and zero indicating very bad sentiment. (39m15s)
- Choosing the right Kullback-Leibler (KL) divergence trade-off in a real model is data-dependent, and performance is often measured on other benchmarks. Typically, a lower K is preferred to preserve performance on these benchmarks. (39m32s)
DPO Adoption and Effectiveness
- The paper includes various experiments demonstrating that DPO works effectively, and its widespread adoption is a testament to the algorithm's success. (40m21s)
- A few months ago, nine out of the top ten models on the Hugging Face leaderboard for open language models were trained using the DPO (Direct Preference Optimization) algorithm, indicating its popularity and effectiveness in the open-source community. (40m33s)
- The Mistral paper exclusively used DPO as their reinforcement learning from human feedback (RHF) algorithm, and some Mistral models are competitive with GPT-4, demonstrating DPO's effectiveness at large scales. (40m55s)
- Llama 3 has incorporated DPO as part of its optimization pipeline, using it in combination with other methods, showing the algorithm's growing adoption. (41m13s)
- DPO can be derived as an inverse Q-learning algorithm in a Max Entropy Reinforcement Learning setting, as discussed in the paper titled "Your Language Model is Secretly a Q Function." (41m48s)
- DPO does not work for control problems under the classical formulation and requires a formulation of preferences under regret rather than reward functions. (42m12s)
DPO vs. POO Debate and Reward Bench
- There is an ongoing debate in the community and industry, particularly on Twitter, about DPO versus POO (Policy Optimization with Optimality), with questions about the effectiveness of DPO's implicit reward function compared to an explicitly parameterized reward function. (42m31s)
- The debate also involves whether the suboptimality in machine learning optimization induces a regularization effect that strengthens the model. (43m36s)
- Reward Bench, a large-scale evaluation of reward models, has been introduced to tackle these questions, as DPO functions as both a generative and a reward model. (43m50s)
- DPO models are evaluated on tasks related to chat safety and reasoning, showing that the top four models in these tasks are DPO models, which outperform larger proprietary models. (44m2s)
- In reasoning tasks, while a proprietary model called "coher" is the top performer, the next five models are DPO models, indicating their strong performance. (44m26s)
- The DPO policy reward is considered to be as effective as the classic AR reward, without losing generality or capability when using an implicit model versus an explicit parameterized one. (44m42s)
- There is a discussion on whether using a weaker optimizer like PO provides better regularization, with feedback indicating that large-scale DPO models can become overly verbose and go off track. (44m54s)
- An analysis of datasets on summarization and dialogue shows a bias towards longer responses, with DPO models pushing the distribution of response lengths significantly outside the covered data set. (45m30s)
Reward Hacking and DPO
- The concept of reward hacking is introduced, referencing a famous OpenAI paper on scaling laws for reward model optimization, which involved training a reward model with human preferences and synthetic data. (46m21s)
- A graph is discussed that shows the expected reward from model training versus actual reward models, highlighting that while models may maximize reward, their quality can degrade, illustrating the issue of reward hacking. (46m55s)
- The AI safety community is concerned about the potential for models to exploit reward functions, leading to unintended and potentially harmful outcomes, a phenomenon known as "reward hacking" (47m35s).
- Despite extensive research and over 200 citations on mitigating reward hacking in traditional reinforcement learning, the community has not fully recognized its occurrence in direct alignment methods, where models are optimized directly on the data without proxy reward functions or synthetic data (48m10s).
- Recent findings indicate that reward hacking is prominent in Direct Policy Optimization (DPO) and its variants like IPO and SLiC, potentially even more so than in Proximal Policy Optimization (PPO), due to DPO's ability to provide an exactly optimal analytical function (48m33s).
- The research shows that as models are trained more extensively, their performance can decrease, contrary to theoretical expectations of monotonic improvement, highlighting the prevalence of reward hacking in these algorithms (49m58s).
- The discussion suggests that weaker optimizers, such as PPO, might offer more stability and be less susceptible to reward hacking compared to stronger optimizers like DPO (50m25s).
- The current state of research on these algorithms is seen as an exciting area for further exploration, with a focus on making reinforcement learning more robust against reward hacking (50m35s).
- Alignment algorithms are highly susceptible to hacking, indicating a need for increased robustness in direct learning algorithms. (51m0s)
Future Directions and Multimodal RL
- There is growing interest in online fine-tuning algorithms, focusing on efficiently eliciting preferences and fine-tuning models. (51m10s)
- There has been a significant expansion of reinforcement learning across various modalities, including vision-language models, diffusion models, and text-to-image and text-to-video work. (51m20s)
- Future areas of exploration include speech and music, with upcoming research on protein synthesis with feedback and robot safety in large-scale robotics foundation models. (51m34s)
- Efforts are being made to enable multi-turn interactions and develop agents for use in these models. (51m47s)
- The field is advancing, with a deeper understanding of its complexities emerging over time. (52m6s)
- Reward hacking is a finite data issue, where models attempt to maximize skewed reward ratios, leading to unintended outcomes. (52m30s)
- Even with different reward functions, such as hinge or square objectives, reward hacking persists, highlighting the challenge of extrapolating beyond observed data. (53m12s)
Challenges in Ranking and Exploration
- A question was raised about ranking samples when none are particularly good, suggesting a need for a method to indicate confidence in rankings. (54m1s)
- The issue is likened to the exploration problem in reinforcement learning, where the absence of good trajectories poses challenges. (54m33s)
- The discussion addresses the challenge of learning without direct feedback, suggesting alternative forms of feedback such as comparative feedback, where options are compared, and using thumbs up or thumbs down to optimize responses. This is identified as an open problem worth exploring. (54m42s)
- There is a focus on the exploration problem in training, emphasizing the importance of gathering data for preference learning. It is noted that effective exploration during data collection is crucial, as it impacts the reward model's ability to correctly label trajectories. (55m11s)
- The concept of applying similar ideas to multi-step reward processes is discussed, where a reward is received after multiple steps. It is suggested that this can be viewed as a learning problem, with credit assignment occurring on intermediate steps, even without explicit bootstrapping. (56m4s)
Real vs. Synthetic Data and Reward Hacking
- The difference between real and synthetic data is explained in the context of training reward functions. Real human data is used to train a reward function, which is then used to rank synthetic data generated by a base model, allowing for the measurement of the real reward function. (57m34s)
- Reward hacking occurs when a synthetic reward function is used instead of the actual reward function, leading to errors outside the trained data distribution. These errors can skew the perceived reward, causing the learned reward to increase while the true reward decreases. (58m16s)
- Even high checkpoint models with low success rates can have low losses and high accuracies, indicating that the quality of the reward function is not necessarily connected to the performance of the downstream policy. (59m11s)
- In scenarios where the reward function is not transitive, such as in games like rock-paper-scissors, it is suggested to move away from reward maximization and instead focus on policy optimization. This involves searching for a policy with the highest expected win rate against an adversary policy. (1h0m2s)
- The approach of using adversary policy classes allows for the comparison of actions even when there is only a partial ordering, without being limited by fitting a reward function first. (1h1m2s)
Nash Learning and Diverse Preferences
- Methods like direct Nash optimization or Nash learning from human feedback are emerging as new algorithms to address issues of partial ordering and plurality of preferences, which are theoretically unsatisfiable in some cases. (1h1m17s)
- The discussion addresses the challenge of training models that satisfy diverse preferences across different population distributions, highlighting the complexity of maintaining internal consistency within these models. (1h1m50s)
K Regularization, Ensemble Models, and Weight Averaging
- In the context of Direct Policy Optimization (DPO), the role of K regularization and the beta term is emphasized as a mechanism to prevent large optimization steps, with the beta term influencing how quickly the loss function changes. Other parameters like learning rates also affect this process. (1h2m12s)
- Ensemble models are suggested as a method to address reward hacking issues, with the possibility of using ensemble sub-pieces or representing reward models as distributions rather than single scalars. This approach can accommodate a variety of preferences in data that may not be consistent with each other. (1h3m10s)
- A promising direction for addressing reward hacking involves weight averaging of random models, which has been shown to improve model robustness. This technique, known as warm weight averaging, has been supported by research and can be applied to both reward models and DPO models to enhance their performance. (1h4m49s)
- Weight averaging of models has been found to significantly improve performance, and this approach was discovered by Twitter without fully understanding the underlying reasons. There is interest in exploring how much robustness can be achieved through strategies like evolutionary merging and averaging. (1h5m50s)
KL-Penalized Reward Maximization and Weight Averaging
- The discussion involves starting with a KL-penalized reward maximization objective, which aims to maximize rewards while keeping the KL divergence small to avoid over-optimizing the reward function. This method is considered somewhat crude, and there is interest in exploring alternatives that account for reward model uncertainty. (1h6m33s)
- There is a consideration of moving away from the KL regularized policy optimization objective, as it might be too restrictive and leave some performance potential untapped. The idea is to explore more flexible approaches that do not overly constrain the policy. (1h7m32s)
- The weight averaging technique is inspired by results in deep Q-networks (DQN), where similar methods have been used to stabilize and improve performance. This involves using a target Q function that incorporates weight averaging, which has shown to improve stability and performance. (1h8m16s)
Overfitting Concerns and Proxy vs. True Reward
- There is a concern about overfitting, especially in domains with small datasets like medical fields. The challenge is to avoid extrapolating incorrectly due to limited data coverage, which is essentially an overfitting problem. (1h8m51s)
- In certain settings, such as DPU, overfitting can be beneficial, allowing for multiple epochs on small datasets to improve performance, although this may lead to losses in other areas depending on model evaluation methods. (1h9m11s)
- There is a discrepancy between the proxy reward optimized during training and the true reward, which is not directly observed. Additionally, models are often evaluated based on win rates against baseline policies rather than average rewards, creating a disconnect between training objectives and practical utility. (1h9m41s)
Instruction Tuning and Preference Learning
- The training process involves two stages: supervised training of the language model followed by DPO training, with Kullback-Leibler (KL) divergence used to prevent deviation from the original supervised model. There is ongoing research into combining supervised instruction tuning and preference tuning into a single process. (1h10m38s)
- Instruction tuning is performed before reinforcement learning from human feedback (RLHF) to ensure the model generates responses aligned with given instructions, especially when starting with a pre-trained model that may produce irrelevant outputs. (1h11m32s)
- Some research efforts focus on integrating instruction tuning and preference learning into a unified optimization algorithm, although this remains an active area of research. (1h12m0s)
- There is skepticism about the effectiveness of certain optimization techniques due to the complexity and pitfalls of the optimization landscape, making single-shot optimization challenging. (1h12m50s)
- The process of obtaining preferences for models often requires exploration to gather trajectories, which is difficult with purely offline data. Iterative sampling and preference gathering can be beneficial. (1h13m31s)
Win Rate vs. Reward Maximization
- A discrepancy exists between training models to maximize a reward function and evaluating them based on win rate. Alternative objective functions, such as those used in Nash algorithms, can optimize directly for win rate by comparing responses from different policies. (1h14m1s)
- These alternative methods, which do not rely on a reward function, can be less constraining and may show improvements in win rate according to experimental results. However, the advantages over traditional methods like DPO depend on interpretation and context. (1h15m2s)
- The discussion addresses the concept of training for win rate instead of reward maximization, highlighting that improvements can be seen when considering the Bradley Terry model as the preference model and data generation model. (1h15m50s)
- It is noted that maximizing reward and maximizing the probability of win rate are closely related, and OpenAI normalizes reward functions by subtracting a human baseline to reduce variance. This normalization does not change the optimal policy but significantly reduces variance. (1h16m7s)
- The concept of a baseline is emphasized, which helps in reducing variance and is directly tied to maximizing the probability of winning. (1h17m1s)
Multi-objective DPO and Uncertainty
- The applicability of DPO (Direct Policy Optimization) to multi-objective settings is discussed, referencing a paper called Mo DPO (Multiobjective DPO). In this setting, DPO can be applied by conditioning on a scalarization of multiple objectives, allowing for a weighted policy without needing to learn a reward function for each objective. (1h17m22s)
- The approach allows for selecting different weightings over objectives without retraining for each scalarization, and there are methods that incorporate uncertainty over the reward model. (1h18m11s)