Stanford CS25: V4 I Aligning Open Language Models

10 May 2024 (11 months ago)

Key Historical Milestones in Language Modeling

Claude Shannon's work on approximating character arrangements.
Introduction of the autoregressive loss function.
Development of the Transformer architecture with attention mechanisms.

The Rise of Language Models

2020: GPT-3 release marked a significant improvement in language models.
2021: "Stochastic Parrots" paper raised questions about language model capabilities.
2022: ChatGPT's release reshaped the narrative around language models.

Reinforcement Learning from Human Feedback (RLHF)

RLHF is crucial for advanced language models like ChatGPT.
RLHF is cost-effective and time-effective, surprising NLP researchers.
Examples of RLHF's impact, such as Anthropic's models.

Alignment and Open Alignment

Timeline of alignment and open alignment, showcasing RLHF benefits.
Various alignment concepts: instruction fine-tuning, supervised fine-tuning, RLHF, preference fine-tuning.

Recent Developments in Fine-Tuning and Alignment

ChatGPT sparked discussions on open-sourcing models and forming development coalitions.
The Llama Suite's Alpaca model and its instruction-tuned capabilities.
Subsequent models like Vicuna introduced new prompt sources and the concept of an LLM as a judge.
Diverse datasets like SharGPT accelerated progress in fine-tuning models.
Legal considerations due to unlicensed datasets highlighted the need for responsible data collection.
Recent datasets like LMIS Chat One 1M and WildChat address data quality and user consent issues.
Weight differences between models due to licensing restrictions.
Notable models: Dolly (human data integration), Open Assistant (human-generated prompts), Stable Vuno (early RHF proficiency).
Efficient fine-tuning methods: QOR (low-rank adaptation), Cura (quantization and GPU tricks).
New evaluation tools: Chatbot Arena, Alpaca of Val, Mt Bench, Open LLM Leaderboard.
Challenges in interpretability and specificity of evaluation metrics.

Reinforcement Learning Fundamentals

Review of reinforcement learning (RL) fundamentals, reward functions, and optimization.
Introduction to direct preference optimization (DPO) as a simple and scalable training method.
DPO involves using gradient ascent to directly optimize the loss function without learning a reward model.
Successful scaling of DPO to a 70 billion parameter model, achieving performance close to GPT-3.5 on Chatbot Arena.
Contributions from other projects like Nvidia's SteerM and Berkeley's Starling LM Alpha.
PO currently outperforms DPO in alignment methods.

The Modern Ecosystem of Open Models

Growth of the open models ecosystem with diverse models and companies.
Emerging models like Gen-struck for rephrasing text and instruction models.
Open models catching up to closed models, but demand for both types persists.
Data limitations in alignment research, with a few datasets driving most work.
Need for more diverse and robust datasets to improve model performance.

Ongoing Research and Future Directions

Continued research on DPO with various extensions and improvements.
Increasing prevalence of larger model sizes and alignment research at scale.
Growing popularity of smaller language models for accessibility and local running.
Personalized language models for enhanced user experience and capabilities.
Active contributions to alignment research by organizations and individuals.
Model merging as an emerging technique for easy model merging without a GPU.
Alignment's impact beyond safety, improving user experience and capabilities in areas like code and math.
Synthetic data limitations and the need for controlled and trusted domain-specific models.
Ongoing search for a better evaluation method with a stronger or more robust signal.
Importance of embracing new developments and rapid progress in the language model space.
Alignment involves changing the distribution of the language model's output and can involve multiple tokens and different loss functions.
Watermarking for language models seen as a losing battle, with a focus on proving human-made content rather than AI-generated content.
Exploring different optimization functions beyond maximum likelihood estimation (MLE), such as reinforcement learning from human feedback (RHF).
Layered approach required to defend against attacks like Crescendo, considering specific use cases and limiting model capabilities.
Potential of quantization methods like Bitet and BitNet, but expertise needed for further exploration.
Need to control large-scale data extraction from large language models, with synthetic data generation as a potential solution.
Self-play as a broad field with no consensus on effective implementation.