Stanford Webinar - Large Language Models Get the Hype, but Compound Systems Are the Future of AI

04 Dec 2024 (1 month ago)
Stanford Webinar - Large Language Models Get the Hype, but Compound Systems Are the Future of AI

Large Language Models and the Rise of Compound AI Systems

  • Large language models receive significant attention, but compound systems are the future of AI, and this concept is often overlooked due to the focus on models in headlines (11s).
  • The trend of emphasizing large language models began with the GPT-3 paper, which introduced a 175 billion parameter model, an order of magnitude larger than previous models, and demonstrated the potential of scaling to create new systems (56s).
  • The announcement of Google's Palm model, with 540 billion parameters, is another example of this trend, where the focus is on the model rather than the system it is part of (1m30s).
  • Even companies like Open AI, which focuses on full software systems, tend to announce new models, such as GPT-4, rather than the systems they power (2m7s).
  • The emphasis on models can create a misconception that they are the primary focus, but in reality, people only interact with systems, not individual models (42s).
  • A large language model is inert on its own and requires additional components, such as a prompt and a sampling method, to function as a system (2m57s).
  • The choices of prompt and sampling method are non-trivial and play a crucial role in creating a functional system (3m49s).
  • Compound systems, which include models like Gemini and ChatGPT, are already being developed and used, but are often described in terms of their models rather than the systems as a whole (1m50s).
  • A minimal system in AI consists of a prompt, a model, and a sampling method, but modern systems are taking this further by giving these minimal systems access to calculators, programming environments, databases, web APIs, and the web itself, making them software systems with the language model as a hub (4m23s).
  • The capabilities of these systems are defined by how all components work together, not solely by the language model, which is the essence of the thesis that compound AI systems are the future of AI (4m43s).

The Shift from Models to Systems

  • This idea has been explored in a blog post titled "The Shift from Models to Compound AI Systems" from February 18, 2024, which emphasizes the importance of thinking in terms of systems rather than just models (5m0s).
  • The idea of shifting from models to systems has been echoed by others in the field, such as Sam Altman, who recently mentioned expecting a shift from talking about models to talking about systems (5m31s).
  • Investing in the right things when developing an AI solution requires thinking in terms of systems, rather than just focusing on the model, which is analogous to designing a Formula 1 race car where all components must work together to achieve success (6m16s).
  • Focusing solely on the model, like focusing solely on the engine of a Formula 1 race car, will not lead to a good overall system, and it's essential to think in terms of the entire system (7m26s).
  • Building the best system for specific goals and constraints will likely emphasize system components, and a small model embedded in a smart system will always be better than a big model in a simplistic system (7m48s).
  • When designing AI systems, considerations such as cost, latency, safety, and privacy are crucial, and in such cases, a small model might be the only viable choice, especially when cost is a significant factor (7m58s).
  • There is a need to shift the focus from regulating model artifacts to regulating entire systems, as focusing solely on models may lead to electing dangerous systems and overregulation (8m30s).

Sampling Methods and Their Importance

  • The method used for sampling when generating output from a model is a critical component of the system, and there are various methods available, including greedy decoding, top-p sampling, beam search, and insisting on token diversity (9m12s).
  • These sampling methods can be used to impose high-level ideals, such as ensuring generated output conforms to valid JSON or a specific grammar, and there are innovative ideas in the literature that make use of gradients and other techniques to achieve this (10m9s).
  • Researchers are exploring new methods to improve sampling, including adding parameters to the model to adaptively find a good temperature for creative or constrained output, depending on the task (11m1s).
  • The concept of sampling can be expanded to include creative exploration, such as majority completion strategies, which involve using the model to complete a prompt with a hard reasoning task (11m31s).
  • Large language models can generate answers in one step, but it might be more effective to have them generate a reasoning path and then produce an answer, allowing the model to explore and reason before providing a final answer (11m46s).
  • By sampling multiple reasoning paths, the model can produce a distribution of answers, and the most common outcome can be considered the final answer, giving the model the ability to explore and reason before producing a final answer (11m56s).
  • This approach is not strictly a sampling strategy, but rather a way to let the model explore and reason before producing a final answer, and it can be made to look like a simple sampling strategy by hiding the intermediate steps from the user (12m27s).
  • There is no one true sampling method for a model, and the choice of sampling method is highly consequential for the overall system, as it is what makes the model "speak" and can interact in complex ways with the language model (13m0s).

Prompting: The Heart of Modern AI Systems

  • Prompting is a crucial aspect of modern AI system development, and it is the heart of what makes these systems work, allowing for complex and interesting behaviors to be achieved through careful engineering of prompts (13m38s).
  • The origins of prompting in AI systems can be traced back to the GPT-2 paper from 2019, which demonstrated that language models can perform downstream tasks in a zero-shot setting without any parameter or architecture modification (14m12s).
  • The GPT-2 paper showed that by adding a simple prompt, such as "TL;DR", to the input text, the model could be induced to perform tasks such as summarization, translation, question answering, and reading comprehension (14m36s).
  • The concept of large language models has undergone significant development, from GPT-2 with 1.2 billion parameters to GPT-3 with 175 billion parameters, showcasing the impact of scaling on complex in-context learning (15m43s).
  • GPT-3 demonstrated successful question-answering capabilities by learning to imitate behavior from a prompt with a context passage, demonstrations, and a target question, leading to striking behavior where the model answers questions as substrings of the passage (16m28s).
  • The GPT-3 paper identified a general pattern for this behavior, consisting of context or instructions, a list of demonstrations, and a target, which was applied to various tasks such as QA, reading comprehension, and machine translation (16m41s).
  • A template for building systems in this modern mode has been established, allowing for exciting possibilities, but also highlighting the potential dark side of prompting, including sensitivity to prompt formatting (17m20s).
  • Research has shown that language models can be highly sensitive to prompt choice, with minor changes leading to significant differences in performance, emphasizing the importance of considering the model-prompt combination when evaluating a model's capabilities (18m12s).
  • The concept of systems thinking is essential in this context, where the focus should be on finding the optimal prompt-model combination to achieve specific goals, rather than evaluating the model in isolation (18m59s).
  • The idea of Chain of Thought reasoning has been explored in papers such as "Echo prompt," which empirically examines different strategies for step-by-step reasoning, highlighting the importance of considering the prompting strategy when evaluating a model's capabilities (19m17s).
  • The performance of a model can vary greatly based on how a question is framed, as shown by the results of an experiment with the Da Vinci 2 model, highlighting the importance of considering model-prompt combinations when evaluating a model (19m38s).
  • Evaluating a model requires thinking in terms of model-prompt combinations, which is a systems-level approach rather than a model-level approach, and can bring clarity to development cycles (20m7s).
  • A tweet about a person's experience with updating a model from GP4 to 40 and having to re-engineer prompts highlights the inextricable link between models and prompts (20m25s).
  • Prompts are not just written in English, but are an effort to communicate with the language model, and understanding this link is crucial for developing effective systems (20m57s).
  • The Apple intelligence prompts found in system files demonstrate the tightly knit relationship between prompts and models, with prompts being more like compiled binaries meant to be paired with a particular model (21m41s).
  • The dspi library is being promoted as a tool for developing systems that take into account the interaction between models and prompts, and adopting a systems-level approach to AI development (22m37s).

From Prompt Engineering to Language Model Programming with DSP

  • The core lessons of artificial intelligence, including modular system design, data-driven optimization, and generic architectures, have been successful in part because they have been adopted throughout the field (22m51s).
  • The development of AI systems has moved rapidly, especially in the deep learning era, leading to the creation of exciting large language models, and libraries like Torch, Theano, Chainer, and PyTorch have embodied time-tested lessons that have helped speed up development (23m7s).
  • However, the current moment in AI system development is characterized by a lot of manual adjustments to prompts, resulting in complete model dependence, which is tragic given the progress made by time-tested lessons (23m34s).
  • DSP is a model and programming library that aims to move away from prompt engineering and towards language model programming, honoring the insight that a prompt, language model, and sampling strategy together design a software system (24m3s).
  • DSP allows users to define a system using high-level tools and then compile it down into an actual prompt to a language model, abstracting away some model dependence (24m34s).
  • A simple example of DSP is a minimal system for doing basic question answering, which can be achieved with just one line of code, and the system gets compiled down into an actual prompt to a language model (24m56s).
  • More complex programs can also be written in DSP, such as a program for multihop question answering, which allows users to freely express the kind of system they want to develop in code (25m43s).
  • The design principles of DSP are tightly integrated with PyTorch, and the system can be optimized to find a successful prompting strategy independent of the chosen tools (26m0s).
  • The optimizer in DSP can simultaneously optimize the instructions and few-shot demonstrations, moving the burden of finding good ways of doing that onto the optimizer (26m29s).

Evaluating AI Systems: A Framework and Case Study

  • A framework for evaluation is presented, consisting of a program, an optimizer, and a language model, to specify a complete system and demonstrate the importance of thinking in systems terms (27m7s).
  • The baseline system, which goes from questions to answers, achieves a score of 34 for Turbo and 27.5 for LLaMA 23B, with the metric being exact match on the desired answer (27m47s).
  • Adding a simple DSP program for retrieval-augmented generation results in a boost in performance, and using bootstrap few-shot optimization leads to significant gains, with scores increasing to 42 and 38 (28m23s).
  • The use of react agents, which enables the model to reflect and think about how to solve the task, shows less success but still demonstrates the power of systems thinking (28m36s).
  • Human-written reasoning prompts underperform compared to prompts generated through simple bootstrapping, highlighting the power of data-driven optimization (28m58s).
  • A program designed for multihop reasoning, which gathers evidence from multiple passages, achieves high scores, with Turbo reaching almost 55 and LLaMA 23B reaching 50, demonstrating the power of intelligent system design and small models (29m37s).

The Importance of System Design for Small Models

  • The importance of designing systems that can get the most out of small models is emphasized, as 77% of enterprise usage of models is at the 13 billion parameter size or smaller, according to an analyst at Theory Ventures (30m16s).
  • In industrial systems, latency is a significant concern, with ideal latency being around 18 milliseconds, but as latency increases above 50 milliseconds and up to 750 milliseconds, it can become expensive and potentially prohibitive (30m35s).
  • To get the most out of small models, it's essential to think about the systems being designed around them, with the prompt being a crucial factor in system performance (31m19s).

Tool Access and System Consequences

  • Tool access is an area where entire systems, not just models, are considered, involving calculators, programming environments, databases, the web, and web APIs (31m36s).
  • When designing systems, it's essential to consider the overall consequences, such as reliability, preference, and danger, rather than just focusing on technical details (31m57s).
  • A giant large language model with a snapshot of the entire web may be less reliable than a tiny language model working with an up-to-date web search engine (32m20s).
  • A small model doing autocomplete tasks locally on a phone may be preferred over a giant large language model doing contextless autocomplete via a centralized service (32m41s).
  • A 10 billion parameter language model with access to databases and the web may be more dangerous than GP4 with no access to these tools (33m5s).

The Future of AI: Systems, Not Just Models

  • In 2026, it's expected that systems consisting of multiple models and tools working together will be more prevalent than massive foundation models doing everything in terms of their parameters (33m38s).
  • Recent legislation, such as SB147, has attempted to regulate models based on their size, with a focus on models that cost over a million dollars to train and have 10 to the 26 flops performed during training (34m1s).
  • The concept of regulating large language models is being explored, with the idea that smaller specialized models may emerge as equally or more dangerous than larger models, and that complex systems composed of these models could be more hazardous than a single expensive model (34m35s).
  • The focus of future legislation should be on regulating systems rather than individual models, as systems are the entities that can cause harm (35m26s).

Rethinking Evaluation: Focusing on Systems, Not Just Models

  • Current research evaluations, such as leaderboards, are flawed as they only evaluate individual models, whereas in reality, a system consisting of a model, prompting strategy, and generation procedure is being evaluated (35m40s).
  • The community should reorient leaderboard evaluations to focus on entire systems, considering all components working together, rather than just individual language models (36m42s).

Scaling Up: Multiple Dimensions and the Power of Systems

  • The future of AI may involve multiple notions of scaling, including scaling up unsupervised training and scaling up instructive fine-tuning, with the latter driving significant gains in AI progress (37m21s).
  • The power of large teams of humans creating good input-output pairs for model updates has been demonstrated, particularly with the emergence of chat GPT in 2022 (38m4s).
  • Scaling up large language models is leading to gains, but it's not a silver bullet, and this has led to a rise in the theme of scaling sampling for generation, which involves sophisticated forms of sampling for generation that can be thought of as scaling up inference time processing search (38m24s).
  • This theme is expected to continue from 2024 onward, with transformative things happening in virtue of the fact that perfectly good language models, even small ones, are given access to lots of different tools and other things that make them really capable as systems (38m54s).
  • The future of AI lies in the scaling up of systems, where language models are given access to various tools and capabilities, making them more productive and leading to bigger gains (39m0s).

Generative Agents and the Importance of Systems Thinking

  • Generative agents play a crucial role in this whole thing, as they can be used to take a language model and have it do things it couldn't do on its own, and also make use of tools and tool output (40m5s).
  • Systems thinking is essential, and people should not be purists when designing software systems, as they can design agents that depend on the model doing complex things in generation or write code to bridge the gaps between the language model's capabilities and the desired outcome (40m31s).

Reasoning Paths and Transparency in AI Systems

  • Modern AI systems are already producing multiple reasoning paths in the background, and the system produces reasoning paths through methods like Chain of Thought, which involves prompting the model to think step by step and generate tokens in response (41m5s).
  • To give users more access to the inner workings of these systems, it's essential to understand how the system produces reasoning paths and provide users with more transparency and control over the evaluation of results (41m13s).
  • The concept of Chain of Thought reasoning involves generating multiple inference paths to arrive at different outcomes, and then statistically analyzing those outputs to decide on a final generation, which is a systems thinking approach that considers the prompt, overall system structure, and generation methods (42m0s).
  • This approach is exemplified in Open AI's 401 models, which perform extensive inference time work before producing an answer, but the details of these processes are considered trade secrets (42m47s).
  • Researchers can explore the behaviors of smaller models to understand how to coax out desired behaviors, and this topic is expected to be explored in the research literature going forward (43m16s).

The Future of System-Level Scaling and Tool Access

  • The future of system-level scaling will involve increasingly complex systems, similar to Google search, which has evolved from a simple search technology to a complicated software system that functions through teams of people and dynamic behavior (43m41s).
  • The development of geni systems will lead to incredible systems being built over the next decades, and a key aspect to watch is when people provide tool access, allowing language models to interact with the web and other systems like humans do (44m17s).
  • As language models become more integrated with the web and other systems, they will have significant consequences, both productive and problematic, and it is essential to establish guardrails to mitigate potential negative consequences (44m51s).
  • Establishing guardrails requires considering both positive and negative consequences and thinking about how to regulate and manage the development and deployment of these systems (45m28s).

Regulating AI: Systems vs. Models and Human Considerations

  • Regulating AI systems is more likely to be effective than regulating the models themselves, as there is already existing legislation that governs software system behavior, which can be applied to the AI realm (45m39s).
  • Considering human aspects, such as requiring AI systems to identify themselves as non-human, can help people calibrate to them as agents and control the situations they are allowed to enter (46m1s).
  • Fundamental restrictions may be necessary, such as limiting AI models' ability to log into certain websites or interact freely on social networks (46m20s).

The Near-Term Impact of AI and the Need for Vigilance

  • The initial disasters caused by AI systems may not be cataclysmic, allowing society to learn from them and figure out how to respond (46m39s).
  • AI will impact people's lives in the next five years, even if they are not thinking about it, as more systems that can help with daily tasks and provide companionship, discovery, and creative expression will emerge (47m11s).
  • AI can also help with education, providing customized experiences at low costs, but there is a risk of bad actors using AI for malicious purposes, such as social engineering (47m54s).
  • Individuals and society need to be on the lookout for AI systems and take steps to prevent malicious activities (48m29s).

Getting Involved in the AI Community and Learning Resources

  • For those interested in learning more about AI, there are resources available, such as a Discord community and tutorials for technical individuals, and recommendations for business leaders to grasp the implications of AI for their businesses (48m44s).
  • To get involved in the community and contribute to projects, one can start by filing issues or making pull requests, which can lead to positive impacts and learning opportunities, and joining the Discord community to see what others are working on and potentially collaborate with them (49m19s).
  • The research community is also open to new contributors, and YouTube can be a helpful resource for learning about different prompting strategies, agent tool usage, and other relevant topics (49m46s).

The Changing Landscape of AI Research and Information Access

  • However, the industry is becoming increasingly closed, making it harder to gain insight into the decisions being made and why, even at the level of research innovation (50m4s).
  • For leaders in organizations looking to define a generative AI strategy, it's essential to think about what success looks like, what kind of testing to do, and designing a system that balances goals with known risks (50m21s).
  • To stay up-to-date with the field, one can dedicate 10 minutes a day to reading and learning, and recommended resources include the ACL Anthology for NLP papers, Semantic Scholar, and tools like ChatGPT that can do retrieval-augmented generation (51m14s).
  • Twitter, or X, is no longer as reliable a resource as it once was due to changes, and communities have spread out to other platforms like Bluesky, Threads, and Mastodon (51m48s).
  • The rise of generative AI has made it easier to get a sense of an area by typing common-sense questions into search engines like ChatGPT, which can provide a starting point for learning and exploring the literature (52m4s).
  • NLP is a well-organized community with a comprehensive literature, and resources like the ACL Anthology and Semantic Scholar can be used to find relevant papers and stay current (52m42s).
  • A course on natural language understanding is available, which includes a project development phase that involves building a literature review, forming an experimental protocol, writing a paper, and creating associated code, providing a guided way to do a focused research project and understand the rhythms of research in the domain (52m55s).

Building Generative AI Strategies and Staying Up-to-Date

  • For leaders in organizations looking to define a generative AI strategy, it's essential to think about what success looks like, what kind of testing to do, and designing a system that balances goals with known risks (50m21s).
  • To stay up-to-date with the field, one can dedicate 10 minutes a day to reading and learning, and recommended resources include the ACL Anthology for NLP papers, Semantic Scholar, and tools like ChatGPT that can do retrieval-augmented generation (51m14s).
  • Twitter, or X, is no longer as reliable a resource as it once was due to changes, and communities have spread out to other platforms like Bluesky, Threads, and Mastodon (51m48s).
  • The rise of generative AI has made it easier to get a sense of an area by typing common-sense questions into search engines like ChatGPT, which can provide a starting point for learning and exploring the literature (52m4s).
  • NLP is a well-organized community with a comprehensive literature, and resources like the ACL Anthology and Semantic Scholar can be used to find relevant papers and stay current (52m42s).

Practical Advice for Building with DSPi and Langchain

  • The SPI (Software Platform Interface) is gaining traction in the business world, with various organizations, including Jet Blue and startups, using it in different ways, and its website, dspi.doai, offers documentation, use cases, and starter code (53m35s).
  • When starting out with SPI or Langchain, it's essential to make principal choices and avoid designing a system entirely around prompt templates, as this can lead to unintended consequences and make changes difficult (54m26s).
  • Using prompt templates can be productive for teaching purposes, but it's crucial to express things as proper software systems to avoid failure modes and ensure flexibility and adaptability (54m32s).
  • DSPi is a great choice for building software systems, especially for those experienced in machine learning, as it's tailored to their needs and has PyTorch principles, although it may have a learning curve (55m28s).
  • The most important thing to take away is to avoid thinking entirely in terms of models and instead focus on building software systems that can respond to new requirements and changes in the underlying environment (56m16s).

The Importance of Systems Thinking: A Final Emphasis

  • Software systems like ChatGPT are often viewed as models, but they are actually compound systems that require a broader focus beyond just the model choice or its properties (56m34s).
  • A more effective approach is to concentrate energy on the entire system, similar to an F1 race car design team that focuses on all the complicated pieces working together in concert (56m49s).
  • In the industry, most energy is focused on small models, which makes system design even more crucial, as simplistic system design and prompting strategies are not sufficient (57m10s).
  • With small models, it is necessary to do everything possible to achieve a significant impact, which places more pressure on system design, but this pressure also presents a huge opportunity (57m29s).

Conclusion and Next Steps

  • The discussion concluded with appreciation for the questions and participants, and the session will be posted on YouTube and shared as a recorded session (57m49s).

Overwhelmed by Endless Content?