Retrieval-Augmented Generation (RAG) Patterns and Best Practices

09 Jul 2024 (9 months ago)

Language Models and Their Capabilities

Generative AI, particularly large language models (LLMs), has the potential to be as transformative as previous technological shifts like personal computers, the web browser, and smartphones.
Language models offer both language understanding and language generation capabilities, including tasks like summarization, copywriting, and variations of text beyond just chat.
AI's most reliable use cases are in search, text classification, and categorization, where language models excel.

The concept of semantic search using language models involves dense retrieval and query rewriting to improve search relevance and capture user intent.
Dense retrieval involves sending a query to a language model to generate an embedding vector, which is then used to find the nearest neighbors in a vector database.
Reranking is a method of improving search results by passing the top 100 results from a search system to a language model, which reorders the results based on relevance.
Retrieval-augmented generation combines search and generation by retrieving relevant documents and then using them to inform the generation of a response.
Retrieval-augmented generation offers several benefits, including smaller models, up-to-date information, explainable sources, and improved accuracy.

The basic steps of a retrieval pipeline involve a search step followed by a generation step.
A ranker can be used in the retrieval step to improve the quality of the results passed to the generation model.
Query rewriting is a technique used to improve the initial results of a retrieval-augmented generation (RAG) system by generating a search query from a natural language question.
Multi-query RAG involves breaking down a complex question into multiple simpler queries, retrieving results for each query, and then synthesizing the results to answer the original question.
The language model can control the flow of the search process by determining whether to search for information, generate text directly, or perform multiple rounds of searching and generating (multi-hop RAG).

Language models can be used to interact with external systems and APIs, enabling them to perform actions such as searching, retrieving, and posting information, which extends their capabilities beyond simple text generation.
The concept of an "LLM-backed agent" refers to a software system that utilizes a large language model as its core component, allowing it to read and write from multiple sources, use tools, and perform various tasks.
Providing citations for the sources used in the generated text is important for enabling users to verify the accuracy and reliability of the information.
The final system architecture includes three language models in the retrieval step (embedding, ranking, and generation) and a separate generation step for grounded generation with citations.

Evaluation of RAG systems involves assessing accuracy, which is discussed in more detail in a separate chapter.
One metric for evaluating search accuracy is to compare the relevance of the results returned by two systems for a given query.
Another metric is to consider the ranking of relevant results, with higher-ranked results being considered more relevant.
Mean average precision (MAP) is a metric that takes into account both the relevance and ranking of search results.

It is important to invest early in building robust search systems and to experiment with different methods to identify failure points.
A hybrid search system that combines keyword search with embedding-based search can be more effective than relying solely on one method.
Injecting other relevant signals into search systems can further improve their performance.
Software testing and unit tests should be incorporated into machine learning development to catch regressions in model behavior.
The choice of framework for building search systems depends on the specific use case and latency requirements.

Language models have evolved beyond their original definition as systems that predict the next word in a sequence, now exhibiting capabilities such as answering questions, generating code, and performing various tasks based on the data they are trained on.
Language models can become general problem-solving tools when trained and optimized properly.

Overtrusting language models without considering their limitations and potential vulnerabilities is a common pitfall.
Language models are probabilistic, not deterministic, so caution is needed when integrating them with systems of record.
The feasibility of integrating language models with systems of record depends on the use case and requires careful consideration of risks, guard rails, and human involvement.
Techniques like majority voting can improve the reliability of language model outputs.