Generative Search: Practical Advice for Retrieval Augmented Generation (RAG)
08 Aug 2024 (4 months ago)
Large Language Models (LLMs) and Their Challenges
- Large language models (LLMs) are used in many applications, including summarization and question answering.
- LLMs have several challenges, including cost, quality, performance, and security.
- LLMs are expensive, especially generative ones.
- LLMs can often generate incorrect information, which is sometimes referred to as "hallucinations."
- LLMs have low query per second (QPS) rates, which can create performance bottlenecks.
- LLMs can pose security risks when used with internal and external data.
Retrieval Augmented Generation (RAG)
- Retrieval Augmented Generation (RAG) is a technique that aims to address the challenges of LLMs by rethinking the data strategy.
- RAG systems are often used to create private versions of ChatGPT, which are trained on internal knowledge bases.
- Fine-tuning LLMs on internal data can improve their performance but can also lead to security issues.
- RAG systems can use vector databases to store and retrieve relevant information at runtime, which can improve performance and reduce costs.
- RAG systems can be chained together using tools like LangChain, Relevance AI, or Llama Index.
- The principle behind RAG is that more relevant context leads to better answers.
- RAG systems can use range queries to ensure that only relevant information is retrieved, preventing hallucinations.
- RAG systems offer advantages over fine-tuning, including lower cost, faster processing, better security, real-time updates, and multi-tenancy.
- RAG systems can be used for various applications beyond question answering, such as summarization, customer service, and feature stores.
Vector Databases
- LLMs are often supplied by external knowledge bases called vector databases, which use vector similarity search.
- Vector databases perform vector similarity search, which involves calculating the semantic distance between a query and a set of vectors representing sentences or documents.
- This distance is calculated using cosine similarity, a computationally efficient operation that allows for large-scale searches.
- Vector databases are more efficient than traditional search methods like BM25 because they account for synonyms and semantic relationships.
- Retrieval augmented generation (RAG) uses a vector database as a knowledge base to provide additional context to a large language model.
- The process involves embedding a user query into a vector, searching the vector database for relevant documents, and then using those documents as context for the language model to generate a response.
- RAG can be used for various tasks, including question answering and summarization.
- RDUS is a popular vector database that can be used for RAG applications.
- Retrieval Augmented Generation (RAG) systems use a vector database like Redis and a generative model like OpenAI to provide relevant information.
Data Preparation and Retrieval
- It is generally not recommended to search an entire corpus of text for semantic information due to the difficulty of finding relevant information within a large dataset.
- Instead, it is more effective to create summaries of the text, such as abstracts, which can be used for semantic search.
- This approach involves two separate vector searches: one to retrieve relevant summaries and another to retrieve specific document chunks from those summaries.
- Another approach is to create embeddings for individual sentences and their surrounding context, which can improve retrieval accuracy but requires more computational resources.
- Data preparation is crucial for RAG systems, as the quality of the retrieved information directly impacts the quality of the generated response.
- Hybrid querying, which combines vector search with other methods, can be used to filter results based on specific criteria, such as author or document type.
- Hybrid search combines keyword frequency (BM25) and vector search to improve retrieval accuracy.
- It allows for filtering documents based on specific criteria like author, user, customer, or geographic location.
- LangChain integrates with Redis to enable hybrid search, allowing users to filter documents based on metadata like category and year.
- This approach enables users to perform complex searches using Boolean operators and SQL-like expressions.
- The system calculates relevancy scores based on cosine distance, ensuring that only documents with a certain level of semantic similarity to the query are returned.
- This enhances retrieval accuracy and improves the performance of LLMs.
Updating Vector Embeddings
- Three approaches for updating vector embeddings are discussed: document level records, context level records, and hot swapping the index.
- Document level records use the last modified timestamp to determine if a document has been updated.
- Context level records track changes within a document, which can be computationally expensive for large datasets.
- Hot swapping the index involves replacing the entire index with a new one.
- The speaker recommends using document level records for large datasets and context level records for smaller datasets with strict boundaries.
- When updating a Frequently Asked Questions (FAQ) document, it is not necessary to update the entire document every time. This can be costly in terms of creating embeddings.
- Hot swapping is a method for updating an index by creating a new index and aliasing it to the old one. This is the most expensive method but can be useful for correcting errors.
- Metadata should not be stored separately from vectors. Instead, it should be retrieved along with the vectors during a vector search. This can significantly improve performance.
Feature Injection and Semantic Caching
- Feature injection allows for the inclusion of rapidly updating information, such as user addresses or sensor data, into the prompt. This can be achieved using an online feature store.
- Semantic hashing can be used to cache the results of queries in a vector database. This can prevent the need to recompute the same answer for similar queries.
- Semantic caching can be used to significantly improve the performance of retrieval augmented generation (RAG) systems.
- Semantic caching involves returning a cached answer if the query is semantically similar enough to a pre-populated answer.
- This approach can save money and increase query per second (QPS) performance.
Orchestration and Data Pipelines
- The speaker recommends using a feature orchestration platform, such as LangChain or Llama Index, to integrate vector databases, feature stores, and RAG APIs into the system.
- The speaker also mentions the importance of data pipelines for updating embeddings in RAG systems.
- The speaker acknowledges that reinforcement learning through human feedback is a crucial aspect of RAG systems but was not covered in the presentation.
- The speaker suggests that companies should consider using existing libraries like LangChain or Llama Index instead of building their own orchestration systems.
- The speaker highlights the importance of using vector databases to create indexes from relational databases for RAG systems.
- The speaker emphasizes that there is a lot of valuable information and functionality baked into libraries like LangChain and Llama Index.
- The speaker encourages viewers to explore examples and resources provided by Redis Ventures and to follow the speaker on Twitter for updates.
RAG Agents and Freshness
- The speaker discusses the use of vector databases in retrieval augmented generation (RAG) and how they can be used to provide contextual information to large language models (LLMs).
- The speaker mentions that LangChain is a popular tool for building RAG applications, but that some developers are moving towards lower-level vector database clients for more control.
- The speaker also discusses the concept of "agents" in RAG, which are LLMs that can interact with external tools and databases.
- The speaker notes that while agents can be powerful, they can also be fragile and prone to errors, especially when allowed to access the internet.
- The speaker suggests that a bounding box should be placed around agents to limit their access to information and prevent them from making mistakes.
- The speaker also discusses the importance of freshness in RAG and suggests using a numerical recency score to rank documents in a vector database.
- The speaker notes that this recency score could be generated by a separate service and stored in the vector database.
- The speaker acknowledges that using a numerical recency score may limit the search to a specific range, but suggests that other methods, such as sorting on demand, could be used to address this issue.
Other Sorting Methods and SQL Database Integration
- The speaker was asked about other ways to sort search results besides freshness.
- The speaker acknowledged that recency is important, especially for news applications.
- A question was asked about using a large language model (LLM) to generate a query against a structured data store like a SQL database.
- The speaker explained that this approach is sometimes used as a tool to trigger the use of a SQL database.
- The speaker stated that this approach is often used to feed data into a graph or plot for a dashboard.
- The speaker indicated that this approach is not commonly used as a standalone method.
Fake Answers and the "Hide" Technique
- Redis VL is a command-line tool for inspecting Redis schemas.
- Fake answers can be used to look up context in a Q&A system, where an LLM generates a fake answer to guide the search for relevant information.
- A technique called "Hide" involves using an LLM to generate a fake answer that is semantically similar to a real answer. This fake answer is then used to create an embedding for vector search.
- This technique can be helpful when a traditional vector search fails to retrieve relevant context.
- The Hide approach can be implemented using tools like Llama Index and LangChain.
- A practical example of the Hide approach is a hotel recommendation system that uses user input to generate a fake review embodying positive and negative qualities.
- The system then uses semantic search to recommend a hotel based on the generated review and user preferences.
- The prompt engineering used in this example includes telling the LLM "you're not that smart" to encourage it to generate reviews that are more similar to the existing data set.
- The speaker built a system that allows users to interact with a large language model (LLM) to read reviews and generate desired information.
- The system retrieves context and saves it in a state variable, allowing users to refresh and search again.
- Metadata like hotel name, state, city, and address can be stored, enabling features like directions.
- The speaker emphasizes the importance of maintaining a dataset of text, as it is unstructured data.