Generative Search: Practical Advice for Retrieval Augmented Generation (RAG)

08 Aug 2024 (11 months ago)

Large Language Models (LLMs) and Their Challenges

Large language models (LLMs) are used in many applications, including summarization and question answering.
LLMs have several challenges, including cost, quality, performance, and security.
LLMs are expensive, especially generative ones.
LLMs can often generate incorrect information, which is sometimes referred to as "hallucinations."
LLMs have low query per second (QPS) rates, which can create performance bottlenecks.
LLMs can pose security risks when used with internal and external data.

Retrieval Augmented Generation (RAG)

Retrieval Augmented Generation (RAG) is a technique that aims to address the challenges of LLMs by rethinking the data strategy.
RAG systems are often used to create private versions of ChatGPT, which are trained on internal knowledge bases.
Fine-tuning LLMs on internal data can improve their performance but can also lead to security issues.
RAG systems can use vector databases to store and retrieve relevant information at runtime, which can improve performance and reduce costs.
RAG systems can be chained together using tools like LangChain, Relevance AI, or Llama Index.
The principle behind RAG is that more relevant context leads to better answers.
RAG systems can use range queries to ensure that only relevant information is retrieved, preventing hallucinations.
RAG systems offer advantages over fine-tuning, including lower cost, faster processing, better security, real-time updates, and multi-tenancy.
RAG systems can be used for various applications beyond question answering, such as summarization, customer service, and feature stores.

Vector Databases

LLMs are often supplied by external knowledge bases called vector databases, which use vector similarity search.
Vector databases perform vector similarity search, which involves calculating the semantic distance between a query and a set of vectors representing sentences or documents.
This distance is calculated using cosine similarity, a computationally efficient operation that allows for large-scale searches.
Vector databases are more efficient than traditional search methods like BM25 because they account for synonyms and semantic relationships.
Retrieval augmented generation (RAG) uses a vector database as a knowledge base to provide additional context to a large language model.
The process involves embedding a user query into a vector, searching the vector database for relevant documents, and then using those documents as context for the language model to generate a response.
RAG can be used for various tasks, including question answering and summarization.
RDUS is a popular vector database that can be used for RAG applications.
Retrieval Augmented Generation (RAG) systems use a vector database like Redis and a generative model like OpenAI to provide relevant information.

Data Preparation and Retrieval

It is generally not recommended to search an entire corpus of text for semantic information due to the difficulty of finding relevant information within a large dataset.
Instead, it is more effective to create summaries of the text, such as abstracts, which can be used for semantic search.
This approach involves two separate vector searches: one to retrieve relevant summaries and another to retrieve specific document chunks from those summaries.
Another approach is to create embeddings for individual sentences and their surrounding context, which can improve retrieval accuracy but requires more computational resources.
Data preparation is crucial for RAG systems, as the quality of the retrieved information directly impacts the quality of the generated response.
Hybrid querying, which combines vector search with other methods, can be used to filter results based on specific criteria, such as author or document type.
Hybrid search combines keyword frequency (BM25) and vector search to improve retrieval accuracy.
It allows for filtering documents based on specific criteria like author, user, customer, or geographic location.
LangChain integrates with Redis to enable hybrid search, allowing users to filter documents based on metadata like category and year.
This approach enables users to perform complex searches using Boolean operators and SQL-like expressions.
The system calculates relevancy scores based on cosine distance, ensuring that only documents with a certain level of semantic similarity to the query are returned.
This enhances retrieval accuracy and improves the performance of LLMs.

Updating Vector Embeddings

Three approaches for updating vector embeddings are discussed: document level records, context level records, and hot swapping the index.
Document level records use the last modified timestamp to determine if a document has been updated.
Context level records track changes within a document, which can be computationally expensive for large datasets.
Hot swapping the index involves replacing the entire index with a new one.
The speaker recommends using document level records for large datasets and context level records for smaller datasets with strict boundaries.
When updating a Frequently Asked Questions (FAQ) document, it is not necessary to update the entire document every time. This can be costly in terms of creating embeddings.
Hot swapping is a method for updating an index by creating a new index and aliasing it to the old one. This is the most expensive method but can be useful for correcting errors.
Metadata should not be stored separately from vectors. Instead, it should be retrieved along with the vectors during a vector search. This can significantly improve performance.

Feature Injection and Semantic Caching

Feature injection allows for the inclusion of rapidly updating information, such as user addresses or sensor data, into the prompt. This can be achieved using an online feature store.
Semantic hashing can be used to cache the results of queries in a vector database. This can prevent the need to recompute the same answer for similar queries.
Semantic caching can be used to significantly improve the performance of retrieval augmented generation (RAG) systems.
Semantic caching involves returning a cached answer if the query is semantically similar enough to a pre-populated answer.
This approach can save money and increase query per second (QPS) performance.

Orchestration and Data Pipelines

The speaker recommends using a feature orchestration platform, such as LangChain or Llama Index, to integrate vector databases, feature stores, and RAG APIs into the system.
The speaker also mentions the importance of data pipelines for updating embeddings in RAG systems.
The speaker acknowledges that reinforcement learning through human feedback is a crucial aspect of RAG systems but was not covered in the presentation.
The speaker suggests that companies should consider using existing libraries like LangChain or Llama Index instead of building their own orchestration systems.
The speaker highlights the importance of using vector databases to create indexes from relational databases for RAG systems.
The speaker emphasizes that there is a lot of valuable information and functionality baked into libraries like LangChain and Llama Index.
The speaker encourages viewers to explore examples and resources provided by Redis Ventures and to follow the speaker on Twitter for updates.

RAG Agents and Freshness

The speaker discusses the use of vector databases in retrieval augmented generation (RAG) and how they can be used to provide contextual information to large language models (LLMs).
The speaker mentions that LangChain is a popular tool for building RAG applications, but that some developers are moving towards lower-level vector database clients for more control.
The speaker also discusses the concept of "agents" in RAG, which are LLMs that can interact with external tools and databases.
The speaker notes that while agents can be powerful, they can also be fragile and prone to errors, especially when allowed to access the internet.
The speaker suggests that a bounding box should be placed around agents to limit their access to information and prevent them from making mistakes.
The speaker also discusses the importance of freshness in RAG and suggests using a numerical recency score to rank documents in a vector database.
The speaker notes that this recency score could be generated by a separate service and stored in the vector database.
The speaker acknowledges that using a numerical recency score may limit the search to a specific range, but suggests that other methods, such as sorting on demand, could be used to address this issue.

Other Sorting Methods and SQL Database Integration

The speaker was asked about other ways to sort search results besides freshness.
The speaker acknowledged that recency is important, especially for news applications.
A question was asked about using a large language model (LLM) to generate a query against a structured data store like a SQL database.
The speaker explained that this approach is sometimes used as a tool to trigger the use of a SQL database.
The speaker stated that this approach is often used to feed data into a graph or plot for a dashboard.
The speaker indicated that this approach is not commonly used as a standalone method.

Fake Answers and the "Hide" Technique

Redis VL is a command-line tool for inspecting Redis schemas.
Fake answers can be used to look up context in a Q&A system, where an LLM generates a fake answer to guide the search for relevant information.
A technique called "Hide" involves using an LLM to generate a fake answer that is semantically similar to a real answer. This fake answer is then used to create an embedding for vector search.
This technique can be helpful when a traditional vector search fails to retrieve relevant context.
The Hide approach can be implemented using tools like Llama Index and LangChain.
A practical example of the Hide approach is a hotel recommendation system that uses user input to generate a fake review embodying positive and negative qualities.
The system then uses semantic search to recommend a hotel based on the generated review and user preferences.
The prompt engineering used in this example includes telling the LLM "you're not that smart" to encourage it to generate reviews that are more similar to the existing data set.

System for Reading Reviews and Generating Information

The speaker built a system that allows users to interact with a large language model (LLM) to read reviews and generate desired information.
The system retrieves context and saves it in a state variable, allowing users to refresh and search again.
Metadata like hotel name, state, city, and address can be stored, enabling features like directions.
The speaker emphasizes the importance of maintaining a dataset of text, as it is unstructured data.