Using your repository for RAG: Learnings from GitHub Copilot Chat

02 Nov 2024 (11 months ago)

Intro (0s)

This session is about using a repository for Retrieval Augmented Generation (RAG) and how it looks in Copilot Chat, as well as the learnings from the experience (10s).
RAG translates into project awareness in Copilot Chat, which can be achieved through different methods in various platforms, such as VS Code, Visual Studio, and JetBrains (34s).
The methods include workspace or intent detection in VS Code, hashtags in Visual Studio, and implicit intent detection in JetBrains (40s).
The feature is currently being rolled out in JetBrains, and users of the JetBrains extension version 1.5.26 or later may have seen it (56s).
The presentation will focus on the implementation details for JetBrains, but the general concepts are applicable to all implementations (1m2s).
The speaker, Kimin Miguel, is a senior software engineer working on Copilot for IDE, focusing on project context for JetBrain Chat and prompt craft for code completions (1m22s).
Kimin Miguel is based in France and enjoys taking care of plants and playing Final Fantasy 14 in their free time (1m32s).

Agenda (1m41s)

The agenda includes a breakdown of how project context works in JetBrains, covering local indexing, local search, and the final reranking step (1m44s).
Key steps behind the process will be explained, including local indexing, local search, and the final reranking step (1m48s).
The explanation of the key steps will be provided before moving on to the next topic (1m57s).

Project Context in JB Chat (2m0s)

Project context is a feature that helps large language models provide more accurate and relevant responses by enriching the prompt with code snippets from the user's codebase, allowing the model to reference symbols and constructs defined in the codebase and provide more helpful answers (2m29s).
The goal of project context is to ground the model with factual data, reducing the likelihood of generic or misleading responses (2m43s).
Project context can be used through various methods, including the "at workspace" action, implicit intent detection in Visual Studio, and implicit intent detection in JetBrains (3m1s).
Initially, JetBrains planned to implement project context using the "at project" method, but due to low usage numbers, they shifted to implicit intent detection (3m11s).
As of the current numbers, about 11.8% of answers in JetBrains include context using project context (3m36s).
Project context can be enabled or disabled through a checkbox in the chat panel, and users who do not have it enabled yet will receive it after the Universe update (4m17s).
When project context is enabled, the model takes a little longer to respond, and users can see an info message indicating that the model is collecting relevant project context (5m18s).
The feature provides references to relevant code snippets from the user's codebase, which can be clicked to open the corresponding code snippet in the code editor (4m50s).
The demo showcases how project context works in a live scenario, where the user asks a question and the model provides a response with references to relevant code snippets (4m11s).

How does project context work (5m35s)

Local project context is built using three main building blocks: local indexing, the first ranking pass, and the second ranking pass, which work together to provide relevant snippets from a project to answer a user's question (5m37s).
The local indexing step starts when a project is opened in an IDE and the extension is activated, and it indexes the project in the background in a non-blocking way by tokenizing files using the Microsoft White Pair encoding tokenizer and splitting each file into chunks of 500 tokens (6m48s).
When a user asks a question, the query is processed, and the result is used to find a wide area of snippets from the local index built earlier through the first ranking pass, which uses the BM25 scoring function to rank the snippets (6m5s).
The top 50 snippets from the first ranking pass are then refined through a second ranking pass, which uses a text embedding 3 model to return embeddings, or vector representations, of the snippets and the user query, and compares them using cosine similarity (7m52s).
The five topmost snippets from the second ranking pass are extracted, included in the prompt, and used to generate an answer, along with references to the original files and selection ranges (8m26s).
The local indexing process is done on startup and is a non-blocking process that runs in the background, allowing the user to interact with the project while the indexing is being completed (7m11s).
The keywords related to the user's question are used to find the most similar snippets in the local index, which are then ranked and selected for inclusion in the prompt (7m32s).
The entire process is designed to provide relevant project context to the user's question, using a combination of natural language processing and machine learning techniques to generate accurate and helpful responses (6m39s).

What constraints are we operating under? (8m41s)

The key takeaway from the session is the need to balance engineering and science when implementing a solution, with the goal of achieving a snappy experience that provides relevant information backed by data (8m51s).
To achieve this balance, it is essential to iterate, experiment, and ultimately compromise between engineering and science (9m8s).
The process involves considering constraints and key points at each step, starting with local considerations (9m19s).

Local Indexing (9m19s)

Local indexing does not persist the index, meaning that when a project is closed, everything gets released, and when the project is reopened, everything is reindexed (9m19s).
Indexing needs to be done in the background to avoid blocking operations, which can be achieved through multi-threading (9m50s).
A bug was fixed that caused issues with indexing a file with 10,000 nested open arrays, which was taking a minute to index (10m3s).
Files specified in .gitignore and excluded in the IDE are not indexed, and node modules and Python environments are also excluded from indexing (10m29s).
A cap is placed on the number of files and code chunks stored in memory to prevent indexing and ranking from taking too long (10m47s).
If a project has more than 10,000 files, only partial project context can be offered, and an info message is displayed when project context is invoked on large projects (11m9s).
The time it takes to index a project depends on the number of files, with smaller repos taking around 5 seconds, medium-sized repos taking around 50 seconds, and large repos taking over 2 minutes (11m34s).
Having everything in memory allows for listening to file system notifications and reindexing files quickly to keep the freshest information in the codebase (12m7s).
The indexing algorithm used is fixed-size chunking with 500 tokens, but alternative methods like semantic chunking could be explored for better results, although they may come with a performance hit (12m47s).

First ranking pass (13m31s)

The first ranking pass involves extracting keywords from the user question and using the BM25 scoring function to score documents, in this case code snippets, against those keywords (13m36s).
BM25 is a scoring function that depends on the frequency of the keyword, relative to the document length, and the number of documents that contain the keyword (13m55s).
Two key questions were how to extract keywords from a natural language question and how to make it fast, with the solution being to leverage an LLM (Large Language Model) to do the work (14m14s).
The LLM used was GPT 3.5 Turbo or GPT 40 Mini, hosted on Azure OpenAI, to extract the most relevant keywords, synonyms, and variations (14m29s).
Testing showed that GPT 40 Mini returned more generic results than GPT 3.5 Turbo, highlighting the need for AB experimentation when swapping models (14m52s).
The keyword and synonym request is independent of the repository size and takes between 600 milliseconds and 1.5 seconds, depending on the query complexity (15m14s).
The processing time for the scoring logic using BM25 scales with the number of index chunks, with search times ranging from 900 milliseconds for a small repository to 10 seconds for a large one (15m36s).
To optimize performance, only a subset of snippets is passed to the second ranking step, with the current approach selecting the top 47 highest-scored snippets (16m29s).
The fixed snippet count may not work for larger repositories, and experimenting with scaling the number of snippets to pass to the second ranking pass is being considered (16m48s).

Second ranking pass and final prompt (17m2s)

The process involves taking the 47 snippets in the user query and sending them to the Text Embedding 3 model hosted on Azure OpenAI, which returns a vector for each string (17m3s).
The vectorized snippets are then compared to the vectorized user query using cosine similarity, and the top five results are selected (17m19s).
There are two levers that can be adjusted in this process: the amount of data sent to the embedding model and the vector size (17m23s).
Increasing the amount of data sent to the model increases the time it takes to process, with 48 strings taking around 4.5 seconds (17m38s).
The vector size can also be adjusted, with the default size being 1,536 dimensions, but reducing the dimensions can result in an accuracy cost (18m3s).
A compromise was reached by using 1,024 dimensions, which speeds up the process to around 4-4.5 seconds without significantly impacting the ranking output (19m1s).
The top five vectors are stored and included in the prompt, with some exceptions, such as files under content exclusion policies (19m18s).
References to the files and code snippets included in the prompt are displayed to the user (19m51s).
The process involves some waiting time, and an info message is displayed to inform users that responses may take a little longer (20m19s).
Users can opt out of project context or add files references directly to the prompt if they know which files are useful (20m27s).
A file picker can be used to select files and add them directly to the context that will be sent as part of the prompt (20m54s).

Summary (21m43s)

Building blocks for project context include local indexing, local search, and reranking using embeddings, which are essential components for creating a functional system (21m46s).
A balance between engineering and science is necessary, as they have a healthy tension that feeds into each other, requiring consideration of the impact of changes on both sides (21m54s).
Having a baseline to compare against is crucial, as any change is not necessarily a good change, allowing for evaluation and improvement (22m8s).
The needs and goals of the users should not be forgotten, as the ultimate objective is to create a helpful and successful product that enables users to accomplish more during their day (22m16s).