Building agentic systems with VS Code extensibility and GitHub Copilot at Uber

20 Nov 2024 (4 months ago)

Introduction (0s)

The presentation is given by Sorab Sherti, a product manager at Uber, and his colleague, Mat Ranis, a senior software engineer at Uber, who will discuss their experience in building agentic systems with Visual Studio Code extensibility and GitHub Copilot at Uber (11s).
Uber is a well-known company that operates in over 10,000 cities, does 30 million trips a day, and has a massive code base of tens of millions of lines of code (42s).
The company faces challenges with its large code base and the constant addition of new code, which led to the development of a tool to solve a specific problem, particularly around authoring tests (51s).
The presentation will cover a packed agenda, including a live demo, and will also allow time for questions (59s).
The speakers are excited about the announcements around Co-Pilot extensibility, including the Co-Pilot multiple file edit feature, and believe that their product has the right abstractions in place to take advantage of these features (1m38s).
Although the product has not been updated to use the new features announced in the past 24 hours, the speakers are confident that the audience can extrapolate from what they will be showing (1m45s).

Problem Statement (1m50s)

Writing unit tests is extremely important while developing code, but it is hard on engineers and difficult to make them complete enough and adhere to company-specific conventions (1m54s).
Maintenance after writing the test is also often a challenge, and having a super complete suite requires a significant maintenance burden (2m15s).
The process of writing a unit test at Uber involves 29 steps, which can be time-consuming and take away from the productivity of the engineer (2m30s).
Most engineers do not enjoy writing unit tests every day, and the goal is to provide a tool to speed up the process and help arrive at better unit tests (2m45s).
The initial tooling for writing unit tests consisted of prompt presets that could generate unit tests with Uber conventions, but they were clunky and not very usable in practice (3m28s).
The initial tooling required engineers to read the documentation, get the prompts, put them in, generate the test, splice them in, test them, and iterate, which led to people not using them very often (3m43s).
The goal is to provide a better tool to solve the problem of writing unit tests and make the process more efficient and productive (2m55s).

Introducing AutoCover (3m55s)

AutoCover is a product that is being iterated on, and it involves using GitHub Copilot to generate tests through a chat participant, resulting in a table test with edge cases and everything after some internal looping (4m22s).
An agentic system is a system comprised of multiple steps, each performing a specific task, and these steps are chained together to allow agents to collaborate and specialize in solving problems (4m47s).
By adding guardrails from deterministic tools like compilers and linters to the output of a probabilistic model, the overall precision and quality of the output can be greatly increased (5m24s).
The approach involves breaking down a large problem like writing a test into smaller, solvable problems, similar to what a human would do when writing a unit test, and then feeding that into a model to get better results (6m1s).
The process of writing a unit test involves diving into the code base, understanding how it works, figuring out the state of coverage, trying to write a test, compiling it, and feeding all that back into the system (6m22s).
The presentation will walk through each of the systems involved in the process, starting with an example of a complex code snippet that is indicative of something that may be seen in a code base (6m43s).

How does it work? (6m59s)

The example code used for demonstration is a piece of source code that contains functionality to parse resource identifiers, such as URIs, and can be found in a utel folder in prod, with one Go file and no existing tests. (7m0s)
The code's coverage status is unknown due to the lack of test files, but as soon as at least one test case is added, the coverage status can no longer be inferred and must be calculated. (7m45s)
Autocover is introduced as a tool that can be dropped into a code base to calculate coverage, starting with the preparer agent, which sets up the environment for the rest of the agents to operate properly. (8m2s)
The preparer agent performs a setup step, including setting up the file, creating the file if needed, scaffolding out the file for some languages, and collecting context such as surrounding files and summaries of imports. (8m10s)
The preparer agent also performs other setup tasks, such as regenerating mocks, to prevent issues in subsequent steps. (8m48s)
The preparer agent's tasks can be done deterministically, and this is a key aspect of the system that will be referred to later. (9m3s)
The generator is the first probabilistic agent, which uses the collected intelligence to generate test cases for the uncovered lines of code, producing candidate test cases that can be refactored later. (9m21s)
The generator takes the target area and generates test cases, which are individual and atomic units that can be deterministically checked against build tooling and coverage tooling. (9m33s)
The initial phase of the process involves generating test cases and setting up and tearing down sequences for each specific test case, which are then packaged and put in shared memory for other agents to use (10m32s).
The test generation is similar to the "/test" command in Copilot, which generates a test and everything else around it is machinery to aid and produce better results (10m41s).
The executor is a deterministic agent that does a lot of work to ensure that the test cases are initially good or not, and it runs tooling to make sure dependencies are imported and the file is syntactically correct (11m7s).
The executor tries out the test cases one by one, and if a test case fails, the feedback is included in the shared memory for future agents to use (11m39s).
Building the executor requires knowledge of all languages and frameworks used in the organization, and it can be challenging to consolidate and prepare the organization for this (12m16s).
The results of the executor are then passed to the fixer, a probabilistic step that has a specific purpose of fixing failed test cases (13m4s).
The fixer sends parallel prompts or calls to the language model (LM) for each test case, giving it a larger context window to elaborate and produce better results (13m26s).
The process is built in Langra, which provides a nice abstraction for doing these types of things (13m38s).
Fixer is a tool that addresses issues in test cases by analyzing compilation errors and making necessary changes, such as removing or adding versions, to make the test cases work properly (13m58s).
Fixer can collaborate with the executor to mark candidate test cases for re-execution, as the initial fix may not be valid, and it may take multiple iterations to get it right (14m32s).
The system is a complex state machine that manages the flow of test cases, and it's not just a linear flow through the system (14m51s).
After fixing test cases, they are passed to the refactor agent, which ensures that the test cases adhere to Uber's best conventions and combines them into larger table tests (15m9s).
The refactor agent uses a directive to look at Uber's best conventions, which are split up by language and are modular and nice (15m13s).
The abstraction used for an agent must support composability of multiple sub-agents, allowing for the integration of different types of tooling, such as syntax tree-based linters (15m42s).
The refactor agent can integrate multiple different types of tooling, such as go format, ra, or black, to reduce verbosity and improve test cases (16m9s).
The validator is the QA component that checks if the generated test case asserts anything and if it asserts the right things (16m31s).
The validator ensures that the test cases are not just blindly executing code and not asserting anything, which is a big worry for engineers (16m41s).
The validator has multiple sub-agents, and it's the first one that is used to validate the test cases (16m56s).
A system has been developed with multiple sub-agents to improve test case generation, including a simple deterministic sub-agent that checks for assertions, a probabilistic sub-agent (LLM Critic) that validates tests against a list of conventions, and a mutation tester. (17m0s)
The LLM Critic sub-agent goes through conventions and maps them back to the test, flagging inconsistencies and providing feedback, such as suggesting additional test cases for concurrency functionality. (17m14s)
The system uses a multi-shot approach to adherence, where a model is asked to generate or refactor a test, and then another model validates whether the first model complied with the style guide. (18m7s)
The agent abstraction needs to account for situations where the model cannot infer what's going on and needs to return control to the human, such as when there are insufficient doc comments or the code is written in an obtuse way. (18m43s)
The system is being developed to add more interaction paradigms and UI elements in the IDE, allowing the model to ask for more context when it doesn't understand the code. (19m5s)
An example of the validator's feedback is shown, where it suggests adding edge cases to a table test, such as checking for Unicode symbols or empty identifiers. (19m25s)
The system iterates on the test cases based on the feedback, passing it back to the fixer and executor to arrive at a new iteration of the test case with more comprehensive coverage. (19m50s)
The mutation testing subcomponent is being developed to further improve the test cases. (20m10s)
The concept of a mutant is introduced, which refers to making an unintended change to the source code and checking if the test suite catches it, with the goal of identifying potential bugs or shortcomings in the test suite (20m31s).
An example of a mutant is provided, where the code is changed to parse a string as V1 instead of V2, which is a breaking change that the test suite should catch (21m5s).
Another example of a mutant is given, where an error is swallowed instead of being returned, which is a contract-breaking change that the test suite should detect (21m31s).
The mutants that pass the test suite are bundled up into feedback and used to add more test cases to address the shortcomings found (21m55s).
The generation of mutants can be done using both deterministic and model-based approaches, such as parsing the syntax tree and inverting if conditions, or using a large language model (LLM) to introduce subtle bugs (22m7s).
The system uses a complex state machine to generate and test code, and the candidate code travels through each state without the user's knowledge (22m37s).
A live demo is mentioned, where the execution of Auto Cover will be stopped at pre-selected states to show what the model or system was thinking and what the feedback is before the next iteration (23m2s).

Demo (23m21s)

A sample project is demonstrated, which includes the "bars" and "bars V2" projects, as well as a stringification tool, but lacks tests and has 0% coverage (23m22s).
The project's source code is highlighted in red, indicating that it hasn't been tested, but some lines are not marked as uncovered because they are just declarations (23m50s).
A more targeted approach is taken by scoping in on specific functions that need testing, rather than selecting the entire file and generating tests for it (24m22s).
The process of generating tests is started by invoking the "autocover" command in the chat panel, which opens up a new test file automatically (24m48s).
The "autocover" command is a fire-and-forget process, but in the future, it is planned to allow users to chat again with the copilot to ask questions, make corrections, or add more scenarios (25m8s).
A debug viewer is used to show the intermediate state of the test generation process, which is not intended for human consumption (25m30s).
The first state in the graph is shown, which is the initial state after the test cases have been generated, and the test configuration panel is introduced, which shows the current node, the number of tests generated, and the number of tests needed to cover the code (25m49s).
The current state of the code is unknown, as it hasn't been run yet, and the test case state is also unknown, with the system having generated a table test or future with a good format to make refactoring easier (26m36s).
The feedback section is empty because the test case was just generated, and there's no information on whether it passes or builds (27m5s).
The imports were split up because sometimes interaction with the build system is necessary, such as running the package manager to pull in new dependencies (27m14s).
After the initial run of the executor, the test functions remain the same, but a new test case state is added, showing that the generated code builds, the test passes, and coverage is raised (27m36s).
The system currently uses coverage as a metric but plans to expand to other metrics like survival or mutation testing mutant survival rate and branch coverage (27m55s).
If a test fails, the system would show a different state with a build or test failure, and provide feedback in the form of output from the test runner or build system (28m28s).
The system has simplified tech feedback to a simple textual field, but can also handle complex structure data that can be passed back to the model for subsequent generation phases (29m0s).
After the initial validation step, the system passes through a few steps and reaches validation, where it determines that the new test state is validation failed, indicating something wrong with the test case (29m26s).
The system generates test cases and provides feedback on how to improve them, including suggestions for adding edge cases and improving readability (29m34s).
The feedback is based on various heuristics, such as catching edge cases, improving readability, and using named parameters, which are continuously evolving (30m17s).
The system also incorporates mutation testing results, which can show up in the feedback if a mutation testing surviving bug is found (30m42s).
The final state of the system is that it outputs a test case that has been iterated upon and improved based on the feedback, with the exact examples suggested in the feedback added to the test case (31m3s).
The system can be adjusted to iterate a certain number of times, and the feedback is flagged as addressed once it has been incorporated into the test case (31m33s).
The user of the system, Autocover, would see a populated test file with multiple tests and expansions, including new edge cases that have been added (31m44s).
The system is able to achieve 100% coverage of the file, with highlighted parts that are covered in green and potential indicators for tested branches and covered code (32m20s).
The system also provides metrics on the number of test cases that cover a particular piece of code, which is a new metric being experimented with (32m41s).
The Auto Cover feature can generate around 110 lines of code with just 15 characters typed to invoke it, making it a keystroke-safe and effective tool, with many engineers at Uber using it (33m0s).
If there is missing coverage, it is likely that the code is too complex or there is a bug in the source that should not be covered (33m17s).
The goal is to expand the feature to support other languages, with Java and Python support currently in development and expected to be productionized at Uber (33m39s).
A standalone mode is being developed to allow users to invoke features like validation logic and mutation testing without having to fully adopt AI-generated code (33m55s).
The standalone mode aims to provide value to developers who are not ready to fully trust AI-generated code, but still want to use specific features (34m8s).
The feature will allow users to get feedback earlier in the development cycle, rather than waiting until submitting a PR (34m21s).
Mutation testing will be made easier by allowing users to invoke a tool to do it for them, making it more accessible to developers (34m30s).
A consulter mechanism is being added to integrate with native UI or UX in the IDE, allowing agents to seed control back to humans when needed (34m47s).
The consulter mechanism will integrate with features like chat boxes, doc comments, and other paradigms to prompt users for input (35m0s).
The feature will also integrate with other entry points in the IDE, such as gutter indicators, code lens, and highlighting, to make it easier for users to discover and use (35m25s).