How to measure the effectiveness of GitHub Copilot

13 Nov 2024 (5 months ago)

Introduction (0s)

This session is led by Kitty Trueu, a Senior Developer Architect, and Mickey, a Staff Developer Architect, both from the Customer Success Strategy team at GitHub, who work with customers to solve complex problems on a daily basis (12s).
The session aims to provide three key takeaways: productive insights from the Copilot platform, insights from the Copilot Matrix, and suggestions on how to survey developers without causing fatigue (53s).
The session is focused on providing numbers and insights to help make a business case for using GitHub Copilot, rather than just relying on the hype surrounding AI tools (2m49s).
The presenters will provide a demo and offer suggestions on how to measure the effectiveness of GitHub Copilot, with the goal of helping attendees to show the value of the tool to their business and make a case for its adoption (3m10s).
The session is motivated by a common scenario where a developer requests approval for a GitHub Copilot subscription, but the boss wants to know what value it will bring to the business and what numbers can be used to justify the expense (1m33s).
The presenters will help attendees to go beyond the hype surrounding AI tools and to understand exactly what they are measuring, with the goal of providing a clear business case for using GitHub Copilot (2m58s).

Why we measure (3m17s)

When modernizing Engineering Systems, it's essential to build a Competitive Edge for the business by constantly building competency around people, process, and technology (3m33s).
The "people" aspect involves investing in the team to increase productivity, ensuring they have the right tools, skills, and communication, as well as adequate feedback and mental space to work effectively (3m57s).
The "process" aspect refers to the orchestration of activities to achieve a goal, and leaders should constantly challenge the status quo and remove red tape to improve efficiency (4m26s).
The quote from James Clear's "Atomic Habits" emphasizes that systems need to evolve to achieve goals, and leaders should focus on building systems that support their objectives (4m42s).
The "technology" aspect involves using tools to enable productivity, and investing in GitHub Copilot is not just about individual productivity but also about increasing system efficiency (5m9s).
The ultimate goal of investing in GitHub Copilot is to ship faster and reduce rework (5m27s).
To measure the effectiveness of GitHub Copilot, it's essential to establish a baseline of the engineering system and start measuring now, as you can't manage what you don't measure (6m25s).
GitHub users have access to metrics and statistics that can provide insights into their platform's productivity, and it's crucial to have a holistic set of signals from people, process, and technology to measure performance (6m50s).
Engineering systems may have performance signals from various dimensions, including high-quality code, which is a delight to both developers and the business (7m39s).
A developer's achievement is to write quality code, while a business's goal is to avoid the cost of delay and rework, and adapt to the market dynamics (7m45s).
High-quality code is characterized by being readable, reusable, maintainable, concise, resilient, and secure, which are the quality attributes that developers aim to write into their code (8m0s).
Developer happiness is about how satisfied developers are with their tasks, whether they can focus, feel challenged, and are engaged, as well as their mental energy when working from home (8m21s).
A philosophy in engineering systems is not just about speed at a single point in time, but about delivery throughput, pace, friction, waste, delays, and manual validation (8m46s).
The goal is to build things right and build the right things, ensuring that modernizing work and effort enable and accelerate engineering system success, making a positive business impact and return on investment (9m23s).
When defining the purpose of using a tool like GitHub Copilot, considerations should include building quality code, developer happiness, philosophy in engineering systems, and making a positive business impact (9m49s).

What we measure (9m52s)

When implementing GitHub Copilot, organizations typically go through four phases: adoption, activity, satisfaction, and impact, with the adoption phase focusing on ensuring users are actually using the product (10m21s).
To measure adoption, an API can be used to track basic usage information, such as the number of people using Copilot every day (10m54s).
The activity phase involves analyzing how users are utilizing Copilot's features, such as chat suggestions and auto-completion, which can be pulled from the API (11m26s).
Acceptance rate is a key metric, calculated by dividing the number of accepted suggestions by the total number of suggestions made, with a 30% acceptance rate considered good (11m45s).
Low acceptance rates often indicate that users need training or guidance on how to effectively use Copilot (12m21s).
The satisfaction phase involves assessing whether developers are happy using the tool, feel satisfied, and believe it helps them solve problems and get into a flow state (13m18s).
The length of these phases can vary, lasting from weeks to months, and cannot be rushed (13m2s).
Measuring the effectiveness of GitHub Copilot ties into the developer experience, but there isn't an API available for this, making surveys a viable solution (13m29s).
When creating surveys, it's essential to keep them short, as respondents are more likely to complete a four-question survey than a 20-question one (13m46s).
The ultimate goal is to determine the downstream impact of using GitHub Copilot, such as whether it improves software development speed and quality (14m0s).
Various metrics can be used to calculate the downstream impact, some of which can be obtained through the GitHub API, while others may require data from other systems (14m11s).
GitHub-specific metrics that can be used include average merge request frequency, which can help determine if GitHub Copilot is leading to shorter, smaller commits and faster development (14m41s).
When working with customers to adopt GitHub Copilot, it's essential to take a phased approach and not rush the process, as changing people's behavior takes time (15m0s).
It's crucial to understand that people's behavior, not just machine performance, is a critical factor in measuring the effectiveness of GitHub Copilot (15m19s).
A reasonable timeframe for measuring the downstream impact of GitHub Copilot is months, not days or hours (15m28s).

How we measure (15m31s)

To measure the effectiveness of GitHub Copilot, the first step is to pull information from the API, which can be done using the GitHub CLI, a tool that encapsulates authentication and allows for easy interaction with GitHub and API calls (15m33s).
The API call returns the last 28 days of information, which is a rolling 28-day period, and the data is in JSON format, including metrics such as total suggestion count, total acceptance count, and active user count (16m15s).
The data can be used to calculate the acceptance rate, which is a key metric for measuring the effectiveness of GitHub Copilot, and can also be broken down by language and IDE (16m52s).
A Power BI dashboard created by GitHub can be used to visualize the data and provide a nicer look at metrics such as engaged users, suggestions made, and lines accepted (17m42s).
The dashboard provides a better understanding of what's going on in the environment, including acceptance rate, acceptance by language and IDE, and engaged users (18m14s).
The data can also be used to identify trends and areas for improvement, such as a decline in acceptance rate over time, and can be used to inform decisions about how to improve the use of GitHub Copilot (18m20s).
As GitHub adds more features to Copilot, more data will be added to the API, allowing for more detailed analysis and visualization (18m42s).
The data and visualization can be used to provide insights and recommendations to managers and stakeholders, who often prefer to see data presented in a visual format (17m24s).
To measure the effectiveness of GitHub Copilot, surveys can be used to gather information from developers, and there are various options available, including third-party survey engines or a simple email to developers (19m2s).
A free, open-source project called the Co-Pilot Survey Engine is available, which can be set up with a database backend and uses a GitHub app to open an issue with four questions every time a pull request is closed (19m21s).
The four questions in the Co-Pilot Survey Engine ask if Copilot helped with the task, how the developer felt about using it, if it saved time, and what was done with the time saved (19m52s).
Another example of a Co-Pilot developer survey is available, which has 10 questions with more detail, and the choice of survey depends on the specific needs and understanding of the users (20m11s).
It is recommended to start with fewer questions, at least initially, to get developers on board and used to answering the questions seriously (20m32s).
Downstream metrics, such as deployment frequency and faster merge rates, can be used to measure the effectiveness of GitHub Copilot, and some of this data can be obtained from GitHub using the GitHub API (20m51s).
A shell script can be used to pull data from GitHub and calculate downstream metrics, such as deployment frequency, and this data can be used to determine if Copilot is helping to solve problems (21m15s).
The discussion revolves around measuring the effectiveness of GitHub Copilot, focusing on key metrics that matter in different phases (22m29s).

Key takeaways (22m51s)

The key takeaways from the presentation include recalling why measuring the effectiveness of GitHub Copilot is important and going beyond the hype to understand the value it brings to the engineering system and the business as a whole (22m51s).
Measuring the effectiveness of GitHub Copilot is crucial to understand the return on investment and to build a competitive edge for the business, and it's essential to start measuring now if not already done (23m30s).
To measure the effectiveness, consider defining signals from the existing Engineering Excellence Matrix or use the four dimensions presented, and integrate third-party system telemetry for a holistic view of the value of the AI investment (23m44s).
The presentation demonstrated how to measure from the first-party GitHub platform and the Matrix API, and the importance of integrating third-party system telemetry to show the value of the AI investment (23m50s).
A dashboard was demoed, which will be open-sourced, and the Copa-generated script and the dashboard link will be posted on the community site for further discussion and questions (24m20s).
Measuring value is a subjective matter, and the community's input and discussions are encouraged to help define the right metrics and numbers to measure value (25m0s).
The presentation concluded with a thank you note to the attendees and an invitation to provide feedback through a survey to improve future experiences (25m17s).