Courtney Nash Discusses Incident Management, Automation, and the VOID Report

03 Oct 2024 (9 months ago)

InfoQ Dev Conference and VOID Report Introduction

The InfoQ Dev conference in Boston will feature senior software practitioners sharing their experiences on critical topics such as generative AI, security, and modern web applications, with plenty of time for attendees to connect with peers and speakers at social events (37s).
Courtney Nash is the author of the recent VOID report and a longtime contributor to the incident management space, having delivered a talk on the topic at QCon New York last year (1m1s).
The VOID report explores the unintended consequences of automation in software, including the role of AI, and provides a call to action for the community to share incident management data (2m24s).
Courtney Nash is the founder of the VOID and has a background as an editor at O'Reilly, where she worked on the Head First series of books and the SRE book (1m37s).
The VOID report was started in 2021 as a research program to collect examples of incidents in the wild, using public incident reports as a starting point (2m44s).
The report aims to provide a comprehensive analysis of incidents and their causes, and to encourage the community to share their own incident data to facilitate learning and improvement (1m19s).
The discussion covers a wide range of topics, including D metrics, working with socio-technical systems, and comparing different approaches to incident management (1m21s).
Courtney Nash's goal is to create a shared understanding of incidents and their causes, and to encourage the community to work together to improve incident management practices (1m31s).
A collection of over 3,000 incident reports was gathered, highlighting the lack of a centralized library for incident reports in the industry, unlike in aviation where anonymized incident reports have been shared to improve safety (3m45s).
This collection has grown to over 10,000 public incident reports, including write-ups from industry professionals, news articles, and tweets, which are used to tackle broad research goals (4m40s).
Research was conducted to dispel myths about incidents, including incident response, management, analysis, and attitudes, using real data instead of relying on attitudes and perceptions (5m10s).
A study on mean time to respond (MTTR) was inspired by an engineer at Google, revealing variability in MTTR for incidents and challenging the assumption that lower MTTR indicates greater resilience or reliability (5m22s).
The findings on MTTR were shared with the Dora metrics team, leading to a shift in the metrics they encourage people to use (5m53s).
Research was also conducted on root cause analysis, coinciding with Microsoft Azure's shift away from this approach in favor of post-incident reviews and publishing postmortem analyses (6m16s).
The Microsoft Azure team now publishes video postmortems with customers and executives, indicating a change in approach to incident analysis (6m42s).
The time is ripe to explore automation in incident management, with the VOID report participating in this shift (7m4s).

VOID Report Research and Findings

A qualitative methodology, specifically thematic analysis, was used to examine the role of automation in incidents, analyzing around 10,000 reports and narrowing it down to 200 incidents that showed some indication of automation involvement (7m43s).
The analysis revealed that automation plays multiple roles in incidents, sometimes even within the same incident, such as detection, problem, or resolution, and often makes it harder to resolve (9m30s).
The research found that humans have to intervene in about 75% of incidents involving automation to resolve or deal with the issue (10m12s).
The study's findings contradict the common narrative that automation will free humans from tasks they are bad at or do not want to do, instead highlighting the importance of human intervention in complex systems (10m32s).
Research from other domains, such as aviation, healthcare, and nuclear power plant systems, supports the idea that automation can be a challenging team player in complex systems (10m38s).
The VOID report's analysis of automation in incidents is limited by the data available, as public incident reports only provide a partial view of the incident, and internal discussions may reveal more information (8m18s).
The study used keyword searches to identify incidents involving automation, initially yielding around 5,600 incidents, which were then narrowed down to 200 (7m52s).
The analysis developed codes to identify similar phenomena in the text data, which were then revised and revisited to develop themes or categories of automation's role in incidents (8m49s).
The research highlights the varied roles automation plays in incidents, which can be categorized into archetypes, and emphasizes the importance of considering these complexities when designing and implementing automation systems (9m26s).
The prevalent mental model is that automation is good at certain things and humans are good at others, but this model does not accurately reflect how complex systems work (11m0s).
Research has shown that there are "ironies of automation," a concept introduced by L. Bainbridge in the 1980s, which highlights the paradoxes that arise when humans and automation interact (11m39s).
The idea is not to eliminate automation, but to readjust mental models and rethink how systems are built to help humans, rather than replace them (12m1s).
The VOID report suggests retiring MTTR (Mean Time To Recovery) as a metric, as it can be misleading, and instead focusing on more nuanced approaches to incident management (12m24s).
Some organizations are moving away from shortsighted approaches like root cause analysis (RCA) and towards more comprehensive methods (12m31s).
The criticism that some people may not be ready to move on from MTTR or RCA is acknowledged, but it is argued that averages can be misleading and that organizations should focus on more nuanced approaches (13m58s).
The work of DORA (DevOps Research and Assessment) is praised for its contributions to the industry, particularly in highlighting the importance of developer experience and culture (13m6s).
The use of data science and actual data can help organizations better understand the distribution of their metrics and make more informed decisions (13m41s).
Many organizations deal with non-normally distributed data, making it difficult to take meaningful averages, means, medians, or modes, as the data is often skewed (14m18s).
Large organizations with thousands of incidents per year may be able to derive some value from their data, such as Cloudflare, Google, and Microsoft, which publish their incident data (14m39s).
Noisy and skewed data require a large sample size or transformations, such as log transformations, to make sense of them, but these transformations can make the data meaningless to others in the organization (15m18s).
Assigning numbers to complex systems can be challenging, and using metrics like MTTR (Mean Time To Recovery) can be misleading and may not provide meaningful insights (15m43s).
Incentivizing metrics like MTTR can lead to unintended behaviors, such as engineers and incident response teams being assigned OKRs (Objectives and Key Results) that can affect their bonuses (16m5s).
Making metrics a target can change people's behaviors, often leading to negative consequences, as illustrated by an XKCD cartoon (16m41s).
The focus on metrics and targets can negatively impact the people responsible for making systems work, particularly those at the "sharp end" of these systems (17m6s).

Challenging Traditional Incident Management Metrics

Research often focuses on studying incidents and failures, but normal work is not studied as much, despite its importance in understanding how systems function (17m15s).
Studying normal work and tradeoff decisions in incident management can provide valuable insights, as seen in research by Dr. Laura Maguire (17m38s).
Research is being conducted to study normal work in high-pressure and high-tempo situations to understand what information incident responders and software engineers need to collect to help managers and senior leadership make decisions (17m58s).
The focus is on understanding what works and what is successful, rather than just analyzing failures, and recognizing the socio-technical side of systems, including the roles people play in making them work (18m44s).
There is a tendency to focus on technology and architecture, but it's essential to consider both the social and technical aspects and how they interact (19m11s).
Automation is beneficial for companies, and high-performing organizations are correlated with mature practices in this space, although there is currently no data to prove this (19m27s).
Investing in learning from incidents and having dedicated incident analysis roles can give organizations a competitive advantage and increase the confidence of engineers in handling and understanding their systems (19m50s).
A preliminary survey was conducted as part of the 2024 report to understand what people are actually doing in the space, as the current understanding is skewed towards those at the cutting edge of incident analysis (20m42s).
The Learning from Incidents community, started by Nora Jones, has a large Slack community, and the perception of what people are doing is influenced by interactions with those who are personally and intellectually at the cutting edge of this field (20m57s).
Incident analysis is crucial for organizations, and investing in it can lead to higher confidence in feeding information back into the organization in useful and beneficial ways, with dedicated roles, executive support, and funded programs being key factors (21m47s).
Organizations that invest in incident analysis may have certain organizational and competitive advantages, but more data is needed to prove this (22m34s).
Research on incident management is focused on people, as they are essential to this field, and more people are needed to participate in this research to provide data and help improve practices (22m56s).
Digital services are mission-critical to the world, and participating in this work, sharing experiences, and publicly discussing incidents can help improve their resilience and reliability (23m30s).
A call to action is made for people to participate in research and share their experiences to help improve incident management practices, with the promise that it is not just marketing or surveys, but actual research (23m22s).
Automation has multiple roles in incidents, and understanding these roles can help improve incident management, with archetypes such as the Sentinel, Gremlin, Meddler, Unreliable Narrator, Spectator, and Action Item being identified (24m36s).
The VOID report and other research aim to provide insights into incident management and improve practices, with the goal of making digital services more resilient and reliable (24m11s).

Automation in Complex Systems

The concept of automation can be complex and nuanced, with different definitions and interpretations, but a common understanding is that automation involves a computer performing tasks instead of a human, often in a repetitive and faster manner (25m38s).
The definition of automation often includes a third layer, which is the notion that it can perform tasks better than humans, but this is where the concept can become problematic (26m12s).
In complex systems, automation can fail in unexpected and surprising ways, which can be confusing and difficult to diagnose (26m40s).
Research by Lay Bainbridge and others has focused on the ironies of automation and the concept of automation surprises, highlighting the challenges of designing automation systems that can handle complex and unpredictable systems (27m1s).
Complex systems are defined as those that cannot be fully modeled or understood by one person, and are characterized by non-linear relationships between inputs and outputs (27m16s).
The concept of automation originated from the idea of assembly lines and linear systems, where tasks can be easily modeled and predicted, but this does not always translate to complex systems (27m35s).
In complex systems, automation failures can be difficult to diagnose and may not fail in expected ways, making it challenging to design and implement effective automation solutions (28m0s).
The design of automation systems in complex environments requires careful consideration of the potential failure modes and the limitations of automation in handling unexpected events (28m10s).
Automation can make complex systems harder to understand and manage, especially when things go wrong, as it can be difficult to introspect and understand what the automated system is doing (28m17s).
This issue is not unique to IT and is also seen in other domains such as medicine, healthcare, and aviation, where automation can make it harder for humans to deal with problems when they arise (29m26s).
The mental model of automation needs to be rethought, and it should be designed as a joint player or team member, rather than just a tool, to make it easier to work with and understand (29m38s).
To achieve this, it's recommended to focus on making automation more introspectable and easier to understand, and to prioritize developer UX for tooling and incident management (30m2s).
The amount of time and money spent on product UX should also be invested in developer UX for complex systems, to make it easier for developers to work with these systems (30m11s).
Automation in simple linear systems is different from automation in complex systems, and they should not be treated the same (30m42s).
While AI may be able to help with explainability and understandability, it's recommended not to jump into AI solutions too quickly, but rather to focus on getting the basics of automation right first (31m29s).
The argument that AI will make things better and smarter is not necessarily true, as humans are still responsible for making AI systems, and the basics of automation need to be understood before moving on to more complex solutions (31m42s).
There's a logical black hole in understanding how AI can diagnose complex systems better than humans, especially when the AI hasn't been given access to all the necessary inputs, and this is a major challenge in incident management and automation (31m47s).
The idea that AI can model an entire system is flawed, and it's essential to acknowledge that AI needs to know where to look for information to be effective (32m21s).
The hype surrounding autonomous cars and generative AI is similar, and it's based on the same misconceptions about automation and complex systems, which can lead to unrealistic expectations (33m16s).

Improving Incident Management through Joint Cognitive Systems

It's crucial for organizations to focus on foundational automation and understanding complex systems before implementing AI solutions, rather than relying on AI to recommend solutions that may not work (33m54s).
Learning from incidents and introspecting system failures can help organizations identify pain points and improve response times, making it easier for engineers and incident responders to do their jobs (34m2s).
Providing better tools for incident responders and engineers is essential for organizations to improve their incident management capabilities (34m45s).
The concept of joint cognitive systems and making automation a team player is a challenging area, with 10 key challenges identified in the VOID report, including the need for a deeper understanding of agent theory and complex systems (35m1s).
The challenges in incident management can be addressed by leveraging human factors research and user interface expertise to improve the joint cognitive systems that work together to achieve a common goal (35m38s).
Joint cognitive systems research originated from the fields of aviation and surgical environments, where the collaboration between humans and machines was critical to preventing harm (36m4s).
The concept of joint cognitive systems involves anthropomorphizing computers to create a team-like environment where humans and machines work together to achieve a common goal (36m33s).
This approach requires updating the mental model from "computers are better at" and "humans are better at" to "we are a team trying to achieve this common goal" (36m56s).
To achieve this, it's essential to have better tooling, introspective systems, and user experience (UX) experts who can design internal tools that facilitate collaboration between humans and machines (37m11s).
The idealized version of a joint cognitive system is exemplified by the Iron Man suit, which represents a system that can communicate with its user and provide real-time information (37m43s).
Designing such a system requires a deep understanding of human factors, user experience, and the ability to create a system that can guide the user's attention and summarize information (38m21s).
Implementing this approach requires significant work, investment, and a shift in corporate priorities, which can be challenging, especially in the current economic climate (38m37s).

Apples and Volkswagens: The Problem with Aggregate Incident Metrics

A recording titled "Comparing Apples and Volkswagens: The Problem with Aggregate Incident Metrics" is available online, discussing the issue of using metrics that don't accurately represent the reality of one's experience (39m3s).
The title is inspired by the speaker's late mother, a sociologist who used the metaphor of comparing apples to Volkswagens to convey that different things can't be directly compared (39m17s).
The recording explores whether duration and severity of incidents are related, finding no statistical correlation between the two (40m13s).
The speaker aims to empower people at the sharp end of incident response to have more meaningful conversations with those at the blunt end by providing data-driven insights (40m57s).
The speaker is collecting incident data and invites people to submit their own incidents through a simple form or by reaching out through LinkedIn or the website's contact form (41m19s).
A larger survey will be fielded this year to gather more data on the experiences and effectiveness of incident responders and on-call teams (41m51s).
People can sign up for the newsletter to receive updates on the survey and other related information (42m8s).
The conversation is coming to a close, with the host expressing gratitude for the guest's time and mentioning that additional information will be made available in the future (42m16s).
The host thanks the guest again, stating that there was a lot to cover and that they will link to relevant reports and other materials (42m20s).
The conversation ends with the host thanking the guest once more and expressing appreciation for the discussion (42m28s).