Stanford Seminar - Accounting for Human Engagement Behavior to Enhance AI-Assisted Decision Making

22 Oct 2024 (5 months ago)

Introduction

Recent research has focused on enhancing AI-assisted decision-making, as AI technology has made significant progress in the past decade and is now widely applied in various areas, including finance, healthcare, and criminal justice (12s).
AI models have been trained to uncover hidden insights from big data, and in decision-making tasks, they can analyze the task, come up with decision recommendations, and present them to human decision-makers to assist in their decision-making (47s).
The collaboration between human decision-makers and AI models has transformed decision-making from a human-only task to a joint task, with the hope that they can complement each other's strengths and weaknesses and achieve better decision-making performance (1m14s).
However, this human-AI collaboration is rarely observed in reality, and it is essential to understand why and how to improve AI-assisted decision-making performance (1m38s).

Improving AI-Assisted Decision-Making

To improve AI-assisted decision-making, two essential steps are necessary: obtaining more empirical understandings of how human decision-makers engage with AI-based decisions and leveraging this understanding to improve the designs of AI-based decisions (1m59s).
Empirical research has identified that human trust and reliance on AI are shaped by various performance indicators of the AI model, especially its performance observed in the field (3m30s).
Humans' confidence in their own independent judgment also influences how much they are willing to listen to AI's suggestions, and they are more likely to be receptive to AI's decision recommendations when their own decision confidence is relatively low (3m52s).
In cases where humans do not observe AI models' performance, it is essential to understand how they decide how much to rely on AI's decision recommendations (4m16s).
A human subject experiment was conducted to study how humans gauge the trustworthiness of AI models when performance information is absent, and it was found that humans tend to use the level of agreement between AI model decisions and their own independent judgment to assess trustworthiness (4m20s).
In the experiment, human subjects were asked to complete speed dating prediction tasks, where they were given a speed dating profile containing demographic background information and other relevant details, and then asked to make their own independent judgment on whether the participant would be willing to see their date again (4m49s).
After making their independent judgment, the subjects were shown an AI model's prediction as a decision recommendation, and then asked to submit their final decision, with the option to follow the AI's suggestion or not (5m44s).

Experimental Design and Methodology

A pilot study was conducted to categorize different decision-making tasks into several categories or groups based on the decisions and confidence levels of the human subjects (6m2s).
The pilot study found that decision-making tasks can be categorized into four groups: tasks where the majority of subjects make correct decisions with high confidence, tasks where subjects make wrong decisions with high confidence, and tasks where subjects make correct or wrong decisions with low confidence (6m40s).
The formal experimental design involved asking subjects to complete a sequence of 40 decision-making tasks with the help of an AI model, without providing information on the AI model's performance, and the tasks were divided into two phases of 20 tasks each (7m42s).
The first 20 tasks in Phase One were selected from the first two categories of tasks identified in the pilot study, which included tasks where the majority of subjects make correct decisions with high confidence and tasks where subjects make wrong decisions with high confidence (8m11s).
The experiment involves three experimental treatments that vary in the frequency of AI model decision recommendations agreeing with human majority judgment on 20 tasks in phase one, with agreement levels ranging from 40% to 100% (8m30s).
In phase one, human subjects make independent decisions on the same 20 tasks but receive different AI model recommendations based on their assigned treatment, resulting in varying levels of agreement between AI and human judgment (9m10s).
The AI model's actual accuracy is kept the same across all treatments to avoid confounding variables, achieved by setting AI recommendations to match human majority judgment in the high agreement treatment and flipping AI recommendations on tasks where humans are correct or incorrect in the low agreement treatment (9m45s).
The low agreement AI model is designed to complement human expertise by being correct in cases where humans are wrong, making it a model that complements human expertise the most (10m38s).
In phase two, the goal is to determine if subjects will trust or rely on AI differently after experiencing varying levels of agreement between AI and human judgment in phase one (11m1s).
The 20 tasks in phase two are selected from the third and fourth categories of instances in the pilot study, where humans tend to make independent judgments with low confidence (11m24s).
In phase two, subjects receive the same AI model recommendations across all treatments, simulating how the AI will perform, and the AI model's predictions are assigned to simulate its performance (11m49s).
To quantify subjects' trust and reliance on an AI model, researchers asked subjects to report their perception of the AI model at the middle and end of the experiment, and also looked at how frequently human subjects' final decisions matched the AI model's decision recommendations, especially when their initial judgments differed from the AI's recommendations (12m23s).
The metrics used to quantify people's reliance on AI were agreement and switch fractions, with higher values indicating more reliance on AI (12m57s).

Results and Findings

The experiment found that as the level of agreement between the AI model and humans' independent judgments increased, humans generally reported seeing the AI model as more reliable, trustworthy, and understandable, and relied more on its decision recommendations (13m12s).
The findings suggest that humans tend to use the level of agreement between AI and their own independent judgments as a heuristic to gauge the trustworthiness of the AI model, especially when they lack performance information about the AI (13m51s).
This tendency is related to the human cognitive bias of confirmation bias, where individuals tend to follow recommendations from those who agree with them and ignore those who disagree (14m17s).
The study found that humans may exhibit confirmation bias towards AI models that provide decision recommendations, which can lead to risks such as over-trusting AI models' wrong recommendations and under-trusting correct recommendations (14m34s).
The results also showed that humans may miss the opportunity to fully realize the potential of human-AI complementary decision-making due to their tendency to under-trust AI models that complement their own expertise (16m7s).

Implications for AI System Design

The study's findings have implications for the design of AI-assisted decision-making systems, highlighting the need to consider human cognitive biases and develop strategies to mitigate their effects (16m29s).
Human-AI collaboration may be vulnerable to attacks by adversary parties who can exploit human confirmation bias towards AI, leading to a loss of trust in AI, and this vulnerability can be strategically targeted using computational models like hidden Markov models to minimize the cost of attack and maximize the reduction of human-AI joint team performance in decision-making (16m55s).
Adversary parties can use computational models to categorize their attacks on AI models and influence people's trust and reliance on the AI model, and develop strategies to deploy these attacks (17m12s).
It is urgent to leverage our understanding of human engagement behavior with AI to combat adversary parties and secure human-AI collaboration (17m48s).

Estimating and Communicating Correctness Likelihood

To improve human-AI team performance in decision-making, humans need to collaborate with AI models by combining their independent judgment with AI decision recommendations, but humans often leverage their independent judgment in a way that does not utilize their strengths and avoid their weaknesses (18m29s).
To help humans utilize their strengths and avoid their weaknesses, it is necessary to let them know where their strengths and weaknesses are, especially in comparison to AI, and estimate the likelihood of human and AI decision correctness on individual decision-making tasks (19m4s).
Estimating the correctness likelihood of human and AI decisions is a practical challenge, and using AI calibrated confidence scores as an indication of correctness likelihood may be easier for AI, but estimating human correctness likelihood on different decision-making tasks in real-time is a challenge (19m50s).
Communicating the correctness likelihood estimation to human decision-makers in a way that encourages them to take more rational actions by either following AI or themselves is another challenge (20m25s).
There are two main challenges in accounting for human engagement behavior to enhance AI-assisted decision making: estimating humans' correctness likelihood on different decision-making tasks and communicating this information back to the human decision maker (20m39s).
Estimating humans' correctness likelihood is complicated, especially when the ground truth answer is unknown, so an important assumption is made that if a person's decision on a similar task is correct, their decision on the current task is more likely to be correct (21m9s).
To estimate humans' correctness likelihood, a four-step workflow is developed, starting with having human decision makers complete a small set of decision-making tasks, then learning a decision tree to categorize their decision-making process, extracting rules from the decision tree, and presenting the rules back to the human decision maker for modification (22m6s).
Using the user-modified set of decision rules, the predicted decisions of the human decision maker on each of the training task instances of the AI model's training data set are determined, and whether those predicted decisions are correct or not is assessed (23m1s).
To estimate the decision maker's correctness likelihood on a new task instance, a number of training task instances similar to the new task instance are fetched, and the frequency of correct predicted decisions on those similar training task instances is used as the correctness likelihood estimation (23m21s).
The second challenge is communicating this information back to the human decision maker, and one approach is to directly display the estimated correctness likelihood of the human decision maker and the AI model's calibrated confidence score on the interface (24m8s).
The direct display approach presents both types of information to the human decision maker and asks them to process the information and react accordingly (24m36s).
The initial intervention involves directly displaying information to users in the way they prefer, serving as a straightforward approach to communication (24m42s).
However, this approach may be too blunt, leading to the consideration of more subtle and indirect methods of communication, such as using interfaces that decrease human reliance on AI (25m1s).
Previous research has shown that certain interfaces in AI-assisted decision-making processes can decrease human reliance on AI, but this decrease occurs regardless of whether the AI is correct or incorrect (25m30s).
Despite this limitation, these interfaces provide opportunities for indirect communication of correctness estimation between humans and AI, allowing for the design of adaptive interfaces that decrease human reliance on AI when necessary (26m1s).

Interventions for Enhanced Decision-Making

The second intervention, called adaptive workflow, involves adaptively deciding when to present AI model decision recommendations to human decision-makers based on the likelihood of human or AI correctness (26m48s).
If AI is more likely to be correct, the AI's decision recommendation is shown upfront; otherwise, humans are asked to make their own independent judgment before seeing the AI's recommendation (27m12s).
The third intervention, called adaptive recommendation, involves adaptively changing whether to provide AI model recommended decisions to humans based on the AI's correctness likelihood (27m46s).
If the AI's correctness likelihood is higher, both the AI's explanation and recommended decision are shown; otherwise, only the AI's explanation is provided (28m7s).
A human subject experiment was conducted to test whether interventions can improve AI-assisted decision-making performance, and the results showed that providing three types of decision recommendation interventions to human decision makers led to a marginal or significant increase in their final decision-making accuracy (28m53s).
The increase in decision-making accuracy was mostly caused by the decrease of humans' overreliance on AI when AI makes a wrong decision recommendation, and the interventions did not significantly affect people's trust and reliance on AI when AI makes correct decision recommendations (29m45s).

Behavior-Aware AI Models

The traditional approach to improving AI-based decision making is to adaptively change how AI's recommendation is presented to human decision makers, but an alternative approach is to retrain the AI model to anticipate humans' engagement behavior and accommodate to humans' engagement behavior (30m12s).
This requires moving away from training behavior-oblivious AI models that optimize for AI's own decision-making performance, and instead training behavior-aware AI models that explicitly account for how humans will react to AI models' decision recommendations (30m53s).
A recent study used a human behavior model inspired by empirical understandings of humans' engagement with AI, which assumes that human decision makers will go with their own independent judgment when their confidence in their own judgment is higher than a threshold, and otherwise will go with AI models' decision recommendations (31m31s).
The study found that optimizing for the human-AI team performance in decision making is effectively the same as trying to solve a weighted loss optimization problem, where the weight assigned to each training task instance is one minus the CDF function value for humans' own decision confidence on the training instance (32m10s).
The results of the human subject study and simulation showed that when humans collaborate with behavior-aware AI models, they are able to achieve a significantly higher level of final decision-making accuracy than when they collaborate with behavior-oblivious AI models (32m42s).

AI-Assisted Group Decision-Making

The goal is to enhance AI-assisted decision making, but in reality, many decisions are made by a group of people, so it's essential to think about how to enhance AI-assisted group decision making (33m3s).
To achieve this, it's necessary to start by obtaining a better understanding of how a group of decision makers will engage with AI-based decisions and then design AI-based decisions that account for the group's engagement behavior (33m35s).
Research was conducted through human subject studies, where decision makers made recidivism predictions by looking at the profiles of criminal defendants and deciding whether they may reoffend in the future (34m31s).
Two experimental treatments were created: one where decision makers made independent judgments with the help of an AI model, and another where decision makers were organized into groups of three to discuss and come up with a consensus decision with the help of an AI model (34m50s).
The study looked into various aspects of how individual and groups engage with AI, including final decision-making accuracy, reliance on AI, fail confidence, accountability, and understanding of AI (35m26s).
One of the most striking results was that groups of decision makers relied on AI's decision recommendations even more than individual decision makers, both when AI was correct and when AI was wrong (35m51s).
This led to a lower degree of under-reliance but a higher degree of over-reliance for groups, and further analysis of chat logs revealed that group members often used AI's agreement with their own judgment to persuade others to vote together with them (36m22s).
Another frequently observed pattern in group discussions was the use of AI's decision recommendations to support their own judgments and persuade others (36m56s).
In some cases, a group cannot reach a consensus, and someone may suggest using an AI model's recommendation as a tiebreaker, which can significantly increase the chance of people using AI models' decision recommendations as their final decision (37m2s).
To improve AI-assisted group decision-making performance, it is necessary to help groups critically reflect on the trustworthiness of AI models' decision recommendations and engage in more critical deliberation (37m26s).

The Devil's Advocate Approach

Introducing a devil's advocate in the group discussion process is a frequently used approach to encourage critical deliberation, but the traditional type of devil's advocate may suffer from limitations, such as the assigned person not actually agreeing with the opposing views they are asked to present (37m54s).
Another limitation of the traditional devil's advocate approach is that the person assigned to this role may suffer from psychological safety issues, feeling isolated in the process (38m40s).
To address these limitations, using a large language model to power the devil's advocate is a potential solution, as it can generate more genuine opposing views without suffering from psychological harms (39m0s).
The language model-powered devil's advocate can be designed to vary in two dimensions: who the devil's advocate should object to, with options including opposing the majority view or specifically advocating against AI models' decision recommendations, and the interactivity of the devil's advocate, with options including a static or more interactive design (39m54s).
The simplest design is a static devil's advocate that generates critical initial questions to encourage groups to engage in critical thinking, but more interactive designs are also possible (40m44s).
A Dynamic Devil's Advocate, a language model that actively participates in group discussions and responds to other members' arguments, can enhance AI-assisted decision-making (41m8s).
This approach was tested in a human subject experiment, which found that the Dynamic Devil's Advocate can help decision-makers achieve higher decision-making accuracy and decrease their overreliance on AI recommendations (41m34s).
The language model-powered Devil's Advocate exhibits emerging behaviors, such as encouraging group members to express their opinions and ensuring everyone's voice is heard, even when not explicitly prompted (42m27s).
The Devil's Advocate also recognizes when the group's decision is based on wrong information and corrects them, and encourages holistic evaluation of the case (43m2s).
Additionally, the Devil's Advocate identifies and challenges assumptions made by the group, promoting critical thinking and more informed decision-making (44m5s).
The benefit of the Dynamic Devil's Advocate comes from the interaction between the people seeing the adversarial interaction, which helps the group come to a better decision, rather than just the AI system itself being better in this environment (44m51s).
The role of AI in decision-making can be enhanced by having a "Devil's Advocate" that challenges AI suggestions and encourages critical reflection, especially when the AI model's performance is on par with human performance (45m56s).
The effectiveness of the Devil's Advocate approach may not generalize to settings where the AI model's performance is significantly better than human performance, and it may be necessary to redesign the approach to account for such cases (46m14s).

Experiment with a Language Model-Powered Devil's Advocate

In an experiment, a Devil's Advocate powered by a language model was introduced to a group of decision-makers, and the group was informed about the presence of the Devil's Advocate (46m57s).
The experiment used a recidivism prediction task, where participants were asked to predict whether a defendant would reoffend within two years, and the data set provided the ground truth answer (47m24s).
The language model was prompted to generate thought-provoking questions and encourage deliberation, but it was observed to play a facilitator role, which may be due to its design or its ability to adapt to the situation (49m1s).
The language model's ability to adopt a facilitator role may be a result of its strategy to engage in arguments and encourage discussion, even when playing the role of a Devil's Advocate (48m40s).
The experiment highlights the importance of considering the role of human engagement behavior in AI-assisted decision-making and the need to design AI systems that can effectively interact with humans (45m39s).
Language models often try to ensure everyone is contributing equally and considering all possible features of a task, but the reason behind this behavior is not well understood (49m33s).
Future work could involve engineering language models to separate roles, such as acting as a facilitator or coordinator, and throwing out opposing views to see how different roles interact and which is most helpful in encouraging group decision-making accuracy (49m51s).

Future Directions and Conclusion

Understanding human engagement with AI is crucial as it can help design adaptive AI that promotes more appropriate engagement from humans and potentially retrain AI models to tailor to human engagement behavior (50m29s).
The goal is to promote a more appropriate engagement from the human side and to design AI that can adapt to human behavior, leading to more effective human-AI collaboration (50m37s).