Stanford CS329H: Machine Learning from Human Preferences I Guest Lecture: Joseph Jay Williams
22 Nov 2024 (1 month ago)
Introduction and Goals
- The goal of the talk is to provide a sharable message about human contraction and experimentation, allowing the audience to explain how experimentation can be used in various behavior change contexts and potentially contribute to helping someone win a Nobel Prize for experimentation (52s).
- The speaker's research can be found on their website, and they have a mailing list that people can join to stay updated on lab meetings and special meetings (41s).
- The speaker's objectives for the talk include providing concrete information that the audience can share with others and inspiring the audience to take action or contribute to the field of experimentation (54s).
- The speaker's program aims to coach behavior change by transforming everyday technology touch points into intelligent interventions, using an approach that involves adaptive experimentation tools, personalized and contextualized interventions, and experimentation to exercise science and practice (2m13s).
Behavior Change and Intelligent Interventions
- The speaker's lab has worked on various projects related to behavior change, including papers presented by Anan B Shari and others (2m47s).
- The science of behavior change can help solve many human problems, and the speaker invites the audience to imagine what behaviors they wish they could start or stop doing (3m3s).
- Behavior change is a crucial aspect of various areas, including education and health, and can be influenced by technology to help people make better decisions and change their behavior for good (3m45s).
- The vision for 2034 is to have on-demand intelligent coaches that measurably help people think through many problems and change their behavior (4m11s).
- The ADC framework has been used as a foundation from 2012 to 2024 to design intelligent interventions, which are tools that change behavior and learn over time (4m22s).
- Intelligent interventions can be in the form of text messages, app messages, or explanations of concepts, and they try to come up with the best thing to give a person at a specific point in time (4m31s).
- To build intelligent interventions, multiple disciplines need to be integrated, and continuous learning is essential (4m47s).
- The Adap Comp framework takes components of everyday experience and makes them into micro-labs for intelligent intervention, where ideas can be tested and improved over time (5m17s).
- Examples of intelligent interventions include text messages, such as a daily message that is crowdsourced and tested to figure out which one is best for a person (5m30s).
- Emails can also be turned into intelligent coaches by generating multiple versions and testing what works when, such as getting students to start their homework early (6m4s).
- Any everyday experience, such as explanations on a website, can be turned into an intelligent intervention, and it should be able to learn and adapt over time (6m31s).
A/B Testing and Machine Learning
- Machine learning from human preferences can be applied to various aspects of life, making everything an intelligent intervention to test out different versions and changes over time (7m54s).
- A/B testing can be used to improve learning management systems (LMS) by testing different versions of explanations, messages, and answers to figure out what works best for students (7m37s).
- However, A/B testing in LMS is not being utilized to its full potential, with many opportunities for improvement, such as testing 200 times more versions than currently being done (7m52s).
- One challenge in implementing A/B testing is achieving sufficient statistical power, especially in small exchanges where the impact on metrics like click rates may be very small (8m15s).
- To overcome this challenge, collaborations can be built to conduct A/B testing at scale, such as in education and mental health settings (9m2s).
- Another approach is qualitative A/B testing, which involves gathering feedback and thinking through different options, even with limited data, such as when sending an email to only one person (9m24s).
- A paper titled "Abri by Moi Razer" explores qualitative A/B testing and its potential applications (9m17s).
- A/B testing can also be applied in a more thought-experimental way, using tools like A/V scribe to create different versions of an email and thinking through how they would come together (9m43s).
Qualitative A/B Testing and Thought Experiments
- The concept of A/B testing can be applied qualitatively, rather than just quantitatively, and a new name for this approach is being sought, with suggestions including "qualitative A/B comparisons" and "super A/B comparisons" (10m36s).
- Intelligent intervention involves a process that can work even without quantitative data, consisting of two parts: generating options and exploiting the evaluation of these options, which can be achieved through adaptive experiments that integrate reinforced learning and human judgment (10m56s).
- To generate options, one can use language models (LMS) and human input to create alternative versions, and then explore the space of options to come up with better ideas (11m10s).
- Exploiting the evaluation of options involves explaining the execution of these options, which can be done using adaptive experiments that give people what looks better, integrating reinforced learning and human judgment (11m30s).
- A concrete example of this process is assigning probabilities to different versions of an email, such as 70% to one version and 30% to another, and then giving each version to the corresponding percentage of people to gather more data (11m51s).
- This approach allows for rethinking the world to not just give 100% or 0% of something, but to give things proportional to the probability of having the best outcome (13m47s).
Probabilistic Models and Human Judgment
- Assigning items or actions proportionally to the probability of being the best requires defining what "best" means and deciding on the complexity of the model used to weigh different versions against each other (14m7s).
- Building a probabilistic model in the background is necessary to estimate the belief of the student and plan subsequent actions, but there are many ways to do this and many different ways to plan subsequent actions (14m33s).
- The process of figuring out relevant components of a model and setting probabilities for A/B testing is complex and depends on various factors, with no one-size-fits-all solution (15m8s).
- One possible model for setting probabilities in a two-arm case with binary outcomes will be presented, but it has its limitations (15m18s).
- An alternative approach is to use a crowdsourced mixture of experts, where a group of people, such as marketing experts, provide their opinions and update their beliefs based on the data (15m33s).
- There is a need for more research on getting human beings to set probabilities, and statistical models can be used, but more innovative work is required (15m58s).
- The goal is to make A/B testing accessible to 8 billion people, who may not run formal A/B tests but can still use tools to randomize and test different versions of emails or messages (16m11s).
- The intention is to rethink the way decisions are made and bring elements of testing together to help think about signals that indicate what might be better or worse (17m23s).
- The goal is to encourage randomization, even if it's not a 50/50 split, and to build the habit of A/B testing, which is progress towards making better decisions (17m57s).
- The problem of choosing probabilities is a problem to solve, and worrying about it gives confidence that a solution can be found (18m12s).
- Research has been done on models that can be used for A/B testing, but there is still a lot of room for improvement and innovation (18m21s).
- Quantitative methods are being used to estimate probabilities, but this approach may not be enough to help people make better decisions, and crowdsourced approaches can be more effective in this regard (18m27s).
- The quality of work will be improved with crowdsourced approaches, and one way to see this is by using models that tell us what our probabilities are (18m49s).
AB Scribe and Thought A/B Testing
- An example of an experiment is AB scribe, a tool that enables users to create and test different versions of a message, such as an email, to see which one is more effective (19m21s).
- AB scribe allows users to select pieces of the message and make them an A/B test, and then use a visual language model to generate different versions of the message (20m10s).
- The tool can be used to experiment with different versions of a message and see which one works better for a specific audience, such as someone who is very scientific or someone who believes in the benefits of yoga (20m30s).
- The goal of AB scribe is to help users create more effective messages by testing different versions and seeing which one performs better (20m53s).
- The tool can be used in various contexts, such as writing an email to students before a test, and can help users think through different versions of a message and get feedback from a language model (21m10s).
- The approach used in AB scribe is called thought A/B testing or thought A/B comparison, which involves defining options, thinking through different versions, getting feedback from a language model, and traversing the options (21m25s).
- The future of this approach is promising, and researchers can build on this work to publish high-quality papers in this area (21m47s).
- A study was conducted to test the effectiveness of a three-minute message in improving test performance, with the message being that stress can actually help improve performance on a test, not hurt it (21m52s).
- The study involved six different elements, including varying the text, showing people instructions, a video of the speaker, and asking participants to explain what they thought they'd learn (22m36s).
- The core message was tested in many ways, and the results showed that the message on its own, without any additional information, resulted in an average grade of 76% (23m7s).
- When the message was elaborated on, such as explaining how stress can help put more resources into the task and make the person pay more attention, the results showed a significant improvement, with an average grade of around 80% (23m38s).
- The study involved running six different experiments in a couple of months, which would have taken a couple of years to run with traditional research methods (24m2s).
- The results of the study suggest that the message could be effective in improving test performance, but it's not a guarantee, and the probability of it working in a particular population is not 100% (24m29s).
- To increase the chances of the message being effective, it's suggested to use crowdsourcing to come up with different versions of the message and then test them using A/B testing and machine learning models (24m36s).
- The data from the study could be used to calculate the probability of the message replicating in a particular setting, and this probability could be used to inform decisions about whether to use the message in a particular context (25m7s).
- Instead of just using statistical models to predict the effectiveness of the message, it's suggested to give the data to someone who is teaching a class and let them decide whether to use the message based on their own judgment (25m30s).
- A class was conducted to demonstrate how quantitative data can be combined with human judgment to make decisions, with the results showing a 76% chance of a particular message being the best to show students (25m34s).
Combining Quantitative Data and Human Judgment
- The process involves taking quantitative data, computing probabilities based on a model, and then giving that data to a human for judgment, allowing for the combination of human judgment with model probabilities (25m49s).
- This approach may seem subjective, but it can be a good way to connect technical tools with human behavior and decision-making (26m23s).
- The approach does not rely on individual estimates of probabilities, but rather on combining different beliefs and models, such as Bayesian combinations of beliefs (26m54s).
- There are different tools that can be used in this process, including statistical models, crowdsourcing opinions from people, and combining mathematical concepts with human behavior (27m25s).
- The goal is to take mathematical concepts and apply them to real-world problems, rather than just focusing on pure math or theoretical models (28m3s).
- This approach requires a combination of technical skills and human intuition, with the goal of designing experiences and interventions that can help people solve problems (28m9s).
Crowdsourcing and Human Feedback
- The idea of "high-dimensional intuition" is mentioned, which refers to the ability to navigate complex and messy problems, and to know which problems are worth working on (28m17s).
- Crowdsourcing opinions from people is a well-defined process that can be used to combine mathematical concepts with human behavior, and can be reproduced and changed as needed (28m50s).
- Human feedback is essential in machine learning, and people are generally bad at quantifying probabilities, making it challenging to work with human preferences (29m25s).
- The importance of using probabilities as signals in human feedback is questioned, and alternative signals or lower-effort methods are considered (29m47s).
Adaptive A/B Testing and Reinforcement Learning
- Deviating from uniform random sampling can increase power in certain cases, and quantitative adaptive AB testing is a valuable approach (30m23s).
- AB testing is a fundamental form of reinforcement learning, and Thompson sampling is a method that samples probabilities to make decisions (31m6s).
- A simple form of AB testing is the beta bandit model, which involves two arms (e.g., explanation A vs. B) and a reward outcome (1 or 0) (31m17s).
- In the beta bandit model, a policy learning algorithm like Thompson sampling assigns probabilities to each arm, and a beta distribution is used to model the probability of a positive outcome (31m50s).
Beta Bandit Model and Thompson Sampling
- A beta distribution with parameters 1 and 1 (beta 1,1) is a uniform distribution, indicating no prior knowledge or data (32m17s).
- Using a beta 1,1 distribution as a prior means that all possible probabilities are equally likely, and it's like having no data or one success and one failure (32m21s).
- A beta distribution can be used to represent the probability that people will think something is helpful, with the reward being a draw from the parameter of that beta distribution, and it is simple and interpretable (32m54s).
- For example, if five people are given arm one and three of them like it, while two do not, the beta distribution becomes beta 1+3, 1+2, which is skewed a bit higher, indicating a higher probability that people think it's helpful (33m19s).
- If five people are given arm two and all of them like it, the beta distribution becomes beta 1+5, 1+0, which is skewed even higher, indicating a high probability that people like it, but still with some uncertainty (33m55s).
- Having more observations, such as beta 60 and 10 versus beta 6 and 1, results in a more peaked distribution, indicating more confidence in the probability, even if the mean is the same (34m22s).
- The mean of a beta distribution can be calculated, for example, the mean of beta 60 and 10 is 6/7, indicating a high probability that people will like something (34m35s).
- Thompson sampling can be used to estimate the probability that one arm is better than another by sampling from the beta distributions and assigning the arm with the highest expected reward (35m48s).
- For example, if the beta distribution for arm A is 6 and 1, and for arm B is 3 and 3, Thompson sampling would sample from each distribution and assign the arm with the highest expected reward, which in this case would be arm A (36m0s).
- Thompson sampling works by sampling from the beta distributions and comparing the expected rewards, and it makes sense because it takes into account the uncertainty of the probabilities (36m35s).
- To determine the probability of one arm being better than another in an adaptive experiment, one can sample from the beta distributions 10,000 times and compute the probability empirically, which can result in a probability of 80% or 70% that one arm is better than the other (37m5s).
- The inference in adaptive experiments is not trivial, and there are many mistakes people make when using these experiments without realizing it (37m32s).
- Sampling works by choosing the arm that gives the highest reward under certain parameters, and one can compute the probability empirically by repeating the process 10,000 times (36m44s).
- The online model can be used to determine the probability of one arm being better than another, and it is possible to formalize human judgment and manual assignment using beta distributions (37m46s).
- If a person has a prior belief about the probability of one arm being better than another, it is possible to back out their prior into beta distributions, which can be represented as fictional successes and failures (38m11s).
- For example, if someone thinks there is a 70% chance that arm A is better than arm B, they might represent this as 8 successes and 2 failures in arm A, and 5 successes and 5 failures in arm B (38m22s).
- There is a need for research on how people express probabilities that one arm is better than another and how to map these probabilities to beta distributions (38m56s).
- It is possible to help people formalize their beliefs about whether one arm is better than another by providing interactive explorers or tools that help them understand how their intuitions can be mapped to beta distributions (39m12s).
- It is also possible to weight people's approaches differently, for example, by giving more weight to the estimates of someone who is considered an expert (39m37s).
- For instance, if someone thinks that an expert's estimates are 10 times more valuable than their own, they can weight the expert's successes and failures 10 times more than their own (40m12s).
Combining Expertise and Opinions
- A method of combining expertise and opinions involves assigning 50/50 using human judgment or a model, then evaluating the results with both quantitative and qualitative metrics to determine the probability of one option being better than the other (40m21s).
- This approach, called Conant experimenting, involves continuously adding new ideas and updating probabilities over time (41m2s).
- The method requires a lot of trust and effort from users, and it's possible for individuals to try to skew the results if they're very confident about something (41m22s).
- A paper titled "Eliciting beliefs from people to get better decision making even when some parties want to sway the vote" investigates ways of gathering feedback from different people to come to a group consensus (41m42s).
- The paper examines policies for combining input from different people and investigates weighting schemes, showing that some schemes can suffer from bias if someone tries to sway the majority (41m55s).
- The "hippo" approach, where decisions are made by the highest paid person's opinion, can be problematic, and a paper on this topic could explore ways to defend against this bias (42m32s).
Language Models and Personalized Interventions
- Using language models to quickly refine priors and simulate human behavior is a potential area of research, with a paper by Rob Willer at the university exploring the use of language models to simulate human behavior (43m45s).
- Writing a paper on using language models to help with decision-making could be a quick, easy, and highly impactful project (44m1s).
- There are numerous potential research papers that can be written on the topic of using large language models (LMs) to simulate human behavior and provide personalized interventions, with at least 200 possible papers in the next 10 years (44m11s).
- Personalized and contextualized interventions are an area of research that involves using LMs to provide tailored advice and support to individuals in specific situations (44m35s).
- Ana's paper on "Small Steps SMS" is an example of a personalized intervention that sends daily messages to help users manage stress and be happier (44m46s).
- Another paper by Ana discusses the use of design prompts for self-reflection, allowing users to ask themselves questions to manage stressful situations (44m53s).
- A paper on "Stories and Messages" presents a system that provides users with stories about others who are going through similar experiences, and allows users to abstract the lessons to their own setting (45m8s).
- A recent paper explores the use of LLMs as a thought partner to help users adapt stories to their own context (45m20s).
- Another paper discusses the use of LLMs as a thought partner to help users overcome procrastination, using prompt engineering and interface design to facilitate effective messaging (45m33s).
- The "Teny Spark" paper or demo presents an interface that allows users to type in their situation and receive a message from an LLM, with options to customize the tone and style of the response (46m40s).
- The interface allows users to edit the prompt and generate multiple messages, and can be used to provide support for a range of situations, including procrastination and stress management (47m13s).
- The development of effective interfaces for LLMs is crucial for unlocking their potential value, and can make a significant difference in the user experience (47m31s).
- The business value of AI can be demonstrated through the development of use cases that show its impact, such as personalized interventions and support systems (47m51s).
Adaptive Experimentation and Statistical Minorities
- Adapting experimentation using psychology and Human-Computer Interaction (HCI) can help discover the best intervention for a statistical minority, which may be the opposite of what's best on average (48m1s).
- A graph is presented showing the outcome of an experiment where students receive messages to generate questions about what they're learning in an online course, with two different messages (blue and red) having different effects on students with higher and lower accuracy (48m31s).
- The data shows that on average, the red message is more effective in getting students to generate questions, but when broken down by accuracy, the blue message is more effective for students with lower accuracy, who are a statistical minority (49m1s).
- This means that taking just the average would actually discriminate against the statistical minority, and using the red message for everyone would hurt the students who would benefit more from the blue message (49m35s).
- The blue message is better for 20% of the data (students with lower accuracy), while the red message is better for 80% of the data (students with higher accuracy), but the blue message is substantially better for the lower accuracy group (50m15s).
- Not running experiments and relying on intuition or qualitative data would make this problem bigger, as it would lead to missing the opportunity to discover the subset of students who benefit from the blue message (50m41s).
- Using contextual bandits, an experiment can be designed to assign the messages separately for the low accuracy and higher accuracy groups, allowing for adaptation over time to help the hierarchy group and collect more data for the low accuracy group (51m15s).
- This approach would allow for a 70% chance that the red message is better than the blue message for the higher accuracy group, while keeping a 50/50 split for the low accuracy group to collect more data (51m32s).
- An approach to experimentation allows for continuous data collection over time, adapting to discover what's better for subgroups, and potentially leading to better outcomes for people in smaller groups, with a 20% increase in statistical power to detect effects (51m55s).
- This method can help identify scenarios where option B is better than option A for most people, but might be hurting a minority, and can increase the statistical power to detect this difference (52m11s).
- In a traditional experiment, outcomes for minorities can be uniform, with 36% of people generating a question, but this approach can lead to better learning outcomes and increased statistical power to detect the best option for minorities (52m53s).
- The statistical power to detect the best option for minorities increased from 87% to 89%, which may not seem like a lot, but can translate to 50 fewer samples needed to be equally confident (53m6s).
- The approach can also reduce the number of people needed in an experiment, with even five fewer people being a good outcome (53m18s).
Epsilon Thompson Sampling and Algorithm-Induced Test Statistic
- The algorithm used is not Thompson sampling, which was previously thought to be effective, but rather Epsilon Thompson sampling, which involves doing uniform random sampling Epsilon of the time and Thompson sampling the rest of the time (54m0s).
- Epsilon Thompson sampling can be thought of as doing some amount of traditional experimentation to reduce the biases of Thompson sampling and guard against changes in the world or missing minority populations (54m32s).
- The analysis of data from Epsilon Thompson sampling can be done using a test statistic, which is a more effective method than debiasing techniques (55m30s).
- When comparing two options, A and B, using Thompson sampling or epsilon Thompson sampling, a Z test can be used to determine if there is a significant difference between the two options, but this assumes uniform random sampling, which is not the case with these algorithms (55m51s).
- The algorithm-induced test statistic takes the Z test statistic and regenerates it using the algorithm to collect the data, resulting in a different distribution for the test statistic and a different cutoff value for the hypothesis test (56m21s).
- Using this approach, the false positive rate can be fixed, whereas Thompson sampling can push the false positive rate to 15% compared to 5% with uniform random sampling, and the power of the test is not bad, with epsilon Thompson sampling giving a power of 66% (56m50s).
- The key takeaway is that when running experiments adaptively, epsilon Thompson sampling or a version called TS postf can be used, and then the algorithm-induced test statistic can be applied to get better results (57m22s).
- Traditional Thompson sampling has a specific tradeoff between false positives and other errors, but modifying the algorithm can result in other tradeoffs that might be more practical for a study, giving good inference (57m43s).
- Thompson sampling is not necessarily on the Pareto frontier, and changing the algorithm can result in better uniformly more statistically sensitive results (58m3s).
- The issues with Thompson sampling have been characterized, and using epsilon top sampling ideally adaptive epsilon top sampling will give better statistically sensitive results, but the way statistical tests are done should also be modified using algorithm-tuned analysis (58m19s).
- The Z test is used to compare means by subtracting the sample mean of people who said they like explanation A from B, and then getting the standard error of the difference, which largely depends on the sample sizes (59m7s).
- The paper shows how Thompson sampling affects positive rates compared to uniform random, and other tests were done to show that factors such as T test, inverse propensity, and waiting also cause issues (59m31s).
- The AL induced test has a power of 17% and controls for a positive rate of 5%, but this is not great, and the probability technique difference exists when it does not, resulting in false positives (59m44s).
- This occurs due to sampling issues, where a random low mean is observed, and sampling is stopped, leading to a false conclusion of a difference when there is none (1h0m11s).
- The vertical axis represents a sample estimate of the mean, and in reality, both arms have a mean of 0.5, but due to sampling, one arm appears to have a lower mean, leading to a false positive (1h0m26s).
- The adaptive algorithm stops sampling from the arm that appears worse, resulting in a false conclusion of a difference, even with 800 participants (1h0m40s).
- This issue also explains why there is lower power in experiments, as the sample size is not evenly distributed between arms, leading to a lack of confidence in the results (1h1m26s).
- Even when there is no difference in means, the assign probability can be 80% or higher in 53% of the time, indicating a high probability of one arm being better when there is no actual difference (1h1m55s).
- Empirical exploration of this issue is necessary, and using a tuned analysis algorithm can help control the false positive rate to 5% and increase power (1h2m50s).
- The algorithm allows for better behavior in experiments, with lower false positive rates and better power, especially when there is a small difference between arms (1h2m40s).
- Under uniform random sampling, the wall Z test is distributed in a way that allows for hypothesis testing, and the probability of getting a test statistic of 1.96 or bigger can be calculated to determine the alpha level (1h3m12s).
- Power is the probability of getting a test statistic bigger than 1.96 when there is an arm difference of 0.1, and in this case, power is 80% with a sample size of 785 (1h3m41s).
- Thompson sampling (TS) can give extreme values of the test statistic, but it's not the reality, resulting in lower power because it reduces the test statistic when there is a difference (1h3m58s).
- TS can't tell apart when there's a difference versus when there's not, reducing the signal from the two distributions (1h4m28s).
- The algorithm-induced test statistic can be adjusted to set the cut-off values to be higher, giving a 5% false positive rate, but this results in lower power, at 17% (1h4m52s).
- The problem is that the data are not discriminating, and TS or adaptive epsilon Thompson sampling drives the distributions apart (1h5m21s).
- Statistically sensitive algorithms are needed, and a paper is being worked on to investigate hybrid combinations of reward-maximizing bands like Thompson sampling with traditional uniform random (1h5m40s).
TS Positive and Adaptive Epsilon Thompson Sampling
- Epsilon Thompson sampling has a fixed probability, but it's not clear what epsilon value to use, and it's not the statistician's business to decide (1h6m3s).
- A clever idea is to use the posterior probability of the arm difference being less than a certain value (e.g., 0.1) to determine epsilon, which is called TS positive (1h6m56s).
- TS positive says that if the probability of the arm difference being smaller than the specified value is X, then that X is epsilon, and this adaptive epsilon Thompson sampling adjusts epsilon based on the posterior probability (1h7m31s).
- The algorithm provides a trade-off across different sample sizes and arm differences or effect sizes, and it is believed to beat existing algorithms because it adapts epsilon to sampling, with a uniform random sampling rate of 10% regardless of the sample size (1h8m2s).
- Using adaptive epsilon to sampling Thompson sampling (TS) positive results in a better trade-off between reward power and false positive rate, with a fixed false positive rate of 5% (1h8m31s).
- The algorithm's power is comparable to other methods, with TS positive having a power of 82%, uniform random having a power of 87%, and Thompson sampling having a power of 66% (1h8m50s).
- The use of TS positive and the algorithm allows for a good trade-off between reward inference and false positive rate, making it suitable for use in experiments (1h9m3s).
Future Research and Vision
- The development of more statistically sensitive algorithms, such as adaptive epsilon Thompson sampling and algorithm-tuned analyses, is a potential area of future research (1h9m22s).
- The vision of experimentation and coaching involves the use of intelligent interventions, such as adaptive experimentation tools like ABscribe, and the use of TS positive and algorithm-tuned tests (1h10m11s).
- The goal of this research is to accelerate science and impact practice through the development of more effective experimentation and coaching methods (1h10m23s).