Stanford Webinar - Creating Fair, Useful, and Reliable AI in Healthcare

12 Dec 2024 (6 days ago)
Stanford Webinar - Creating Fair, Useful, and Reliable AI in Healthcare

Bringing AI into Clinical Use

  • Dr. Nigam Shaw is a professor of medicine at Stanford University and chief data scientist for Stanford Healthcare, with research focused on bringing AI into clinical use safely, ethically, and cost-effectively (14s).
  • Dr. Shaw has an extensive background, including being an inventor on patents, authoring over 300 scientific publications, co-founding three companies, and being inducted into the American College of Medical Informatics and the American Society for Clinical Investigation (27s).
  • The quality of AI and machine learning models in healthcare is heavily dependent on the quality of the data they are trained on, with data being collected from patient timelines (1m33s).
  • Patient timelines are visualized as a series of data points collected over time, including ECGs, blood pressure, respiratory rate, cardiac output, medication orders, lab tests, and reports (2m7s).

Data Quality and Patient Timelines

  • In a typical healthcare setting, not all data is collected at every point in time, and there is often a lack of longitudinal coverage for individual patients (2m46s).
  • The manipulation and processing of patient timeline data have a significant impact on the performance of AI and machine learning models (3m22s).
  • With large amounts of patient timeline data, models can be built to support decision-making in healthcare, including whether to treat a patient and how to treat a patient (4m1s).

Classification vs. Prediction vs. Recommendation

  • The decision of whether to treat a patient can be broken down into classification or diagnosis tasks, or prediction tasks, such as prognosis (4m18s).
  • The terms "prediction" and "classification" are often misused, with classification being the correct term when analyzing an image to determine if it contains a specific object or condition, such as pneumonia or a dog, as the outcome is already present (4m46s).
  • In medicine, many things that claim to be predictions are actually classifications, such as sepsis predictors, which are actually figuring out if a patient has sepsis, and this distinction is important as it affects how the information is used (5m13s).
  • The distinction between prediction and classification is crucial, as predicting an outcome may lead to attempts to prevent it, while classifying and diagnosing a condition leads to treatment, not prevention (5m30s).
  • Recommendation is the hardest task, given the data's limitations and biases, and it has been a 40-year journey in medicine to figure out reliable recommendation (5m58s).
  • There are three things that can be done with AI in healthcare: classification, prediction, and recommendation, and it's essential to consider whether these technical exercises are advancing the science of medicine, the practice of medicine, or the delivery of medical care (6m13s).

Advancing Science, Practice, and Delivery of Medical Care

  • An example of advancing the science of medicine is the discovery of three subtypes of heart failure with preserved ejection fraction, which would be a classification (6m42s).
  • Advancing the practice of medicine would involve developing a test to determine the subtype of heart failure and having a treatment available to target the specific subtype (6m58s).
  • Advancing the delivery of medical care would involve implementing the test and treatment over time, resulting in improved patient outcomes, such as longer life, lower costs, and better quality of life (7m28s).

The Green Button Project and On-Demand Data Analysis

  • The "Green Button Project" is an example of advancing the practice of medicine, where a simple query system was developed to help clinicians make decisions at the bedside by analyzing similar patient cases (8m1s).
  • A bedside consultation service was created to provide written reports with recommendations for patient care, utilizing aggregated data from millions of patients to make better decisions (8m16s).
  • Research has shown that medical evidence can be unreliable, with physicians often making decisions without prior published data, highlighting the need for on-demand data analysis (8m47s).
  • A project was conducted to analyze data on demand, which was later scaled and shared through a company called Atropos Health, reducing the time it takes to conduct bedside studies from a day or two to under 24 hours (9m36s).
  • The use of generative AI has further reduced the time it takes to conduct studies, allowing for on-demand analysis in a matter of minutes (10m9s).

Predictive AI and Cost Savings

  • An example of simple AI being used to predict which patients will become medically costly in the future, and taking proactive action to enroll them in management programs, resulted in an estimated 10-15% cost savings without sacrificing quality (10m54s).
  • AI and machine learning can be used to make various predictions, including operational, biological, and delivery-related predictions, such as predicting no-shows, classifying images, and deciding who to put on an air ambulance (11m53s).
  • The AI model provides a risk estimate value, but the actual value comes from taking responsive action based on that estimate, such as early intervention or Advanced Care planning, depending on the specific case (12m28s).
  • A three-star logo is used to remember the interplay between the model's risk estimate, the work capacity to follow through, and the action taken, with the goal of achieving net benefit (13m10s).
  • Research has shown that about 25 papers and five to six faculty members have worked on studying this interplay, leading to the key insight that the focus should be on what can be achieved given work capacity (13m36s).
  • A plot is used to illustrate the relationship between the rank-ordered cases based on the probability of an event happening and the cumulative benefit of taking action, with the goal of determining how far down the list action can be taken before diminishing returns are seen (13m53s).

Fair, Useful, Reliable (FIR) Models

  • The approach is called Fair, Useful, Reliable (FIR) models, which is a multi-step process that involves usefulness simulations, financial projections, ethical considerations, and prospective evaluation (14m59s).
  • The current way of doing things in AI is unsustainable, with 220 atomic pieces of guidance on how to do good AI, but half of them focus only on building the model, with little emphasis on workflow analysis and implementation (16m22s).
  • The FIR approach is a packaged process that has been developed over 57 years of work and is used on a routine basis in the campus healthcare system (15m43s).

Unsustainable AI Practices and the FIR Approach

  • The current organization of medical research is unsustainable, with an example of a model taking 10 years and $28 million to be tested and validated at multiple sites, highlighting the need for more efficient processes like form assessment in healthcare (16m47s).
  • To create fair, useful, and reliable AI in healthcare, a three-step approach can be taken: Discovery (solving for the science), Development (validating the intent), and Dissemination (scaling) (17m43s).
  • In the Discovery stage, the process is often too slow and costly, while in the Development stage, the focus should be on achievable benefit and financial sustainability, which may require changes to business models (18m15s).
  • The firm assessment is a tool used to evaluate the development of AI models, with a link to the assessment available at fm.stan.edu (18m49s).

Firm Assessment and Workflow Definition

  • The first step in the firm assessment is to define the workflow, including what actions will be taken and by whom, with an example workflow provided for a classifier that identifies undiagnosed pediphil arter disease (19m8s).
  • Clarity on responsible action is key when building AI models, including defining the policy and workflow, and determining the threshold for action (19m42s).
  • An Ethics assessment is also conducted as part of the firm assessment, considering factors such as equity, reliability, governance, and autonomy in decision-making (20m6s).

Form Assessment and Capacity Planning

  • A form assessment process is used to evaluate the impact of AI projects in healthcare, considering factors such as the number of patients affected, sustainability, and potential ethical problems, with the goal of identifying good projects to pursue (20m46s).
  • The assessment process involves analyzing six cases, with the first row of the table showing an example where 1,400 patients are impacted, and the project is deemed sustainable with no ethics problems (20m57s).
  • Capacity planning is necessary for responsible AI in healthcare, as launching multiple projects simultaneously can be challenging with limited personnel, and operational engineering work is done to determine the number of concurrent assessments needed to achieve a certain throughput (21m33s).
  • Little's law, a basic operations engineering principle, is used to calculate the required team size, indicating that a team that can handle two assessments at the same time is needed to complete at least one assessment per month (22m12s).
  • Good governance is essential to ensure that everything needed is done, and a life cycle is established to make sure that the form assessment is integrated into the workflow, with governance, IT support, and standard work (22m35s).
  • The governance process involves making decisions, assigning responsibility, and conducting analyses to produce numbers and inform decision-making, with the goal of ensuring that AI projects are fair, useful, and reliable (23m12s).
  • The four key components of the process are standard work, IT support, governance, and form assessment, which work together to ensure that AI projects are well-planned and executed (23m34s).

Language Models and Patient Timelines

  • The process was developed to ensure that machine learning projects are fair, useful, and reliable, but the emergence of large language models (LLMs) in 2022 has introduced new challenges and uncertainties (23m49s).
  • A language can be viewed as a sequence of tokens from a finite vocabulary, and with this lens, a patient timeline can be seen as a language, consisting of tokens such as ICD codes, CPT codes, and LOINC codes (24m46s).
  • There are two ways to build language models: the classical way using natural language, which can be used for chat and summarization, and using patient timelines to forecast what will happen, a unique way of using language models in healthcare (25m17s).

Evaluating Language Models in Healthcare

  • A study was conducted to verify the effectiveness of language models in healthcare, using GPT 3.5 and GPT 4 to answer questions from a bedside service, and the results showed that agreement with reference answers increased and disagreement decreased from GPT 3.5 to GPT 4, but around 40-50% of the time, physicians couldn't decide if the answers were correct (26m19s).
  • Another study, MedAlign, was conducted to evaluate the alignment of language model outputs with medical needs, and the results showed a 35% error rate in answering medical prompts, even in the best-case setup (27m57s).
  • Research is being conducted to train models that can forecast patient outcomes, using a forecasting classifier or predictor, and the results show that the number of positive examples used to train the model affects its performance (28m22s).
  • The performance of different models, including grade and boosted models, logistic regression, random forest, and timeline-trained language models, was compared using a receiver operator curve, with the timeline-trained language model (dark blue line) consistently showing higher accuracy (28m39s).
  • The timeline-trained language model achieved an accuracy of around 78%, outperforming the highest accuracy of classical methods (red, orange, or light blue dots) while using 95% less training data and training eight times faster (29m11s).

Climber and Motor: Open-Source Language Models

  • The models, called Climber and Motor, have been publicly released and can be found on GitHub (29m50s).
  • The focus should be on verifying the benefits of language models and generative AI, as there are many tech companies building models that cost millions of dollars, which academic sites cannot afford (30m7s).
  • It is essential to ask hard questions about whether these models actually work as advertised, and there is a need to develop a worldview for generative AI (30m36s).
  • The development and dissemination of this worldview are uncertain, and there is a need to focus on verifying benefits, as seen in conflicting articles about whether AI is better than doctors (30m50s).

Building Fair, Useful, and Reliable Models

  • To build fair, useful, and reliable models, data engineering and data science work must be done collaboratively, with data engineers and data scientists working side by side (31m56s).
  • The data science team should have more data engineers than data scientists, and they should work together to extract, clean, and make decisions about the data, as these decisions will affect the kind of science that can be done (32m7s).

Efficiency in AI Model Development

  • The time it takes to develop and refine AI models in healthcare decreases with each replication, as the team's maturity and established platforms and procedures contribute to increased efficiency and reliability (32m45s).
  • The first end-to-end development of an AI model in healthcare took around 5-7 years, but subsequent replications took significantly less time, with the goal of reducing the time by 50% with each replication (33m14s).

Verification of EHR Data Accuracy

  • Electronic Health Record (EHR) data is inherently noisy and prone to errors, so it's essential to verify conclusions by looking for multiple lines of corroboration, such as independent pieces of information confirming a diagnosis (34m2s).
  • To establish the accuracy of EHR data, it's crucial to look beyond a single code or piece of information and instead consider multiple factors, such as lab results, medications, and other relevant data points (34m18s).

Stanford's Approach to LLMs

  • Stanford is not training its own Large Language Model (LLM) for medical analysis, instead opting to use an open-source model with no copyright issues and fine-tuning it for their specific needs (34m37s).
  • Training an LLM from scratch is expensive and not considered an efficient approach, especially given the rapid pace of advancements in the field (35m1s).

Promising Applications of Machine Learning in Medicine

  • The most promising applications of machine learning in medicine include both clinical and non-clinical areas, such as operational applications like transcription and billing, as well as clinical applications like disease diagnosis and management (35m10s).
  • The choice of application and prioritization of AI development in healthcare depends on the specific environment and structural issues of the healthcare system, with different priorities in resource-rich versus resource-poor settings (35m31s).
  • In resource-poor settings, AI may be used for clinical care, such as retinal scanning algorithms for diabetic patients, due to the lack of alternative options and the potential for AI to provide better-than-no-care solutions (36m27s).
  • The development and implementation of AI in healthcare must consider the local context and prioritize applications that address the most pressing needs and challenges in that specific environment (36m50s).

Real-world Applications and Generative AI in Behavioral Health

  • Academic work has been successfully applied to solve real-life problems in the healthcare industry, such as predicting mortality for improving Advanced Care planning, which has improved the care of over 6,000 patients at Stanford Healthcare.
  • Generative AI has potential applications in Behavioral Health, but there are concerns about its reliability, as seen in incidents like Gemini providing harmful advice to a high schooler, making it uncertain if it's the best use of the technology today.

Patient Involvement and Stakeholder Collaboration

  • Patient involvement in implementing AI models, such as the FEMR model, is ensured through a patient family advisory council, which reviews the workflow and actions taken by the algorithm to ensure patient comfort and consent.
  • Multiple stakeholders, including clinicians, patients, administrators, and developers, should be involved in the development and deployment of AI algorithms to ensure their appropriateness and effectiveness.

Machine Learning in Laboratories

  • Machine learning has practical applications in laboratories, such as in histology, cytology, or flow cytometry, with examples including deep neural nets that can assist pathologists in reading slides and identifying areas of interest.
  • AI is widely used in pathology, ranging from cell sorters that use lasers and calculations to count white blood cells, to more advanced tools that help pathologists read slides and augment their work.

Data Gaps in EHR Systems

  • Critical data gaps exist in current Electronic Health Record (EHR) systems, with the biggest issue being the presence of too many systems, often ranging from 500 to 1,000, which can include EHR systems like Epic or Cerner, as well as specialized systems for various departments, resulting in medical data being scattered across hundreds of systems (41m7s).
  • It is a myth to believe that all medical data is in the EHR, and instead, there is an opportunity to combine and put this data in one place (41m46s).

Addressing Bias in Predictions

  • When addressing bias in predictions made by models, it is essential to distinguish between two interpretations of bias: systematic differences in model performance for people belonging to different subgroups, and systematic differences in the actual benefit or reward resulting from the model's output (42m9s).
  • The latter type of bias is more concerning, as it can result in unequal benefits for people belonging to different subgroups, and addressing this requires focusing on policies and workflows driven by the model's output rather than just removing model-side differences (42m48s).
  • Algorithmic fixes may not always be effective in addressing bias, and it is crucial to consider other factors that can impact the actual benefit or reward, such as policies and workflows (43m16s).

Traceability, Explainability, and Trust

  • Traceability and explainability requirements can slow down the progress of AI model advancements, but it depends on the purpose of these requirements, which can include debugging, mitigating outcomes, or establishing trust (43m50s).
  • Different scenarios require different types of interpretability, including engineers' interpretability for debugging, causal interpretability for mitigating outcomes, and transparency for establishing trust (44m25s).
  • Providing the wrong type of explanation can be counterproductive, and it is essential to understand the purpose of the request for explainability or interpretability to provide the most effective response (45m10s).
  • Establishing trust in AI models can be done through prospective studies, similar to how trust is established in medications, even if the exact mechanism of action is not fully understood (45m20s).

AI Assistance and Medical Errors

  • Studies have compared error rates in medical diagnosis and treatment with and without AI assistance, with one recent study by Jonathan Chen finding that doctors sometimes make mistakes when using AI, possibly due to suboptimal use (45m59s).
  • The study by Jonathan Chen involved giving case vignettes to physicians with and without access to AI, and found that AI alone performed better than doctors with AI assistance (46m6s).
  • More research is needed to understand how AI can be effectively used in practice, rather than just debating whether it should be used (46m56s).

EHRs as Raw Material for AI Training

  • Electronic Health Records (EHRs) can be considered one of the raw materials for teaching AI, but should not be the only source, and should be supplemented with information from textbooks and online sources (47m15s).
  • Medical Imaging AI has many practical uses, with the FDA approving around 1,000 image-based models, mostly in radiology and cardiology (48m4s).

Medical Imaging AI and Resources

  • The Stanford Center for Health Education offers resources for learning more about machine learning and medicine, including a program and online content (49m1s).

Overwhelmed by Endless Content?