241121 CHE NigamShah final
Dr. Nigam Shaw's Background and Expertise
- Dr. Nigam Shaw is a professor of medicine at Stanford University and the chief data scientist for Stanford Healthcare, with research focused on safely and effectively integrating AI into clinical use (14s).
- Dr. Shaw has an extensive background, including being an inventor on patents, authoring over 300 scientific publications, and co-founding three companies (27s).
- He was inducted into the American College of Medical Informatics in 2015 and the American Society for Clinical Investigation in 2016 (37s).
- Dr. Shaw holds an MBBS from Baroda Medical College in India, a PhD from Penn State University, and completed post-doctoral training at Stanford University (46s).
Data Quality and Patient Timelines in AI/ML Models
- The quality of AI and machine learning models is heavily dependent on the quality of the data they are trained on, with data being collected from patient timelines (1m33s).
- Patient timelines can include various data types, such as ECGs, blood pressure, respiratory rate, and medication orders, which are used to build models (2m7s).
- In a typical healthcare setting, not all data is collected at every point in time, and there is often a lack of longitudinal coverage for individual patients (2m46s).
- The manipulation and processing of patient timeline data have a significant impact on the performance of AI and machine learning models (3m22s).
AI/ML in Healthcare Decision-Making: Classification, Prediction, and Recommendation
- With large amounts of patient timeline data, models can be built to support decision-making in healthcare, including whether to treat a patient and how to treat them (4m1s).
- The decision of whether to treat a patient can be broken down into classification or diagnosis tasks, or prediction tasks, from a computational standpoint (4m18s).
- In general language, terms such as prognosis, prediction, and classification are often misused, and it's essential to understand the distinction between them, as classification is not the same as prediction, and many things in medicine that masquerade as predictions are actually classifications (4m29s).
- A classification is when a model identifies something that already exists, such as diagnosing pneumonia or identifying a dog in an image, whereas a prediction would imply that the model is forecasting a future event (4m46s).
- The distinction between classification and prediction is crucial, as it affects how we approach treatment and prevention, and it's essential to understand that many models, such as sepsis predictors, are actually classifiers and not predictors (5m13s).
- Recommendation is the hardest task, given the data's limitations and biases, and it's been a 40-year journey in medicine to figure out reliable recommendation systems (5m48s).
- There are three things that can be done with AI and machine learning in medicine: classification, prediction, and recommendation, and it's essential to consider whether these technical exercises are advancing the science of medicine, the practice of medicine, or the delivery of medical care (6m13s).
- An example of advancing the science of medicine is the discovery of three subtypes of heart failure with preserved ejection fraction, which would be a classification that advances our scientific understanding of the disease (6m42s).
- An example of advancing the practice of medicine would be developing a test that can identify the subtype of heart failure and provide a treatment plan accordingly, which would improve patient outcomes and reduce costs (6m58s).
Advancing Medical Science, Practice, and Delivery with AI
- An example of advancing the delivery of medical care is the "Green Button Project," which aimed to query similar patients' data to inform treatment decisions at the bedside, and it has been successful in improving patient outcomes and reducing costs (8m1s).
- A bedside consultation service was developed to provide written reports with recommendations for patient care by analyzing timeline objects from millions of other patients, which helped make better decisions than would have been made otherwise (8m14s).
- A study found that about 80% of the time, physicians did not have prior published data to inform their decisions, and less than 3% of the decisions had a study specific to the question at hand that the physician knew (9m4s).
- The lack of access to prior published data highlights the need to analyze data on demand, which was achieved through a prior project that was later spun out into a company called Atropos Health (9m36s).
- Atropos Health reduced the time it took to conduct bedside studies from a day or two to under 24 hours, and sometimes even a few hours, and with the advent of generative AI, studies can now be done in a few minutes (10m3s).
AI-Driven Healthcare Advancements and Cost Savings
- The use of different technologies such as machine learning, chatbots, and generative AI enables rapid decision-making and can improve patient care (10m37s).
- A relatively simple AI model was used to predict which patients would become medically costly the next year, and proactive action was taken to enroll them in care programs, resulting in an estimated 10-15% cost savings without sacrificing quality (10m54s).
- The use of AI and machine learning can drive advancements in healthcare delivery, including predicting cost, biology, practice, and delivery outcomes, such as no-shows, patient transport, and image classification (11m42s).
- A consistent theme in AI applications is that the AI provides a risk estimate, and the value comes from taking action based on that estimate, such as early intervention in cost blooms or advanced care planning in mortality prediction (12m34s).
- The model stratifies by risk value, and the actual intervention necessary may vary, such as providing transportation support in the case of predicting no-shows (12m51s).
- A three-star logo is used to remember the interplay between the model, work capacity, and the action taken, with the yellow box representing computer science and statistics, the green box representing the number from the model, and the red box representing the action taken (13m10s).
- The interplay between the model, work capacity, and action taken is studied, with about five or six faculty members working on it and publishing around 25 papers on the topic (13m36s).
- The key insight from this research is that the focus should be on what can be achieved given the work capacity, rather than just building a model (13m42s).
The FUR Model and Responsible AI Development
- A plot is used to illustrate the cumulative benefit of taking action on cases ranked by probability, with the goal of determining how far down the list action can be taken before diminishing returns are seen (13m56s).
- The approach used is called Fair Useful Reliable (FUR) models, which is a multi-step process that includes usefulness simulations, financial projections, ethical considerations, and prospective evaluation (14m59s).
- The FUR approach is used on a routine basis in the healthcare system, but it was developed because the current way of doing things is unsustainable, with 220 atomic pieces of guidance on how to do good AI, but only half of them focusing on how to build a model (16m23s).
- A model was developed over 10 years at a cost of $28 million to determine who should receive immediate attention in the emergency department (ED) and who can wait for registration, highlighting the unsustainable nature of current medical research practices (16m45s).
- The current organization of work in medical research is unsustainable, which is why processes like form assessment are needed to make healthcare activities more responsible and sustainable (17m20s).
- A three-step approach is proposed for responsible and sustainable innovation in healthcare: Discovery (solving for the science), Development (validating the intent), and Dissemination (scaling) (17m43s).
- For standard AI, the Discover stage is too slow and costly, and the development stage needs to focus on achievable benefits and financial sustainability, which may require changing business models (18m12s).
Firm Assessment for Responsible AI Deployment
- The firm assessment is a tool used to evaluate the responsible development and deployment of AI in healthcare, and it includes a link to a blog post and a website (fm.stan.edu) for more information (18m49s).
- The firm assessment involves evaluating the workflow, including the steps involved in using a classifier to identify patients with undiagnosed diseases, and ensuring clarity on responsible action (19m8s).
- The assessment also includes an ethics evaluation, considering factors such as equity, reliability, governance, autonomy, and decision-making processes (20m6s).
- A form assessment process is used to evaluate the impact of algorithms on patients, considering factors such as the number of people affected, sustainability, and potential ethical problems, to ensure that projects are fair, useful, and reliable (20m46s).
- The assessment process involves analyzing six cases, with the goal of identifying projects that can benefit a large number of people without causing harm to certain subgroups (20m57s).
- Capacity planning is necessary to ensure that responsible AI can be implemented in a healthcare system, and operational engineering work is done to determine the number of concurrent assessments needed to achieve a certain goal (21m36s).
- Little's law is used to calculate the required team size, which indicates that a team that can handle two assessments at the same time is needed to complete at least one assessment per month (22m12s).
- Good governance is essential to ensure that everything that needs to be done is actually done, and a life cycle is established to make sure that the form assessment is integrated into the workflow (22m35s).
- The workflow consists of four key components: standard work, IT support, governance, and form assessment, which are necessary to make informed decisions about AI projects (23m8s).
- The form assessment process produces numbers and analyses that help the governance body make decisions about AI projects, considering factors such as patients affected, sustainability, and potential harm to certain subgroups (23m20s).
- The goal of the form assessment process is to ensure that AI projects are fair, useful, reliable, and do not cause harm to certain subgroups, and to make informed decisions about which projects to pursue (23m49s).
Large Language Models (LLMs) and Their Application in Healthcare
- The introduction of Large Language Models (LLMs) in 2022 has changed the landscape and raised new questions about the form assessment process (24m2s).
- A language, in the context of computers, is a sequence of tokens from a finite vocabulary, such as a dictionary, and can include natural languages like English, Spanish, and Gujarati, as well as the "EHR language" used in electronic health records (EHRs) consisting of codes like ICD, CPT, and LOINC codes (24m24s).
- The EHR language is a sequence of tokens representing events and actions in a patient's timeline, such as visits, prescriptions, and diagnoses, which can be used to build language models for forecasting and generative AI in healthcare (24m55s).
- There are two ways to build language models: the classical approach using natural language text and documents, and the use of patient timelines to forecast future events and outcomes (25m17s).
- A study was conducted to evaluate the effectiveness of language models, specifically GPT 3.5 and GPT 4, in answering questions from a bedside service, with results showing an increase in agreement and a decrease in disagreement and uncertainty among 12 physicians (26m19s).
- However, the study also found that around 40-50% of the time, physicians couldn't decide whether the language model's answers were correct or not, limiting its utility (26m53s).
- Another project, MedAlign, aimed to assess the alignment of language model outputs with medical needs, with results showing a 35% error rate in answering medical prompts, even in the best-case scenario (27m57s).
- Research is being conducted to train models that can forecast patient outcomes, with a focus on using positive examples to develop forecasting classifiers (28m19s).
- A performance receiver operator curve was used to evaluate the performance of different models, including logistic regression, random forest, and a timeline-trained language model, with the language model showing the highest accuracy (28m35s).
- The language model achieved an accuracy of around 78%, which is better than the highest accuracy achieved by classical methods, and it trains eight times faster and uses 95% less training data (29m16s).
- The models, called Climber and Motor, have been publicly released and can be found on GitHub (29m53s).
- The focus should be on verifying the benefits of language models and generative AI, rather than just building them, as many tech companies are doing (30m9s).
- Academic sites have a crucial role in asking hard questions about the effectiveness of these models, despite not having the same resources as tech companies (30m25s).
- The development and dissemination of a worldview for generative AI is needed, but it is unclear how this will scale (30m38s).
Collaboration between Data Engineers and Data Scientists
- Building fair, useful, and reliable models requires collaboration between data engineers and data scientists, who should work together as part of the same team (31m56s).
- Data engineering and data science are two functions that need to work together, with data engineers playing a crucial role in extracting, cleaning, and preparing data for analysis (32m4s).
- The arrangement between data engineers and data scientists should be collaborative, with decisions made during data cleanup and extraction affecting the kind of science that can be done (32m28s).
- The time it takes to analyze or model data and come up with something presentable depends on the team's maturity, with the first time taking significantly longer and subsequent replications becoming faster, cheaper, and more reliable (32m44s).
- The first end-to-end project took around five to seven years to complete, while the next project took a year and a half, and the third project took four months, with the goal of reducing the time by 50% with each replication (33m14s).
Data Integrity and Model Development in Healthcare
- Electronic Health Record (EHR) data is noisy and contains errors, so it's essential to look for multiple lines of collaboration to confirm any conclusions, such as verifying a diabetes diagnosis with HbA1c levels and medication data (34m2s).
- Stanford is not training its own Large Language Model (LLM) for medical analysis, instead opting to use an open-source model with no copyright issues and fine-tuning it, as training from scratch is expensive (34m37s).
Applications of AI in Medicine: Clinical vs. Non-Clinical
- The most promising applications of machine learning in medicine include operational applications such as transcription, responding to messages, and billing, which tend to have higher adoption rates in academic centers (36m7s).
- In resource-constrained settings, AI may be used for clinical care, such as retinal scanning algorithms for diabetic patients, as the alternative may be no care at all (36m27s).
- The choice of whether to use AI for clinical or non-clinical applications depends on the structural issues of the healthcare system and the environment in which the work is being done (35m29s).
- The circumstances in which technology is deployed can significantly impact its effectiveness, and there are many examples of academic work being applied to solve real-life problems in the healthcare industry (36m52s).
- One such example is the work done by Stanford Healthcare's HMS and HS Davies on predicting mortality for improving Advanced Care planning, which has improved the care of over 6,000 patients (37m12s).
Generative AI, Patient Involvement, and Ethical Considerations
- Generative AI may be applicable in Behavioral Health, but there are concerns about its potential for errors, such as the incident where Gemini told a high schooler to harm themselves while giving homework advice (37m42s).
- Patient involvement is crucial in implementing the FEMR model, and this can be achieved through a patient family advisory council, which reviews the workflow and ensures that patients are comfortable with algorithm-driven care (38m28s).
- Multiple stakeholders, including clinicians, patients, administrators, and developers, should be involved in the development and implementation of algorithms to ensure that they are effective and unbiased (39m23s).
AI Applications in Medical Imaging and Pathology
- Machine learning has practical applications in laboratories, such as in histology, cytology, and flow cytometry, with examples including the use of deep neural nets to sort and analyze cells (39m35s).
- The Pathology Department at Stanford has developed a system called Nuclei, which uses a deep neural net to help pathologists read slides and identify areas of interest (39m50s).
- Other examples of AI in pathology include cell sorters, which use lasers and calculations to analyze cells, and have become a widely used tool in healthcare systems (40m33s).
Challenges and Opportunities with EHR Systems
- The biggest issue with current EHR systems is that there are too many systems, with a typical hospital running anywhere from 500 to 1,000 systems, making it a myth to believe that all medical data is in the EHR, as it is scattered over hundreds of systems (41m19s).
- The biggest gap or opportunity is to combine all the data in one place, as the current system makes it difficult to access and utilize the data effectively (41m55s).
Addressing Bias in AI Models and Healthcare Workflows
- When it comes to removing bias in predictions made by models, there are two interpretations of bias: a systematic difference in the model's performance for people belonging to different subgroups, and a systematic difference in the actual benefit or reward that people receive as a result of the model's output (42m12s).
- The latter interpretation is the one to worry about, as it is dependent on policies and workflows driven by the model's output, rather than just the model itself (42m51s).
- To address this issue, it is suggested to focus on policies and workflows rather than just removing model-side differences, and to consider the human-centered AI Institute's blog post on "when algorithmic fixes failed" (43m25s).
Traceability, Explainability, and Trust in AI Models
- Traceability and explainability requirements can slow down the progress of AI model advancements, but it depends on the purpose of the explainability, which can be for debugging, mitigating outcomes, or establishing trust (43m55s).
- The need for explainability, interpretability, or transparency can be broken down into different scenarios, each with different requirements, and it is essential to understand the purpose of the request to provide the appropriate explanation (44m53s).
- Establishing trust in AI models for medical diagnosis and treatment requires prospective studies to test their effectiveness, similar to how trust is established in conventional drugs like Tylenol, even when their exact mechanisms of action are not fully understood (45m18s).
- Studies have been conducted to compare error rates in medical diagnosis and treatment with and without AI assistance, such as a study by Jonathan Chen, which found that doctors sometimes make mistakes when using AI, possibly due to suboptimal use of the technology (45m59s).
- The study by Jonathan Chen involved giving case vignettes to physicians with and without access to AI, and the results showed that AI alone performed better than doctors with AI assistance, highlighting the need for more research on how to effectively use AI in medical practice (46m4s).
Data Sources for Training AI in Healthcare
- Electronic Health Records (EHRs) can be considered one of the raw materials to teach AI, but they should not be the only source, as sanitized versions of online sources and textbooks are also necessary (47m7s).
- There are many practical Medical Imaging AI models in use, with the FDA approving around 992 AI-related medical devices, half of which are image-based, primarily in radiology and cardiology (47m44s).