Denys Linkov on Micro Metrics for LLM System Evaluation

16 Dec 2024 (10 months ago)

Micrometrics and Business Value

A micrometric is a specific metric defined to measure a problem seen in production or something that is foreseen to happen, and it is used to move some kind of business metric, which is different from broader metrics like accuracy and data science metrics such as F1 and Rouge (42s).
The idea of micrometrics is to optimize for business value rather than premature optimization for broader metrics, which may not reflect the user experience (53s).
An example of a micrometric is measuring how often a large language model switches languages, which was an issue that caused user upset, and implementing a retry mechanism to fix the issue (1m28s).
The retry mechanism was able to fix 99% of the language switching issues, and it was a simple solution that was tracked and measured (1m48s).
The use of micrometrics can help identify fundamental flaws within models, such as the language switching issue, which was not caused by a prompt template update (2m8s).
Different industries and domains will have different micrometrics, and it's up to the domain experts to define what's actually happening and what metrics to use (2m36s).
As a platform provider, one can only guess and see the problems that are encountered from customer complaints or interactions, but domain experts should learn to define their own metrics (2m43s).

Accuracy Metrics and Model Evaluation

There is a difference between overall accuracy metrics and micrometrics, and the choice of large language model may still depend on the specific use case (3m4s).
Evaluating models is challenging, and accuracy can be measured in different ways, such as exact match or Rouge, and different models may perform better in different use cases (3m18s).
Evaluating the accuracy of Large Language Models (LLMs) is challenging due to flaws in various metrics, and it's essential to consider it as an approximation rather than an absolute measure (3m27s).
Human feedback and labeling can also be inconsistent, with low overlap between expert labelers and average people, making it difficult to achieve an agreeable answer (3m41s).
The use of LLMs can be attributed to laziness in defining good training and evaluation sets, but it's crucial to move back to defining these sets and knowing exactly what to look for (4m10s).

Defining Training and Evaluation Sets

Defining training and evaluation sets involves understanding the desired metrics, which can be granular and trackable through complex systems like retrieval and generation pipelines (4m19s).
In retrieval pipelines, metrics can include relevancy of retrieved documents, while generation pipelines involve more complex metrics like the BLEU score, which can be challenging to measure, especially for customer-facing agents (5m26s).
Measuring the value or accuracy of compound answers that require multiple sources can be particularly difficult, as seen in datasets like Multihop QA (5m53s).
Generation metrics can include specific requirements, such as mentioning the user's name, using a particular greeting, or adhering to brand voices, which can become defining factors and micrometrics (6m27s).

Metrics and Brand Voice

Defining metrics and priorities is not a one-time exercise but a continuous process that involves updating the system to adapt to evolving information and data drift (5m13s).
A 100-list of brand voice characteristics can be used to measure the response of a language model (LM), with specific guidelines for companies that can be quantitatively measured, such as searching for certain keywords (6m53s).
Companies often struggle with defining a brand response, as there isn't a well-defined metric for qualities like kindness, requiring user or human evaluation to determine statistical significance (7m0s).
Programmatic guidelines can be used to measure specific aspects of a brand's voice, such as how to handle situations where a customer orders a product not available at the store (7m37s).
In a multimodal world, the response to complex scenarios could involve escalating to a more expensive model or connecting the customer to a human (8m21s).

Prompt Engineering and Optimization

A balance between short-term and long-term improvements can be achieved by using techniques like prompt engineering, which is still an immature field that requires more rigor and measurement (8m32s).
Auto-optimization frameworks like DPy can be used to define a training set, test set, and optimizer to find a good set of prompts, and assertions can be used to validate the correctness of a prompt (9m1s).
The development of prompt engineering is expected to continue evolving, with a focus on building rigor and using multiple responses to evaluate the effectiveness of a prompt (9m22s).

User Experience and Design Patterns

In end-user applications, presenting multiple responses to the user can be useful in specific contexts, such as content evaluation, where the user is guiding the model and doing human preference (9m43s).
The design patterns for generative AI are still in the early stages, and more research is needed to develop effective user experience (UX) patterns (10m5s).
Many current chatbots and language models still use methods from 20 years ago, and consistency across answers is a concern, especially when upgrading or switching between models, as it can cause confusion for customers who receive different responses to the same question (10m16s).
The amount of context used can affect the model's performance, and intentional choices must be made about what factors to consider, such as time of day or user behavior, to improve the customer relationship (10m34s).

Model Upgrades and Performance

Model upgrades require evaluation to mitigate regressions, and a test suite is essential for tracking changes in model performance (11m7s).
Upgrading from one model version to another can result in significant changes in performance, even if the model is the same, as seen in the case of upgrading from ChatGPT's original version to the November version, which resulted in a 10% drop in accuracy (11m13s).
Model providers often do not provide transparency about changes made to the model or updated metrics, making it necessary for developers to conduct their own evaluations (11m38s).
Having multiple models can lead to conflicts, as different versions may perform better on certain tasks or have specific benefits, such as cost savings, requiring careful evaluation and migration strategies (11m58s).
Large language models can be complex and non-deterministic, making it challenging to understand the impact of changes, even when using traditional model training methods (12m48s).

Micrometrics and Content Moderation

Micrometrics can be useful for evaluating model performance, and a crawl-walk-run approach can be used to start with macrometrics and gradually move to more specific metrics (13m18s).
Specific metrics can be used to evaluate LLM systems, such as measuring retrieval versus generation in RAG, and these micro metrics can be used to do more specific things (13m34s).
Policy on content moderation is a common metric, where the goal is to determine how to respond when a user says something inappropriate, and this can be measured by tracking how many bad questions the user asks or how many out-of-domain questions they ask (13m42s).
Different industries have different trade-offs between false positives and false negatives, and measuring how often these occur can be an important metric (14m30s).

Voiceflow Platform Overview

Voiceflow is an AI orchestration platform that helps customers build out different workflows for tasks such as customer support and lead generation, and they focus on team collaboration, control, and hosted solutions (14m39s).
Voiceflow's platform allows users to define business logic and build out workflows using a low-code approach, and they also provide features such as content moderation and event logging (15m13s).
The platform allows users to instrument the system in different ways, such as using different prompts or API calls, and this can be used to track milestones that users go through in a workflow (16m29s).
The hosted aspect of the platform allows users to build prototypes quickly and launch to production, and this can be useful for tasks such as building a new bank account opening workflow (16m0s).
The platform provides useful analytics, such as tracking where users drop off in a workflow, and this can be used to improve the overall user experience (16m51s).

Custom Functions and Multimodal Models

Custom functions and components can be written and reused for specific business requirements, allowing developers to build tailored user experiences (16m58s).
Multimodal models are emerging, but they are not replacing existing tools; instead, they provide another option for people to use, and their application depends on the specific scenario (17m30s).
A platform can be used to orchestrate different models and tools, such as choosing a model to process user-uploaded receipts and then verifying the information against an ERP or API-based system (17m59s).
The platform allows for balancing general knowledge from AI systems with the need to program specific workflows and constrain users to a particular process (18m19s).
The platform provides flexibility, enabling users to build differently and create custom workflows, from simple user input loops to more complex, strictly defined requirements (18m36s).
In certain industries, such as banking, specific legal policies or terms and conditions must be output verbatim, requiring a more controlled workflow and less interpretation by large language models (19m9s).

AI Product Development and Iteration

Many people start by building a basic product and then refine it based on user feedback and production data, treating AI as a product that requires ongoing development and improvement (19m43s).
Launching an AI-powered product is not a one-time event; it requires ongoing evaluation and adaptation to changing business needs and user behaviors (20m1s).
The technology landscape is constantly evolving, and it's essential to update and adapt, just like software, using Agile Release processes to ensure living artifacts rather than static ones (20m9s).
Building experiences can be interesting, especially when working with a large number of users, including 4 million free users and 60 enterprises, each building different things (20m29s).
Integrating with other APIs can still be a challenge, and having a natural language interface might make it easier, but tool use is still immature for what is needed (20m42s).
Defining business processes and making specific API calls can be more efficient than using natural language queries, which can be too vague (20m59s).

LinkedIn Courses and Learning Preferences

Creating LinkedIn courses on various topics, including prompt engineering, AI pricing, and grounding techniques, has been a successful experience, with a wide range of attendees (21m42s).
The process of creating courses on LinkedIn involves working with a content manager to define priorities and interests, creating a table of contents, writing the content, and collaborating with a producer to bring the course to life (22m25s).
The courses attract a variety of people, including those who are curious about AI, with some having more advanced experience, but the platform is not yet known for expert-curated courses (22m44s).
The platform is available to educational providers, universities, libraries, and companies, which use it as their Learning Management System (LMS), resulting in a diverse range of attendees (23m6s).
The best-performing courses tend to be introductory ones, such as "Intro to GP4," as people are looking for a basic understanding of the AI field (23m25s).
Individuals have different learning preferences, with some opting for courses over self-directed learning through resources like Jet open.com (23m37s).