Mind Your Language Models: an Approach to Architecting Intelligent Systems
Introduction and Background
- Large language models have gained significant attention in the past year, with models like ChatGPT and Gemini demonstrating impressive capabilities due to their massive size and extensive training data (42s).
- The journey of architecting systems with large language models began before their recent popularity, with a lot of work building on top of existing research and technologies (38s).
- November 2022 saw the release of DALL-E, followed by ChatGPT a few months later, generating significant buzz and speculation about the potential for tech giants to monopolize AI (55s).
- However, in February 2023, it became clear that open-source alternatives and fine-tuning on models like LLaMA would play a crucial role in the development of AI, with companies like Meta and NVIDIA investing heavily in the technology (1m20s).
The Rise of AI and its Impact
- The revenue of companies utilizing AI has increased significantly, with OpenAI growing from $200 million to $1.6 billion and NVIDIA's market revenue reaching $18.1 billion (2m12s).
- The use of AI tools like ChatGPT and Gemini has become widespread across various industries, including marketing and sales, with many organizations recognizing the potential for AI to provide a competitive advantage (2m25s).
- However, the increasing use of AI also raises concerns about risks, such as deepfakes, which have already led to significant financial losses in some cases (2m33s).
- As AI continues to occupy the spotlight, organizations are tasked with finding ways to safely and reliably integrate these powerful technologies into their operations without sacrificing speed or innovation (2m59s).
- The global AI market is expected to grow to $1.8 trillion, with nine out of ten organizations believing that AI will provide a competitive advantage and around four to five companies considering AI their top priority (3m34s).
- Despite the potential risks, investments in AI continue to grow across the entire landscape of machine learning, AI, and data (4m4s).
Adoption and Deployment of Language Models
- Many organizations understand the risks associated with language models and are working to address them, while others are unaware of how to proceed and are waiting for others to take the lead (4m17s).
- Over 60% of people are either running proof-of-concepts (POCs) or have already adopted language models for widespread use with their customers (4m37s).
- Deploying AI in production is complex and requires more than just thinking about it as a POC, and striking a balance is crucial (5m0s).
- The AI landscape is constantly changing, but not everything requires attention, and the goal is to build a stack that allows for flexibility and innovation while driving value for customers (5m23s).
Talk Objectives and Speaker Introduction
- The purpose of the talk is to discuss the potential pitfalls of language models and share experiences from the past 16-18 months of bringing them into production (5m38s).
- The speaker will share their journey, including the effort, time, money, and meltdowns that occurred, and will provide an opinionated perspective on the topic (5m50s).
- The speaker is Nisel, the VP of Data Science and ML Engineering at Scout B, with over 13 years of experience in the ML space, primarily in the insurance and supply chain space (6m20s).
Key Topics and Case Study Focus
- The talk will cover enabling language models in products, improving conversation quality, improving trust in results, and improving data coverage and quality (6m34s).
- The case study will focus on the company's experience working in the supply chain space, helping organizations like Unilever and Walmart with supplier discovery (7m11s).
- Supplier discovery is a complex task that requires more than just a Google search, as it involves understanding nuances in the supply chain space (7m24s).
- The supply chain has a significant impact on everyone, and every manufacturer is dependent on other manufacturers, making it a critical area of focus (7m55s).
- Companies are looking for manufacturers to collaborate with to mitigate risks and handle disruptions, and this requires innovating manufacturers (8m0s).
Integrating LLMs into an Existing Product
- The challenge was to integrate large language models (LLMs) into an existing product that has been a market leader for 6-7 years and has multiple generations (8m11s).
- LLMs are being sought after for their efficiency and effectiveness, enabling organizations to do things they couldn't do before, such as solving complex problems and providing dynamic solutions (8m32s).
- The integration of LLMs allows users to ask questions or provide problem statements, and the model figures out what data to bring in to help solve the problem (9m1s).
- To enable LLMs in their product, the company connected their application to ChatGPT's APIs through an API and performed prompt engineering with LangChain (9m43s).
- The company's existing stack included knowledge graphs and smaller machine learning models, which were used for distributed inferencing with Spark (10m9s).
Initial Challenges and User Feedback
- However, the integration of LLMs revealed several issues, including a lack of domain knowledge, hallucinations, and concerns about Enterprise security and privacy (10m26s).
- The foundational models did not understand the domain-specific concepts, leading to irrelevant conversations and fabricated results (10m44s).
- The company's customers were not satisfied with the new system, citing its chattiness and lack of relevance, and some even requested to go back to the old system (11m10s).
Stage One: Market Validation and Initial Challenges
- The initial goal was to determine if there's a market for bringing large language models (LLMs) to a product, and user feedback indicated a big market for the new generation of the product, with users enjoying the experience and wanting more (12m24s).
- To proceed, several challenges needed to be addressed, including efficiency, domain adaptation, removing hallucinations, building trust and reliability, and implementing guardrails (12m33s).
Example of Unsatisfactory LLM Response
- A foundational model was used to test the product, and it provided a response to a user's query about finding sustainable, fair market-certified coffee beans from South America (13m7s).
- However, the response was deemed unsatisfactory, as it didn't provide a direct answer to the user's question and was too chatty (13m15s).
Stage Two: Hosting and Domain Adaptation
- In stage two, the focus was on determining if hosting a large language model was feasible, considering privacy and data security concerns (13m30s).
- An open-source LLM, LLaMA 13B, was integrated into the product, and an API called FastChat API was built on top of it (13m56s).
- The integration of LLaMA 13B required adaptations from the original prompt engineering work done for CH GPT, which increased the cost and complexity of the project (14m49s).
- Domain adaptation was identified as a significant challenge in enabling large language models in an organization, and the cost of retraining or fine-tuning a large language model was highlighted, with the example of OpenAI spending $63 million to ship GPT 4 (15m49s).
- The cost of using a large language model (LLM) API from big companies can be around $30 for a million tokens, whereas retraining a foundational model requires a lot of data and infrastructure, resulting in a significant cost difference (16m2s).
- Domain adaptation can be achieved without retraining an entire model, and there are different ways to do this, including zero-shot learning, context learning, few-shot learning, and building agents (16m26s).
Building Agents and Implementing Guardrails
- An agent is a system that can be given instructions, examples, and the capability to make requests to different systems, allowing it to understand tasks, pick relevant data, make queries, summarize answers, and provide results (16m40s).
- Feeding domain knowledge into an agent and using heavy prompt engineering can enable the agent to perform tasks effectively, but this process can be challenging and requires significant effort (17m16s).
- Guard rails are necessary to validate if an LLM is making the right decisions, as they cannot be trusted entirely, and there are different ways to implement guard rails, including using Open Source libraries or companies like Nemo and Guard (17m54s).
- Implementing guard rails with a graphs of thoughts approach can help understand the dynamic nature of business processes and invoke different kinds of guard rails to support users (18m35s).
- The graphs of thoughts approach involves thinking of the entire business process as a graph and invoking different guard rails at any given point in time to ensure the LLM is not misleading the user (19m21s).
- Not having guard rails can lead to issues, such as the case of Air Canada enabling a chatbot with an LLM that resulted in problems (19m40s).
- A chatbot agent informed a customer that the airline owed them money, making the airline liable for the chatbot's actions, highlighting the need for guardrails and domain adaptation when enabling agents or large language models (LLMs) (19m51s).
Addressing Initial Issues and Remaining Challenges
- Issues identified in stage one included the need for guardrails, domain adaptation, building trust and reliability, reducing chattiness, and minimizing hallucinations (20m24s).
- Changes made in stage two addressed some of these issues, including domain adaptation and guardrails, which increased trust and reliability, but hallucinations remained a significant challenge (20m40s).
- Hallucinations affected the trust and reliability of the system, as users received different answers each time they used the system, making it difficult to reuse the system for the same problem (21m6s).
- Users were happier with the quality of conversations with ChatGPT, but wanted the same quality without using ChatGPT, creating a challenge (21m43s).
- Testing agents was a significant challenge, as it was difficult to understand their thought process and debug their actions (22m1s).
Improving Conversation Quality and User Experience
- The conversation with the agent improved, as it was able to understand the user's input and provide relevant information, enhancing the user's experience and leading them down the path of efficiency and effectiveness (22m41s).
- The agent was able to provide information on sustainability certificates without requiring additional training, as it had already seen a large amount of data and was aware of the relevant information (23m35s).
Addressing Hallucinations with RAGS and Chain of Thoughts
- A language model was initially allowed to proceed with its task, but it began creating its own suppliers, which raised concerns about the reliability of the system (23m47s).
- To address this issue, the concept of RAGS (Retrieval Augmented Generative AI) was introduced, which involves providing the model with data and context to answer questions (24m29s).
- The engineering stack and system grew larger with the implementation of RAGS, and the planner and execution layer for LLMs became more complex (24m52s).
- The Chain of Thoughts framework was used to improve the model's reasoning process, allowing it to provide more reliable answers (25m23s).
- This framework involves prompting the model to go through a reasoning process, which can be taught to the model without retraining the entire model (26m37s).
- LangChain, a popular framework in the community, was used to implement the Chain of Thoughts framework (27m18s).
- The new system allows for more robust and reliable conversations with users, but also requires adding new services or technologies to enable it (25m58s).
- The Chain of Thoughts framework provides a roadmap for the model to follow, making it easier to implement and improve the system (27m9s).
- The use of RAGS and the Chain of Thoughts framework has improved the overall reliability and trustworthiness of the system (24m24s).
Improving User Input and Query Processing
- Users of a conversational-based system initially typed in keywords, thinking it was similar to Google search, while others provided entire stories, resulting in lengthy passages in the first message rather than a clear problem statement (27m54s).
- To address this, a system was developed to understand what the user wants to do, transform it into a standard form, and potentially split it into several queries rather than one single query, utilizing the Chain of Thought approach (28m33s).
- This approach breaks down the problem into smaller parts, makes data calls to fetch relevant data, and uses this data to find an answer (28m55s).
Observing System Performance and Improving Metrics
- Observing the system's performance is crucial, focusing on two aspects: the Large Language Model's (LLM) response times and the number of tokens it processes, as well as the conversation and result (29m19s).
- The LLM's performance can be affected by its size, leading to slower response times, and efforts are being made to improve its efficiency and run it on smaller GPUs (29m32s).
- Evaluating the conversation's effectiveness involves understanding if the LLM understood the conversation, picked up relevant data, and provided accurate answers, using metrics such as context precision and recall (30m10s).
- The RAGAS framework, an open-source tool, helps generate a score between generation and retrieval, providing insights into improving the system (30m49s).
- The introduction of RAGAS and knowledge graphs as the data source significantly reduced hallucination, and storing all data and results enabled better observability and product metrics (31m19s).
- Eliminating agents made testing easier, and the system's performance improved overall (31m45s).
Addressing System Challenges and Enhancing Data Quality
- The initial system had challenges, including the need for users to have a deeper conversation with the data itself, which was not enabled at the time, and higher latency with more users, resulting in slower response rates and annoyed users (31m59s).
- To address these challenges, the system was modified to reduce hallucinations with Retrieval-Augmented Generation (RAGs), which forces the system to use data and provide citations with data provenance, allowing users to verify the information (32m48s).
- With RAGs, the system provides more transparent and trustworthy answers, such as telling users that a coffee supplier is based in Brazil and providing a source for the information, and also informs users if a supplier lacks sustainable certificates (33m1s).
- However, this led to users requesting more information, such as revenue and delivery times, which the system did not have, highlighting the need to expand and improve the data (33m52s).
- The effectiveness of a large language model (LLM) depends on the quality of the data it is trained on, and expanding and improving the data is crucial for the system's success (34m10s).
- To enhance and scale the data, the system uses a knowledge graph as a system of records, integrating data from different domains, partners, and customers, and providing a semantic understanding of the data (34m51s).
- The knowledge graph allows for the integration of different data fields, such as revenue, risks, and customer data, and enables the system to understand how the data is related (35m27s).
Data Operations and Knowledge Graph Management
- The data operation part of the system required maintaining control over the explainability and provenance of the data, which led to exploring embeddings and knowledge graphs (35m52s).
- Embeddings were found to be relatively faster to work with, but explaining them to enterprise users and correcting them posed challenges, leading to a focus on knowledge graphs for the time being (36m0s).
- The current ontology of the knowledge graph was designed about 2.5 years ago, but surprisingly, large language models (LLMs) can be used to design ontologies for specific domains, potentially reducing the effort required (36m51s).
- The knowledge graph was populated using a transformer-based model, but the quality of the data needed to be higher, leading to the use of a superior LLM to generate high-quality training data (37m35s).
- This approach reduced the effort required to generate annotated data by 10-20 times, as humans were only needed to validate the data generated by the LLM (38m46s).
- The use of a smaller LLM model, adapted to a specific task, was chosen for ease of operation and economical reasons, as running massive models can be expensive and require a lot of GPUs (39m1s).
- The availability of GPUs is currently a challenge, with difficulties in obtaining them in various regions, including EU Frankfurt, Ireland, and North Virginia (39m20s).
Improving Semantic Search and Data Expansion
- The problem of semantic search is complicated by the fact that people interacting with the system may use different language, even when working on the same problem (39m41s).
- To improve the range of language that users can interact with, a knowledge graph was expanded using a large language model (LLM) to generate facts from data, which were then converted into the knowledge graph with their synonyms and different ways of asking for the data (39m55s).
- The LLM was used to generate training data for smaller models and to augment and expand the data within the domain, and third-party data providers were also worked with to get financial information, risk information, and more (40m41s).
Scaling the System and Addressing Engineering Challenges
- The engineering problem of scaling this to millions and billions of web pages and documents was a challenge, as it involved big data, big models, and big problems, including orchestrating and running the system (41m5s).
- The machine learning (ML) pipelines had to run LLM workloads, which had a big impact on throughput and cost, and data scientists wanted to run experiments at scale, but the ML pipelines were not observable and did not use infrastructure efficiently (41m35s).
- To address these challenges, the entire ML and LLM Ops platform was changed, and a universal compute framework was introduced for ML, LLM, and data workloads, using a framework called Ray (43m57s).
- The challenges of running Spark pipelines with LLMs were also addressed, including the fact that the data science team was not Spark-aware, and the ML engineering team was Spark-aware, requiring a translator between the two worlds (42m31s).
- The challenges of using Spark, which is written predominantly in Java and Scala, were also addressed, as the data scientists were used to working with scientific packages in Python, and observing and understanding when Spark fails was a challenge (43m9s).
- A framework called Ray, an open-source project from UC Berkeley, was used to host and run large language models, allowing for optimized performance on smaller GPUs and faster execution with minimal management required (44m8s).
- Ray also enabled the use of decorators to scale code and run on massive infrastructure, making it easier for data science teams to manage large data workloads and pipelines (45m1s).
Improved User Experience and Data Enrichment
- The outcome of using this framework was the ability to provide users with more detailed information, such as revenue data, and to design a Chain of Thought prompt to help users obtain missing data by drafting emails to suppliers (45m30s).
Considerations for LLM Adoption and ROI
- It is essential to carefully consider whether a product warrants the use of a large language model (LLM), as they come with significant costs, including upskilling, running, and maintaining the model, as well as managing failures and aiming for continuous improvement (46m51s).
- LLMs are not a golden bullet and still require high-quality data, data contracts, data Ops, and managing the entire data life cycle (47m11s).
- It is crucial to compute the return on investment (ROI) for LLMs, as they require significant time, money, and people, and to measure everything to ensure efficiency and effectiveness (47m24s).
- LLMs should be used to augment human capabilities, not replace them, and the value of guardrails and domain adaptation should not be underestimated (48m7s).
User Experience, Team Care, and Upskilling
- To bring out the best in Large Language Models (LLMs) and user interaction, it is essential to consider the user experience, which involves a lot of work on the user experience side to add value to the product (48m12s).
- Taking care of the team is crucial, as they may experience prompt engineering fatigue, burnouts, and fear of being replaced by LLMs, which can lead to meltdowns (48m27s).
- Embracing failure is necessary, as there will be many failures before LLMs are ready for production, and actively investing in upskilling is vital to build a support system for the team (48m46s).
- The field of LLMs is new, and there are many resources available, including free content, workshops, and upskilling opportunities that can help the team stay updated (48m55s).
System Design and Sustainable Improvements
- System design is critical, and it is essential to think about sustainable improvements, designing systems to work with flexibility and reliability, and implementing version control for everything, including prompts, data, agents, APIs, and metadata (49m12s).
- It is crucial to consider the entire system, rather than just relying on LLMs to solve problems, and to think about the relationships between different components, such as the "plus operator" in the equation 1 + 1 = 2 (49m46s).