Meryem Arik on LLM Deployment, State-of-the-art RAG Apps, and Inference Architecture Stack

03 Oct 2024 (6 months ago)

InfoQ Dev in Boston

InfoQ Dev in Boston is an upcoming event where senior software practitioners will share their experiences and practical insights on critical topics such as generative AI, security, and modern web applications, with plenty of time for attendees to connect with peers and speakers at social events (41s).
Reini Picala is the lead editor for AI/ML and data engineering community at InfoQ and a podcast host, and he will be speaking with Meryem Arik, co-founder and CEO of Titan ML, about the current state of large language models (LLMs) and how to deploy and support them in production (58s).
Meryem Arik is the co-founder and CEO of Titan ML, which builds infrastructure to help regulated industries deploy generative AI within their own environments, and she has a background in theoretical physics and philosophy (1m38s).

Meryem Arik and Titan ML

Arik's company, Titan ML, aims to solve the infrastructural gap between what is possible in research literature and what enterprises can achieve in terms of AI, which was identified by Arik and her co-founders, who are also physicists (2m22s).

Generative AI and LLMs: Current State and Future Trends

Generative AI and LLM technologies are rapidly advancing, with recent developments including Google Gemini updates, generative AI in search, Open AI's GPT 40, and Llama 3, which can work with audio, vision, and text in real time (3m9s).
The growth and technological advancement in the LLM industry have been phenomenal since the company first started, with the industry evolving from NLP to LLMs, and the company working with GPT2 models initially (4m10s).
The rate of progression in language model development is enormous, with significant advancements in capabilities, such as generating Shakespeare-like poems and providing real-time audio feedback, making it an exciting time to be in the space (4m32s).
Even if LLM innovation stops, there's around a decade of Enterprise Innovation that can be unlocked with the current technologies, indicating enormous potential for growth and development (5m1s).
The next year is expected to see increasingly impressive capabilities from surprisingly small models, such as the Llama 38 billion model, which was as performant as GPT 3.5, and smaller size models providing better outputs (5m42s).
The quality and number of tokens used in training these models are expected to improve, enabling smaller models to provide better outputs and helping Enterprise scale specifically (6m9s).
Emergent technology and phenomena, such as the GPT 40 model, are being developed, which can handle multimodal tasks, including audio-to-audio conversations, and are expected to lead to impressive technologies (6m20s).
The next year is expected to see more Enterprise-friendly scale models and huge models with amazing multimodal abilities (6m41s).

LLMs in Regulated Industries

LLMs can be used in various important use cases, especially in regulated industries, to help with tasks such as security, privacy, and compliance, by thinking of them as having an intern with super vision (7m29s).
LLMs can be used to automate tasks that would typically require an intern with super vision, such as data processing and analysis, and can help organizations unlock their potential (7m50s).
Large language models (LLMs) can be used to delegate and break down tasks into smaller ones, with one common use case being as a research assistant or knowledge management system, which is good at searching through large documents and summarizing information (8m16s).
LLMs can augment human efforts but will not completely replace them, and they can be used to automate workflows in businesses (8m58s).

Types of LLMs and Model Selection

There are base models and open-source and proprietary models, such as Llama models, Lava models for vision, GPT models, and Google's B and CLA models, which can be used for different tasks (9m11s).
Base models come pre-trained, which means they already have a good base level of understanding for a specific modality, and developers can work on top of them (9m44s).
When choosing a model, developers should consider the modality they care about, such as text, image, or audio, and choose a model that works with that modality (10m34s).
Another key decision is whether to use API-based models or self-hosted models, with API-based models being suitable for experimentation or small-scale deployment, and self-hosted models being more suitable for mass scale production or applications with strict compliance or data residency regulations (11m1s).
API-based models, such as Open AI and Anthropic, are good places to start for experimentation or small-scale deployment (11m29s).
Self-hosted models are more suitable for applications that require data residency, privacy, and cost sensitivity, and are looking to deploy at a huge scale (11m49s).

Deploying LLMs: Key Considerations

When deploying Large Language Models (LLMs), there are four key considerations: modality, whether to use an API-based or self-hosted approach, the cost to accuracy tradeoff, and whether to use fine-tuned variants for domain-specific or niche applications (11m55s).
The cost to accuracy tradeoff involves choosing between different sized models that perform differently but have varying cost ratios, with smaller models suitable for easy tasks and larger models for complex tasks (12m1s).
Self-hosted models, such as those from AMA, offer advantages but also come with challenges, including data privacy and latency concerns (13m14s).

Challenges of Self-Hosting LLMs

There are three main challenges to self-hosting: the quality of the model, the need for additional infrastructure, and access to GPUs (13m42s).
The quality of self-hosted models has improved significantly, with open-source models like LLaMA 3 performing as well as API-based models at a better cost tradeoff (14m22s).
Self-hosting requires building and maintaining additional infrastructure, including batching servers, model optimization, and function calling, which can be a challenge for teams (14m43s).
Access to GPUs is still a challenge for teams, despite the easing of the GPU crunch since last year (15m23s).

Benefits and Adoption of Self-Hosted LLMs

Self-hosting Large Language Models (LLMs) can be beneficial for companies of various sizes, not just large enterprises, as it provides more control over the process and can offer better performance in terms of latency and throughput (16m24s).
Midmarket businesses and scale-ups are investing in self-hosted capabilities, often due to performance reasons, whereas large companies are driven by privacy and data residency concerns (16m47s).
Self-hosting allows access to a wider range of models, including open-source options, which can be beneficial for building state-of-the-art RAG (Retrieval Augmented Generation) apps (17m19s).

Retrieval Augmented Generation (RAG) Applications

RAG is a technique used in the majority of production-scale AI applications, enabling LLMs to call on stored business data, and its key components include the retrieval system, the ranker, and the generator (18m33s).
The choice of vector database is not crucial, as they are commoditized and fairly similar, and companies should focus on other aspects of the RAG app (19m9s).
Building a state-of-the-art RAG app may involve using a hybrid solution with both open-source and self-hosted components to achieve better performance (17m45s).
The characteristics of a state-of-the-art RAG app include the ability to efficiently retrieve and generate data, as well as the ability to integrate with various models and systems (18m24s).
The choice of generative model typically doesn't matter as much as other factors, such as data pipelines and embedding search, in determining the quality of output, as "garbage in, garbage out" still applies, even with the best models (19m21s).
Two important factors in achieving good output are document processing or data processing pipelines and semantic search, with the former involving how text is chunked and how images and tables are passed for efficient searching (19m57s).
Semantic search is best achieved through a two-stage process: embedding search and ranker search, with the former good for searching vast documents and the latter for refining the most relevant results from a shortlist (20m33s).
For a good RAG application, several models need to be deployed and optimized, including an LLM, table passer, image passer, embedding model, and ranker model, which can be challenging to orchestrate (21m12s).

Simplifying LLM Deployment

To simplify deployment, a container has been built to deploy all necessary models in one, allowing for a single endpoint to be hit (21m32s).
A recent presentation highlighted tips and tricks for LLM deployment, including the importance of considering deployment requirements and boundaries early on to avoid last-minute scrambling (22m41s).
The presentation has been written up into a blog post and will be revised for a future talk, reflecting changes in the landscape (22m18s).
When designing an application, it's crucial to know the deployment requirements upfront, as it can significantly change the system's architecture, and having hard and fast deployment requirements like real-time or batched processing, or specific GPU or cost profile needs, should be considered from the start (22m56s).
Unless there are unlimited resources, it's recommended to quantize the model using 4-bit quantization, as a larger model quantized down to a fixed size will generally perform better than a natively small model of the same size, retaining more performance and accuracy (23m30s).
Even if a model like GPT-4 is the best, it doesn't necessarily mean it needs to be used for everything, and smaller models can be just as performant, cheaper, and easier to deploy, so the task's difficulty should be considered before choosing a model (24m12s).

Titan's Takeoff Inference Stack

Titan's Takeoff Inference Stack is an inference software designed to make self-hosting AI apps simple, scalable, and secure, allowing users to run language model applications efficiently on their own private hardware (24m56s).
The Takeoff Inference Stack is a Docker container that exposes an API for interacting with a language model, can be deployed on various hardware, and fully integrates with tools like Kubernetes, making it a containerized and Kubernetes-native product (25m40s).
The server part of the Takeoff Inference Stack is written in Rust, making it a multi-threaded server, and the inference engine uses techniques like quantization, caching, and inference optimization to make the model run faster (26m17s).
The technical stack is written in a combination of Python, Open AI's Triton language, and Docker, Rust, and Python, allowing for hardware agnosticism and compilation to Nvidia, AMD, and Intel, avoiding being tied to the Cuda stack (26m39s).
The stack automates the process of turning raw hardware and a model into a scalable endpoint, supporting various modalities such as generative models, embedding models, re-rank models, and image-to-text models (27m19s).
The platform allows for deploying multiple models in one container, automating multi-GPU setup, and providing a declarative interface, saving clients around two to three months per project (28m17s).
The stack enables developers to focus on building applications, working with a stable API that supports model swapping, and scaling properly (28m42s).

AI Regulation and Ethical Considerations

The number of AI regulations in the United States is predicted to sharply increase, with a need for balance between control and innovation to avoid disrupting progress (29m4s).
The challenge lies in finding a balance between responsible AI development and regulation, considering the societal impact of the technology, including disinformation, misinformation, and underemployment (30m1s).
Regulation may not be able to stop the negative consequences of AI, as they already exist, and a more nuanced approach may be necessary to address these issues (30m20s).
The current state of AI technology raises concerns about its potential impact on humanity, with both closed-source and open-source AI worlds having their own set of scary implications, such as concentrated power in the hands of a few individuals or companies, and the potential for harmful models to fall into the wrong hands (30m25s).
The speaker leans towards an open-source AI regime but acknowledges the risks associated with it, such as the creation of deep fakes and other harmful content, and emphasizes the need for governments to be thoughtful and concerned about these issues (31m11s).
There is a need for regulatory alignment between major powers, including the EU, Britain, the US, and Asia, to address the misalignment in regulations that is currently causing confusion for players in the field (31m55s).
Self-regulation can play a role in addressing these issues, but it may need to come from big platforms, and there is a risk of concentration of power in these platforms, which can be problematic (32m50s).
The speaker hopes that lessons have been learned from the past experiences with social media and that the next generation of AI will be more careful in terms of what is allowed on platforms (33m30s).

The Future of AI in Work and Daily Life

AI is expected to play a larger role in work and daily life in the future, with predictions suggesting that anything not connected to AI will be considered broken or invisible within the next three years (33m44s).
AI is expected to be deeply embedded in everything we do, but the pace of Enterprise adoption might be slower than expected, taking more than three years to make significant changes (34m5s).
Exciting use cases for AI involve micro improvements in every single workflow, making them 10% more efficient, leading to real transformation over time (34m24s).
Automating entire organizations, not just parts, can bring significant benefits and synergies (35m10s).
Tech-enabled services companies, such as law firms, are being transformed by AI, with new firms emerging that are Tech-first and AI-first (35m30s).

Resources for Learning about LLMs

Recommendations for learning about LLMs include listening to this podcast, reading blogs on the website, and checking out free quantitized models on the Hugging Face page (36m0s).
The Hugging Face course is a good resource for starting to learn with LLMs (36m30s).
Self-hosted models and deployment require more rigor and discipline after going into production, not just during the development phase (36m58s).