Namee Oberst on Small Language Models and How They are Enabling AI-Powered PCs
04 Nov 2024 (3 months ago)
Introduction to AI and Small Language Models (LLMs)
- Microsoft Reactor provides events, training, and community resources to help developers, entrepreneurs, and startups build on AI technology, which can be learned more about by visiting aka.ms/infoq/reactor (9s)
- Nami Oberst is the founder of AI Blocks, the company behind the open-source LLM framework called LLMware, used for Gen AI-based applications in financial services and legal industries (34s)
- LLMware aims to make generative AI easy to use, deploy, and develop for the enterprise and regulated industries, with a focus on development and deployment securely, safely, and cost-effectively (1m40s)
- Nami Oberst's background as a corporate attorney working in big law motivated her to start an AI company to automate repetitive tasks, and her experience working with highly regulated industries led her to focus on small language models (2m22s)
- Small language models can perform focused and targeted tasks, such as contract analysis, information retrieval, and providing concrete facts, and have been found to be more reliable and predictable than open AI calls in some ways (3m29s)
- AI Blocks started working on small language models 20 months ago and launched an open-source project about a year ago, with the goal of enabling mobile devices and edge computing servers to leverage Gen solutions that were previously limited to large language models (2m1s)
- The emergence of small language models is enabling AI-powered PCs and allowing for the deployment of Gen AI solutions in industries that were previously limited by the need for large language models (1m7s)
- Open AI had significant issues with outages and accuracy, prompting the development of alternative models that can be run locally and produce comparable results for specific use cases (3m58s).
- The first models released were called Dragon, which are seven billion parameter models trained to not hallucinate and provide accurate answers with quality scores, aiming to offer users the accuracy of a corporate lawyer (4m37s).
- The use of small language models has evolved to include smaller RAG fine-tunes and function calling models with one to three billion parameters, which can automate workflows and processes (5m11s).
The Rise of Small Language Models (SLMs)
- Large language models (LLMs) have been a major innovation in AI and ML, but they require significant computing resources and raise privacy concerns, limiting their adoption (6m7s).
- Small language models (SLMs) offer an alternative to LLMs, providing a more feasible solution for companies with limited computing resources and data privacy concerns (7m26s).
- The development of SLMs has the potential to make a significant impact on the adoption of AI solutions, offering a more accessible and efficient alternative to LLMs (5m57s).
- SLMs can be used in various business and technical use cases, empowering end-users, software engineers, and devops engineers to be more productive (6m39s).
- The use of SLMs can also address the challenges associated with LLMs, such as the need for significant computing resources and data privacy concerns (7m5s).
- Small language models (SLMs) offer many of the same benefits as large language models (LLMs) but are smaller in size, trained using smaller data sets, and don't require a lot of computing resources (7m31s).
- SLMs are valuable for use cases where there are constraints on resources or a need to localize model execution, and they are opening up new opportunities to run SLMs on smartphones and other mobile devices used for Edge Computing (7m58s).
- SLMs keep data within the device, making them great candidates for use cases where privacy, latency, or other concerns exist in sending data to the cloud (8m17s).
- The definition of a small language model has changed over time, and it is now possible to run models with up to 14 billion parameters on an Intel-based AI PC (9m23s).
- The latest AI PCs, such as those from Intel, are enabling the development of SLMs that can run on commodity laptops, with prices starting from around $1,100 (10m27s).
- The small language models themselves are getting better by the week, making innovation possible, and the hardware on edge devices is also improving (10m6s).
- The innovation in training SLMs is also advancing, with new technologies such as Apple Intelligence being released at the end of the month (11m8s).
Advantages and Applications of SLMs
- SLMs are not a one-size-fits-all solution and are not suitable for every use case, but they are valuable for specific applications where resources are limited or model execution needs to be localized (7m42s).
- A 3 billion parameter model is being developed to run on-device, made possible by pruning a 6.4 billion parameter model, which involves removing unnecessary parts and using distillation techniques to create a smaller version (11m11s).
- The use of small language models is becoming increasingly innovative, making it difficult to say that large language models are necessarily better, as the choice of model depends on the specific use case and task (11m47s).
- Using the right size model for the right problem is crucial, and leveraging small models with proprietary data can result in comparable accuracy to larger models (12m21s).
- A study by Nvidia found that using 40 times fewer training tokens, small language models can achieve comparable results to larger models, and even outperform them in some cases (12m40s).
- The combination of distillation, pruning, and fine-tuning with proprietary data can result in a model that is 16% better than a model of the same size trained from scratch (13m2s).
- The ability to run these models on a laptop, without the need for a special GPU farm, is democratizing AI and making it more accessible to a wider range of users (13m14s).
- The increased accessibility of AI models is expected to revolutionize the field and open up new use cases, making it possible for people to use AI for day-to-day tasks and micro-tasks (13m43s).
- The development of small language models is also addressing concerns around data privacy and leakage, as users will be able to query their documents and chat with the model without worrying about data security (14m48s).
- Small language models can be used to bring AI to workers' fingertips, increasing productivity by automating various tasks and making information more accessible, especially for those with laptops (15m7s).
- The use of small language models can democratize AI use cases, making them more accessible to a wider range of people and organizations (15m41s).
- Basic use cases for small language models include finding information in documents, such as searching for specific details in an 80-page contract, which can be done locally on a laptop without the need for internet access (15m54s).
- Small language models can also be used for tasks like summarization, SQL queries, and transcription of voice recordings, automating microtasks that are part of day-to-day work life (16m40s).
- The use of small language models on local devices can help address data privacy concerns, as sensitive information does not need to be uploaded to the cloud (17m2s).
- Small language models can be used to automate workflows, such as creating reports for financial analysts, which can include tasks like looking up company information, stock prices, and historical data (17m42s).
- Agent workflows can be created to automate tasks, such as making API calls to services like Yahoo Finance or Wikipedia, to gather information and generate reports (17m59s).
- The promise of AI is to serve as a co-pilot for everyday working life, making it accessible to users on their devices, rather than being a behemoth use case only large companies can access (18m20s).
- Small language models and the Retrieval-Augmented Generator (RAG) are a good fit for each other, as they can be trained with a company's private information and used to ask domain-specific questions, making them suitable for commodity hardware (19m0s).
- RAG is not necessarily better with large models, and studies have shown that large language models are not designed for complex RAG, making the key to successful RAG deployment the workflow and accuracy of the chain from document ingestion to inferencing (19m35s).
- Combining RAG with a specialized embedding model that understands the domain can lead to fast inference speeds, especially with new AI PCs that have integrated GPUs, resulting in performance differences compared to older Intel-chipped devices (20m14s).
- The difference in running PyTorch versus OpenV GPU can be significant, with a 5-second to 15-second difference for a 21-question inference test for a 1.1 billion parameter model, allowing for subsecond response times with the right inferencing technique and hardware (20m43s).
- AI is becoming more accessible, coming to users at their fingertips, and bringing value to those who can utilize it, rather than sending data to the cloud (21m1s).
- The AIML and Data Engineering Trends report, recorded in August, provides more information on the topic and will be linked for reference (21m20s).
- Small language models are being used in applications like auditing and compliance, enabling proactive compliance and regulations by design rather than by accident (21m38s).
- Features related to compliance and auditability, such as AI explainability and guardrails, are crucial for AI-powered PCs (22m3s).
- Small language models shine in AI explainability, allowing for visibility into every single step of the decision-making process (22m16s).
- Chaining workflows with small language models can create decision trees based on model inferences and answers, enabling course correction and fault identification (22m32s).
- Unlike large language models, small language models provide visibility into the decision-making process, allowing for the identification of mistakes and the rationale behind the workflow design (23m24s).
- AI explainability is critical for Enterprise applications, enabling the exposure of options considered by the model and the chosen outcome at every step of the process (24m0s).
- Observability and explainability factors are crucial for systematic and observable decision-making, especially when deploying new processes (24m36s).
- The use of small language models captures every interaction, inference, and decision, providing all necessary data for auditability and compliance purposes (24m46s).
Comparing SLMs with LLMs and Their Combined Use
- Large language models are good at preserving the context of conversations, making them suitable for customer-facing chatbots that can go on for days, due to their large context window that can keep the conversation going and preserve context over long sessions (26m40s).
- Small language models can be used for processes that run in the background, such as hourly, daily, weekly, or monthly tasks, and can be chained together on CPUs to run on inexpensive hardware (27m36s).
- Small language models can be powerful on the edge in devices, doing real-time analytics for finding defects or other use cases in manufacturing processes, and then sending results to the cloud for offline analytics to generate more insights (28m30s).
- The combination of large and small language models can be the best choice for certain use cases, where large models can capture conversations and preserve context, and small models can drive insights and analytics off of that conversation in batch processes (26m1s).
- Large language models are not necessarily better than small models, as small models can be just as good in terms of performance or accuracy, despite being smaller in size (25m51s).
- The use of small language models can provide visibility into every single step in the process, which is beneficial for auditors who want to know all the under-the-hood details (25m30s).
- Small language models can be used for specific use cases, such as Edge on-device analytics, real-time defect detection, and other use cases in manufacturing processes (28m11s).
- Local modeling can generate insights that can be sent to the cloud to train large Val models, creating a feedback loop between localized small language models and cloud-based large language models, making each other better (28m35s).
Cost-Effectiveness and Adoption of SLMs
- Using a large language model for everything can be overkill and a tremendous waste of resources, especially for startups that are financially constrained (29m5s).
- Small language models can be a first step in the learning and adoption process for companies, allowing them to learn the process, solutions, and invest in a bigger solution later (29m38s).
- Enterprises are extremely cost-sensitive and look for cost efficiency, security, safety, and performance, making small language models a viable option (29m52s).
- During the exploration phase, companies can try low-cost, high-performance, and easy-to-run models and then grow into large models (30m30s).
- To adopt or try small language models, the required infrastructure or tools depend on the laptop, with different recommendations for Mac and Intel-based machines (30m48s).
- For Mac users, the GGF quantize version and a solution like AMA are recommended, while for Intel-based machines, the open Vino library is the preferred choice (31m17s).
- The open Vino library is supported by LLmware, but users need to download it themselves to work with the library (31m45s).
- The Microsoft version of the NX Onyx model is a cross-platform approach that offers a middle-of-the-road performance, with negligible differences in performance compared to the ggf model on a Dell machine (32m4s).
- Onyx is a good option for non-Mac and non-Intel based machines, while ggf is the fastest way to run inferencing on a Mac, and openVINO is the best option for Intel-based machines (32m34s).
- A four-year-old Dell machine can achieve the same performance as an M3 in terms of inference speed using the right software, highlighting the importance of matching software to hardware (32m48s).
The Future of AI and SLMs
- The development of small language models is enabling AI-powered PCs, allowing for the democratization of AI and making it as ubiquitous as regular software (34m17s).
- Model HQ is a product that allows users to run openVINO models without needing to know C++, making it easy to use and accessible to a wider audience (34m36s).
- The future of AI development and PC hardware is expected to lead to better and more ubiquitous AI, with the potential for AI to be as common as regular software in the next three years (35m5s).
- Small language models can operate at the edge, without the need for cloud connectivity, and can be used in a variety of devices, including laptops, smartphones, IoT devices, and sensors in manufacturing plants or autonomous vehicles (33m49s).
- The emergence of small models is accelerating the power of AI PCs and devices, with limitless use cases and potential applications (34m7s).
- AI will become ubiquitous in the future, making it a standard feature in applications, similar to software, with some applications having to explicitly state that they do not include AI (35m29s).
- The power of small language models is increasing due to advancements in distillation, pruning, and combination techniques, allowing them to run on smaller hardware footprints (35m46s).
- The definition of a small language model is changing, enabling the possibility of running large models, such as a 20 billion parameter model, on a laptop (36m6s).
- Smaller models, like three billion parameter models, are becoming increasingly powerful, making them a viable option for many applications (36m26s).
- Small language models and AI-powered PCs are complementary technologies that will drive innovation in each other at a rapid pace (36m39s).
- The price point of AI-powered PCs is relatively inexpensive, considering their powerful capabilities, with options available for around $1,000 to $2,000 (37m35s).
- The Lunar Lake version of AI-powered PCs is becoming available for consumers, offering powerful GPU capabilities (37m12s).
Best Practices and Recommendations for SLMs
- Best practices for using small language models include starting with Microsoft models, such as the Five series, and being aware of the rapidly evolving nature of these models (38m20s).
- It is recommended to try out small language models and experiment with different options, such as using LL mware for inference on Mac devices (38m35s).
- Small language models can be easily installed and used, and they are capable of performing various tasks, with the caveat that they may not be as good as larger models for tasks like video or image generation (38m39s).
- These models can be stacked together to create workflows, allowing for workflow automation, and can be used for tasks such as sentiment analysis, named entity recognition, and information extraction (39m24s).
- There are a dozen or so models available that have specific functions and can be chained together in a workflow, and there are also many examples and YouTube videos available to help users get started (40m1s).
- When deploying small language models to production, they are more secure in some ways because they are less susceptible to suggestions and hacks, and are less likely to respond to prompt injection attacks (40m52s).
- Small language models are also great for observability, as they can be easily tested and debugged, and can be swapped out or fine-tuned if they are not performing as desired (41m35s).
- The use of small language models can also make it easier to identify and fix issues, as it is clear where the model is failing or succeeding, and the data set can be examined to identify any problems (41m43s).
SLM Ops and Deployment Considerations
- The deployment of small language models, or "SLM Ops", is an important consideration, and users should be aware of the potential benefits and challenges of using these models in production (40m34s).
- When creating an AI workflow, it's often better to start with small language models, chain them together, and then increase the model size if necessary, rather than starting with a large model like Open AI (42m16s).
- Starting with a small model, such as a 1 billion parameter model, and then substituting larger models, like 3 billion or 7 billion, can be an effective approach (42m36s).
- A 10 billion parameter model can likely solve most workflows, except for hard exceptions like video creation and image generation (42m51s).
- It's recommended to start small, explore, and iterate when working with AI models (43m18s).
Online Resources and Communities for AI
- For online resources, Mark Tech Post is a good source for leading-edge AI research, and YouTube channels like AI Anytime and World of AI are great for tutorials and exposure to the latest AI developments (43m35s).
- Hugging Face's LinkedIn site is also a good resource for promoting new models, and InfoQ's website has good information on AI, including articles on Apple Foundation models (44m18s).
- AI is becoming ubiquitous and is being integrated into various communities, including architecture, devops, cloud, security, and machine learning (44m52s).
- The question remains when AI will become a regular thing, like having a personal website, and no longer be a topic of excitement (45m14s).
- The integration of AI in everyday technology is expected to become the norm, similar to how it is now understood that everything will have the internet in it (45m27s).
- Looking back, it will be clear when AI became an integral part of daily life, but it may not be immediately apparent when it happens (45m42s).
- Nami encourages the InfoQ community to keep experimenting and playing around with AI-powered technologies (46m0s).
- Lmware, an open-source site, offers an end-to-end solution for small language models and is free for anyone to try out (46m7s).
- Small language models have the potential to commoditize and localize language model solutions, allowing for a bigger impact on the software development community (46m24s).
- The AI/ML and data engineering community page on the InfoQ website is a resource for learning more about AI/ML topics, including recent podcasts and the AI/ML Trends report for 2024 (46m40s).
- The AI/ML Trends report for 2024 covers topics such as small language models, AI-powered PCs, coding assistance, and other trends in the field (46m56s).