CoreWeave’s Brannin McBee on the future of AI infrastructure, GPU economics, & data centers | E1925

06 Apr 2024 (9 months ago)
CoreWeave’s Brannin McBee on the future of AI infrastructure, GPU economics, & data centers | E1925

CoreWeave’s Brannin McBee joins Jason (0s)

  • CoreWeave is one of the largest operators of AI infrastructure in the world.
  • The demand for AI infrastructure is growing rapidly and CoreWeave is struggling to keep up.
  • CoreWeave started in the space renting GPUs to crypto miners.
  • When the AI revolution happened, CoreWeave was in a prime position to capitalize on the demand for AI hardware.
  • NVIDIA's market cap has increased more than 7x from $300 billion to $2.3 trillion.
  • The growth in NVIDIA's market cap is due to the increasing demand for GPUs for AI training.
  • CoreWeave is a capital-intensive business that spends a lot of money on GPUs and setting up infrastructure.

The founding story of CoreWeave and the transition from the cryptocurrency market to AI (3m36s)

  • CoreWeave's founders, with backgrounds in institutional commodity trading, saw cryptocurrency mining as an arbitrage opportunity due to predictable power costs and efficient revenue modeling.
  • They focused on GPU-oriented mining for Ethereum because GPUs could also run AI workloads, providing more versatility.
  • CoreWeave transitioned into the cloud infrastructure market in 2019, building a cloud specifically designed for running AI and highly parallelizable workloads.
  • CoreWeave's cloud infrastructure outperforms competitors due to engineering decisions made for AI workloads rather than website hosting and data storage.
  • CoreWeave specializes in providing AI infrastructure at scale, with clients using 10,000 GPUs simultaneously in a single contiguous fabric, creating supercomputers.

CoreWeave’s Brannin McBee on the future of AI infrastructure, GPU economics, & data centers | E1925 (0s)

  • OpenPhone is a business phone app that simplifies business communications.
  • It offers a shared phone number feature, allowing multiple employees to field calls and texts to a single number.
  • OpenPhone is rated number one for customer satisfaction on G2.
  • Twist listeners can get an extra 20% off any plan for the first 6 months at openphone.com/Twist.
  • OpenPhone can port existing numbers from other services at no extra cost.

Energy reliance of GPUs, the future of data centers, and the cost and efficiency of GPU cloud infrastructure (14m4s)

  • GPUs consume a significant amount of energy, accounting for 10% of the total cost of delivering GPU infrastructure.
  • Data center capacity, especially data centers with sufficient power, is the bottleneck in the cloud infrastructure market.
  • GPUs are more power-efficient than CPUs on a workload basis but consume more power on a density basis.
  • The immense power consumption of GPUs is justified by their ability to unlock value from data.
  • There is a need for more baseload power, such as nuclear power, to support the consistent demand growth of data centers and electric vehicles.
  • CoreWeave's Brannin McBee discusses the future of AI infrastructure, GPU economics, and data centers.
  • McBee emphasizes the importance of contracting a substantial amount of capacity to ensure business growth and warns that the lack of capacity will hinder other market participants.
  • McBee also mentions the challenge of managing the heat generated by these systems.

Direct-to-chip liquid cooling and other cooling solutions for GPUs (19m28s)

  • Data centers consume a significant amount of energy, with cooling infrastructure requiring 2 to 3 times more energy than the infrastructure itself.
  • The industry is transitioning from forced air cooling to liquid cooling for improved energy efficiency.
  • CoreWeave focuses on direct-to-chip liquid cooling for its operational efficiency, involving pipes running to the chips for easier servicing.
  • Data centers are highly controlled environments with strict measures to prevent accidents.
  • GPU economics are changing due to the rising demand for AI infrastructure, leading to increased GPU costs and the need for data center reconfiguration to accommodate power and cooling requirements.

Demand trends for GPU capacity and the growth of inference linked to user growth (24m52s)

  • CoreWeave’s revenue is expected to increase fourfold this year.
  • The company is already sold out of all its capacity through the end of the year.
  • There is an immovable wall of demand for GPU compute.
  • The demand is driven by the move from training models to inference.
  • Inference is linked to the growth of the market.
  • Users who were using 10,000 GPUs for training need hundreds of thousands for their early-stage inference products.
  • Demand for GPU infrastructure is expected to continue to grow.
  • An H100 GPU can process hundreds of queries per hour.
  • The cost per query is a few pennies.
  • The cost per query is expected to become more efficient over time.
  • The existing cloud infrastructure was not built for the parallelizable workloads of AI.
  • There is a need to rebuild the cloud infrastructure at the pace of AI software adoption.
  • CoreWeave is building 28 data centers this year in North America and is still unable to keep up with demand.

CoreWeave’s Brannin McBee on the future of AI infrastructure, GPU economics, & data centers | E1925 (0s)

  • Brannin McBee is the CEO and Co-founder of CoreWeave.
  • CoreWeave is a company that provides AI infrastructure solutions.
  • Brannin has a background in electrical engineering and computer science.
  • He has worked in the tech industry for over 20 years.
  • AI is becoming increasingly important in various industries.
  • This growth is driving the demand for AI infrastructure.
  • AI infrastructure includes hardware, software, and services that are used to develop, train, and deploy AI models.
  • The future of AI infrastructure is expected to see continued growth and innovation.
  • GPUs (Graphics Processing Units) are essential for AI workloads.
  • The cost of GPUs has been increasing in recent years.
  • This increase is due to the high demand for GPUs and the limited supply.
  • The cost of GPUs is expected to continue to increase in the future.
  • Data centers are facilities that house computer systems and provide the necessary infrastructure for them to operate.
  • Data centers are essential for AI workloads.
  • The demand for data centers is expected to grow in the future.
  • This growth is driven by the increasing demand for AI infrastructure.
  • CoreWeave provides AI infrastructure solutions that help businesses to reduce the cost of their AI workloads.
  • CoreWeave’s solutions include hardware, software, and services.
  • CoreWeave’s solutions are designed to be scalable and efficient.
  • CoreWeave’s solutions are used by a variety of businesses, including Fortune 500 companies.
  • AI is becoming increasingly important in various industries.
  • The growth of AI is driving the demand for AI infrastructure.
  • CoreWeave provides AI infrastructure solutions that help businesses to reduce the cost of their AI workloads.

LPUs Vs. GPUs (30m28s)

  • GPUs are expensive, but inference engines and LPUs (Language Processing Units) are emerging as purpose-built hardware for inference, potentially lowering costs.
  • Different models will require different types of infrastructure for optimal efficiency.
  • Companies like Microsoft and Meta are building their own silicon to solve for specific models they run internally, not to replace GPUs.
  • GPUs will continue to dominate in training the most demanding and complex workloads, especially latest generation models.
  • Inference may see various levels of infrastructure solutions, but models trained on A100s will likely run best on A100s due to software compatibility.
  • Nvidia's open-source driver solution (CUDA) has become the default across the market, giving them a strong advantage in the AI infrastructure sector.

Dominance of Nvidia in training models, implications of open source chips and chip architecture (34m13s)

  • Nvidia dominates the market for training models due to their high-performance training fabric called NVLink.
  • Open-source chips and chip architectures may have limited impact in the short term due to performance and configurability losses.
  • AMD lacks a performant training fabric, making it less suitable for training models compared to Nvidia GPUs.
  • Most large-scale consumers of GPUs prefer Nvidia infrastructure due to its superior performance and established ecosystem.
  • Infiniband is a high-speed networking solution acquired by Nvidia and integrated into their DGX solution.
  • It is the most performant fabric solution for data throughput in GPU infrastructure.

CoreWeave’s Brannin McBee on the future of AI infrastructure, GPU economics, & data centers | E1925 (0s)

  • Brannin McBee, CEO of CoreWeave, discusses the future of AI infrastructure, GPU economics, and data centers.
  • AI is rapidly evolving and becoming more accessible, leading to increased demand for AI infrastructure.
  • Traditional data centers are not well-suited for AI workloads, which require high-performance computing and specialized hardware.
  • CoreWeave is developing new AI infrastructure solutions that are optimized for AI workloads.
  • GPUs are essential for AI workloads, but they are also expensive.
  • The cost of GPUs is a major barrier to entry for many organizations that want to use AI.
  • CoreWeave is working to reduce the cost of GPUs by developing new GPU architectures and optimizing GPU utilization.
  • Data centers are critical for AI infrastructure, as they provide the compute power and storage capacity needed for AI workloads.
  • CoreWeave is building new data centers that are optimized for AI workloads.
  • These data centers will provide the infrastructure needed to support the growth of AI.
  • Brannin McBee concludes by discussing the importance of AI infrastructure and the role that CoreWeave is playing in developing new AI infrastructure solutions.

Challenges in training large language models and the use of infrastructure as a weapon by big tech (40m0s)

  • CoreWeave's Brannin McBee discusses the challenges and solutions in building AI infrastructure, particularly for large language models.
  • A key challenge is the throughput of data movement between GPUs, which can be a bottleneck. Non-blocking Infiniband fabric is crucial for ensuring high performance and efficiency without bottlenecks.
  • Building a 16,000 GPU fabric involves complex physical engineering, with 48,000 discrete connections and 500 miles of fiber optic cabling.
  • CoreWeave provides a software solution and platform specifically designed for these types of workloads, catering to clients who require the best engineering solutions.
  • Large companies like Microsoft, Meta, and Google are building infrastructure as a means to utilize their capital, creating potential jobs and opportunities for innovation, as they are unable to deploy it through M&A due to regulatory restrictions.
  • Promising use cases for AI infrastructure include integrating AI into existing products seamlessly, without requiring users to learn new interfaces or applications.
  • The growth of AI integration into existing user processes is limited by the availability of cloud infrastructure.
  • Cloud infrastructure limitations can delay product launches and hinder the growth of AI-powered applications, especially for those that require extensive GPU usage, such as co-pilot products and search engines.

The potential impact of AI on search engines, advertising sector, and ecommerce (45m36s)

  • The high cost of processing AI queries using GPUs is a barrier to widespread adoption.
  • Integrating AI into software products may become mandatory, leading to increased competition and potential market share loss for companies that don't adopt AI.
  • Companies like Microsoft are investing heavily in AI infrastructure to gain a strategic advantage.
  • Generative AI, particularly in advertising, will drive significant demand for infrastructure, especially GPU-based infrastructure.
  • The demand for AI infrastructure exceeds the capabilities of current cloud systems, necessitating a fundamental shift in infrastructure design.

Timeline for supply-demand balance in AI infrastructure and the operational feat of running multiple data centers (52m46s)

  • Supply and demand for AI infrastructure may normalize by the end of this decade.
  • Infrastructure growth is on a heavy trajectory, but supply may catch up to demand by then.
  • CoreWeave's background in commodity trading helps them assess supply and demand in the AI infrastructure market.
  • Unlike fungible cloud infrastructure for hosting websites, the lack of fungibility in AI infrastructure is being de-commoditized through software and infrastructure disruption.
  • CoreWeave is hiring 20 people a week to configure and manage their infrastructure.
  • They have hundreds of people unboxing and racking equipment, and semi-trucks arriving at their 28 data centers across the US.

Overwhelmed by Endless Content?