Inside Distributed Inference With Llm-d Ft. Carlos Costa Technically Speaking With Chris Wright podcast

Over 20 million podcasts, powered by

Artwork

Tech Business Red Hat Emerging Technologies Machine Learning Chris Wright Hybrid Clouds Data Development Security Linux Software Cloud Computing Enterprise Technology Artificial Intelligence Programming Coding Careers Technology

Content provided by Red Hat. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by Red Hat or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://podcastplayer.com/legal.

Technically Speaking with Chris Wright »
Inside distributed inference with llm-d ft. Carlos Costa

1d ago 26:23

Share

MP3•Episode home

Content provided by Red Hat. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by Red Hat or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://podcastplayer.com/legal.

Scaling LLM inference for production isn't just about adding more machines, it demands new intelligence in the infrastructure itself. In this episode, we're joined by Carlos Costa, Distinguished Engineer at IBM Research, a leader in large-scale compute and a key figure in the llm-d project. We discuss how to move beyond single-server deployments and build the intelligent, AI-aware infrastructure needed to manage complex workloads efficiently. Carlos Costa shares insights from his deep background in HPC and distributed systems, including: • The evolution from traditional HPC and large-scale training to the unique challenges of distributed inference for massive models. • The origin story of the llm-d project, a collaborative, open-source effort to create a much-needed ""common AI stack"" and control plane for the entire community. • How llm-d extends Kubernetes with the specialization required for AI, enabling state-aware scheduling that standard Kubernetes wasn't designed for. • Key architectural innovations like the disaggregation of prefill and decode stages and support for wide parallelism to efficiently run complex Mixture of Experts (MOE) models. Tune in to discover how this collaborative, open-source approach is building the standardized, AI-aware infrastructure necessary to make massive AI models practical, efficient, and accessible for everyone.

… continue reading

4 episodes

#Tech #Business #Red Hat #Emerging Technologies #Machine Learning #Chris Wright #Hybrid Clouds #Data #Development #Security #Linux #Software #Cloud Computing #Enterprise Technology #Artificial Intelligence #Programming #Coding Careers #Technology

Artwork

Inside distributed inference with llm-d ft. Carlos Costa

Technically Speaking with Chris Wright

published 1d ago

Share

MP3•Episode home

Content provided by Red Hat. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by Red Hat or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://podcastplayer.com/legal.

Scaling LLM inference for production isn't just about adding more machines, it demands new intelligence in the infrastructure itself. In this episode, we're joined by Carlos Costa, Distinguished Engineer at IBM Research, a leader in large-scale compute and a key figure in the llm-d project. We discuss how to move beyond single-server deployments and build the intelligent, AI-aware infrastructure needed to manage complex workloads efficiently. Carlos Costa shares insights from his deep background in HPC and distributed systems, including: • The evolution from traditional HPC and large-scale training to the unique challenges of distributed inference for massive models. • The origin story of the llm-d project, a collaborative, open-source effort to create a much-needed ""common AI stack"" and control plane for the entire community. • How llm-d extends Kubernetes with the specialization required for AI, enabling state-aware scheduling that standard Kubernetes wasn't designed for. • Key architectural innovations like the disaggregation of prefill and decode stages and support for wide parallelism to efficiently run complex Mixture of Experts (MOE) models. Tune in to discover how this collaborative, open-source approach is building the standardized, AI-aware infrastructure necessary to make massive AI models practical, efficient, and accessible for everyone.

… continue reading

4 episodes

#Tech #Business #Red Hat #Emerging Technologies #Machine Learning #Chris Wright #Hybrid Clouds #Data #Development #Security #Linux #Software #Cloud Computing #Enterprise Technology #Artificial Intelligence #Programming #Coding Careers #Technology

All episodes

×

Welcome to Player FM!

Player FM is scanning the web for high-quality podcasts for you to enjoy right now. It's the best podcast app and works on Android, iPhone, and the web. Signup to sync subscriptions across devices.

Listen to 500+ topics

Help/FAQ | Advertise with Us

Arts|Business|Comedy|Economics|Entertainment|News|Politics|Religion

Science|Soccer|Sports|Storytelling|Technology|True Crime

Copyright 2025 | Privacy Policy | Terms of Service | | Copyright

Listen to this show while you explore