Episode 45 - LLM On K8s With Seif Bassem Microsoft Community Insights podcast

Content provided by Nicholas Chang. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by Nicholas Chang or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://podcastplayer.com/legal.

Microsoft Community Insights Podcast »
Episode 45 - LLM on K8s with Seif Bassem

1d ago 43:03

MP3•Episode home

We start by weighing the trade-offs: managed AI gives you speed, safety, and a deep model catalog, but steady high-volume workloads, strict compliance, or edge latency often tilt the equation. That’s where AKS shines. With managed GPU node pools, NVIDIA drivers and operators handled for you, and Multi-Instance GPU to prevent noisy neighbours, you get reliable performance and better utilisation. Auto-provisioning brings GPU capacity online when traffic surges, and smart scheduling keeps pods where they need to be.
The breakthrough is Kaito, the Kubernetes AI Toolchain Operator that treats models as first-class cloud native apps. Using concise YAML, we containerise models, select presets that optimise vLLM, and expose an OpenAI-compatible endpoint so existing clients work by changing only the URL. We walk through a demo that labels GPU nodes, deploys a model, serves it via vLLM, and validates responses from a simple chat UI and a Python client. Tool calling and MCP fit neatly into this setup, allowing private integrations with internal APIs while keeping data in your environment.

Text Us About the Show

Chapters

1. Episode 45 - LLM on K8s with Seif Bassem (00:00:00)

2. Welcome And Guest Intro (00:00:50)

3. Why LLMs On Kubernetes (00:03:16)

4. Managed AI Platforms: Pros And Cons (00:06:36)

5. When Self‑Hosting LLMs Makes Sense (00:12:27)

6. Challenges: GPUs, Models, Engines, Ops (00:18:20)

7. AKS Capabilities For AI Workloads (00:23:05)

8. Meet Kaito: Models As Cloud Native Apps (00:28:32)

9. Demo Setup: AKS GPU And Labels (00:34:12)

10. Deploying Models With Kaito Presets (00:38:18)

11. Serving Via VLLM And OpenAI API (00:43:08)

12. Observability With Prometheus And Grafana (00:47:12)

13. Alternatives To Kaito And YAML Deep Dive (00:52:40)

14. LLMOps, GitOps, And Fleet Scale (00:57:20)

15. Resources, Next Steps, And Wrap (01:01:59)

16. Personal Side And Closing (01:03:02)

45 episodes

#Tech #Nicholas Chang