Go offline with the Player FM app!
Episode 45 - LLM on K8s with Seif Bassem
Manage episode 523940014 series 3551436
We start by weighing the trade-offs: managed AI gives you speed, safety, and a deep model catalog, but steady high-volume workloads, strict compliance, or edge latency often tilt the equation. That’s where AKS shines. With managed GPU node pools, NVIDIA drivers and operators handled for you, and Multi-Instance GPU to prevent noisy neighbours, you get reliable performance and better utilisation. Auto-provisioning brings GPU capacity online when traffic surges, and smart scheduling keeps pods where they need to be.
The breakthrough is Kaito, the Kubernetes AI Toolchain Operator that treats models as first-class cloud native apps. Using concise YAML, we containerise models, select presets that optimise vLLM, and expose an OpenAI-compatible endpoint so existing clients work by changing only the URL. We walk through a demo that labels GPU nodes, deploys a model, serves it via vLLM, and validates responses from a simple chat UI and a Python client. Tool calling and MCP fit neatly into this setup, allowing private integrations with internal APIs while keeping data in your environment.
Chapters
1. Episode 45 - LLM on K8s with Seif Bassem (00:00:00)
2. Welcome And Guest Intro (00:00:50)
3. Why LLMs On Kubernetes (00:03:16)
4. Managed AI Platforms: Pros And Cons (00:06:36)
5. When Self‑Hosting LLMs Makes Sense (00:12:27)
6. Challenges: GPUs, Models, Engines, Ops (00:18:20)
7. AKS Capabilities For AI Workloads (00:23:05)
8. Meet Kaito: Models As Cloud Native Apps (00:28:32)
9. Demo Setup: AKS GPU And Labels (00:34:12)
10. Deploying Models With Kaito Presets (00:38:18)
11. Serving Via VLLM And OpenAI API (00:43:08)
12. Observability With Prometheus And Grafana (00:47:12)
13. Alternatives To Kaito And YAML Deep Dive (00:52:40)
14. LLMOps, GitOps, And Fleet Scale (00:57:20)
15. Resources, Next Steps, And Wrap (01:01:59)
16. Personal Side And Closing (01:03:02)
45 episodes
Manage episode 523940014 series 3551436
We start by weighing the trade-offs: managed AI gives you speed, safety, and a deep model catalog, but steady high-volume workloads, strict compliance, or edge latency often tilt the equation. That’s where AKS shines. With managed GPU node pools, NVIDIA drivers and operators handled for you, and Multi-Instance GPU to prevent noisy neighbours, you get reliable performance and better utilisation. Auto-provisioning brings GPU capacity online when traffic surges, and smart scheduling keeps pods where they need to be.
The breakthrough is Kaito, the Kubernetes AI Toolchain Operator that treats models as first-class cloud native apps. Using concise YAML, we containerise models, select presets that optimise vLLM, and expose an OpenAI-compatible endpoint so existing clients work by changing only the URL. We walk through a demo that labels GPU nodes, deploys a model, serves it via vLLM, and validates responses from a simple chat UI and a Python client. Tool calling and MCP fit neatly into this setup, allowing private integrations with internal APIs while keeping data in your environment.
Chapters
1. Episode 45 - LLM on K8s with Seif Bassem (00:00:00)
2. Welcome And Guest Intro (00:00:50)
3. Why LLMs On Kubernetes (00:03:16)
4. Managed AI Platforms: Pros And Cons (00:06:36)
5. When Self‑Hosting LLMs Makes Sense (00:12:27)
6. Challenges: GPUs, Models, Engines, Ops (00:18:20)
7. AKS Capabilities For AI Workloads (00:23:05)
8. Meet Kaito: Models As Cloud Native Apps (00:28:32)
9. Demo Setup: AKS GPU And Labels (00:34:12)
10. Deploying Models With Kaito Presets (00:38:18)
11. Serving Via VLLM And OpenAI API (00:43:08)
12. Observability With Prometheus And Grafana (00:47:12)
13. Alternatives To Kaito And YAML Deep Dive (00:52:40)
14. LLMOps, GitOps, And Fleet Scale (00:57:20)
15. Resources, Next Steps, And Wrap (01:01:59)
16. Personal Side And Closing (01:03:02)
45 episodes
All episodes
×Welcome to Player FM!
Player FM is scanning the web for high-quality podcasts for you to enjoy right now. It's the best podcast app and works on Android, iPhone, and the web. Signup to sync subscriptions across devices.