Search a title or topic

Over 20 million podcasts, powered by 

Player FM logo
Artwork

Content provided by Nicholas Chang. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by Nicholas Chang or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://podcastplayer.com/legal.
Player FM - Podcast App
Go offline with the Player FM app!

Episode 45 - LLM on K8s with Seif Bassem

43:03
 
Share
 

Manage episode 523940014 series 3551436
Content provided by Nicholas Chang. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by Nicholas Chang or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://podcastplayer.com/legal.

We start by weighing the trade-offs: managed AI gives you speed, safety, and a deep model catalog, but steady high-volume workloads, strict compliance, or edge latency often tilt the equation. That’s where AKS shines. With managed GPU node pools, NVIDIA drivers and operators handled for you, and Multi-Instance GPU to prevent noisy neighbours, you get reliable performance and better utilisation. Auto-provisioning brings GPU capacity online when traffic surges, and smart scheduling keeps pods where they need to be.
The breakthrough is Kaito, the Kubernetes AI Toolchain Operator that treats models as first-class cloud native apps. Using concise YAML, we containerise models, select presets that optimise vLLM, and expose an OpenAI-compatible endpoint so existing clients work by changing only the URL. We walk through a demo that labels GPU nodes, deploys a model, serves it via vLLM, and validates responses from a simple chat UI and a Python client. Tool calling and MCP fit neatly into this setup, allowing private integrations with internal APIs while keeping data in your environment.

Text Us About the Show

  continue reading

Chapters

1. Episode 45 - LLM on K8s with Seif Bassem (00:00:00)

2. Welcome And Guest Intro (00:00:50)

3. Why LLMs On Kubernetes (00:03:16)

4. Managed AI Platforms: Pros And Cons (00:06:36)

5. When Self‑Hosting LLMs Makes Sense (00:12:27)

6. Challenges: GPUs, Models, Engines, Ops (00:18:20)

7. AKS Capabilities For AI Workloads (00:23:05)

8. Meet Kaito: Models As Cloud Native Apps (00:28:32)

9. Demo Setup: AKS GPU And Labels (00:34:12)

10. Deploying Models With Kaito Presets (00:38:18)

11. Serving Via VLLM And OpenAI API (00:43:08)

12. Observability With Prometheus And Grafana (00:47:12)

13. Alternatives To Kaito And YAML Deep Dive (00:52:40)

14. LLMOps, GitOps, And Fleet Scale (00:57:20)

15. Resources, Next Steps, And Wrap (01:01:59)

16. Personal Side And Closing (01:03:02)

45 episodes

Artwork
iconShare
 
Manage episode 523940014 series 3551436
Content provided by Nicholas Chang. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by Nicholas Chang or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://podcastplayer.com/legal.

We start by weighing the trade-offs: managed AI gives you speed, safety, and a deep model catalog, but steady high-volume workloads, strict compliance, or edge latency often tilt the equation. That’s where AKS shines. With managed GPU node pools, NVIDIA drivers and operators handled for you, and Multi-Instance GPU to prevent noisy neighbours, you get reliable performance and better utilisation. Auto-provisioning brings GPU capacity online when traffic surges, and smart scheduling keeps pods where they need to be.
The breakthrough is Kaito, the Kubernetes AI Toolchain Operator that treats models as first-class cloud native apps. Using concise YAML, we containerise models, select presets that optimise vLLM, and expose an OpenAI-compatible endpoint so existing clients work by changing only the URL. We walk through a demo that labels GPU nodes, deploys a model, serves it via vLLM, and validates responses from a simple chat UI and a Python client. Tool calling and MCP fit neatly into this setup, allowing private integrations with internal APIs while keeping data in your environment.

Text Us About the Show

  continue reading

Chapters

1. Episode 45 - LLM on K8s with Seif Bassem (00:00:00)

2. Welcome And Guest Intro (00:00:50)

3. Why LLMs On Kubernetes (00:03:16)

4. Managed AI Platforms: Pros And Cons (00:06:36)

5. When Self‑Hosting LLMs Makes Sense (00:12:27)

6. Challenges: GPUs, Models, Engines, Ops (00:18:20)

7. AKS Capabilities For AI Workloads (00:23:05)

8. Meet Kaito: Models As Cloud Native Apps (00:28:32)

9. Demo Setup: AKS GPU And Labels (00:34:12)

10. Deploying Models With Kaito Presets (00:38:18)

11. Serving Via VLLM And OpenAI API (00:43:08)

12. Observability With Prometheus And Grafana (00:47:12)

13. Alternatives To Kaito And YAML Deep Dive (00:52:40)

14. LLMOps, GitOps, And Fleet Scale (00:57:20)

15. Resources, Next Steps, And Wrap (01:01:59)

16. Personal Side And Closing (01:03:02)

45 episodes

All episodes

×
 
Loading …

Welcome to Player FM!

Player FM is scanning the web for high-quality podcasts for you to enjoy right now. It's the best podcast app and works on Android, iPhone, and the web. Signup to sync subscriptions across devices.

 

Copyright 2025 | Privacy Policy | Terms of Service | | Copyright
Listen to this show while you explore
Play