Search a title or topic

Over 20 million podcasts, powered by 

Player FM logo
Artwork

Content provided by Redpoint Ventures and By Redpoint Ventures. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by Redpoint Ventures and By Redpoint Ventures or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://player.fm/legal.
Player FM - Podcast App
Go offline with the Player FM app!

Ep 74: Chief Scientist of Together.AI Tri Dao On The End of Nvidia's Dominance, Why Inference Costs Fell & The Next 10X in Speed

58:37
 
Share
 

Manage episode 505534886 series 3495253
Content provided by Redpoint Ventures and By Redpoint Ventures. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by Redpoint Ventures and By Redpoint Ventures or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://podcastplayer.com/legal.

Fill out this short listener survey to help us improve the show: https://forms.gle/bbcRiPTRwKoG2tJx8

Tri Dao, Chief Scientist at Together AI and Princeton professor who created Flash Attention and Mamba, discusses how inference optimization has driven costs down 100x since ChatGPT's launch through memory optimization, sparsity advances, and hardware-software co-design. He predicts the AI hardware landscape will shift from Nvidia's current 90% dominance to a more diversified ecosystem within 2-3 years, as specialized chips emerge for distinct workload categories: low-latency agentic systems, high-throughput batch processing, and interactive chatbots. Dao shares his surprise at AI models becoming genuinely useful for expert-level work, making him 1.5x more productive at GPU kernel optimization through tools like Claude Code and O1. The conversation explores whether current transformer architectures can reach expert-level AI performance or if approaches like mixture of experts and state space models are necessary to achieve AGI at reasonable costs. Looking ahead, Dao sees another 10x cost reduction coming from continued hardware specialization, improved kernels, and architectural advances like ultra-sparse models, while emphasizing that the biggest challenge remains generating expert-level training data for domains lacking extensive internet coverage.

(0:00) Intro

(1:58) Nvidia's Dominance and Competitors

(4:01) Challenges in Chip Design

(6:26) Innovations in AI Hardware

(9:21) The Role of AI in Chip Optimization

(11:38) Future of AI and Hardware Abstractions

(16:46) Inference Optimization Techniques

(33:10) Specialization in AI Inference

(35:18) Deep Work Preferences and Low Latency Workloads

(38:19) Fleet Level Optimization and Batch Inference

(39:34) Evolving AI Workloads and Open Source Tooling

(41:15) Future of AI: Agentic Workloads and Real-Time Video Generation

(44:35) Architectural Innovations and AI Expert Level

(50:10) Robotics and Multi-Resolution Processing

(52:26) Balancing Academia and Industry in AI Research

(57:37) Quickfire

With your co-hosts:

@jacobeffron

- Partner at Redpoint, Former PM Flatiron Health

@patrickachase

- Partner at Redpoint, Former ML Engineer LinkedIn

@ericabrescia

- Former COO Github, Founder Bitnami (acq’d by VMWare)

@jordan_segall

- Partner at Redpoint

  continue reading

78 episodes

Artwork
iconShare
 
Manage episode 505534886 series 3495253
Content provided by Redpoint Ventures and By Redpoint Ventures. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by Redpoint Ventures and By Redpoint Ventures or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://podcastplayer.com/legal.

Fill out this short listener survey to help us improve the show: https://forms.gle/bbcRiPTRwKoG2tJx8

Tri Dao, Chief Scientist at Together AI and Princeton professor who created Flash Attention and Mamba, discusses how inference optimization has driven costs down 100x since ChatGPT's launch through memory optimization, sparsity advances, and hardware-software co-design. He predicts the AI hardware landscape will shift from Nvidia's current 90% dominance to a more diversified ecosystem within 2-3 years, as specialized chips emerge for distinct workload categories: low-latency agentic systems, high-throughput batch processing, and interactive chatbots. Dao shares his surprise at AI models becoming genuinely useful for expert-level work, making him 1.5x more productive at GPU kernel optimization through tools like Claude Code and O1. The conversation explores whether current transformer architectures can reach expert-level AI performance or if approaches like mixture of experts and state space models are necessary to achieve AGI at reasonable costs. Looking ahead, Dao sees another 10x cost reduction coming from continued hardware specialization, improved kernels, and architectural advances like ultra-sparse models, while emphasizing that the biggest challenge remains generating expert-level training data for domains lacking extensive internet coverage.

(0:00) Intro

(1:58) Nvidia's Dominance and Competitors

(4:01) Challenges in Chip Design

(6:26) Innovations in AI Hardware

(9:21) The Role of AI in Chip Optimization

(11:38) Future of AI and Hardware Abstractions

(16:46) Inference Optimization Techniques

(33:10) Specialization in AI Inference

(35:18) Deep Work Preferences and Low Latency Workloads

(38:19) Fleet Level Optimization and Batch Inference

(39:34) Evolving AI Workloads and Open Source Tooling

(41:15) Future of AI: Agentic Workloads and Real-Time Video Generation

(44:35) Architectural Innovations and AI Expert Level

(50:10) Robotics and Multi-Resolution Processing

(52:26) Balancing Academia and Industry in AI Research

(57:37) Quickfire

With your co-hosts:

@jacobeffron

- Partner at Redpoint, Former PM Flatiron Health

@patrickachase

- Partner at Redpoint, Former ML Engineer LinkedIn

@ericabrescia

- Former COO Github, Founder Bitnami (acq’d by VMWare)

@jordan_segall

- Partner at Redpoint

  continue reading

78 episodes

All episodes

×
 
Loading …

Welcome to Player FM!

Player FM is scanning the web for high-quality podcasts for you to enjoy right now. It's the best podcast app and works on Android, iPhone, and the web. Signup to sync subscriptions across devices.

 

Copyright 2025 | Privacy Policy | Terms of Service | | Copyright
Listen to this show while you explore
Play