Go offline with the Player FM app!
38.8 - David Duvenaud on Sabotage Evaluations and the Post-AGI Future
Manage episode 468961331 series 2844728
In this episode, I chat with David Duvenaud about two topics he's been thinking about: firstly, a paper he wrote about evaluating whether or not frontier models can sabotage human decision-making or monitoring of the same models; and secondly, the difficult situation humans find themselves in in a post-AGI future, even if AI is aligned with human intentions.
Patreon: https://www.patreon.com/axrpodcast
Ko-fi: https://ko-fi.com/axrpodcast
Transcript: https://axrp.net/episode/2025/03/01/episode-38_8-david-duvenaud-sabotage-evaluations-post-agi-future.html
FAR.AI: https://far.ai/
FAR.AI on X (aka Twitter): https://x.com/farairesearch
FAR.AI on YouTube: @FARAIResearch
The Alignment Workshop: https://www.alignment-workshop.com/
Topics we discuss, and timestamps:
01:42 - The difficulty of sabotage evaluations
05:23 - Types of sabotage evaluation
08:45 - The state of sabotage evaluations
12:26 - What happens after AGI?
Links:
Sabotage Evaluations for Frontier Models: https://arxiv.org/abs/2410.21514
Gradual Disempowerment: https://gradual-disempowerment.ai/
Episode art by Hamish Doodles: hamishdoodles.com
54 episodes
Manage episode 468961331 series 2844728
In this episode, I chat with David Duvenaud about two topics he's been thinking about: firstly, a paper he wrote about evaluating whether or not frontier models can sabotage human decision-making or monitoring of the same models; and secondly, the difficult situation humans find themselves in in a post-AGI future, even if AI is aligned with human intentions.
Patreon: https://www.patreon.com/axrpodcast
Ko-fi: https://ko-fi.com/axrpodcast
Transcript: https://axrp.net/episode/2025/03/01/episode-38_8-david-duvenaud-sabotage-evaluations-post-agi-future.html
FAR.AI: https://far.ai/
FAR.AI on X (aka Twitter): https://x.com/farairesearch
FAR.AI on YouTube: @FARAIResearch
The Alignment Workshop: https://www.alignment-workshop.com/
Topics we discuss, and timestamps:
01:42 - The difficulty of sabotage evaluations
05:23 - Types of sabotage evaluation
08:45 - The state of sabotage evaluations
12:26 - What happens after AGI?
Links:
Sabotage Evaluations for Frontier Models: https://arxiv.org/abs/2410.21514
Gradual Disempowerment: https://gradual-disempowerment.ai/
Episode art by Hamish Doodles: hamishdoodles.com
54 episodes
All episodes
×Welcome to Player FM!
Player FM is scanning the web for high-quality podcasts for you to enjoy right now. It's the best podcast app and works on Android, iPhone, and the web. Signup to sync subscriptions across devices.