Search a title or topic

Over 20 million podcasts, powered by 

Player FM logo
Artwork

Content provided by LessWrong. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by LessWrong or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://podcastplayer.com/legal.
Player FM - Podcast App
Go offline with the Player FM app!

"How to game the METR plot" by shash42

12:05
 
Share
 

Manage episode 525367440 series 3364760
Content provided by LessWrong. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by LessWrong or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://podcastplayer.com/legal.
TL;DR: In 2025, we were in the 1-4 hour range, which has only 14 samples in METR's underlying data. The topic of each sample is public, making it easy to game METR horizon length measurements for a frontier lab, sometimes inadvertently. Finally, the “horizon length” under METR's assumptions might be adding little information beyond benchmark accuracy. None of this is to criticize METR—in research, its hard to be perfect on the first release. But I’m tired of what is being inferred from this plot, pls stop!
14 prompts ruled AI discourse in 2025
The METR horizon length plot was an excellent idea: it proposed measuring the length of tasks models can complete (in terms of estimated human hours needed) instead of accuracy. I'm glad it shifted the community toward caring about long-horizon tasks. They are a better measure of automation impacts, and economic outcomes (for example, labor laws are often based on number of hours of work).
However, I think we are overindexing on it, far too much. Especially the AI Safety community, which based on it, makes huge updates in timelines, and research priorities. I suspect (from many anecdotes, including roon's) the METR plot has influenced significant investment [...]
---
Outline:
(01:24) 14. prompts ruled AI discourse in 2025
(04:58) To improve METR horizon length, train on cybersecurity contests
(07:12) HCAST Accuracy alone predicts log-linear trend in METR Horizon Lengths
---
First published:
December 20th, 2025
Source:
https://www.lesswrong.com/posts/2RwDgMXo6nh42egoC/how-to-game-the-metr-plot
---
Narrated by TYPE III AUDIO.
---
Images from the article:
roon tweets:
2 popular AI safety researchers making massive updates based on the Claude 4.5 Opus result today, 200+ likes, within 6 hours.
  continue reading

709 episodes

Artwork
iconShare
 
Manage episode 525367440 series 3364760
Content provided by LessWrong. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by LessWrong or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://podcastplayer.com/legal.
TL;DR: In 2025, we were in the 1-4 hour range, which has only 14 samples in METR's underlying data. The topic of each sample is public, making it easy to game METR horizon length measurements for a frontier lab, sometimes inadvertently. Finally, the “horizon length” under METR's assumptions might be adding little information beyond benchmark accuracy. None of this is to criticize METR—in research, its hard to be perfect on the first release. But I’m tired of what is being inferred from this plot, pls stop!
14 prompts ruled AI discourse in 2025
The METR horizon length plot was an excellent idea: it proposed measuring the length of tasks models can complete (in terms of estimated human hours needed) instead of accuracy. I'm glad it shifted the community toward caring about long-horizon tasks. They are a better measure of automation impacts, and economic outcomes (for example, labor laws are often based on number of hours of work).
However, I think we are overindexing on it, far too much. Especially the AI Safety community, which based on it, makes huge updates in timelines, and research priorities. I suspect (from many anecdotes, including roon's) the METR plot has influenced significant investment [...]
---
Outline:
(01:24) 14. prompts ruled AI discourse in 2025
(04:58) To improve METR horizon length, train on cybersecurity contests
(07:12) HCAST Accuracy alone predicts log-linear trend in METR Horizon Lengths
---
First published:
December 20th, 2025
Source:
https://www.lesswrong.com/posts/2RwDgMXo6nh42egoC/how-to-game-the-metr-plot
---
Narrated by TYPE III AUDIO.
---
Images from the article:
roon tweets:
2 popular AI safety researchers making massive updates based on the Claude 4.5 Opus result today, 200+ likes, within 6 hours.
  continue reading

709 episodes

All episodes

×
 
Loading …

Welcome to Player FM!

Player FM is scanning the web for high-quality podcasts for you to enjoy right now. It's the best podcast app and works on Android, iPhone, and the web. Signup to sync subscriptions across devices.

 

Copyright 2025 | Privacy Policy | Terms of Service | | Copyright
Listen to this show while you explore
Play