Show notes are at https://stevelitchfield.com/sshow/chat.html
…
continue reading
Content provided by LessWrong. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by LessWrong or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://podcastplayer.com/legal.
Player FM - Podcast App
Go offline with the Player FM app!
Go offline with the Player FM app!
“OpenAI:” by Daniel Kokotajlo
MP3•Episode home
Manage episode 470807453 series 3364760
Content provided by LessWrong. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by LessWrong or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://podcastplayer.com/legal.
Exciting Update: OpenAI has released this blog post and paper which makes me very happy. It's basically the first steps along the research agenda I sketched out here.
tl;dr:
1.) They notice that their flagship reasoning models do sometimes intentionally reward hack, e.g. literally say "Let's hack" in the CoT and then proceed to hack the evaluation system. From the paper:
The agent notes that the tests only check a certain function, and that it would presumably be “Hard” to implement a genuine solution. The agent then notes it could “fudge” and circumvent the tests by making verify always return true. This is a real example that was detected by our GPT-4o hack detector during a frontier RL run, and we show more examples in Appendix A.
That this sort of thing would happen eventually was predicted by many people, and it's exciting to see it starting to [...]
The original text contained 1 image which was described by AI.
---
First published:
March 11th, 2025
Source:
https://www.lesswrong.com/posts/7wFdXj9oR8M9AiFht/openai
---
Narrated by TYPE III AUDIO.
---
…
continue reading
tl;dr:
1.) They notice that their flagship reasoning models do sometimes intentionally reward hack, e.g. literally say "Let's hack" in the CoT and then proceed to hack the evaluation system. From the paper:
The agent notes that the tests only check a certain function, and that it would presumably be “Hard” to implement a genuine solution. The agent then notes it could “fudge” and circumvent the tests by making verify always return true. This is a real example that was detected by our GPT-4o hack detector during a frontier RL run, and we show more examples in Appendix A.
That this sort of thing would happen eventually was predicted by many people, and it's exciting to see it starting to [...]
The original text contained 1 image which was described by AI.
---
First published:
March 11th, 2025
Source:
https://www.lesswrong.com/posts/7wFdXj9oR8M9AiFht/openai
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

508 episodes
MP3•Episode home
Manage episode 470807453 series 3364760
Content provided by LessWrong. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by LessWrong or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://podcastplayer.com/legal.
Exciting Update: OpenAI has released this blog post and paper which makes me very happy. It's basically the first steps along the research agenda I sketched out here.
tl;dr:
1.) They notice that their flagship reasoning models do sometimes intentionally reward hack, e.g. literally say "Let's hack" in the CoT and then proceed to hack the evaluation system. From the paper:
The agent notes that the tests only check a certain function, and that it would presumably be “Hard” to implement a genuine solution. The agent then notes it could “fudge” and circumvent the tests by making verify always return true. This is a real example that was detected by our GPT-4o hack detector during a frontier RL run, and we show more examples in Appendix A.
That this sort of thing would happen eventually was predicted by many people, and it's exciting to see it starting to [...]
The original text contained 1 image which was described by AI.
---
First published:
March 11th, 2025
Source:
https://www.lesswrong.com/posts/7wFdXj9oR8M9AiFht/openai
---
Narrated by TYPE III AUDIO.
---
…
continue reading
tl;dr:
1.) They notice that their flagship reasoning models do sometimes intentionally reward hack, e.g. literally say "Let's hack" in the CoT and then proceed to hack the evaluation system. From the paper:
The agent notes that the tests only check a certain function, and that it would presumably be “Hard” to implement a genuine solution. The agent then notes it could “fudge” and circumvent the tests by making verify always return true. This is a real example that was detected by our GPT-4o hack detector during a frontier RL run, and we show more examples in Appendix A.
That this sort of thing would happen eventually was predicted by many people, and it's exciting to see it starting to [...]
The original text contained 1 image which was described by AI.
---
First published:
March 11th, 2025
Source:
https://www.lesswrong.com/posts/7wFdXj9oR8M9AiFht/openai
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

508 episodes
All episodes
×Welcome to Player FM!
Player FM is scanning the web for high-quality podcasts for you to enjoy right now. It's the best podcast app and works on Android, iPhone, and the web. Signup to sync subscriptions across devices.