Go offline with the Player FM app!
20251106 - Oxford pretends AI benchmarks are science, not marketing
Manage episode 518002128 series 3662020
How could all these benchmarks be fake, it's a mystery
https://pivot-to-ai.com/2025/11/06/oxford-pretends-ai-benchmarks-are-science-not-marketing/ - text
- Patreon: https://www.patreon.com/davidgerard
- Ko-Fi: https://ko-fi.com/A1529D5
- Buy me nice things: https://www.amazon.co.uk/hz/wishlist/ls/3Q8VZW46J6DM6
- Get an extremely cool Pivot to AI shirt or mug: https://pivot-to-ai.redbubble.com
Send in your story tips: [email protected]
Sources:
- Measuring what Matters: Construct Validity in Large Language Model Benchmarks (press release) https://oxrml.com/measuring-what-matters/
- Measuring what Matters: Construct Validity in Large Language Model Benchmarks (PDF) https://openreview.net/pdf?id=mdA5lVvNcU
Previously on Pivot to AI:
- OpenAI o3 beats FrontierMath — because OpenAI funded the test and had access to the questions https://pivot-to-ai.com/2025/01/20/openai-o3-beats-frontiermath-because-openai-funded-the-test-and-had-access-to-questions/
- AI benchmarks are self-promoting trash — but regulators keep using them https://pivot-to-ai.com/2025/02/25/ai-benchmarks-are-self-promoting-trash-but-regulators-keep-using-them/
- The finance press finally starts talking about the 'AI bubble' https://pivot-to-ai.com/2025/09/28/the-finance-press-finally-starts-talking-about-the-ai-bubble/
- video: https://www.youtube.com/watch?v=AgR1TCllRgc&list=UU9rJrMVgcXTfa8xuMnbhAEA
Full Pivot to AI playlist: https://www.youtube.com/playlist?list=UU9rJrMVgcXTfa8xuMnbhAEA
235 episodes
Manage episode 518002128 series 3662020
How could all these benchmarks be fake, it's a mystery
https://pivot-to-ai.com/2025/11/06/oxford-pretends-ai-benchmarks-are-science-not-marketing/ - text
- Patreon: https://www.patreon.com/davidgerard
- Ko-Fi: https://ko-fi.com/A1529D5
- Buy me nice things: https://www.amazon.co.uk/hz/wishlist/ls/3Q8VZW46J6DM6
- Get an extremely cool Pivot to AI shirt or mug: https://pivot-to-ai.redbubble.com
Send in your story tips: [email protected]
Sources:
- Measuring what Matters: Construct Validity in Large Language Model Benchmarks (press release) https://oxrml.com/measuring-what-matters/
- Measuring what Matters: Construct Validity in Large Language Model Benchmarks (PDF) https://openreview.net/pdf?id=mdA5lVvNcU
Previously on Pivot to AI:
- OpenAI o3 beats FrontierMath — because OpenAI funded the test and had access to the questions https://pivot-to-ai.com/2025/01/20/openai-o3-beats-frontiermath-because-openai-funded-the-test-and-had-access-to-questions/
- AI benchmarks are self-promoting trash — but regulators keep using them https://pivot-to-ai.com/2025/02/25/ai-benchmarks-are-self-promoting-trash-but-regulators-keep-using-them/
- The finance press finally starts talking about the 'AI bubble' https://pivot-to-ai.com/2025/09/28/the-finance-press-finally-starts-talking-about-the-ai-bubble/
- video: https://www.youtube.com/watch?v=AgR1TCllRgc&list=UU9rJrMVgcXTfa8xuMnbhAEA
Full Pivot to AI playlist: https://www.youtube.com/playlist?list=UU9rJrMVgcXTfa8xuMnbhAEA
235 episodes
All episodes
×Welcome to Player FM!
Player FM is scanning the web for high-quality podcasts for you to enjoy right now. It's the best podcast app and works on Android, iPhone, and the web. Signup to sync subscriptions across devices.