Go offline with the Player FM app!
Rewarding the Unlikely: Lifting GRPO Beyond Distribution Sharpening
Manage episode 487468089 series 3524393
This paper critiques GRPO's bias in training language models for theorem proving and introduces the unlikeliness reward to enhance performance and sample diversity, achieving competitive results.
https://arxiv.org/abs//2506.02355
YouTube: https://www.youtube.com/@ArxivPapers
TikTok: https://www.tiktok.com/@arxiv_papers
Apple Podcasts: https://podcasts.apple.com/us/podcast/arxiv-papers/id1692476016
Spotify: https://podcasters.spotify.com/pod/show/arxiv-papers
2317 episodes
Manage episode 487468089 series 3524393
This paper critiques GRPO's bias in training language models for theorem proving and introduces the unlikeliness reward to enhance performance and sample diversity, achieving competitive results.
https://arxiv.org/abs//2506.02355
YouTube: https://www.youtube.com/@ArxivPapers
TikTok: https://www.tiktok.com/@arxiv_papers
Apple Podcasts: https://podcasts.apple.com/us/podcast/arxiv-papers/id1692476016
Spotify: https://podcasters.spotify.com/pod/show/arxiv-papers
2317 episodes
Alla avsnitt
×Welcome to Player FM!
Player FM is scanning the web for high-quality podcasts for you to enjoy right now. It's the best podcast app and works on Android, iPhone, and the web. Signup to sync subscriptions across devices.