We're sunsetting PodQuest on 2025-07-28. Thank you for your support!
Export Podcast Subscriptions
cover of episode “FrontierMath Score of o3-mini Much Lower Than Claimed” by YafahEdelman

“FrontierMath Score of o3-mini Much Lower Than Claimed” by YafahEdelman

2025/3/18
logo of podcast LessWrong (30+ Karma)

LessWrong (30+ Karma)

Shownotes Transcript

OpenAI reports that o3-mini with high reasoning and a Python tool receives a 32% on FrontierMath. However, Epoch's official evaluation[1] received only 11%. There are a few reasons to trust Epoch's score over OpenAIs:

Epoch built the benchmark and has better incentives. OpenAI reported a 28% score on the hardest of the three problem tiers - suspiciously close to their overall score. Epoch has published quite a bit of information about its testing infrastructure and data, whereas OpenAI has published close to none.

^Which had Python access.

The original text contained 1 footnote which was omitted from this narration.


First published: March 17th, 2025

Source: https://www.lesswrong.com/posts/z8zPL2hBqTmx7Kf6J/frontiermath-score-of-o3-mini-much-lower-than-claimed)

    ---
    

Narrated by TYPE III AUDIO).