OpenAI reports that o3-mini with high reasoning and a Python tool receives a 32% on FrontierMath. However, Epoch's official evaluation[1] received only 11%. There are a few reasons to trust Epoch's score over OpenAIs:
Epoch built the benchmark and has better incentives. OpenAI reported a 28% score on the hardest of the three problem tiers - suspiciously close to their overall score. Epoch has published quite a bit of information about its testing infrastructure and data, whereas OpenAI has published close to none.
^Which had Python access.
The original text contained 1 footnote which was omitted from this narration.
First published: March 17th, 2025
---
Narrated by TYPE III AUDIO).