Over the winter break, OpenAI announced their new model, o3. Much of the media attention rightly focused on its impressive results on the ARC-AGI benchmark, but for us at Galois, the most significant result was something else—the model’s 25% score on a benchmark called Frontier Math.
Frontier Math is designed to measure human-like mathematical abilities in an AI. Before o3, generative AIs were scoring just 2% on this benchmark. The fact that OpenAI was able to achieve a leap to 25% suggests that AI mathematics is improving very rapidly.1 In fact, we may soon see AI mathematicians that rival or surpass humans on certain kinds of problems.
Let's break this down: What is Frontier Math? What did o3 do? And what does it mean for the future of mathematics?
A long-term problem in AI is figuring out whether the AI is truly capable at a particular task—or just regurgitating answers it has memorized. Modern AIs like GPT-4 are trained on gargantuan quantities of data—essentially every piece of text available on the Internet. That makes it hard to devise a fair test: if you ask an AI a question, maybe it just saw the answer somewhere before!
This is a particular problem for advanced mathematics. Producing the answer to a specific math puzzle is often easy once you know the “trick.” It's finding the trick that’s the real challenge. To test whether AIs are truly able to reason mathematically, we need a set of questions that are both extremely challenging and brand new—questions that have never been posted online before. That’s exactly what Frontier Math is: a set of very challenging math questions written specifically to test AIs. The whole benchmark set is private, so no AI can simply memorize the answers.
Another challenge in testing AI’s math ability is knowing when the AI got the answer right. For a simple arithmetic question, such as 417 + 292, this isn’t too hard. But most advanced math problems aren’t like this—the answer might be a proof or an equation. Frontier Math is cleverly designed so that this isn’t a problem: all the answers are simple objects (integer numbers, sets, and similarly straightforward structures). This makes it relatively easy to check whether the answer is correct.
So, what questions does Frontier Math contain? We don’t know for sure—the test set is private—but we do know:
The Frontier Math website contains sample questions randomly selected from each quintile of difficulty, here.
All these tiers represent extremely hard problems. Solving Tier 1 is within the abilities of many expert mathematicians, but it’s unlikely that any one individual could solve every Tier 2 or Tier 3 problem, each of which require extensive specialized knowledge. Still, no problem in Frontier Math requires truly superhuman abilities.
The last few years have seen impressive leaps in AI capabilities. Starting around GPT-3, AIs became able to write poems, understand jokes, plan vacations, and even play certain video games. But up to now, LLM-based AIs haven’t been very good at reasoning. One reason is that even the most advanced LLM is only able to “think” one word (or one token) at a time. If you ask an LLM such as GPT4o to solve a puzzle, it typically just gives the first answer that comes to mind—no deep reasoning or reflection.
The O-series models (o1, o1-mini, and now o3) are profoundly different. Instead of just “blurting out” the answer, these models are trained to produce long chains of thought. The model iteratively generates some reasoning, then examines the reasoning produced so far. The next reasoning step is based on this evaluation of the model’s previous thoughts. Essentially, the model is allowed to “speak to itself” before answering which allows it to tackle complex problems step-by-step, and also allows it to detect incorrect steps in reasoning.
This iterative reasoning process takes more time than just answering immediately. In practical terms, it means each answer requires more computation, and therefore is more expensive. We know from OpenAI that some questions answered by o3 cost more than $1000, per question. This is dramatically more expensive than the models available to consumers. For comparison, a query to OpenAI’s flagship GPT-4o model through their API costs $2.50 for 1m input tokens, and $10 for 1m output tokens.
Right now, the most powerful mathematical instrument we have is the human mind. But AI models like o3 are starting to approach (or even exceed) the abilities of their human creators. We don’t know what will happen next, but one premise is undeniable: AI math is accelerating rapidly.
How much could that change the world? Opinions differ. One argument is that math lies at the heart of many scientific disciplines, so AI math could accelerate scientific progress dramatically. Another argument is that, although math is important, the real barriers to progress lie elsewhere—so super-human AI math could end up as more of a curiosity. Think of chess: AIs can beat even the best human players, but it doesn’t affect everyday life much.
One area where super-human AI math could have a transformative impact is in formal methods—the use of math to build highly reliable systems. Galois has pioneered formal methods (FM) technologies for more than 25 years, and we’ve used FM to build secure systems for U.S. government agencies and many commercial clients. However, there has always been a challenge: FM techniques depend on mathematically capable humans, which makes them prohibitively expensive for many applications.
If AIs surpass humans in mathematics, it could dramatically change what is possible in formal methods. We might be able to:
All of these ideas are in a sense possible today, but only if we have clever humans to perform the work. The problem is that humans are scarce and expensive. If we can replace (or significantly augment) human experts with AI, we may be able to apply formal methods at scale to many more systems in the world.
In fact, formal methods may be able to solve one of the major limitations of AI, its propensity to make up false answers (or to ‘hallucinate’). An AI coder may not be trustworthy enough to use, even if it has super-human abilities. But if we pair it with a highly trustworthy formal methods tool, we may be able to establish the code it produces is correct, automatically, at scale, with limited human oversight.
Right now, the o3 model is at the frontier of what’s possible. We don’t know how much it cost to run the model, but some observers estimate tens or hundreds of thousands of dollars for the entire Frontier Math benchmark. That’s much too expensive to reshape the world today. But the lesson from recent years is that any AI capability that becomes possible, rapidly becomes much cheaper. We may be a few years (or months!) away from having o3-level math assistants running on our phones or laptops.
We also don’t know yet which questions o3 was able to solve. It’s reasonable to assume many of them were in Tier 1, the “easier” group of questions. Still, scoring 25% on such a demanding test is a remarkable achievement.
o3 isn’t the only important result in AI mathematics this year. In June, Google Deepmind announced AlphaProof, an AI that can also solve challenge problems of the kind that Frontier Math contains. Both AlphaProof and o3 are attacking the same problem, but in very different ways. AlphaProof is a specialized AI specifically trained for mathematical reasoning, using a similar ‘reinforcement learning’ approach to previous game-playing AIs such as AlphaGo. What’s particularly exciting about o3 is that it is a general AI, similar to GPT-4, which also exhibits advanced mathematical capabilities.
Where does that leave us? The truth is, no-one knows what happens next—it depends on how well OpenAI and other labs succeed in building cheaper and more effective AI models for math. At Galois, we’re closely watching the results on Frontier Math. If we see scores climb and costs go down, we would view this as an indicator that AIs may soon reach or surpass human capabilities.
Regardless, Epoch isn't standing still. They want the Frontier Math benchmark to stay useful and avoid getting saturated as AI capabilities increase. Epoch recently announced they are planning Frontier Math Tier 4, an even harder dataset of problems. Their aim is that answering a Tier 4 problem should require the expertise of an entire math department. If an AI model can answer such questions, we might truly judge that it has "superhuman" mathematical abilities.