OpenAI’s IMO Team on Why Models Are Finally Solving Elite-Level Math

Training Data: Ep56

In just two months, a scrappy three-person team at OpenAI sprinted to fulfill what the entire AI field has been chasing for years—gold-level performance on the International Mathematical Olympiad problems. Alex Wei, Sheryl Hsu and Noam Brown discuss their unique approach using general-purpose reinforcement learning techniques on hard-to-verify tasks rather than formal verification tools. The model showed surprising self-awareness by admitting it couldn’t solve problem six, and revealed the humbling gap between solving competition problems and genuine mathematical research breakthroughs.

Listen Now

Stream On

Summary

OpenAI researchers Alex Wei, Sheryl Hsu and Noam Brown took a different approach than other AI Labs and achieved gold medal-performance on this year’s International Mathematical Olympiad. They prioritized general-purpose AI reasoning techniques over specialized mathematical tools. Their breakthrough demonstrates how test-time compute scaling and reinforcement learning can tackle hard-to-verify tasks, representing a significant leap in AI’s mathematical reasoning capabilities.

Build with general techniques, not specialized solutions: Alex emphasized that their team “really prioritized general purpose techniques” rather than developing specialized systems for mathematical competition. Unlike previous AI projects that required years of domain-specific engineering, this approach focused on scalable reinforcement learning methods that could improve reasoning across multiple domains, not just mathematics.

Small teams can achieve breakthrough results: The core team consisted of just three researchers working for only two months on the final sprint, though they built on broader OpenAI infrastructure. They leveraged existing work from inference, scaling, and training teams—demonstrating how focused execution can amplify organizational capabilities.

Self-awareness prevents hallucination in difficult problems: When the model encountered the most difficult problem, it acknowledged its inability rather than generating a plausible-sounding but incorrect solution. Training a model to give “no answer” represents a crucial advancement for AI reliability.

Test-time compute scaling enables deeper reasoning: The breakthrough came from scaling inference compute from seconds to hours, allowing models to think longer about complex problems. However, with longer-running problems, evaluation becomes a bottleneck requiring longer evals.

Competitions represent stepping stones, not the destinations: The IMO competition is emblematic of AI progress generally but there remains a large gap between it and real research breakthroughs. Ultimately, real-world utility is the standard by which AI systems are judged.

Transcript

Noam Brown: The pace of progress is really, I think you see it so clearly in math. And I think Alex tweeted about this, where even a few years ago, these models were struggling with grade school math. And then I remember even in 2024, that GSMAK was used as the standard eval when everybody would release a model, and then it was MATH for a short period of time and then it became AIME and then it became USAMO. And the pace that it’s just gone, it’s blown through all of these math benchmarks is really astonishing.

Sonya Huang: Today, we’re joined by Alex Wei, Sheryl Hsu and Noam Brown, the trio behind the OpenAI model that just achieved gold medal performance at the International Math Olympiad. The IMO gold is one of the most important milestones in the race to artificial superintelligence, and what makes this breakthrough particularly fascinating isn’t just the mathematical chops, but the underlying architecture. General purpose techniques for scaling test time compute and handling hard-to-verify tests that extend far beyond competition math. We’ve now gone from models that can reason about math for a tenth of a minute just a year ago to systems that can reason and concentrate on the order of a hundred minutes. The hope for superintelligence is that as we scale reasoning to thousands or hundreds of thousands of hours, we can begin to solve humanity’s greatest unsolved problems in math, the sciences and more. Alex, Sheryl and Noam joined us on Training Data to talk about their approach, and share some of the behind-the-scenes fun and learnings behind this historic result. Enjoy the show.

Alex, Sheryl, Noam, thank you so much for joining us today. We have with us the team behind OpenAI’s first gold medal at the IMO. Congratulations to you all. It’s a momentous achievement.

Noam Brown: Thanks.

Alex Wei: Thank you.

Sonya Huang: I’d love to get into a little bit of the origin story behind this. I know that the IMO gold has just been this elusive thing that everyone in AI has been chasing for a long time. I remember back when Sam pitched us in 2021, it was on the slides, and I remember thinking, “Oh, that seems really far away.” I’d love to understand, you know, the more immediate origin story for this specific effort. When did you guys start thinking about this and how did it come about?

Alex Wei: Yeah, I think it’s sort of like something that we’ve been thinking about for a long time. And I remember in my first week at OpenAI, Noam asked me, like, “When do you think the model will get IMO gold?” And I thought it was really unlikely in 2025, but I feel like it’s something that’s always been on our minds. As you said, Sam, many years ago as well. But this specific effort, I think it was really only like, you know, maybe a couple months since …

Sonya Huang: Just a couple months. Wow.

Alex Wei: Like, the sort of last sprint to, like, get everything ready for this year’s IMO. And of course, we’ve been working on, like, improving our RL algorithms. The ideas for this started coming together maybe six months ago, but really the last push, like, you know, we’re going to try to do something for this year’s IMO was only a couple months long.

Sonya Huang: That’s amazing. And how big is the team involved?

Alex Wei: So we’re, like, you know, definitely building on, like, a lot of folks’ work at OpenAI. Like, this is not possible without a lot of help from people, from, you know, the people working on inference and the scaling org, the people who do the pre-training and the RL training. But in terms of, like, the core team, I would say it’s just three of us. So it was a super small, scrappy effort here.

Sonya Huang: That’s crazy. Just the three of you.

Noam Brown: Also, it was mostly Alex. Alex had been working on this technique for a while, and Sheryl and I were happy to help out as we were getting closer to the IMO to make it a reality.

Sonya Huang: That’s so cool. And how does this even come about? Like, do you self direct and self choose? You know, “I want to work on IMO gold and I’m gonna get us there.” How do you even raise your hand to work on something like this?

Alex Wei: I think it was something where it just felt like, you know, maybe it’s possible. Like, maybe if we, like, you know, push a bit for a couple of months we can just, like, you know, get there.

Noam Brown: One of the nice things about OpenAI is that I think the researchers are really empowered to do the kinds of research that they think is impactful. And, you know, so Alex had this pitch that, hey, there’s this new technique that I think could help out a lot. And honestly, there’s a decent amount of skepticism. You know, I think some people were supportive, but everybody felt like we should give them the freedom to be able to explore this and pursue it. And then it started showing some strong evidence, and I think people still were a little skeptical, but more people were getting excited about it. And eventually it turns into something more substantial. And I think now people are obviously very excited about it. [laughs]

Sonya Huang: Can you say a little bit more about the strong evidence? Like, what were some of the early signs that you all were seeing that made you really lean in?

Alex Wei: I think it was just like, you know, progress on hard-to-verify tasks, where I think previously, you know, a lot of RL was more focused around just, like, you know, if you have, like, these verifiable rewards, like, you know, what can you do? We were just seeing more improvement on these, like, hard-to-verify tasks is, I think, what made us excited.

Sonya Huang: Maybe on that front, how did you even verify that the results you had were right? I saw that you published the proofs on GitHub, but can you just say a little bit more about how you even know that you’ve discovered the answers? Because my understanding is that they’re done a bit differently from how a human might answer them.

Alex Wei: Yeah, I do think the style of the model outputs is a little atrocious.

Sonya Huang: “Atrocious” isn’t the word I was going to use. Creative. Like an alien language.

Alex Wei: Yeah, it’s a little. I think we could have. I mean, you know, it was a very small, scrappy effort, and so we didn’t optimize as hard for human readability. But that’s something that we know how to do. Like, you know, we can do the same stuff in the same way that ChatGPT is very readable. We can do the same things here.

Sonya Huang: Do you even need to optimize for human readability? Is that even important?

Noam Brown: I think if you’re showing this to humans, they prefer readability. We were actually discussing, you know, we got the proofs, like, okay, because you could actually just, like, run them through ChatGPT and, like, ask ChatGPT to, like, rewrite them in a more readable way. And it’s like, the proofs are still correct, they’re just like, a little bit more readable. And we were like, “Oh, when we post these online, should we post the more readable version that’s run through ChatGPT, or should we just post the raw version?” And we decided, you know, I think for full transparency, we’ll just, like, post the originals, and people will figure it out.

Sonya Huang: You guys have a bunch of IMO medalists and participants in the staff at OpenAI, right? Do you guys, like, moonlight in your spare time grading the answers that the model produces?

Alex Wei: During the testing, like, you know, we read a lot of samples, but for grading these specifically, we hired external former IMO medalists. So each proof was graded by three medalists, and for each one, they reached unanimous consensus on the correctness.

Noam Brown: I should also say that for me, I don’t know about Sheryl, but for me, like, the proofs are beyond my ability to comprehend. I was a math major, and I never really did competition math, and already the stuff that this model is writing about is beyond my ability to grade.

Sheryl Hsu: Yeah, same. I think that’s what makes it, like, even more amazing just how smart the model is.

Sonya Huang: Totally. What about problem six? How come none of the models at this year’s IMO had a solution, and your model didn’t even attempt problem six? Can you say more about what makes that problem, and traditionally problem six is always the hardest at the IMO. Is that right?

Alex Wei: Yeah, I think problem three or problem six, usually.

Sonya Huang: Okay, just say a bit more about what made problem six different, and what you learned from, I think you tweeted that the fact that your model knew that it couldn’t solve problem six was one of the things that gave you hope. So just say a bit more about that as well.

Alex Wei: For problem six, it’s just a really tough problem. I think if you gave me months to think about it, if you even gave me a big hint about, like, the main idea to solve problem six, I don’t think I’d be able to get there. It’s just a crazy, tough problem where there’s so many things you can do and there’s a very narrow path to finding the proof. And I think it’s one of those things. I think math is just hard.

Sheryl Hsu: Yeah. And we, like, threw a lot of compute at problem six, but I think it was good to see, like, the model doesn’t, like, kind of hallucinate or try to just make up some solution, but instead will say, like, “No answer.” I mean, it is kind of disappointing when you feel like it’s done so much work just to say, “No answer,” but I think it’s good that it actually acknowledges that.

Sonya Huang: Yeah, that’s an amazing level of self awareness of your own kind of ceiling. Because I mean, I remember at least a couple years ago with these models, they’d always try to be helpful and make up an answer, right? And so to see this is just like, I think, an amazing level of self-awareness from these models.

Noam Brown: When we released the reasoning models, I talked to some professors, mathematicians, computer scientists, and I was asking them, like, you know, are you finding value in these models? And the answer is frequently yes, but the one thing that they would complain about is like, if they would ever ask the model a question that it didn’t know the answer to, it would just, like, output a very convincing but wrong answer. And they would have to go through it very carefully to figure out was it exactly correct or was there some flip of an inequality or something that the model snuck in there. And it’s nice to see that this model, like, if it doesn’t know, it will just acknowledge that it doesn’t know, at least more frequently.

Sonya Huang: I guess internally, did you guys have a betting, like a polymarket or something going on whether you guys were gonna win IMO gold this year? And, like, what was the internal vibe?

Alex Wei: I think we felt like we had a strong shot, but I think we also felt that it wasn’t like a lock, where there’s definitely a distribution of questions where the models would probably struggle more than the humans. But then there’s another distribution of questions where the models would be really, really strong. And I think this year was somewhere in the middle where, you know, problem six, I think, is just out of reach of state-of-the-art models today. And I think maybe in general, like, you know, these hard combinatorics problems, which problem six was, are, I think, more challenging. And that’s still something that the models struggle with.

Sonya Huang: What is it about combinatorics that makes it challenging, versus geometry, for example, which seems like you guys do well at?

Alex Wei: I think for combinatorics it’s probably because it’s a little more abstract, a little more high dimensional. And I think oftentimes combinatoric problems require, like, leaps of faith or leaps of insight that the models aren’t as good at. I think the models are more good at problems that require a bunch of smaller steps, for example.

Sonya Huang: What about from your guys’ perspective? Was the internal vibe optimistic or not that you all were gonna get gold?

Sheryl Hsu: I feel like it wasn’t super optimistic. Like, I think we definitely knew that it could happen, but I think even a month or two months back, it definitely felt like we’d have to improve quite a bit, which I guess we did.

Noam Brown: I remember I was talking to another researcher at OpenAI, like, maybe two months before the competition, and we were like, you know, saying, like, “Okay, if we were to bet” you know, I’m a betting man. I’m happy to bet.

Sonya Huang: Yes, you are. [laughs]

Noam Brown: And I was saying, like, what odds would you take? Because I was willing to bet, I’m like, we were going to get gold here. And he was like, there’s really no chance. And, you know, he said that he would gladly take, like, two-to-one odds against, like, the model winning. So, like, you know, less than one-third chance. But he didn’t want to bet against us. So, you know, he thought it’d be bad vibes to bet against the team winning. So he didn’t go for the bet.

Sonya Huang: So did you make some pocket change Noam?

Noam Brown: I wish.

Sonya Huang: [laughs] I mean, you need it. Because I mean, you guys were at, I think you tweeted 12 percent on AIME, like, 15 months ago, right? So even though you want to never to bet against scale at OpenAI, it’s just an astounding slope of what you all have accomplished here.

Noam Brown: The pace of progress is really, I think you see it so clearly in math. And I think Alex tweeted about this, where even a few years ago, these models were struggling with, like, grade school math. And then I remember even in 2024, that GSMAK was used as the standard eval when everybody would release a model, and then it was like MATH for a short period of time, and then it became AIME and then it became USAMO. And the pace that it’s just gone, it’s blown through all of these math benchmarks is really astonishing.

Sheryl Hsu: Yeah, I remember training a model on GSMAK two years ago.

Sonya Huang: Yeah, we’re past those days, huh? Saturated the evals. What’s next? Do you think, I mean, at this point next year, you think we’ll be solving Millennium Prizes?

Alex Wei: I think those are still very far away. I think on one hand, you know, you think about how much math progress has been made since GSMAK, which is just like two years ago, it was sort of a standard that people were trying to push on. That’s like an astounding level of progress. But also you think about, like, how much time it takes for people, like, GSMAK problems, they’re like grade school math. You know, it takes someone good at math, like, a couple seconds. And now we’ve gone from a couple seconds to something that takes these brilliant students an hour and a half per problem on average, or the IMO is three problems, four and a half hours.

And then research math is going to be, like, you know, these same brilliant students. They’ve grown up, they’re researchers. It’s going to take them, like 1,500 hours. So there’s 1000x more thinking time. And then Millennium Prize problems have taken entire fields, like, you know, people’s lifetimes of thinking, and we still don’t have much progress on most of those.

And so it’s on one hand super exciting that we’ve made so much progress. On the other hand, it’s sort of also humbling to see how much further progress has to go from an hour and a half to tens of thousands, hundreds of thousands of hours of human thinking.

Sonya Huang: Totally. Noam, I think you deserve a lot of credit for seeing the future on this. I remember you visited us before you even joined OpenAI, talking about the results from game play, and what happens if you let a model think for hours and tens of hours. And credit you, you’ve really seen the future on this.

Noam Brown: Thank you. Yeah, I mean, it’s exciting to see it actually happen.

Sonya Huang: What are the hard things that happen as you scale compute time, inference time, from the order of 0.1 minutes to the order of 100 minutes? I guess at a high level, because most of our listeners are not AI researchers, but what are the hard things that happen to keep the model on the rails, so to speak?

Noam Brown: One thing I think we can point to is, like, pretty clearly a challenge, is that if you have the model thinking for 1,500 hours, then in order to eval it, you have to have it think for 1,500 hours. And so eventually, the evaluation of the models becomes a significant speed bump on progress. So we’re not really at that point. You know, if we have the model think for an hour and a half, it’s no big deal, you know, we can run those tests. But to run a test where the model is thinking for a month, it takes a month to finish that test. And so progress can only advance so fast if you want to wait for those kinds of results.

Sonya Huang: I think both of you are on the multi-agent team. Help me understand what the role that multi-agent systems play in this is.

Noam Brown: Yeah. So in addition to having the model think for a very long time and make a lot of progress on hard-to-verify tasks, this also involved scaling up parallel compute. And so there’s a multi-agent component to that. We’re probably not going to be able to go into too much detail about the exact techniques, but that was certainly one way that we were able to scale up test time compute for the IMO.

By the way, one thing I’ll add for the multi-agents scaling parallel compute thing is that the way that we did it, we really tried to prioritize generality in our techniques. For example, I worked on AI for poker. Alex and I actually both worked on AI for diplomacy, so Alex was on the team that worked on CICERO.

Sonya Huang: CICERO?

Noam Brown: Yeah.

Sonya Huang: Nice.

Noam Brown: And, you know, those were projects that I’m really proud of, but they were also projects that we spent years working on to achieve that result. And with the pace of AI progress being so fast, it felt like that wasn’t the best use of time to, like, develop a very bespoke system that could only do that one task. And so we all really prioritized general purpose techniques in all this. And the techniques that we used for everything, for scaling up the thinking time, for working on hard-to-verify tasks and for the parallel compute are all general purpose techniques that we’re either planning or have used for other systems as well.

Sonya Huang: And is that the reason you all chose not to do this in Lean? Like, my understanding is the official kind of IMO AI track was a Lean interpretation this year. Is that why you guys chose not to go with Lean?

Noam Brown: Yeah, that’s right. I mean, there is certainly, I think there is a lot of value in Lean as a tool. You know, mathematicians find it useful, for example, but the priority for us is really general purpose reasoning capabilities. And Lean has its limitations, and so that’s why we wanted to prioritize natural language.

Sonya Huang: My layman’s understanding is Lean is a formal verification tool. Does your result here basically say that, like, informal verification with scale can perform at the same level or even surpass formal verification? Is that the right takeaway?

Noam Brown: I would not say that’s the right takeaway. I don’t know. Alex, do you have thoughts?

Alex Wei: I would say that these are just sort of two orthogonal components here, where I think we found the informal math sort of an interesting problem because it represents sort of like a kernel of difficulty around scaling up test time compute, hard-to-verify tasks that represented something like difficulties from a very broad set of tasks that we were interested in from a general purpose standpoint. I think, like, Lean is a little bit more narrow where I think a lot more of the world can be approached with informal reasoning than is formalizable.

Noam Brown: I don’t think there’s anything wrong with narrow AI. Like, narrow AI can be very effective and obviously far surpass general purpose AI in certain domains. And I think the right way to think about it is in the same way that humans, human mathematicians find a lot of value in Lean, general AI can be compatible with a more narrow system that’s focused on formal mathematics. And the combination, I think, can be better because of it.

Sonya Huang: I think I saw on Twitter from multiple folks at OpenAI, and I think you guys have mentioned this as well, that this system was built with a very similar approach and infrastructure to many of the recent launches from OpenAI. We had Isa from the ChatGPT agent launch on the podcast last week. Can you say a little bit more about what the similar kind of foundation and approach is?

Sheryl Hsu: I think infrastructure-wise, I mean, we all kind of just use the same infrastructure. But I think as far as the core of this question, like, Noam and Alex said, there’s nothing that’s very bespoke to IMO here. And the hope is really that we can use the techniques that Alex worked on as far as, like, non-verifiable tasks, and as far as just scaling up test time compute, and be able to apply this to other areas of reasoning or other areas of model capabilities in general, and just build stronger models, keep improving agent, keep improving ChatGPT and everything else.

Sonya Huang: Tell me about the actual experience of IMO day. What was it like?

Noam Brown: Yeah. I mean, we were waiting for the problems to come through because, like, you know, once the participants finish the exam, then they get posted. And so we plugged the problems into our model, and that was around, I guess, pretty late at night, maybe 1:00 am or something. And honestly, I went to sleep because it’s likeꟷyou know, it’s 1:00 am, I’m not going to stay up for four and a half hours to see the output. I’ll just wake up in the morning and see. But I think these two actually stayed up and got to watch the model, and see it come in in real time.

Sheryl Hsu: It was a lot of fun.

Sonya Huang: Did anyone call Noam, like, “Wake up, wake up! We got this!”

Noam Brown: There were a couple moments where, like, Alex was so exhausted that he decided to take a nap. But we told him, “Okay, just make sure your phone is unsilenced so that if we need to wake you up we can call you.” And at one point we did actually have to call him, but I don’t think he woke up. [laughs]

Sonya Huang: That’s awesome. It must have been such a thrill and such a high, especially for that to come through at, like, so you started at 1:00 am, so you must have known at, like, 9:00 am, then?

Sheryl Hsu: Oh, it’s four and a half hours.

Sonya Huang: Four and a half hours.

Noam Brown: For the first day.

Sonya Huang: Okay.

Sheryl Hsu: Yeah, I don’t know. I mean, we can kind of see the problems come in. So I’d just be, like, making sure the systems are, like, staying stable. And then Alex is, like, over there reading and seeing whether or not, how the model’s doing.

Sonya Huang: So you were doing the live human proof checking to see if it was actually …

Alex Wei: I was naturally very anxious about the results.

Sonya Huang: Totally.

Alex Wei: So I was just looking at the partial progress the model was making. You can sort of observe that. And then I also hand checked things. Like, we were going to send these out to the graders, but I also just hand checked them because I was so curious.

Sonya Huang: Okay, well call me next time. I want to come hang out for that. I’m not going to go to sleep. That sounds awesome.

Noam Brown: One of the cool things about these models is, like, you know, I can’t understand the proofs, but when you see the model thinking about it, it will express its uncertainty or its confidence in natural language throughout the process, and it will just kind of say words that will hint at its, you know, if it’s really confident that it figured it out, it’ll say “good” a lot. And if it’s unsure, it’ll throw in a lot of question marks. And so it’s cool that I can kind of follow along and see how the model is feeling about its progress, even though I can’t really tell if it’s got it correct or not.

Sheryl Hsu: You get the dreaded “Seems hard.”

Sonya Huang: [laughs] You got that on problem six?

Sheryl Hsu: We got that a lot. No progress. Hard.

Sonya Huang: Keep going. Too bad. [laughs] Wonderful. I guess, looking ahead, you’ve gotten, like, the pinnacle results in competition math. I guess you can go do Putnam next year, but you’re basically at the top, right? And so what’s next?

Alex Wei: Yeah, so actually for Putnam, the problems, I think since the exam is less time per problem than the IMO and it’s a little more knowledge heavy, we actually found in our evals that the model was really, really good at Putnam problems. Like, better than it was at IMO problems. And so I think the frontiers here are really not about these very time-boxed competition problems anymore, but it’s about, like, problems that really take longer periods of time and more deep thinking to solve.

Sonya Huang: That’s really cool. Okay, so you’re going to start proving novel theorems now?

Alex Wei: Again, I think there’s this very intimidating gap between these very time-boxed competition problems and, like, a real research breakthrough, which takes a year’s worth of work that’s on the order of 1,500 hours instead of 1.5.

Sonya Huang: Yeah, totally. I guess relatedly, I was listening to the Demis podcast last night, and he mentions that the hardest thing is actually coming up with the interesting problems to solve. And I’m curious if you all agree with that.

Noam Brown: I think there’s some truth to that, that these models are really good now at solving these problems. Coming up with them is still a challenge, but I think it’s also worth noting the incredible pace of progress that we’re seeing. And there’s always a next hurdle, you know? And originally when LLMs came out, it was like, well, how do we get them to reason? And then we got them to reason, but then how do we get them to reason on hard-to-verify tasks? And now they can reason on hard-to-verify tasks. And I think the next hurdle is going to be like, okay, well, how do we get them to come up with these novel questions? You know, even creating an IMO question is a challenge, and it takes a lot of extra mathematicians a lot of work to do that. But I don’t see any fundamental barriers that block us from getting there.

Sonya Huang: I love that. Do your results in math, do they just fully generalize to, you know, you’re just going to be better at scientific reasoning, you’re going to be better at general reasoning? Does being great at competition math make you be great at everything else?

Alex Wei: I think how we approached this was not like we should be, like, great at competition math, but really I think it’s like we were focused on developing general purpose techniques to make reinforcement learning better. And I think those we are very excited to improve our models in other domains beyond math, and hopefully make models more useful for us in everyday usage.

Noam Brown: This is like a, it’s a pretty late breaking result. Honestly, it was a surprise even to people internally at OpenAI. And so the next step is to incorporate this more broadly into our models and, you know, improve the reasoning capabilities across the board. But it’s going to take some time to go through that process and deploy it to the world. So I think it’s going to come, but yeah, it’ll just take a little bit more time.

Sonya Huang: Is it harder for these models to do the IMO or the Physics Olympiad?

Alex Wei: I think definitely the Physics Olympiad, because the Physics Olympiad has, I think, like an experimental section where you have to …

Sonya Huang: [laughs]

Sheryl Hsu: We need to solve robotics first.

Sonya Huang: I didn’t realize that. Okay. I thought it was just on a piece of paper.

Alex Wei: Yeah. So I think the model will probably be good at the on-the-paper part, but yeah, I think there’ll be a little bit of time before it can do the experiments.

Sonya Huang: Not with like a world model. [laughs] Okay, cool. Are you going to release this model for customers to play with? Roelof’s son is a Math Olympiad kid and he’s like, “I want access to the Math Olympiad model.” Will people be able to play with this?

Noam Brown: So we want to make this accessible to mathematicians to use. We’re still trying to figure out the exact details of how we make that happen, but I think it’s really cool that we’ve developed this system that is incredibly good at math, and it makes sense that we want to see what mathematicians can do with it. I’ve actually already been emailing with this Stanford professor, mathematics professor. He actually emailed me about a year ago before we announced o1, and he was like, “Hey, do you want to do a collaboration on solving hard math problems?” And basically what I told him is, like, I think we just got to advance general reasoning capabilities and eventually they’re going to be able to help you with your hard math problems. And I think that’s actually the most promising route to getting there. He was a little skeptical, but every model release, like, every reasoning model release, he’s emailed me with a follow up and he’s like, “Can it solve this problem now?” And I’ve been plugging them in, and I don’t know what the output is, but I email it back to him and he says, like, “Yeah, that’s wrong.”

Sonya Huang: [laughs]

Noam Brown: And he emailed me a follow up this time with the same problem asking, like, “Hey, can it solve it now?” It still can’t solve it, but at least this time it recognizes that it can’t solve it. So I think that’s a big step. But we’re curious to see if there’s, like, a lot of other problems out there that mathematicians want to challenge this model with, and see if it can take them on.

Sonya Huang: Amazing. Congratulations to you all. I think this is a momentous result that the entire field has been waiting for for a very long time. And the fact that it was accomplished by a team of three people in a span of two months, it’s extraordinary. Congratulations, and thanks for joining us on Training Data.

Sheryl Hsu: Thank you.

Alex Wei: Thank you.

Noam Brown: Thanks for having us.

Mentioned in this episode:

IMO: Official page of International Mathematical Olympiad (problems here)
OpenAI IMO 2025 Proofs: Solutions to the five problems OpenAI completed
Alexander’s X thread: Announcement of results. (also see Noam’s LinkedIn post)

OpenAI’s IMO Team on Why Models Are Finally Solving Elite-Level Math

Training Data: Ep56

Listen Now

Stream On

Summary

Transcript

Chapters

Contents

The journey to IMO gold

Early signs of success

Problem six

The internal vibe

Seeing the future

Why not Lean?

IMO day itself

What’s next?

Mentioned in this episode