The Breakthroughs Needed for AGI Have Already Been Made: OpenAI Former Research Head Bob McGrew

Training Data: Ep50

As OpenAI’s former Head of Research, Bob McGrew witnessed the company’s evolution from GPT-3’s breakthrough to today’s reasoning models. He argues that there are three legs of the stool for AGI—Transformers, scaled pre-training, and reasoning—and that the fundamentals that will shape the next decade-plus are already in place. He thinks 2025 will be defined by reasoning while pre-training hits diminishing returns. Bob discusses why the agent economy will price services at compute costs due to near-infinite supply, fundamentally disrupting industries like law and medicine, and how his children use ChatGPT to spark curiosity and agency. From robotics breakthroughs to managing brilliant researchers, Bob offers a unique perspective on AI’s trajectory and where startups can still find defensible opportunities.

Listen Now

Stream On

Summary

Former OpenAI Chief Technology Officer Bob McGrew led the engineering organization that brought reasoning capabilities to frontier models and shaped the technical roadmap from GPT-3 to the reasoning breakthroughs of o1 and o3. His insights center on the fundamental shifts happening in AI development and where opportunities remain for startups in an increasingly competitive landscape dominated by frontier labs.

Reasoning represents the biggest opportunity in 2025, with significant low-hanging fruit still available: McGrew emphasizes that reasoning has an “overhang of compute, data, and algorithmic efficiency improvements” that creates substantial room for rapid progress. The difference between o1-preview and o3—adding tool use to chain of thought—exemplifies how obvious improvements can take months to implement, suggesting many more such advances await discovery. Every frontier lab is focusing intensively on reasoning, making it the primary battleground for capability improvements this year.

Agents will be commoditized and priced at compute cost, not human replacement value: The fundamental economics of AI agents differ dramatically from human services. While lawyers command high fees due to scarcity, AI lawyers face infinite supply once the capability exists. McGrew warns founders against assuming they can charge based on human job values, noting that “some other startup can come in and compete that away” using the same underlying frontier models. The real value lies in creating genuine scarcity through network effects, brand or economies of scale.

Enterprise applications requiring deep domain integration remain safe from frontier lab competition: Companies like Palantir and Distyl succeed by building systems around models rather than training specialized models. Frontier labs see business problems as opportunities to train new models, but individual enterprise needs are too small to warrant dedicated model development. The opportunity lies in creating the infrastructure that extracts context from businesses, feeds it to models and transforms outputs into actionable decisions—turning many small problems into one scalable solution.

Robotics has reached an inflection point due to language interfaces and vision capabilities: McGrew contrasts his 2016 experience teaching a robot to play checkers—”very fun and super cool, and really far away from any form of commercialization”—with today’s reality where companies like Physical Intelligence can solve diverse problems like laundry folding in months rather than years. The combination of LLMs providing natural language task description and strong vision encoders gives robots “a head start at doing generic tasks” by building on the existing frontier model and research infrastructure.

Proprietary data value diminishes as AI replicates the embodied labor behind it: Much proprietary data represents accumulated human effort—calling customers, working through case studies, conducting surveys. Since AI can now perform these tasks, competitors can replicate proprietary datasets without the original time investment. The exception is real-world data that customers trust you to use on their behalf, like financial portfolios or personal preferences, where the data enables better service delivery rather than teaching general skills. This shift challenges assumptions about data moats and suggests focusing on trust relationships rather than data accumulation.

Transcript

Chapters

2025 is going to be the year of reasoning
The changing role of pre-training
The stool has three legs
The history of reasoning at OpenAI
The commoditization of agents
How do startups sell agents?
A good time for robotics
Where can startups play?
Is proprietary data overrated?
AI coding passes a threshold
Why “member of the technical staff” at OpenAI?
ChatGPT equals everything you don’t want to do yourself
The curiosity of an eight-year-old
How to prepare the next generation
Favorite AI apps
Lessons for engineering managers
Mentioned in this episode:

Bob McGrew: I think what’s really changed is that now that you have LLMs, you have this language interface to the robot so that now you can describe the tasks much more cheaply, and you have really strong vision encoders, you know, that are tied into that intelligence. So that gives the robots really a head start at doing generic tasks. So we spent years solving one specific problem: teaching a robot to manipulate a Rubik’s Cube. And now a company like, let’s say, Physical Intelligence can spend months solving a huge variety of problems like laundry folding and cardboard and packing egg crates. And that’s something that they can only have because they’re building on top of existing frontier models and, you know, the entire tech and research stack that we’ve built over the last 10 years.

[insert intro here]

Stephanie Zhan: Bob, thank you so much for joining us today.

Bob McGrew: Oh, it’s great to be here.

Stephanie Zhan: We’re at a really interesting time in AI development. We have a beautiful new trifecta: pre-training, post-training, reasoning. Can you help us unpack what else is left? What else is there in each left?

Bob McGrew: So I think we’re going to continue to see capabilities increase. It’s going to continue to feel like it’s felt super fast, super exciting over the last even five years, and I think it’s going to continue feeling like that. There’s not a wall here, but what is going to be different is that 2025 is going to be the year of reasoning.

Stephanie Zhan: Yeah.

Bob McGrew: So it makes a lot of sense. Reasoning is a new technique. When you have a new technique, you know, there’s often an overhang of compute, of data, of algorithmic efficiency improvements that you can make. And so that’s something that, you know, if you look at just the incredible progress that we saw from o1 preview back in September, and then six months later we go to o3 in April. And then, you know, at the same time we also see diffusion of reasoning from OpenAI, who we’ve been working on that for years, out to Google, DeepSeek, Anthropic, again just in a few months. And so this is really the right place where every lab is going to focus for the year.

Stephanie Zhan: Yeah.

Bob McGrew: And just like as sort of a fun example of how low hanging the fruit is right now. If you look at the most interesting difference between o1-preview and o3, o1-preview is not able to use tools. o3 can use tools as part of the chain of thought. And this is pretty obvious, right? When we were training o1, we knew that this was a thing that we wanted to do, but it was difficult to implement. It took time. And so that took six months to get done and released. The next step on reasoning is going to be a lot less obvious than that. It’s going to be a lot harder. And so, you know, as reasoning continues to mature, we’re going to see the overhang get eaten up, and it’s going to start being slower and slower to make progress.

Sonya Huang: You said there’s not a wall. I think there’s this meme in the Twittersphere right now that pre-training is hitting a wall. Can you say more about that dynamic?

Bob McGrew: Yeah, and I think that’s great, that’s a great question here, because pre-training is not going away. But what we’re seeing out of pre-training is that we are at the place where it’s working really well, and we’re hitting diminishing returns. And so, you know, diminishing returns are baked in because the intelligence of a model is log linear in the amount of compute that you’re using to train it, which means that you have to have exponential increases in compute to get each increment in intelligence.

When you pre-train a model, that’s a giant training. It takes all of your data center for, you know, a period of months. And when you go to pre train the next model, you can’t really do it on the same data center. You can rely a little bit on algorithmic efficiency to make it better, but fundamentally you have to wait until you get a new data center. And that’s not a matter of, like, something you can do in six months the way you can do improvements in reasoning right now. That’s something that takes years. So that doesn’t mean pre-training is useless though, because the real lever for pre-training in 2025 is improving architectures. So even though you’re working on reasoning, you want to improve pre-training so that you can have better inference time efficiency, or so that you can have longer context or better use of the context. And when you’re doing that, you have to start back from the beginning, do pre-training on this new architecture, and then go through the whole reasoning process again. So that’s the role of pre-training now. It’s still important, it’s just doing something different in the pipeline.

Stephanie Zhan: Can you help us unpack what’s left in post-training?

Bob McGrew: Yeah. So post-training is pretty interesting because both pre-training and reasoning are about increasing intelligence. And there’s a very clear scaling law that you get where you put in more compute and you get out increasing intelligence. Post-training isn’t like that. Post-training is about model personality. And, you know, intelligence, it’s a sort of a thin problem, right? If you can get better at it, it turns out to be very generalizable and it applies to everything. So you can work on math, and you find that it makes you better at legal reasoning.

But model personality, it’s a thick problem. You actually need a lot of human effort to think about what makes a good personality. How do I want this agent to act? And it’s much more of a training process like you would go through over many years of interacting with people. And now it’s a very hard research problem to take that specification for what the agent is and turn it into an actual appealing personality. But when you think about post-training, I think about people like Joanne Jang at OpenAI or Amanda Askell at Anthropic who really spend a lot of time crafting these model personalities. And they’re not research practitioners.

Stephanie Zhan: Interesting.

Bob McGrew: Right? They’re people with their product managers, or they’re people with a very deep understanding of human nature.

Stephanie Zhan: And are there more legs to the stool?

Bob McGrew: Well, okay, so I’m going to—I’m going to say something potentially controversial, and I think actually there aren’t.

Stephanie Zhan: Huh.

Bob McGrew: So I think if you go forward to 2030 or if you go forward to 2035, and you look back and you say, “What were the fundamental concepts that you needed in order to create, you know, more and more intelligence?” Maybe, maybe that’s AGI, maybe it’s something different at that point. I think you’re going to come up with the idea of language models with transformers, the idea of scaling the pre-training on those language models—so GPT-1 and GPT-2, basically—and then the idea of reasoning. And sort of woven throughout that, increasing more and more multimodal capabilities. And I think even in 2035 we’re not going to see any new trends beyond those. And the reason I think this is if you go back to 2020, so GPT-3 has just been trained. You know, imagine yourself sitting—we’re at OpenAI. We haven’t released this thing, but we know something epochal has happened, and Dario Amodai, Ilya Sutskever, Alec Radford, you know, we’re all sitting there in the room looking at this thing. And it was fairly obvious internally what the roadmap was. We knew that at this point going from GPT-3 to GPT-4 by increasing pre-training was absolutely critical. We could see that we needed to increase multimodality, ultimately ending in a model that could use a computer. We were beginning to make experiments with test time compute.

Stephanie Zhan: Got it.

Bob McGrew: And in 2021, after the Anthropic people left, we really started developing the idea of reasoning at OpenAI. And it’s funny actually, sometimes my friends ask me after Anthropic released computer use, they’re like, “Did you see that coming?” And I was like, “Well, we were working on that together back before they left.” One of the people that did that project went to Anthropic, and the other one went to OpenAI and developed Operator. And it just took many years before the multimodality had matured enough to get to that point. That was obvious to us way back then. And so that’s why I think, you know, we’re going to see from here on out there’s very important scaling, there’s very important development and refinement of these ideas. That is extremely hard. It takes a lot of brain power. It’s not going to be easy. But I think if we look back from 2035, we’re not going to see anything new and fundamental. I think I’m right. I kind of hope I’m wrong. It would be a lot more fun if I’m wrong. But I think we’ll have to see.

Sonya Huang: That’s a hot take. I’m glad we have it on the record.

Stephanie Zhan: Yeah.

Sonya Huang: We’ll see in 2035. That’s amazing. I’m curious about reasoning. It seems to me that OpenAI really leaned big behind this paradigm, probably before the others. And now everybody has reasoning models. What did you see in reasoning that caused you to lean so far in so quickly?

Bob McGrew: Well I mean, effectively it really was sort of this missing piece, where with pre-training the model has an intuitive sense of how to answer the question. But if I ask you to multiply two five-digit numbers, this is something that’s completely within your capability. But if I asked you to do it right now, you wouldn’t be able to do it because it is just natural as a human capability to be able to think about something before you answer, to have a scratch pad to be able to work through a problem. And that is something that, you know, the initial models, even GPT-3 really didn’t have.

And so we began to see, you know, glimmers of this publicly—things like, you know, thinking step by step, and the idea of having a chain of thought that you could train, that the model would learn itself how to guide a chain of thought, not just be guided by cloning from publicly available data on how humans think. That was very powerful. And we knew that it would be more powerful than pre-training because in fact, your thoughts are inside your head. They are not something that the model has access to. And so almost all the data that’s out there is actually something that’s just the final process. But you don’t get to see that chain of thought, and so the model had to figure it out for itself. That’s why reasoning mattered.

Stephanie Zhan: You alluded to earlier that we probably still have to uncover more things in reasoning. Do you think we have a good sense of what those things are today? Are we earlier in that R&D stage?

Bob McGrew: I think at this point with reasoning, if you are at the coalface, then you’re seeing a lot of ideas and refinements of things that you can do. I think we’ve gone past the point where if you’re on the outside, if you’re not at a frontier lab, you’re probably not seeing them anymore. And this is the same situation we saw where at one point academic labs could make huge amounts of progress. And then later, you know, I would begin to see academic papers and I’d think, “Oh, they rediscovered this thing that we found a long time ago.” And so now the level of effort that’s being put into this, I think, is actually quite intense. So there are definitely things to be discovered, but they’re not sort of simple ideas that you and I could talk about.

Stephanie Zhan: Cool! Switching gears a little bit, you tweeted recently about agents—I think a very, very interesting take—that agents will be incredibly powerful, but priced at the cost of compute due to competition. If that’s the case, where do you see the opportunities in new startups and companies that are now building agents?

Bob McGrew: Yeah. So I mean, I think the thing about agents is people think, “Well, you know, I’m going to go develop an agent.” And they look at how much the job is worth out there by a human. So, you know, you want to develop an AI lawyer and you think lawyers get paid a lot of money. So I’ll be able to develop an AI lawyer and the lawyers, I’m going to be able to charge huge amounts of money.

Stephanie Zhan: Tens of thousands of dollars a month. [laughs]

Bob McGrew: Exactly. Exactly, right? But the reason lawyers are expensive is because their time is scarce, because there’s only so many people who have undergone that training. But by the time you’ve made an AI model out of it, well, now there’s effectively an infinite number of lawyers, and so it’s not scarce at all. And maybe you, with your AI lawyer startup, will be able to have a lead over other people, but it’s the same frontier model underneath and, you know, some other startup can come in and compete that away. And so we should expect to see it priced at some opportunity cost over the cost of compute.

Stephanie Zhan: Interesting. Because you’re changing—you now have a lot more supply, infinite supply of the highest capability intelligence in whatever domain you now have.

Bob McGrew: And on the one hand, there’s a story where they say, “Oh, this is bad because startups can’t make money.” But this is actually the future we want, right? We want services that don’t require people to be extremely cheap. You want everyone to have access to a lawyer. What you want to be expensive and scarce are things that are actually about personal relationships. So, you know, maybe we won’t be asking the human lawyers to write contracts because agents will be doing that for us, but we’ll be asking them for sort of deep advice on, you know, how legal challenges affect the, you know, detailed challenges that I’m facing in my business. And I think that’s the world we want to live in.

Sonya Huang: You think application companies will make any money selling agents, though? Like, where would you tell us to invest?

Bob McGrew: Yes and no. So just to back up for a second. People often talk about where does the value accrue in the stack, right? Is it at the model layer? Is it at the application layer? And if you look at the model layer, it’s very competitive. Every company has a frontier model. Some of the frontier models can do things other frontier models can’t, but by and large, they’re all really very good. And if you’re an enterprise, you can swap them out very easily. And beyond the frontier, you know, all the models that are answering the bulk of the questions are distilled, they’re very competitive.

And so this isn’t a very good business to be in when you consider the cost of training the models. So what’s the point of training models in the first place? It’s to give you an option. It’s to give the frontier labs an option on the valuable places in the application layer that are coming up. So, you know, ChatGPT, that’s a great business, right? There’s a lot of competition over that. I think probably it’s too late to replace ChatGPT—maybe not. You’d have to do something very different. Coding. Another place where all the frontier labs are eager to see right now. I think you can compete with the frontier labs, but you want to do something that’s different, something that involves more than just you talking to your computer, you doing some sort of personal productivity task on your computer, something that involves other people, something involves an enterprise. I think that the moats that you have for your business are going to be the same moats they always were. It’ll be network effects, brand, economies of scale. And so you want to find an agent that allows you to have those network effects, not just something that, you know, would be high priced out in the world.

Stephanie Zhan: Are there particular domains that you think are maybe outside of the scope of what frontier labs want to innovate in and build in that you think are interesting and that you’ve been mulling about? We’ve got scientists, lawyers, research analysts, agentic software engineers. What other domains have you been thinking about?

Bob McGrew: So personally I’m very interested in robotics, because I think robotics is something—I wouldn’t actually say it’s off the roadmap of the frontier labs right now, but I think it’s something that’s far enough away that to me it feels like where AI was a few years ago. And so I think this is a very good time to be a company like Skild or a company like Physical Intelligence. Or to start a new robotics company, maybe not one that’s competing with those two, but somebody that’s doing something different, something on its own. I think it’s at the end stages of being a research challenge, and a matter of months or years—small digit years away from being commercialized. So I think that’s really fun.

Sonya Huang: Why now? What do you think has changed? Like, OpenAI famously had a robotics effort for a long time. What do you think has changed?

Bob McGrew: Well, you know, so in between Palantir and OpenAI, I actually wanted to start a robotics company myself. And I got to the point of teaching a robot to play checkers from vision back in 2016.

Sonya Huang: Wow! Yeah, it was very …

Sonya Huang: It could pick up the checkers pieces?

Bob McGrew: It could pick up the checkers pieces, and it could move them to a different place on the board.

Sonya Huang: Nice!

Bob McGrew: And my conclusion from this was that it was very fun and super cool, and really far away from any form of commercialization. And when we pursued robotics at OpenAI, we didn’t pursue it for commercial motives. It was really a demonstration of the power of machine learning. And some of the ideas we had there later played into large language models. But I think what’s really changed is that now that you have LLMs, you have this language interface to the robot so that now you can describe the tasks much more cheaply, and you have really strong vision encoders, you know, that are tied into that intelligence. So that gives the robots really a head shot at doing generic forms, generic tasks. So we spent years solving one specific problem: teaching a robot to manipulate a Rubik’s Cube. And now a company like, let’s say, Physical Intelligence can spend months solving a huge variety of problems like laundry folding and cardboard and packing egg crates. And that’s something that they can only have because they’re building on top of existing frontier models and, you know, the entire tech and research stack that we’ve built over the last 10 years.

Sonya Huang: Yeah. I’m going to go back to this point you had on where is the value? And I really liked your framing that the foundation models kind of have an option on whichever parts of the application stack they want to own. How much of the application market do you think the foundation models will win?

Bob McGrew: I think I would look at this in a slightly different direction, which is, if you’re a startup, where is it safe to play and where is it that you’re going to get steamrolled by the frontier labs? And so I think the areas that I think are safe to play in are areas where you have to understand something very deeply outside the model. And so I think a lot of enterprise really has this flavor. So for example, you know, Palantir AIP actually really fits this, where, you know, it’s not a model company, but it’s something that sits outside the model.

Stephanie Zhan: Yeah.

Bob McGrew: That interacts with the rest of the business. There’s another company I’m an investor in, an advisor in called Distyl, that builds AI systems that allow a business to sort of extract the context from within the business, feed that to the models, and then use that to make decisions. And so these are things that the frontier labs don’t want to do. The frontier labs see business problems as how do I train a model to do something new? And if you look at all these enterprises, each one of those is a very small problem. It’s not worth OpenAI or Anthropic’s time to train a model specifically for each one of them. If you flip the problem and you think about what is the system that goes around the models, and how do I use the models to sort of ease the context in and get the outputs out, then suddenly, that’s one problem, and I think it’s a big opportunity.

Stephanie Zhan: What are the specific use cases and problems that Distyl and Palantir’s effort solve for those enterprise companies?

Bob McGrew: So a lot of times right now, what you see is you’re trying to automate some existing piece of work. And the easy cases are where that piece of work is in a regulated industry, and you’re working on something like healthcare, maybe you’re interacting with insurance companies. And you have a workflow that is extremely scripted, where the company cares a lot about fidelity to that workflow. And that doesn’t mean you can just say, “Hey, AI, go read the clinical guidelines and make these decisions.” But with a process of transformation, you can get it to the point where the AI can do that. And that’s sort of the low-hanging fruit.

And then the next level up, though, is that if imagine you’re working on something that isn’t a regulated industry or that isn’t extremely scripted, and you want to ask someone—you want to automate some labor-intensive process, well, the first thing you have to do is make that legible. And if you go to someone and you ask them to describe their job, a lot of times their manager doesn’t know what they do. They don’t even really know what they do. They can give you examples, but they can’t say, like, “This is the workflow that I follow,” because in practice they don’t follow a single workflow.

Stephanie Zhan: Right.

Bob McGrew: Right? And so I think that is what a lot of these problems look like. And for example, that’s actually what Distyl does is to, you know, work with companies, help them take the data they have, interview the people with AI, systematize all of that and have it be something that an AI model can actually execute.

Stephanie Zhan: That’s really interesting. Huh! So this is also somewhat related to this other question I wanted to ask you about proprietary data. I was surprised to actually see you tweet this, but I’m very intrigued by this question that you posed, which was: how valuable will your proprietary data be compared to what your competitors’ infinitely smart, infinitely patient agents can estimate from public data? Can you unpack that for us a little bit?

Bob McGrew: Yeah. So, you know, a starting point for this is a few years ago, there was a lot of interest in training industry vertical-specific models. You know, that finance companies would say, “We’ve got all of this data that no one else has, and we’re going to train a finance model on top of GPTs or on top of Llama and it’s going to be so much better.” And actually, all of those were worse than the next generation of GPT, because the power of intelligence and the ability to synthesize new information was bigger than the power of sort of memorizing the old information that you have.

So that’s, I think, what this theme looked like a couple years ago. But fast forward a year or two years. Now the story is I have all of this proprietary data. I’ve accumulated it over years. And in some sense, for a lot of times, if that data is teaching the model a skill or if it’s meant to teach them all a skill, that data is sort of embodied labor, right? Someone worked through all these case studies.

Stephanie Zhan: Yeah.

Bob McGrew: Someone called all of these customers and found out all of this information. Well, that embodied labor is now free. AI can do all those things. And so now there’s an opportunity. You can have AI call all those customers, do a big survey, find out what they know. You can have AI work through all the case studies, a lot of chats with o3, right? And then now you’ve replicated that proprietary data, but without needing all of that work.

Stephanie Zhan: How do you square that with the value of real world proprietary data? Say something like what Cursor gets from its developer community constantly, or Tesla Autopilot over the last handful of years.

Bob McGrew: So I think those are in the middle because they’re really huge amounts of data.

Stephanie Zhan: Yeah.

Bob McGrew: I think there are challenges sometimes to training on the data that you get from your users. You know, one thing models can’t do is if you train data and you memorize data about a specific person, maybe that leaks out into the next person. So that’s a real challenge to using these kinds of proprietary data.

Stephanie Zhan: Right.

Bob McGrew: I think there is a kind of real world proprietary data that’s very useful, which is data—very specific data about very specific customers that they trust you to use on their behalf. So to give an example, my financial advisor knows a lot about me. She knows my entire portfolio, she knows the kinds of objectives that I have.

Stephanie Zhan: Your risk tolerance. [laughs]

Bob McGrew: My risk tolerance. Right. And she uses all of that information to give me a better outcome, which is what is the next asset I should buy? And she doesn’t do that—like, the data doesn’t make her a better financial advisor. It doesn’t teach her skills, but it allows her an opportunity to use the skills she already has. And so that’s the place where I think proprietary data is really useful.

Sonya Huang: I want to switch gears a bit to coding. It feels like, you know, software engineering has just gone through this fast takeoff moment. And, you know, just judging from the pace of how quickly things are changing, you know, I think there’s at least a certain subset of the market that thinks, you know, the super intelligence takeoff probability is a lot higher than folks thought it was, just given how quickly coding has taken off. What’s your view of what’s happened in the coding space?

Bob McGrew: So I think, you know, on the one hand, coding has taken off very quickly. On the other hand, way back in January, 2020, as soon as we saw GPT-3, we launched a project to train GPT-3 how to code.

Sonya Huang: Yep.

Bob McGrew: And so, you know, when you look at an exponential curve, you know, the progress is actually the same the whole time, but the impact of that progress can become very nonlinear when it passes a threshold. And that’s what’s happened with coding in the last couple years. And so my take on where coding will go is that you’re going to continue to see a mix of coding with the user in an IDE, you know, traditional cursor-style work, and coding in the background as an agent, something like Devin-style work. And this is going to continue for a long time as a year or two maybe is a long time in AI adoption.

Sonya Huang: That’s forever in AI years.

Bob McGrew: [laughs] But that, you know, if you think about something like vibe coding, right? Like, the story you hear with vibe coding is you can, you know, if you have a PM, and you want to create a demonstration project, I think that you’re going to see PMs vibe coding really cool prototypes, really cool demos that they can use to get user feedback. But then those things are going to get thrown away, and that they’re going to get rebuilt with professional software engineers. Because, you know, if you are given a code base that you don’t understand—this is a classic software engineering question—is that a liability or is it an asset? Right? And the classic answer is that it’s a liability. Like, you have to maintain this thing. You don’t know how it works, no one knows how it works. That’s terrible. Usually the answer is it’s actually cheaper to rewrite it from scratch.

And so we don’t yet have a way that we’re comfortable with agents being the ones that understand the code base right now. I think the liability has gone down, but it’s still net a liability. You need humans to do the design, to understand the code base at a high level, so that when something breaks, when the project itself becomes too complicated for the AI to understand, you can have a human do a problem decomposition and break it down into problems that are small enough for the AI.

Sonya Huang: What do you think happens after that one or two years, though?

Bob McGrew: Oh, I don’t know. We’re gonna have to find out.

Sonya Huang: [laughs]

Stephanie Zhan: I love your bifurcation, though, of on one side, agentic software engineers that handle tasks autonomously in the background, and on the other side, human programmers who code in an IDE with the help of AI. I don’t think that most of the mainstream actually believe that, realize that. Can you maybe unpack that for us a little bit? What would the agentic software engineers who handle these tasks autonomously handle? And then where do you see this other end of the spectrum go? Do they collide at some point? Do you think they remain separate things over time in the long term?

Bob McGrew: I think it is already a spectrum where, you know, the things that your agentic software engineers can do, you can say, “Well, fix a bug, do a refactor.” Something that, you know, requires relatively little taste.

Stephanie Zhan: Yeah.

Bob McGrew: And has a clear outcome.

Stephanie Zhan: Yeah.

Bob McGrew: Another great use case I’ve heard is, “Translate software from COBOL into Python.”

Stephanie Zhan: Yes.

Bob McGrew: Right? It’s very clear when you’ve done this correctly, but it’s a lot of work, it’s very boring and you can’t get smart people who want to work on this and do a good job on it. On the flip side, if you’re doing something where it requires a lot of taste and taste in how it’s implemented, where there will be non-obvious consequences to how the implementation works, maybe there’s non-obvious performance consequences, maybe there’s non-obvious consequences in how the user interface is going to evolve, and therefore how that needs to change the abstractions deeper in the system, those are places where right now we have no alternative but to have humans do that work.

Stephanie Zhan: Yes.

Bob McGrew: And I do think this is very interesting. Is there a way—you know, is there a sufficiently detailed spec or a sufficiently detailed architecture diagram that the agents can be writing for us that means that when you take work from one agent and you put it into another agent—which could just be the same agent the next day with a different context window—that it’s able to actually make progress on the code base? So these are the kinds of questions I want to see the answers to over the next couple of years.

Stephanie Zhan: Love it. It’s exactly what we’re working on at Reflection. [laughs]

Bob McGrew: Perfect.

Sonya Huang: Why is it called “member of the technical staff?”

Bob McGrew: That’s a great—yeah, that’s a great question. So for a long time—this was true even before I joined OpenAI, by the way. I believe this was Greg Brockman’s idea. But we really wanted there not to be a distinction between engineers and researchers. If you look at a classic lab, a place like Google Brain, for example, where a lot of the people who started OpenAI came from, at the time, and maybe still today, there was a big differentiation between whether you had a PhD and you were a researcher or whether you were a software engineer and you did data, you did implementation.

And it was bad because the researchers didn’t feel like they could get their hands dirty writing data code or writing implementation code. And you can’t understand the systems aspects of your research unless you’re writing the code. If you think about what makes Alec Radford the genius researcher that he is, it’s each time he does something it’s that he looked very closely at the data and he thought, “What are the possibilities of this data?” He wrote his own data scraping code from the very beginning.

And so if you want to have someone who really understands the full stack, I think Paul Graham has this great analogy to painting where the resistance of the medium dictates the kind of painting that you’re able to make. Research is very much like that. It’s very much an artistic endeavor, and researchers themselves are artists and should act like artists. And so by not having that distinction, just by calling everyone “member of the technical staff,” we were able to have a much more level playing field. And later that really came in handy when we had people who didn’t have PhDs. Many of the great researchers at OpenAI, you know, Aditya Ramesh, Alec Radford, you know, many of these people don’t have PhDs, and in fact, learned their trade by working at OpenAI.

Sonya Huang: That’s a great answer. It’s a random, throwaway question. I love that answer.

Bob McGrew: [laughs]

Stephanie Zhan: So at AI Ascent recently—Sam Altman left us with some interesting fodder, which was how different generations use ChatGPT. He said if you’re old, you tend to use it as a Google replacement. If you’re in your 20s and 30s, you use ChatGPT as a life coach or a life advisor, and if you’re in high school or younger, then you’re using it as your operating system. How do you see people use ChatGPT around you? How do you have your kids use ChatGPT?

Bob McGrew: Yeah. So look, think about that operating system comment for a second. At the very highest level, the total addressable market for ChatGPT is every user intent that requires thought or action that you don’t want to do yourself. Anything that you wish got done but you didn’t have to do is something that you might want to use AI for. And so there’s—I mean, if you think about that, there’s a version of that that feels very scary, right? It’s like people don’t do anything for themselves anymore. There’s a de-skilling. No one learns how to do hard things, we’re all just zombies watching our VR headsets, you know, watching movies. But I don’t think that’s actually what people want out of AI. And I don’t mean this isn’t the world we want to live in—I think that’s true. But that’s not what I want out of my relationship with AI.

And that’s not what I see people doing now. And partly this is because the technology for ChatGPT as an operating system isn’t actually there yet. Pretty famously, you cannot use ChatGPT to control your iPhone. But it’s also not what people want. And so I see this with my son. He’s eight years old. He’s been using ChatGPT from a pretty young age. I used to ask him to test the models before they were publicly released, and he always gave pretty good feedback, actually. And he spends a lot of time with ChatGPT. He knows it is not his friend, it is not his companion. It is an expert, someone he can talk to to explain things to him, and if you are eight years old, having someone who can explain things to you correctly in great detail and with a lot of patience is a very valuable thing.

Stephanie Zhan: Yeah.

Bob McGrew: And so he has, like, curiosity, he has enthusiasms. And one day he decided he wanted to be a coin collector. And so he collected all the coins in the house, sorted through all the ones that were from before 1970.

Stephanie Zhan: Wow!

Bob McGrew: Went to ChatGPT, started typing and just asked—took pictures and just asked questions about every single one of the coins from before 1970. And he’s, you know, “What’s this worth? Well, what would make this worth more? You know, how can I test—what is a mint mark?” You know, all these different questions. And if you think about this, this is something, you know, when I was a kid, I probably could have learned this. Like, maybe there were books, there were magazines, maybe I could have looked at an encyclopedia.

Stephanie Zhan: Yeah.

Bob McGrew: But all of this is just so accessible now.

Stephanie Zhan: Yeah.

Bob McGrew: And it’s accessible to an eight year old. And so when we went on vacation, we took him to a coin shop, and the staff at the coin shop were just shocked at how much this eight year old knew. And the very detailed quote he said was, “Show me all your coins. No, I don’t want that one. I want one that has a San Francisco mint mark. I want one from this year. This is the year that they were all made out of silver.” And the coin shop owner was just very surprised. He doesn’t deal with kids that have that level of detail—at least not until now. And so this is, I think, actually what we want out of AI is that AI should make you an expert at the things you want to do and it should remove the burden of doing the things that—the boring things that you don’t want to have to do.

Stephanie Zhan: Yeah. On the topic of the next generation, how else are you preparing that next generation for all the capabilities to come in AI?

Bob McGrew: I think this is a super, super tough question. If you think about any particular field, you know, should you teach your son how to code, right? Like, I think about my eight year old. My daughter is writing essays, my eldest son is really excited about math. All of those things are going to be automated. And so it’s clearly not some specific skill that you have to teach them. I think there’s really two things that I want my kids to understand. The first is the process of learning and figuring things out.

Stephanie Zhan: Yeah.

Bob McGrew: So that’s the value in the math and the essay writing and the coding. It’s sort of this process of being able to have learning to learn.

Stephanie Zhan: Yes.

Bob McGrew: The second thing is, you know, having ideas and projects, and the belief that you can do it and the ability to use whatever tools are at your disposal to figure it out. So this is agency.

Stephanie Zhan: Yes.

Bob McGrew: Right? And so that’s where I think that’s the right way to have kids use AI right now.

Stephanie Zhan: Yeah.

Bob McGrew: And there’s always a trade off. I’m often very torn between—so my eight year old uses ChatGPT for a lot of things, but I don’t let him use it to code. Because he’s trying to learn to code, and if he sees that he doesn’t have to use it to code, then it’s going to be very hard for him to do that work to get all the way there. I don’t let my other kids use it to do their school assignments, of course. Why would you do that? But I want them to have those basics, and then once they have the basics, once they understand things one level down, to then be able to use it to extend their capabilities.

And, you know, here’s another fun story about my eight year old. Last week he decided he wanted to build a project where the grandparents who are coming to visit could press a button and he could go—it would ring a buzzer in a different room, and he could go get breakfast in bed.

Sonya Huang: [laughs]

Bob McGrew: And he asked ChatGPT for help. I mean, he asked ChatGPT for help. It said, “Okay, you need jumper wires, you need, you know, two Arduino boards.” And, you know, just a sort of list of things. And he asked a lot of questions, you know, “How is this going to work?” He asked it to give us a list of Amazon links for us to buy.

Sonya Huang: Wow!

Bob McGrew: I reviewed this, made sure he wouldn’t get electrocuted, bought the items on Amazon for him, and now we’re putting it together. And my approach on this is I’m going to let him put it together, everything he can. I’m going to install the software because, you know, his computer’s locked down, he can’t install software. And this is going to be his project.

Stephanie Zhan: That’s amazing!

Bob McGrew: Who could have—none of us could have done that at eight years old. And he has learned so much in doing this. It’s not just that he outsourced it all to ChatGPT. Now he understands what Arduino is, he understands what the circuit boards—you know, what is going—what happens when I hit this pin? Why is this pin named, you know, GRP1? You know, these are all—I mean, I don’t know the answers to these things either. So it’s really—you know, it’s just this huge help that ChatGPT is able to do all these things for him.

Stephanie Zhan: That’s amazing. Sparking curiosity and then agency. I love it. And it’s also just the time to impact, and that just feeds more and more curiosity and agency.

Bob McGrew: Yeah, that’s right. I mean, if you think back, you know, well, you want to do this project. Well, here’s a book on Arduino and, you know, you’re going to have to write the code yourself. And, you know, what circuit boards am I supposed to use? I don’t even know how to do that. You know, probably this project just dies on the vine.

Stephanie Zhan: Yeah.

Bob McGrew: And, you know, there’s a truism in education theory that when someone asks a question, that’s the time when they’re ready to learn the thing that they’re asking the question about. And so you want to—you know, it’s worth going off script to answer someone’s question because you’re doing a huge service to them and teaching them that thing right then. And that’s—you know, now you have that, you have the ability to get your questions answered on demand at the right time for you, when you are mentally ready to do it, not when maybe you’re tired and you’re in school and you’re thinking about all sorts of other things, just right then when you actually want to know the answer. And I think that’s hugely powerful.

Stephanie Zhan: So how else are you using AI in your daily life? ChatGPT, Deep Research, I’m sure Howie.ai for scheduling. Maybe Autopilot. [laughs] What else?

Bob McGrew: Yeah, so I pretty much exclusively use o3 at this point. Once you use a good model, I think it’s very hard to go back. I think I could probably use Gemini 2.5. I hear it’s really good. But of course, as we’ve talked about, if it’s good enough, why switch? And I use Deep Research about five times a week. And it’s hugely helpful, and I think even one time that it saves you a few hours of doing work, sort of repays the cost. Absolutely makes sense.

Stephanie Zhan: What do you use Deep Research for?

Bob McGrew: It’s a mix. One answer is I’m batting around something with my kids, and it’s a question that no one has ever asked before, probably, and I want to know the answer. For example, what happens when you compress wood? You know, it starts off, it’s—you know, it’s elastic compression and then it starts deforming and then you go a little further and becomes diamond. And then you go a little further from that and it becomes a black hole. But actually, there’s, like, a dozen steps. And so that’s a really fun topic to dive into and just—you know, this is the kind of thing that would have been an XKCD comic 15 years ago and would have taken him weeks to figure out. And now you can get an answer just in a few seconds. Also, I use it when I’m thinking about a new domain or a new startup opportunity. You know, well, if I was interested in robotics, tell me everything there is to know about a particular company or about a particular market.

Stephanie Zhan: That’s our daily life. [laughs]

Bob McGrew: Yeah. Yeah, yeah, yeah.

Stephanie Zhan: Any other new products?

Bob McGrew: Well, like you mentioned, I use an AI assistant for scheduling, which is great. I mean, I’m solo right now. You know, I could hire an assistant, but it’s just actually more fun to do things myself. But calendaring, it’s really boring, and it’s just very nice and very pleasant to be able to CC, you know, an AI agent and have it do the calendaring for me.

Sonya Huang: I’d love to hear a little bit about managing, you know, OpenAI, the research org. It’s such a, you know, collection of insanely smart individuals. Creative, I’m sure. And, you know, the feedback we have on you is exceptional in terms of, you know, what a fair and what a great manager and leader you’ve been for the organization. I guess what have been some of your lessons leading an organization like that?

Bob McGrew: So this sounds sort of boring but, like, the core thing that you have to do as a manager is you have to really care about the people you’re managing. And this maybe isn’t relevant a lot of the time. A lot of time as a manager, your day-to-day job, you know, you’re coordinating, you’re, you know, helping people understand things and, you know, loyalty doesn’t really matter that much. But there comes a time as a manager when you have to ask someone to do something hard early on in your career. This is when you have to ask someone to come in and work on Sunday when they’d rather be playing basketball. But later in their career, it’s, you know, working with someone, and you have to ask them to give up a project they really care about and give it to someone else, or share credit for a research breakthrough that they know they could get to by themselves, but that, you know, a team of people, you know, not just this one talented person, but two very talented people or three very talented people working together could get done even faster.

And one thing I learned from working with Alex Karp at Palantir is that very talented people have superpowers, but they also have debilitating weaknesses. And for people who are at the very edge of these capabilities, they often don’t even understand what their weaknesses are, but it’s extremely apparent to everyone around them. And, you know, for me as a manager, it’s something that I could see very easily. And at this level of capability, when people fail, it’s almost always a form of self destruction.

Stephanie Zhan: Wow!

Bob McGrew: That there’s a choice that they could have made. And I don’t mean little failures. I don’t mean like, “Oh, I had a bad day.” I mean, you know, when someone makes a career-altering choice in a bad way, it’s almost always a matter of self destruction because they had to do something that was very difficult for them. They had to confront something that was extremely scary for them to do, that to everyone else is kind of obviously the right answer. It’s obviously the right thing for the company, but it’s emotionally extremely hard for them.

And if you—going back to being a manager, if you as a manager, if people know that you’re in it for yourself, when you tell them to do something, they won’t trust you. But if they know that you are doing what’s best for them, then when you tell them to do that thing that is super hard and extremely scary for them, sometimes you can help them across the chasm, and you can solve the problem and prevent them from doing something really stupid and end up with something that works out really well.

And I hold this bar even for firing people. For me, it is always when I am talking to someone, I have to be talking to them, giving them advice, helping them do the thing that is best for them and for the company even if you’re firing someone. If they’re not going to succeed in this role, and I have invested enough time to make sure that they won’t succeed in this role, then it is in their own best interests for me to tell them that they’re not succeeding and give them the opportunity to find somewhere else. Loyalty, in the end, is the thing that I think unlocks all of the other things that you want in management.

Stephanie Zhan: I really, really love that. There was a nuance there that you said in the middle around working with a ton of high-performing individuals who are really excited about a particular research direction that they want to break through. They know they can get there potentially by themselves, potentially with one or two others. They all have a good dose of confidence, maybe sometimes ego. How do you actually get them or convince them to embrace that, you know, effort of working together to get there?

Bob McGrew: Yeah, it’s very hard. And I think this is actually one of the things that’s very different about a research lab from an engineering culture. Because in an engineering culture, it’s sort of an assumption that we’re all working together, we’re all building one product.

Stephanie Zhan: Yeah.

Bob McGrew: But research often comes out of academia, which has this very negative culture of it’s a PI, it’s his team. Who’s going to be the first author? Who’s going to be the last author? None of the other people in the middle matter.

Stephanie Zhan: [laughs]

Bob McGrew: And we struggled with this a lot. And I don’t think there is any one answer. One thing we tried, which worked well for a time, we published some papers where we actually had OpenAI be the first author so that there wouldn’t be the fight over who’s the first author. That was, you know, one technique. It didn’t all—you know, we couldn’t always do that. It didn’t always make sense. But, you know, in the end, the key is really when you work with people, you understand there’s something they want, and you have to find a way to give them the thing they want and let them do the thing they want to do, the art that they’re trying to create, while also letting all the other people do that, and having it all add up to one big whole and just spend time over and over again making sure you’re solving that problem.

Stephanie Zhan: Yeah. Security, I know, is an interesting topic to you. In an increasingly agentic world, what kinds of security issues do you think we should be aware of, and where do you see potential opportunities?

Bob McGrew: When I think about how AI impacts security, for me, the first order is the much easier ability to do offensive work than you could do previously. And so the number of threats have gone up, the time to execute on a threat has gone up. And so that then pushes the defense to be much more agentic. So there’s a company I’m an investor in, it’s called Outtake. I met the team, they’re a group of ex-Palantir folks. And we also ended up using them very successfully at OpenAI. And what they’ve done is that they have made an agentic stack for doing cybersecurity that uses very little human input.

And I think this is—right now we’re at a place where the models can actually do all of these things. If there’s something that a human could do that’s sort of one of these bulk operations, if you can’t make the model do it, that’s your fault. It’s not the model’s fault. But the barrier then is that businesses and organizations aren’t set up to do this. They have to go change their business processes in order to make this happen. And so I think that’s an opportunity for startups where similar—you know, as big as, you know, this shift from web to mobile is just disrupting the existing businesses because it may be faster for you to replicate their technology and their distribution than for them to be able to change the way they operate to get rid of or reduce the number of humans they need.

Stephanie Zhan: Awesome. Bob, thank you so much for joining us.

Bob McGrew: This has been really fun.

Stephanie Zhan: This has been a pleasure to have you here.

Mentioned in this episode:

Solving Rubik’s Cube with a robot hand: OpenAI’s original robotics research
Computer Use and Operator: Anthropic and OpenAI reasoning breakthroughs that originated with OpenAi researchers
Skild and Physical Intelligence: Robotics-oriented companies Bob sees as well-positioned now
Distyl: AI company founded by ex-Palintir alums to create enterprise workflows driven by proprietary data
Member of the technical staff: Title at OpenAI designed to break down barriers between AI researchers and engineers
Howie.ai: Scheduling app that Bob uses

The Breakthroughs Needed for AGI Have Already Been Made: OpenAI Former Research Head Bob McGrew

Training Data: Ep50

Listen Now

Stream On

Summary

Transcript

Chapters

Contents

2025 is going to be the year of reasoning

The changing role of pre-training

The stool has three legs

The history of reasoning at OpenAI

The commoditization of agents

How do startups sell agents?

A good time for robotics

Where can startups play?

Is proprietary data overrated?

AI coding passes a threshold

Why “member of the technical staff” at OpenAI?

ChatGPT equals everything you don’t want to do yourself

The curiosity of an eight-year-old

How to prepare the next generation

Favorite AI apps

Lessons for engineering managers

Mentioned in this episode: