Podcasts Training Data OpenAI Codex Team

OpenAI Codex Team: From Coding Autocomplete to Asynchronous Autonomous Agents

Stream Now On

Hanson Wang and Alexander Embiricos from OpenAI’s Codex team discuss their latest AI coding agent that works independently in its own environment for up to 30 minutes, generating full pull requests from simple task descriptions. They explain how they trained the model beyond competitive programming to match real-world software engineering needs, the shift from pairing with AI to delegating to autonomous agents, and their vision for a future where the majority of code is written by agents working on their own computers. The conversation covers the technical challenges of long-running inference, the importance of creating realistic training environments, and how developers are already using Codex to fix bugs and implement features at OpenAI.

OpenAI researcher Hanson Wang and product lead Alexander Embiricos discuss their work on Codex, an agentic coding assistant that operates in its own containerized environment to generate complete pull requests from task descriptions. Their insights emphasize the critical shift from AI models that assist developers to autonomous agents that can work independently, requiring new approaches to training, user interaction patterns and product design.

Specialized training beyond technical capability is essential for professional-grade AI products: While o3 excels at competitive programming, transforming it into Codex required extensive reinforcement learning to align with professional software engineering practices. The model needed to learn the “taste and preferences” of professional developers—writing proper PR descriptions, matching code style, creating comprehensive tests and producing mergeable code. This professional alignment training proved as crucial as the underlying technical capabilities, highlighting that raw model performance on benchmarks doesn’t automatically translate to real-world utility.

Delegation-based workflows require fundamentally different and product design: Codex succeeds best when users adopt an “abundance mindset”—running multiple tasks in parallel rather than trying to perfect single requests. The most successful users generate 10+ PRs daily, treating the agent as an independent teammate rather than a sophisticated autocomplete tool. This shift from pairing to delegation requires rethinking user onboarding, interface design and success metrics. Products must actively guide users toward this new interaction paradigm rather than assuming they’ll discover it naturally.

Environment design and infrastructure consistency are critical for agent reliability: Creating realistic training environments proved as challenging as model development itself. Real-world codebases lack consistent testing frameworks, documentation standards and development practices for agents to rely on. OpenAI solved this by using identical containerized environments for both training and production, eliminating the “works on my machine” problem. This infrastructure consistency—where agents train and deploy in the same environments—is essential for reliable agentic products.

Long-running agent tasks create new challenges: Thirty-minute agent sessions revealed that users often can’t precisely specify complex tasks upfront, requiring new interaction patterns where agents first generate plans for user approval before execution. The models also developed human-like limitations—occasionally giving up on overly complex tasks with messages like “sorry, I don’t have enough time to do this.” This suggests that effective agentic products need sophisticated planning phases and graceful failure modes rather than just execution capability.

The future belongs to generalized agents with specialized interfaces: Rather than building separate agents for different functions, OpenAI’s vision involves one general assistant (ChatGPT) that can access specialized tools and interfaces when needed. For developers, this means having the same underlying agent accessible through IDEs, terminals, Slack and other tools where they work. This approach leverages the broader context and memory capabilities of a unified system while providing function-specific interfaces for power users, suggesting that successful AI companies should focus on building versatile foundation capabilities rather than narrow point solutions.

Alexander Embiricos: In my opinion, the easier it is to write software, then the more software we can have. Right now, if we think of—like, I bet you if we pull up our phones—well, you folks are investors. But if you’re not an investor, I bet you if you pull up your phone, most of the apps on it are apps that are built by large teams for millions of users. And there’s very few apps that are built, like, just for us and the specific thing that we need. And so I think as it becomes more and more practical to build bespoke software for people or teams, we’ll end up having higher and higher demand for software.

Sonya Huang: Welcome to Training Data. Today we’re joined by Hanson Wang and Alexander Embiricos from OpenAI’s Codex team for a fascinating look at the future of software development. Codex is OpenAI’s series of AI coding tools that helps developers delegate tasks to cloud and local coding agents. Unlike the original OpenAI Codex, which was developed in 2021 to auto-complete lines of code, the latest evolution of Codex can complete entire tasks for you autonomously in the background. The key difference between o3 and Codex is that while o3 is great at competitive programming, Codex has been RL-tuned to be great at day-to-day enterprise development tasks.

Alexander and Hanson share more about the backstory for Codex, and the broader paradigm shift from snappy autocomplete to longer running background agents. Plus, they share their surprising vision for how developers will interact with AI in the future as sync and async experiences merge. Hint, it might look more like TikTok than your current IDE.

Lauren Reeder: Thank you guys for joining us. It’s wonderful to have you here.

Alexander Embiricos: Hey, thanks for having us.

Hanson Wang: Great to be here.

Lauren Reeder: We’d love to hear a little bit more about what you guys work on. Tell us about the Codex team and your story.

Hanson Wang: Well, yeah. I’m Hanson. I’m one of the researchers that helped train the codex-1 model.

Alexander Embiricos: And I’m Alex, the product lead.

Hanson Wang: I think for me, the name “Codex” is such a great callback to the original Codex model. That was kind of like an a-ha moment for me when it first came out because I think GPT-3 was really cool, but then Codex was the first moment where I felt like, wow, this can really do something that is going to change the world. And that’s actually kind of like how I got into the whole startup space. Like, one of the first couple demos I did was using Codex to do data analysis. I think it’s actually, like, a funny story. Like, I was here as part of Sequoia’s Arc program. That’s how I met Lauren. And then one of the demos we did, we actually used OpenAI Codex to do data analysis. And that’s how I started in the startup space. And I think as time went on, as the later versions of GPT came out, it became super clear that using AI for agentic use cases was going to be the future. And so I joined the company to work on agentic coding efforts.

Alexander Embiricos: Yeah, and this was, like, per standard OpenAI style where we like the naming to be as easy to follow as possible. This is the Codex of, I think it was 2021?

Sonya Huang: Yeah, this is pre ChatGPT, right?

Alexander Embiricos: Exactly, yeah. So it was actually the model powering GitHub Copilot. And then recently, you know, as we were working on this product, which we’ll talk about, we thought, you know, this is like a super fun brand, also a very apt name, you know, Code, Codex, Code Execution. So, like, we decided to sort of like, resuscitate the brand and, like, keep using it.

Sonya Huang: You said resuscitate. So was Codex dormant for a while, and then you all resuscitated it for …

Alexander Embiricos: Yeah, we haven’t used the brand, like, recently.

Sonya Huang: Okay. Okay. Really cool. Can you tell us a little bit about Codex the agent and what it does?

Hanson Wang: Yeah, I think so basically Codex is a coding agent that has its own container and its own terminal, kind of like fully in the cloud. you give it a task, and it comes back to you with a PR in this sort of like one-shot style. And we actually experimented with a lot of different form factors along the way, but in the end decided to settle on this one.

Alexander Embiricos: Yeah, so we’ve been working on a bunch of agents and we’ve been working on a bunch of coding products as well. And basically, in our mind, Codex is like this thought experiment for how would it work to code with AI, but where we sort of put all our effort into thinking about what that would feel like if the AI is working on its own computer, you know, independently from you? And so you’re delegating to it rather than pairing with it. And so some of the things that we’re really proud of with this Codex launch are thinking about the compute environment and how do we set it up so that the agent can actually work on its own but be productive, and creating the model, which Han can talk more about, like, that basically that isn’t just good at writing code that looks good or is functional, but also is really good at writing code that, like, is useful for professional software engineers and mergeable ideally without even touching their own computer.

Lauren Reeder: So what is the difference between Codex and Codex CLI?

Alexander Embiricos: Yeah, we’ve definitely gotten some questions about that. I promise this is all going to make even more sense over time. So basically, Codex for us is our brand for agentic coding. And we have this vision of like, you know, like, we’re going to have this agent and mostly the agent will work on its own computer, but it should be able to meet you in, like, any of the tools that you use wherever you work, be that your terminal or, you know, your IDE or your issue management tool. So Codex CLI is basically like Codex in your terminal. So, like, CLI stands for “command line interface,” right? So it’s like in your terminal you can work with Codex. That’s like your environment. And then Codex or Codex in ChatGPT is basically like Codex working on its own computer. Today, those are just distinct things. As a brief aside, one of my favorite things about working at OpenAI is how willing we are to cut scope and just launch things quickly. But over time, we’ll actually bring those things closer together. So you can really think of it as it’s just like Codex and it can be in ChatGPT or it can be in your CLI.

Lauren Reeder: And so what did you have to do differently for the model to make it useful beyond just writing the next line of code?

Hanson Wang: Yeah. So I think one of the most interesting progressions, so if you go back to the o1, the first reasoning model that we launched, we highlighted how good it is at math and even coding competitions. Like, as of now, I used to be a competitive coder and, like, it’s better than me at competitive coding. It’s better than most, almost all people at OpenAI at that. But I think one of the things that we saw was that, you know, despite being good at these programming competitions, it wasn’t actually that good at producing mergeable code. And so we even highlighted this in the blog post with models like o3, like, the code that it generates often isn’t quite to the taste or style that a professional software engineer would expect. So a lot of the effort that we spent on training this model was aligning the model to basically, like, the taste or the preferences of professional software engineers. And that’s something that took a lot of, I guess, specialized training.

Alexander Embiricos: Yeah, I have this, like, very product-y analogy that I like, which is like, if you take our, like, recent models, which are great at coding, they’re great at coding, but it’s kind of like this, like, really precocious, like, competitive programmer, like, college grad who doesn’t have many years of job experience being a professional software engineer at, like, on a team, right? And so a lot of the work we did to go from, like o3 to, like, codex-1 was actually, like, the equivalent of, like, those first three years of job experience where it’s like, hey, like, what does a good PR description look like? PR titles, like, how do you read the style of the code base and then make sure your code is in the same style? How do you, like, test well? How do you show that you tested well? Stuff like that.

Sonya Huang: What’s typically the “a-ha” moment for when somebody uses Codex?

Hanson Wang: Yeah, I think one of the things we have in the onboarding is, like, find and fix a bug in the code base. I think that’s one of the areas where Codex really shines is, like, specifically, like, bug fixing. Just because it can actually, like, independently try not just to see if something looks a bit off, but it can actually go and then verify that okay, like, I can try and reproduce a particular issue. And so I think even leading up to the Codex launch, there were a couple of bugs where we were sitting there kind of wondering what’s going on. And honestly, like, sometimes the easiest thing to do is just, like, paste in a description of the issue into Codex, and we were surprised how frequently it could actually ended up with a usable fix.

Alexander Embiricos: Yeah, like, fun story here. Hopefully this doesn’t give away too much, but at 1:00 am the night before launch or the morning of launch, at 1:00 am we were looking at a bug with an animation, a Lottie animation. And this is the kind of thing, like, okay, I guess we could cut it from launch scope. It’d be okay to launch without it. But we really wanted to get it in, and we just couldn’t figure this out. And so an engineer ended up, like, describing what the bug was and putting it into Codex. And actually, like, a fun pro tip for anyone who’s using Codex is that if there’s a really hard task, it can be useful to ask Codex to take multiple cracks at it. So they pasted that description in and ran it four times. Like, “Hey, there’s this bug. We can’t figure out what’s going on.” And three of those rollouts did not work. And then one of the four was just, they fixed the bug that we were stuck on for hours at 1:00 am before launch. And so landed the fix, deployed the code, and the animation was in for launch.

Sonya Huang: That’s awesome. Maybe tell us more about how you all are using it internally at OpenAI. Is every engineer, is every researcher using Codex now in their workflows?

Alexander Embiricos: Yeah, and actually, can I give you the other kind of magic moment?

Sonya Huang: Oh, yeah. Please do.

Lauren Reeder: Definitely.

Alexander Embiricos: One of the interesting things about Codex is that it’s a very different form factor from maybe what people are used to, right? Like, a lot of the AI products that people are used to, especially in software, maybe GitHub Copilot was the first really good one, they’re really things that work with you in flow, and you’re just seamlessly going back and forth. You’re kind of pairing, right? And flavors on pairing. And we think that’s awesome, and the Codex CLI is a tool that you can use in that way. But for Codex, we really wanted to push this idea of, like, you’re delegating. Because, like, in the future, we imagine that actually the vast majority of coding is actually going to be done independently. from the human working on their computer who can only do one thing at a time. And so it’ll be done by agents working on their own computer.

And so that is a very different thing to delegate to an agent than it is to pair with sort of an AI model that’s in your tooling. And so you have to kind of use it differently. And so when we actually were working on an alpha before launch, we would just give this agent to people and be like, “Hey, like, just use this however you want.” And we noticed that many, many of the people trying to use our alpha of Codex were just, like, not really finding it super useful. And then we’re like, “Huh, that’s interesting. Let’s look at how people, like, at OpenAI are using internal tooling like Codex.” And we realized there was a big difference, which is the mindset of using it.

Sonya Huang: Hmm.

Alexander Embiricos: The mindset that works really well for Codex is, like, kind of like this abundance mindset and, like, hey, let’s try anything. Let’s try anything even multiple times and see what works. It saves me time. And so we’ve kind of shifted the way that we even onboard people into the product to try to create this “a-ha” moment, which is running many tasks in parallel. So, like, for us, if we see someone like trying it out and, like, they’ve run, like, 20 tasks in, like, a day or an hour, that’s amazing. And, like, they’re probably going to—like, they’ve understood basically how to use the tool.

Lauren Reeder: Fascinating. How does that change the role of the human when you have to review all of this code? Like, if two of the three work, then what do you do?

Hanson Wang: Yeah, I think we’ve put a lot of focus on also making the outputs easy for people to review. So, like, one of the things that we’re proud of is, like—we haven’t seen this in too many other tools, is the ability for the model to cite its own work. So not just the files that have changed, but also even the terminal outputs. So, like, if it ran a test and for some reason the test wouldn’t work, it actually tells you that and it tells you, like, here’s the exact kind of terminal command I ran, here’s the output. It makes it much easier to verify the outputs. But it is a great point. I think we’re shifting to a world where, like, a lot of the time that we spend, you know, normally coding, a lot of that is going to shift to actually reviewing the code.

Sonya Huang: Do you need humans to review the code? Because I think of code as one of those things where, you know, it compiles or it doesn’t. And once it compiles, you can go and check if it does the thing it was supposed to do. Like, do you even need humans to do the code review?

Hanson Wang: I think, yeah. I mean, for the foreseeable future, at least, I do see that to be the case. I mean, I think a lot of it’s also just, like, building trust with the early users. I think people really need to have a feeling for, like, you know, what things are working well, what things are not. And I think there’s always just some external context about, like, you know, what makes this code correct that might be beyond what you initially provided as context.

Alexander Embiricos: Yeah. If you think of what a developer does, and this is obviously oversimplifying, but there’s coming up with what things maybe should be done, discussing them with the team maybe, deciding what to do. You call that ideation. You know, maybe then there’s design, like, okay, what are we actually doing? And then, like, planning how are we going to do it? Then there’s implementing and then validating, you know, testing those changes. And that’s basically a loop. And that small loop of, like, implementing and then testing is what Codex is great at right now, although we can talk about how you can use it for planning, too.

And then there’s actually deploying the code, and then maybe maintaining the code, writing documentation, et cetera. And so, like, I forget the exact stat, but I feel like a stat I remember recently is, like, engineers spend, like, maybe, like, 35 percent of their time coding. It’s not actually the majority of even what engineers do. And so the future that we’re trying to build towards is one where, you know, if you’re a software developer or even, like, in any profession, all the work that is easily automatable, that’s usually the grungier type of work, you’re not doing. You’re delegating that. And then the work that is more interesting because maybe it’s ambiguous or maybe because it’s really hard, that’s the work that you’re driving.

So we’re trying to build towards that work, that world. And I think we have to get there iteratively. So for example, right now, if you’re a human and you write code, another human is going to review that code, right? And so we’re not going to come in and just, like, try to change that. We’re like, okay, let’s plug into that. So the way the product works right now is, like, you, the developer, are being accelerated by the tool. You ask for some code to be written, you decide if it’s good and you want to push it out to your team, and then your team can review it. And then over time, we’ll basically kind of expand what we can do, so we’ll help more and more with, like, planning, maybe even designing, maybe even thinking about what to do in response to things that are happening in your app or at work. And then we’ll push to, like, make review easier and easier, as Hanson was describing.

Hanson Wang: Yeah. And I do think I see a future where you have, you know, like, multiple agents collaborating together. So you have Codex. The Codex agent writes the code, and then maybe the operator agent’s the one that’s testing it, and all of the different agents that We’ve been working on at the company can kind of come together.

Lauren Reeder: That’s awesome. Have you seen people—like, now that you can delegate writing code, people beyond engineering teams start to use Codex? And as we get into the world of vibe coding, you guys are helping us bring us further down that hole.

Alexander Embiricos: Yeah, this is actually super funny. So the answer is yes, but I’ll tell you a story. We were working on our launch blog post with Lindsey here, and we were talking about, like, what quotes to quote from customers. And we had a customer that wanted to say, “Yeah, like, we on the engineering team love this, and also, like, it’s a power tool for PMs.” And I remember looking at that quote and being like, “This is a really cool quote.” Because I’m on the product team, and I use it to just, like, avoid having to bug an engineer about things or to answer questions. But I remember looking at that quote and being like, “Do we want that in the launch blog post?” Because the target audience for what we’re building is, like, specifically professional software engineers, not vibe coders. So I think we ended up not including that exact line, but I think over time, like, as we have agents that can help us code, I would expect more and more people to be able to contribute to codebases, yeah.

Sonya Huang: Do you think the number of professional software developers goes up or down over time?

Alexander Embiricos: This is just my opinion, but I think it goes way up.

Sonya Huang: Huh!

Alexander Embiricos: I think …

Sonya Huang: And not vibes coders, professional software developers?

Alexander Embiricos: Yeah, I think so. But in my opinion, the easier it is to write software, then the more software we can have. Right now, if we think of—I bet you if we look, pull up our phones—well, you folks are investors, but if you’re not an investor, I bet you if you pull up your phone, most of the apps on it are apps that are built by large teams for millions of users. And there’s very few apps that are built, like, just for us and the specific things that we need. And so I think as it becomes more and more practical to build bespoke software for people or teams, we’ll end up having higher and higher demand for software.

Hanson Wang: Yeah, as I think about how I use it, I think it just really is a multiplicative factor right now rather than any sort of replacement, especially looking at the patterns of our internal power users, like, there’s a really dramatic difference in the top users of Codex are, like, doing, you know, like, 10-plus PRs every day. It’s just really such a multiplicative factor that I can’t see, like, a world in which, like, it’s lowering the bar to creating software so much.

Alexander Embiricos: That said, I think this is a really important question, and to be completely honest, like, we don’t know. And so this is something that we as a company pay a lot of attention to.

Sonya Huang: I want to talk a little bit about what’s happening under the hood on the technology side. So you mentioned, the model itself, one of the things that makes it different from competitive programming is you’ve made it more, be good at the things that a professional software developer would do. Is that the biggest difference on the model side, or should we think of it as a close cousin of o3?

Hanson Wang: Yeah. So it’s definitely the same model as o3 with additional reinforcement fine-tuning. But that said, yeah, I think so part of it is kind of like these more, like, qualitative aspects of what makes a good software engineer versus simply like a good, let’s say, like, coder, you know, like style, even like how it writes comments. I think that’s like one of the things that people have noticed with other models. And then on top of that, I also want to highlight one of the big challenges was, like, making good environments for the agent to kind of learn in. And so if you think about, like, real world software repositories, it’s, like, so varied and complicated. Like, think about, like, how much DevOps has to go into, like, setting up a repository. And that’s something we’re kind of learning the hard way with our environment setups.

Alexander Embiricos: Should we talk about the multi-repo I was showing you yesterday?

Hanson Wang: Oh, yeah.

Alexander Embiricos: I was showing Hanson the repo for the startup that OpenAI acquired, and so we joined. And so we were looking at that repo together, thinking about it for use as an environment. And Hanson’s like, “So, like, where are the unit tests?”

Hanson Wang: [laughs]

Alexander Embiricos: You know, because the agent uses unit tests to verify. And I was like, “This is a real startup that has no unit tests.”

Hanson Wang: I mean, I can’t complain. So yeah, like, you have all these, like, really messy environments. So we had to—over the course of training, like, we had to basically generate these really realistic environments for the agent to learn from. And I think, like, one of the reasons that we’re able to make such an end-to-end product work is that we have, like, the same environments that we use during training and the same, like, basically the containerization infrastructure that we’re using to serve in production. So our users are—you know, like, we’re running our own computer environments. When users use Codex, they’re running in the exact same environments that we’re using for training.

Sonya Huang: So you don’t have the agent saying, “But it works on my machine.”

Hanson Wang: Exactly. [laughs]

Sonya Huang: Okay. I think these are also the longest-running agents I’ve seen out of OpenAI. Deep Research maybe was the previous one that was longest running. And my understanding is, you know, Codex can, you know, sometimes spend 30 minutes on different tasks. Are there any kind of surprising challenges and things you’ve encountered just getting inference time to scale up on, you know, a query for so long?

Alexander Embiricos: Maybe I’ll start with the product side, and then there’s many on the modeling side. But on the product side, actually the thing that I think the most about is, like, user intent. It’s actually, you know, if you imagine someone using autocomplete in their IDE, it’s not super hard necessarily. I mean, obviously it’s difficult, but it’s not super hard to predict, like, what are they trying to do right now for the next microsecond? But for doing a task that takes 30 minutes, it’s actually, like, fairly difficult to help a user describe the task. Like, they may not even know exactly what they want for 30 minutes worth of work.

And so something that we spent a while debating, and it’s still a thing we debate, is what is the right granularity of a task for someone to give to Codex? And, like, how can we make it easy so that Codex can, like, be really flexible where you can use it for one line changes, you can use it for, like, big refactorors that you know exactly what you want or, like, larger features where you know what you want. Or maybe can you use Codex when you don’t know exactly what you want? And so maybe you should ask Codex for a plan, and then you can, like, have Codex suggest tasks and then, like, do those tasks afterwards. So that’s still a topic of debate and iteration for us.

Sonya Huang: Hmm.

Hanson Wang: I think that’s actually like a good pro tip for using it. It’s actually, like, really good at coming up with its own plans. And then sometimes it’s really tedious to specify everything you want upfront. And that’s kind of like one of the unique challenges about working. If you wanted to work for an hour at a time, then you kind of do have to specify a lot upfront, which means that you have to spend, like, I don’t know, 10-20 minutes coming up with that. But if you use actually, like, the ask mode to first generate a high-level plan of what you want to do, and then you can iterate on that with the model before you send it off for an hour.

Sonya Huang: It really is like working with an intern.

Hanson Wang: [laughs] Yeah.

Sonya Huang: What about on the model side? Anything that’s surprising in terms of model behavior as it starts to run for so long?

Hanson Wang: Yeah, I think our models have gotten a lot better at kind of like sticking kind of like on task as it—especially, like, with these longer rollouts. I will say, like, there are cases where, you know, like, even there is a limit to the model’s patience even though it’s quite high. So it can be frustrating sometimes, you know? It’s like it goes off for, like, 30 minutes and then, you know, this is a case that we’re working to get better at where it’s like—you know, it’s kind of like just like a human, it comes back to you and it’s like, “Sorry, I don’t—this is too much. I don’t have enough time to do this, actually.” Like that’s one of the things it says.

Lauren Reeder: Just like an intern.

Hanson Wang: Very human-like in many ways.

Lauren Reeder: Yeah. I’m curious how you think about the right interaction patterns and how they evolve, and how the suite of products around this evolve over time. We have Codex, we have Codex CLI. What else do you think is out there in the design space for engineering and building products?

Alexander Embiricos: Yeah. So the Codex as we launched it is really just like, you know, it’s a research preview. It’s a thought experiment—a useful one, but it’s still very early. And what we’re most proud of with Codex is the model and the beginning of this foundation for computer environments. And the UI we shipped is one that we iterated towards, and there’s some fun stories there, but it’s definitely not the final form factor.

And for those listening, basically the UI we shipped is an interface in ChatGPT where you can, like, submit a task and ask Codex to either answer your question or write code. And then you kind of have this something that looks a little bit like a to-do list of things that you can go look at merging. Really, I think for—so we built that to really lean hard into this idea of an asynchronous agent that you delegate to.

But what we want to build towards is a setup where you don’t have to think about whether you’re delegating or whether you’re pairing with an agent. And it should just feel like working with a teammate where that teammate is, like, ubiquitously present in all the tools you work with. So you should be able to pull up any tool that you’re working in, be it your terminal, your IDE, your issue management tool, maybe your alerting tool or your errors, you know, the tool that shows you errors, and just ask for help. Maybe even Codex has already taken a look before you even got there and it has, like, an opinion there. And you could be able to ask something, be it a short question or a long question, it’ll just appropriately decide how much time to spend before answering you and just help you land those changes. So basically we want to kind of blend this idea of, like, pairing and delegation, but the first thing we shipped was just like the purest thought experiment.

The other thing I’ll add to this is, like, one of the unique things about working at OpenAI is that we are the makers of ChatGPT, which is the AI system that most people use. And so we don’t actually see a future where as you go about your day, you’re deciding whether to use, like, the Codex agent or, I don’t know, your shopping agent or taxi ordering agent. By the way, I’m just, like, naming random things here. Or your, like, marketing agent. Actually, the way we think this should work is you should just have, like, one assistant that you talk to, and you can ask it anything about anything and it can just, like, do the things you need. And so that’s ChatGPT. That will become our assistant. And then if you’re a power user of a certain type of tool, so let’s say you’re a software developer, you spend a lot of time in certain functional tools, then you can go into that tool and have a bespoke interface with buttons, with lists that you can use to, like, efficiently go about your day.

Sonya Huang: Do you think we’ll still use IDE?

Alexander Embiricos: Yeah, for sure. But they’ll evolve. Right now they’re very focused on writing code. And as Hanson was saying, like, probably agents will be writing more and more code. And so it’s going to become—like, there’ll be a shift in emphasis towards, like, landing code or reviewing code or, like, validating them or maybe even like, a shift in emphasis towards, like, planning bigger arcs.

Hanson Wang: Yeah, I think we’re already seeing a lot of people on the team, they kind of like first thing in the morning, they come in, like, they make coffee, and then they kick off a few tasks just to kind of get a starting point. And then they come back after their breakfast, and they look at the tasks that were the PRs that got generated, then they’ll take those. And the IDE is kind of like the place where you take—maybe it’ll get you 80 percent of the way there, hopefully, or even more. But then there’s always this, like, last mile where you go in and really, like, fine tune based on kind of like your own vibe.

Lauren Reeder: How do you see the broader market evolving? Within OpenAI, you have so many different strategies here. And as you think about async tasks, as you think about some of the things that you mentioned moving into ChatGPT, we’re seeing an explosion of other tools and specialized models. You obviously are biased, but I’m curious what your read is of the broader market.

Alexander Embiricos: Yeah, it’s a crazy time to be a developer right now. Like, there are just so many new tools that are just so helpful. Like, a fun story recently is I was in the airplane and there was no Wi-Fi, and I had thought that I was gonna maybe write some code and, like, build a thing. And there was no Wi-Fi and I was like, “You know what? Screw it. Like, it’s just not worth my time to, like, even try to write code anymore.” Whereas, you know, the startup that I was working on, like, many years ago, like, part of the genesis of that startup was, like, me writing some code without Wi-Fi in an airplane. And I just wouldn’t even do that anymore because, like, the market is just like—it’s just changed so much.

And I think we’re going to see, like, an equivalent shift in an equivalent amount of time. So, like, in the next two years, coding will look completely different. I think right now, most of the tools that people find the most value from, are tools that work really closely with you, like in your development environment, basically pairing. And I think the shift that we’re going to see—but we have to figure out how this will happen, but the shift that we’re going to see is that actually the majority of code will be written by agents. And those agents won’t be working in your environment where you can do one thing at a time, but they’ll be working in their own environments. And they won’t just be triggered by you, like, thinking of specific tasks, but they’ll be connected into the tools you use doing work there.

And so I think we’ll see basically that shift towards agents. I think we’re going to have to figure out a lot about code review, as you were asking about. Like, personally, I don’t exactly know how that’s going to work, but I do know that even already at OpenAI, we’re seeing much more code is merged by agents, but actually also even more code is generated by agents as folks are, like, say, kicking off tasks four times to, like, choose their favorite implementation. And so it’s not a hundred percent clear how we should even manage all this code that is being written.

Some things that I will say though, in case it’s useful to the audience, is that there are definitely things you can do to your code base to make it more addressable for agents. This isn’t necessarily particularly novel, but obviously using typed languages is really helpful. Another thing that’s very helpful is, like, having, like, smaller modules that are, like, better tested. Like, we joke about …

Hanson Wang: Having good tests at all, yeah. [laughs]

Alexander Embiricos: Yeah, like, we joke about my startup’s repo, but I bet you we would have written it differently if we were writing it today. And even there’s, like, small things like the code name for this project is WHAM. This is the code name for Codex. It’s like W-H-A-M. And when we named it, we were very intentional in doing so because we knew we would have code, like, in the server, like, for the website, in various other places. And we wanted it to be really easy for the agent to, like, search for WHAM-related code and find it. And so we named the project WHAM, and we prepped the code base first to figure out how often it was there. Like, if we would have called it something like “code” or “Codex” or “agent,” you can imagine it would have been really hard for the agent to …

Lauren Reeder: Would have gotten very confused.

Sonya Huang: And now you called it Codex, so now the agent’s going to be confused. [laughs]

Alexander Embiricos: Well, so in the code, this is kind of my point, right? Like, intentional design. In the code, we use the term WHAM, like, a lot, because that’s actually much easier for the agent to find. Obviously, if we didn’t use a word like that, the agent could still find its way, but it would have to spend much more time to find the right files.

Hanson Wang: Yeah, it is cool that, you know, a lot of the things that actually make the code base easier for humans too, also tends to make it easier for the agents. Like good tests, for example, writing good docs is another great example, where now, I think there’s even more of an incentive to do that because not only does it make your life easier, it makes the agent’s life easier.

Sonya Huang: Okay, sorry to be the annoying VC, but Claude Code and Jules are also, I think, agentic coding experiences from others. I’m curious how you think your experiences compare today. And then do you think the market is probably going to converge towards the same vision of what sync and async coding look like? And in that version of the future, what do you think OpenAI wins on?

Alexander Embiricos: I think we’re going to see a little bit of everything, right? Like, even in what you mentioned, like, there’s tools that are working on your computer, there’s tools that are working on their own computer. As I mentioned, like, I think we’re going to see the majority of work being written where the agent has its own computer, but it will still be really important for us to invest in accelerating developers who are doing work on their own computer, too. So ideally we get the best of both worlds there, but most work is done in agent compute.

Hanson Wang: I think the way I see it as well is, like, I think one of the hardest parts of software engineering really is, like, taking all the context from the world and, like, encoding it in these requirements, these design docs. And then the implementation, I think as we alluded to earlier, is not actually that much of the lifecycle is spent on physical coding. And so I think where ChatGPT shines is it is this assistant that has memories now, it has access to a lot of different connectors, to all the different tools you use. We have Operator, Deep Research that have all these, like, different capabilities. And so I think the vision where that all comes together is where a tool like Codex can really shine once it has access to all that knowledge, it’s able to make use of that. And I think with that, it should be able to do a much more effective job at just the coding part.

Alexander Embiricos: Yeah, imagine, like, hiring a software engineer, and the only thing that that software engineer can do is take a task from you and produce a PR, right? Or it has, like, these very well-defined features and it can exactly do those things. and then you ask for a random thing, like, “Oh, hey, like, the team is getting together. Do you mind also, I don’t know, getting a meeting room and leading a brainstorming?” It would just be so frustrating if you hired a teammate and they refuse to do that kind of work, right? And so similarly, I think it’s really—we’re building towards a future where, like, agents that you’re working with are a little bit more generalized. Like, you know, to reference, like, Hanson was talking about, like, you know, Operator and Deep Research. If you think, Operator has a web browser, Deep Research has a different flavor of a web browser, Codex has a terminal. Really, like, your teammate has pretty similar tools, like a human teammate, right? And so the goal for us eventually is to, like, pick places where we want to really invest in a specific audience to make rapid progress. So obviously we’re doing that with coding, with Codex or GPT-4.1, where we generated specific evals for that audience and then, like, made a better model for them, for developers. But then over time, like, generalize these things into simple things that everyone can use. So I think again with us, with OpenAI and ChatGPT, I feel like that’s a place where the products we build will look very different from something that’s very only specifically for coding.

Sonya Huang: What do you think will be the primary UI that developers use to interact with Codex? Do you think it’ll be ChatGPT, the CLI, the IDE, all the above?

Hanson Wang: Yeah, I think a mix of all the above. I think we just kind of want to meet developers where they are in that moment. So it might not even be in the editor or in the terminal. It might be on Slack. Someone messages you, like, “Hey, there’s a bug. And you’re just like, “Hey, like, go fix it.”

Alexander Embiricos: I’ll give you my fun future UI that is not at all serious. But maybe the future of working with agents, if you’re a startup founder in the future and you have a team of just you or you and a couple co-founders and many agents, actually looks like TikTok. You know, maybe you have, like, vertical feed and it’s basically an agent has produced video that you can watch with, like, an idea, like, “Hey, a customer wrote in with this request. I think we should fix it.” And then you swipe right to say, like, “Yeah, let’s fix this. Let’s do this.” You swipe left to say, “No, we can’t do it.”

Sonya Huang: Tinder or TikTok?

Alexander Embiricos: Sorry, it’s a hybrid.

Sonya Huang: [laughs]

Alexander Embiricos: I didn’t say this was going to make a lot of sense.

Sonya Huang: I like it.

Alexander Embiricos: And then you press and hold to provide feedback. So you go, like, “Yes, do it, but, you know, make sure the font is in italic.” And so basically you have all these agents who are, like, subscribed to information at your company or on your team, and they’re proactively coming up with ideas and doing them, and then giving you updates, and you’re just curating the work that is being done.

Sonya Huang: I love that.

Lauren Reeder: And they show you little previews of what the world could look like.

Alexander Embiricos: Yeah. Obviously that’s a half joke, though. I think that’ll be kind of the arms-length working with agents, and then it’s definitely going to be really important for people to be able to, like, go do the work themselves and pair with agents in …

Sonya Huang: I get that it’s a half joke, but it’s a really cool visual because I think everyone agrees conceptually with this idea of, you know, collaborating and reviewing all the different changes an agent makes is going to look very different from how we code today. But nobody’s actually given me a visual of what that might look like. So that’s a really cool idea.

Lauren Reeder: I love it.

Sonya Huang: Awesome. Should we wrap with a lightning round?

Alexander Embiricos: Let’s do it.

Sonya Huang: Okay. Recommended piece of content or reading for AI fans?

Alexander Embiricos: For me, that’s immediate. It’s, like, The Culture by Iain Banks. Have you read it?

Lauren Reeder: Yes, it’s amazing.

Alexander Embiricos: Yeah. It is a science fiction series, started being written in the ‘80s, and it is unusually positive in its view of, like, how a future spacefaring human and non-human race could kind of look. And there’s a lot of questioning about, like, what is the purpose and meaning of life when we have AGI.

Hanson Wang: Yeah, I think for me it’s like anything by Richard Sutton. I think that was my introduction to reinforcement learning. And I think it’s kind of a joke here that we read The Bitter Lesson every single day. That’s, like, kind of the philosophy of OpenAI. I think even with Codex, like, we give it a terminal and it literally uses PAWS-X tools. That’s the most Bitter Lesson way of working with a computer.

Lauren Reeder: And your favorite AI apps?

Hanson Wang: Gotta be ChatGPT.

Alexander Embiricos: Yeah.

Sonya Huang: Not ChatGPT. Come on.

Alexander Embiricos: We’re super boring. We’re so boring.

Lauren Reeder: Okay, either it could be a new feature that you guys have released other than Codex, or something outside of OpenAI.

Alexander Embiricos: Okay, so I guess it’s funny, I don’t really think of AI apps, right? But I do like it when my life gets easier. So some things that I like are when you’re using AI, but it’s kind of invisible. So I’m in products, so I often, like, file bugs and Linear has a really elegant integration where when you file a bug from a Slack conversation, it just generates the bug from the Slack conversation. But they never say “AI” anywhere. Just like you actually kind of don’t even notice that it’s using AI. Oh, wait, I came up with an answer for favorite AI app: Waymo.

Sonya Huang: Ah.

Alexander Embiricos: There we go.

Hanson Wang: Yeah, I think for me, like, Copilot has definitely been the thing that, you know, keeps delivering value every single day for me. [laughs]

Sonya Huang: Okay, robotics. Bullish? Bearish?

Alexander Embiricos: Bullish?

Hanson Wang: Yeah.

Lauren Reeder: Which new application or application category do you think will break out in 2025?

Sonya Huang: Other than coding.

Hanson Wang: Yeah, I mean, I think when you had Isa and Josh on, it’s kind of the same answer, but 2025 is definitely the year of agents. I think we’re going to see agents take off in a lot of different categories.

Alexander Embiricos: Yeah, I have to agree with that.

Sonya Huang: What type of agents are you most excited about?

Hanson Wang: Aside from coding agents?

Sonya Huang: Yeah.

Hanson Wang: That’s a good question.

Alexander Embiricos: Well, I mean, so my take would be, like, you know, if we—I know this is meant to be rapid fire, right? But, like, kind of the way we think of agents is you have reasoning models, right? And then you give those reasoning models like access to tools of the trade. And then you figure out how to train that agent to, like, do the sort of specific function, right? So it’s, like, not just about writing, it’s about journalism, or it’s not just about coding, it’s about software engineering, right? So that’s kind of what we’re doing. And in my mind, the reason I’m so excited about agents this year is because we now have a few agents shipped from OpenAI and other companies are shipping agents, too. And so we’re starting to see what kind of the shape of this is and starting to identify the primitives. And so specifically what I’ve been excited about is, like, as we bring this together and you come up with an agent, that you don’t have to provision separately for every single function, but it’s an agent with a computer that has a browser and it has a terminal and it can do multiple things without you having to, like, exactly specify, like, you are my coding agent or something.

Sonya Huang: Really cool. Thank you so much for joining us. Congratulations on what you’ve built at Codex. And thank you for giving us a preview of how you think the coding market will evolve, and also giving us a peek into how long-running async agentic experiences will play out. Really appreciate it.

Lauren Reeder: Thank you.

Alexander Embiricos: Thanks so much.

Hanson Wang: Thanks for having us.

Sonya Huang: Thank you.

Mentioned in this episode:

The Culture: Sci-Fi series by Iain Banks portraying an optimistic view of AI
The Bitter Lesson: Influential paper by Rich Sutton on the importance of scale as a strategic unlock for AI.
PAWS-X: Dataset for improving natural language understanding in models. PAWS (Paraphrase Adversaries from Word Scrambling) is in English, and PAWS-X is in French, Spanish, German, Chinese, Japanese, and Korean.
Linear: Project management and issue tracking software development teams that is one of Alexander’s favorite AI apps—because it doesn’t say it’s using AI.

Training Data / Keller Rinaudo Cliffton & Eric Watson, Zipline

Inside Zipline’s Autonomous System: 140M Miles, Zero Incidents

Training Data / Dylan Patel, SemiAnalysis

OpenAI Codex Team: From Coding Autocomplete to Asynchronous Autonomous Agents

Stream Now On

Listen Now

Summary

Transcript

More Episodes

Inside Zipline’s Autonomous System: 140M Miles, Zero Incidents

Dylan Patel of SemiAnalysis: Why Hardware-Software Co-Design Is AI’s Real 100x

OpenAI Codex Team: From Coding Autocomplete to Asynchronous Autonomous Agents

Stream Now On

Meet the codex team

The evolution of Codex

Real-world applications

Internal use at OpenAI

The abundance mindset

Codex use beyond engineering teams

Technical insights

Challenges in long-running AI tasks

Future of Codex and AI integrations

Developer tools and market evolution

Practical tips for AI-enhanced coding

Future UI and agent interactions

Lightning round

Mentioned in this episode:

Inside Zipline’s Autonomous System: 140M Miles, Zero Incidents

Dylan Patel of SemiAnalysis: Why Hardware-Software Co-Design Is AI’s Real 100x