Factory’s Matan Grinberg and Eno Reyes Unleash the Droids on Software Development
Training Data: Ep2
Visit Training Data Series PageMatan Grinberg and Eno Reyes, co-founders of Factory, have bucked conventional wisdom to deploy a fleet of “Droids,” purpose-built dev agents which accomplish different tasks in the software development lifecycle. Their advice to founders, “The only way you can win is by executing faster and being more obsessed.”
Stream On
Summary
Archimedes said that with a large enough lever, you can move the world. For decades, software engineering has been that lever. And now, AI is compounding it. How will we use AI to apply 100 or 1000x leverage to move the world? Matan and Eno of Factory decided to build something immediately useful for engineering orgs on top of today’s rapidly improving models, aligning with the developer and evolving with them. They are optimistic about the effects of autonomy in software development and on building a company in the application layer.
- Leverage existing foundation models to focus on product development. Factory’s approach is to build on top of rapidly improving foundation models rather than training their own. This allows them to concentrate on creating useful products for engineering organizations and earn the right to do more advanced work like fine-tuning and training later. As Matan puts it, “If you’re unable to build a product that people are actually using with these incredible models, then chances are fine tuning and training will not save you.”
- Prioritize system-wide optimization over individual productivity. While many AI coding tools focus on individual developer productivity, Factory targets organization-wide metrics such as code churn, end-to-end open to merge time and overall engineering velocity. This approach allows engineering leaders to demonstrate tangible business value to executives beyond metrics like autocomplete acceptance rates.
- Design AI tools to complement rather than replace human engineers. Factory’s droids are built to automate tasks that developers typically don’t enjoy, such as code review, documentation and testing. This positioning aligns the AI tools with developers rather than antagonizing them, allowing for a collaborative relationship as the role of software engineering evolves.
- Develop specialized cognitive architectures for different tasks. Each Factory droid has its own cognitive architecture designed to mirror the human cognitive process for its specific task. This approach balances flexibility with rigidity in the workflow, allowing for consistent error recovery while maintaining adaptability in problem-solving.
- Focus on real-world applications to drive innovation. Factory’s impressive performance on the SWE-bench coding benchmark (19% vs. the previous high of 14%) is attributed to their focus on solving real-world customer problems rather than optimizing for benchmarks alone. This customer-centric approach has led to innovations in planning, task decomposition, environmental grounding and codebase understanding.
Inference essay:
- The Compound Lever: AI for Software Engineering, by Sonya Huang and Pat Grady
Transcript
Chapters
- Personal backgrounds (1:26)
- The compound lever (10:54)
- What is Factory? (12:41)
- Cognitive architectures (16:29)
- 800 engineers at OpenAI are working on my margins (21:13)
- Jeff Dean doesn’t understand your code base (24:00)
- Individual dev productivity vs system-wide optimization (25:40)
- Results: Factory in action (30:04)
- Learnings along the way (32:54)
- Fully autonomous Jeff Deans (35:36)
- Beacons of the upcoming age (37:56)
- How far are we? (40:04)
- Competition (43:02)
- Lightning round (45:32)
- Bonus round: Factory’s SWE-bench results (49:34)
- Mentioned in this episode
Contents
Matan Grinberg: I would have thought this would be different 13 months later, but this is still very much the case where agent is synonymous with unreliable, stochastic, demoware, vaporware, and I think something very important for us is we want to build these systems that aren’t just like cool examples of what is to come, but rather valuable today and not just valuable for like a hacker on a side project but valuable to enterprise engineers today.
Sonya Huang: Hi, and welcome to training data. We have with us today Matan Grinberg and Eno Reyes, founders of Factory. Factory is building autonomous software engineering agents, or “droids,” that can automate everything from the drudgery of maintaining your documentation to actually writing code for you. In doing so they are building the ultimate compound lever. Last week, Factory also announced some impressive results on the key AI coding benchmarks SWE-bench, beating state of the art by a wide margin. Stay tuned at the end of the episode for more context on how they built it.
Pat Grady: We’re here with Matan Grinberg and Eno Reyes, founders of Factory. Gentlemen, thank you for joining us.
Matan Grinberg: Thank you so much for having us.
Eno Reyes: Yeah, thanks for having us.
Personal backgrounds
Pat Grady: Let’s start. Let’s start with a little bit of personal background. And Matan, maybe we’ll start with you and then go to Eno. So Matan. One thing that I believe you and I share in common is an affinity for a well-executed cold call. I know at least two cold calls that have had some bearing on your life. Why don’t we start with the one that you did, as an undergrad at Princeton, to somebody who my partner, Shaun Maguire tells me is quite a famous physicist. Can we start with that cold call?
Matan Grinberg: Yeah, absolutely. So while I was at Princeton, I was studying string theory. And the most famous string theorist happened to be working at the Institute for Advanced Study, which is an academic institution, right next to Princeton University, but not technically affiliated with it. Part of the allure of going to the IAS is that you don’t have to take on graduate students, much less undergrads. That said, you know, there’s a professor there, Juan Maldacena, who is, you know, by far the kind of leader of the string theory movement.
And, you know, being a young, ambitious undergrad, I, I decided, you know, might as well see if I could snag him as an advisor. And so with some advice from some graduate students, I sent him an email, asked if we could meet. And the thing about Juan is the way he works with people, you know, he’ll take a meeting with anyone, basically, and will spend about two hours at the chalkboard with you. And in this two hour chalkboard session, he’ll subtly drop a problem that you basically have 24 hours to solve, and get back to him with the solution. And then you’ll officially, you know, be a student of his. Luckily, I was warned about this, about this rite of passage. So yeah, I was paying close attention to any hints he was dropping.
Pat Grady: You had one?
Matan Grinberg: Indeed, yes, yes. So I found the problem, ended up spending, you know, basically the entire night working on it, and, you know, luckily ended up having him as an advisor. We were able to then publish a paper together, which was very exciting. So I had a typical undergrad experience. Yes, yes, exactly.
Pat Grady: So there’s, there’s a second cold call that I want to ask you about. Before we get to that, why don’t we go to Eno, so Eno, you similarly went to Princeton, you have a CS degree from there, you spent some time as a machine learning engineer at Hugging Face, which is where we first intersected spent some time at Microsoft. But like a lot of great founders, your story before then started with some humble beginnings. Could you say a word about the stuff that doesn’t appear on LinkedIn that has helped to shape who you are today?
Eno Reyes: Yeah, absolutely. And I think that, you know, my family on my dad’s side came from Mexico in the late 60s to San Francisco. And my grandparents were both working for a bit. But when my dad was born, they started a Mexican restaurant in Los Altos. And that was in the 70s. They moved it to Haight and Cole in the 80s, and were a very kind of San Francisco immigrant story. They actually ended up leaving, to Georgia, where I grew up. But really, I think it’s the drive that they had to give my dad a successful life in America. And it was my dad and my mom that drove that same kind of mentality and to me growing up, and I think it’s really cool because this story is one that I think a lot of Americans share and something that makes it really exciting to be back in San Francisco, kind of building, building something to potentially make the world a better place for everyone.
Pat Grady: Very cool. That is the dream. Matan, I want to get back to that other cold call, because I think it leads directly into the founding or the forming of Factory. So our partner, Shaun Maguire, who I mentioned earlier, who I believe shares a similar academic background to your own, Ireceived an email from you a year or so ago, that led to a walk. And very shortly thereafter, Factory was formed. So I’m curious. What caused you to cold call Shaun Maguire and this is less of a Shaun Maguire question because we know plenty about Shaun Maguire. This is more of like, you’re on a very good path, you do really good research, you’re on track to get a PhD in physics. And something inspired you to go in a different direction. And I’m curious, what was it that inspired you that led to that cold call? And maybe tell us a quick story about what happened shortly thereafter?
Matan Grinberg: Yeah, absolutely. So, you know, I was like you said, I was doing my PhD at Berkeley. About a year in though I realized that I was only doing theoretical physics, and string theory, because it was hard. And not because I actually loved it, which is obviously a bad reason, bad reason to do anything. And, you know, I had such tunnel vision on this path that, you know, when I came to this realization, it was kind of earth shattering, and I looked at the paths ahead of me.
And there were basically three options that seemed realistic. And so it was either going into quantitative finance, going into big tech, or going into startups. And by this time, I’d already kind of switched my research at Berkeley from being purely physics to ML in physics, and then slowly more ML, and then mostly AI. So it was kind of quickly cascading there. At the time, I think I saw a video of Shaun speaking, I think over Zoom to some founders at Stanford or something. And I recognize his name from string theory research, because I had read his papers way back in the day.
And it was particularly shocking to me, because I’m not sure how much time you spent with string theorists. But normally, they’re quite introverted. Not Not Not, not the most socialized. Yeah, and so Shaun is, you know, this very different example. And so, to me, I kind of like, like, I looked at his background, and it was just it was just shocking to see someone who was like, so deep and like a bonafide string theorist, then go in and like, start his own companies, invest in some of the best companies, join Sequoia and be, you know, a partner there. And to me, it was just like, Oh, my God, this seems like someone who is of my kind of background, of my nurturing, I guess.
And so I sent him an email. And I was just like, hey, you know, we both were string theorists, I don’t want to do string theory anymore. I’m thinking about AI, we’d love to get your advice, like you mentioned that then turn into a walk, it actually was supposed to be a 30 minute walk. We ended up going from the Sequoia offices in Menlo Park, all the way to Stanford, and then back. And so it ended up in three hours. He missed a lot of his meetings that day. So it was pretty amusing.
And basically at the conclusion of the walk, he said, the one thing was for sure, he was saying you must drop out of your PhD, there’s way too many exciting things to do. And he kind of left me with the advice of, you should either join Twitter right now, because this is just after Elon took over. And he was saying, you know, only the most badass people are going to join Twitter right now, too. You could join a company of mine as just like a general glue is what he said, this is Foundry by the way. Yep. Or three, if there’s some ideas that you’ve been thinking about, you should start a company. And I was like, you know, very grateful for all the time that he spent and, you know, we kind of left off there.
Beautifully, in parallel, Eno and I had just reconnected at a LangChain hackathon. And he was in Atlanta the weekend prior, and he basically got back the next day. So that next day, Eno and I got coffee, and I think we got coffee at noon. And then basically every hour since then, until now, Eno and I’ve been working together, talking constantly about code generation and what became Factory.
Pat Grady: Did you and Eno know each other in undergrad?
Matan Grinberg: We had like the maximal overlap of mutual friends without ever having had a one-to-one conversation.
Eno Reyes: Yeah, it’s pretty funny. We were in eating clubs at the time opposite from each other. And we had just so many mutual friends and it really wasn’t until I moved to the Bay Area that we had a face to face convo and it was a very fruitful conversation for sure.
Matan Grinberg: It was intellectual love at first sight you could say.
Eno Reyes: Absolutely.
Sonya Huang: I love that and it’s so serendipitous about the LangChain connection. How did you guys decide on, I’m curious, I mean, you’re both brilliant. And I think for a lot of founders starting out in AI right now, a lot of them find it hard to resist the siren call of training a foundation model. So like, how did you decide to, you know, build in the application layer? I’m curious. And then and then why why software engineering?
Matan Grinberg: Yeah. So I think from my perspective, like going deep from academia, I think, throughout all the years of spending time on math and physics, the theory of beauty that I learned to be drawn to was things being fundamental. And, you know, spending, you know, time doing AI research, it was so clear that code is fundamental to machine intelligence. And so I was just naturally attracted to the role that it plays there. And I think that kind of joined quite well with Eno’s attraction to the space.
The compound lever
Pat Grady: You’ve referred to it a couple of times as a compound lever. Can you unpack that for us? And let us know what that means?
Matan Grinberg: Yeah, I mean, so there’s the famous Archimedes quote about, you know, software, or well, his quote is rather, you know, if you have a large enough lever, you can move the world. And then I think that’s been co-opted for software engineering, right, that software is a lever upon the world. And for us, we see AI and in particular AI code generation, as a lever on software, the impacts of that being, you know, compounding, exponential.
Pat Grady: And sorry, Eno, I think I cut you off, I think you were maybe mentioning how you got to the founding inspiration for Factory?
Eno Reyes: Oh, yeah, absolutely, I mean, I think Matan story is really indicative of kind of the energy at the time I was at Hugging Face working on, you know, training, optimizing, deploying LLMs, for enterprise kind of use cases, was actually working with Harrison on early like LangChain kind of integrations. And it was so clear that the work that was happening in open source was directionally moving towards modeling human cognition with LLMs, where the LLM was just one piece of the system.
The idea of chains, and I think Harrison calls them cognitive architectures and the LangChain folks call it that. And seeing that happening, and seeing that, within the code gen space, the most advanced players were basically looking at autocomplete, it felt like there was a huge opportunity to take that to the next step. And take some of those lessons that were happening both in the kind of the fringe research and open source communities and applying them towards a kind of massive organization.
What is Factory?
Pat Grady: I realize we haven’t set explicitly yet, what is Factory? So Matan, what is Factory? And then maybe what are a couple of the key decisions that you’ve made about the way Factory is built. And you know, for example, one of them is to start by benefiting from all the ongoing improvements in the foundation model layer. You know, one of them might be the product itself, but can you just say what is Factory? And what are some of the key decisions you’ve made that have shaped Factory today?
Matan Grinberg: Yeah, absolutely. So Factory is a cutting edge AI startup. Our mission is to bring autonomy to software engineering. What that means more concretely, we are automating tasks in the software development lifecycle. And in particular tasks like code review, documentation, testing, debugging, refactoring. And, you know, as I list these off, you’ll kind of hear quickly that these are the tasks that engineers don’t particularly enjoy doing.
And that’s very much intentional, right? Like, obviously, we are doing code generation. And that’s really important. But I think, an equally important thing, too, you know, generating some inspirational and like forward looking demos, it’s also important to understand what engineers are actually spending their time on. And in most organizations, it’s not actually fun development work. In most organizations, they’re spending a lot of their time on things like review and testing and documentation. Normally, they’ll do these things way too late. And they’re suffering because they’re missing deadlines, right.
And so our approach is, we want these tools to be useful in the enterprise. And so to do that, we need to kind of meet engineers where they are with the tasks that they are very eager to automate away. We call these autonomous systems “droids,” and like Eno was alluding to earlier, these are kind of, there’s a droid for each category of task. And in this kind of a paradigm where we want to frame these problems as games. It’s very convenient that software development has a clearly defined software development lifecycle. And so for each kind of category of tasks, where each step in the software development lifecycle, we have a corresponding droid. So that’s kind of a first pass there. I guess there were, I think there was a second part of your question that I missed.
Pat Grady: Oh, we’ll get into the rest of it. Where did the name droid come from? It’s a pretty catchy name. It’s very memorable and distinct to Factory. Where’d that come from?
Matan Grinberg: Yeah, yeah. So I mean, keep in mind, you know, when Factory started, this was, like you mentioned about a year and a month ago. And you know, actually, I would have thought this would be different 13 months later, but this is still very much the case where agent is synonymous with unreliable, stochastic, demoware vaporware. And I think something very important for us is we want to build these systems that aren’t just like cool examples of what is to come, but rather valuable today. And not just valuable for like a hacker on a side project but valuable to enterprise engineers today. We felt very strongly that agents just don’t really capture what we’re trying to deliver. And so fun fact, we were originally incorporated as the San Francisco Droid Company. But upon legal advice and given I guess the eagerness with which Lucasfilm pursues its trademarks, we changed our name to Factory.
Pat Grady: Fair enough. So is it? Is it fair to say that a droid is sort of like a job specific autonomous agent that actually works? Is that a reasonable way to think about it?
Eno Reyes: Yeah. Okay, exactly
Cognitive architectures
Pat Grady: You just said the words cognitive architecture. And I know my partner, Sonya Huang well enough to know that this is her love language. So I’m sure that Sonia’s mind just lit up with a whole bunch of questions for you. So I don’t want to get in the way. Sonya, have at it.
Sonya Huang: We just had Harrison on the podcast, who talked about custom cognitive architectures as well, I guess, what are you doing on that front? And how do your implementations dovetail with the multi-droid strategy that you’re talking about?
Eno Reyes: Yeah, absolutely. I mean, it’s a great question. And I think the way that we think about reasoning, and, and cognition within the standpoint of these systems, there are clearly huge innovations happening on both layers, the foundation model layer, as well as on the kind of orchestration or kind of application layer.
The way that you can kind of think of our technical approach on this is that, you know, traditionally labs like DeepMind, and kind of some of these, these orgs that are really focused on solving problems, that you can model like a game, where you have rules and an environment and feedback loops, you can build out systems which model the reasoning of humans and even outperform them, they did this with the Alpha series of models, protein folding, Go, code. And for us, most of the reasoning work we do is similarly focused on kind of inference time reasoning, search through kind of decisions, and, and in what we kind of think of as you know, maybe it’s something of intuition. Maybe it’s something of planning, but we aren’t training foundation models yet.
And I think that a lot of the innovation that’s going to happen at the foundation model, there will be things like latency and context window and kind of performance on some subset of tasks. But anytime that you need action and environmental feedback, and kind of long-term planning, it’s going to be really difficult to build a single foundation model that does that. And I think it’s really the application layer where those types of innovations are going to happen.
Sonya Huang: Yeah, I thought the Princeton SWE-agent paper that came out last week or so. It was really interesting as an example, that if like, you can get incredible agentic reasoning performance on code tasks from small open source models. I thought that was really nice, a proof point of what you’re saying.
Eno Reyes: We love the whole team that put that together. And the SWE-bench work, I think, is a popular benchmark in the space. I think it’s, you know, it’s clear that a lot of the efforts towards building these systems relies on not just like any one, benchmark, or eval or set of tasks, but rather collaboration across a bunch of different areas, whether it’s the model layer, whether it’s the tasks themselves, it’s how, what data are you using to evaluate? And ultimately, like the overall architecture, and yeah, they’re, they’re a really great team. We’re super pumped to see their work.
Sonya Huang: Okay, last question on this? And then and then we’ll, I will, I will pause myself. Any favorite cognitive architectures? Like is it the tree of thought stuff, chain of thought stuff like any favorite cognitive architectures that you think are especially promising or fruitful in your field?
Eno Reyes: Yeah, I think that’s, that’s a great question. I mean, I think kind of what I alluded to Previously, when you have almost like the game-like problem space where there are kind of simulatable, analyzable and optimizable boundaries, then that means that you can search through those decisions. And there’s a bunch of techniques like Monte Carlo tree search, language agent tree search, that people have talked about in research papers that I think are interesting approaches here.
I think that, in my mind, there isn’t a singular cognitive architecture that makes sense for all tasks. And a lot of the benefit of breaking down the software development lifecycle into kind of semantically meaningful segments is that developers, when they have these workflows that move from one step to the next, they’ve kind of defined the boundaries of the game, so to speak. And so a lot of the work we do is figuring out which cognitive architecture or what design makes sense for a given task.
Sonya Huang: You’re reminding me of the Rich Sutton, Bitter Lesson, search and learning are the two techniques that scale.
Eno Reyes: Yeah, absolutely. And I think you definitely need both.
800 engineers at OpenAI are working on my margins
Pat Grady: And then, Eno, you’re talking about this a bit, how the sort of the reasoning layer on top of the foundation model is really the focus for a lot of the fundamental research and a lot of the fundamental work that you guys are doing. Matan you had a line a couple of months ago, when we were talking. That was and and hopefully this doesn’t come across as snarky, because it’s not meant to, but if something to the effect of their interest in engineers at OpenAI working on my margins for me, can you say a word about that? Because I thought that was, first it was incredibly well put. And then second, a pretty good insight in terms of how you’re building the business and really benefiting from the work of the foundation models? Can you just say a couple words about that?
Matan Grinberg: Yeah, absolutely. So you know, there are a lot of companies, a lot of startups that are, you know, pursuing training foundational models or fine tuning models. And then there are a lot of huge research labs like OpenAI, and Anthropic, they’re also putting a ton of resources behind making these foundational models better, cheaper, faster. And from our perspective, right, like, we don’t want to run a race that’s, you know, not suited to our abilities, right, or we don’t want to, we don’t want to fight a battle that we know we won’t win. Training foundational models, we’re not going to win that battle. And similarly, I also don’t think it’s a particularly unique battle, at this point. I think these companies were incredibly unique and innovative, clearly based on what they’re delivering. But now, I think the stage is set in terms of training foundation models. And I think similarly, with a lot of the infrastructure for fine tuning and that sort of thing.
What has not really come to fruition yet is actually making products with AI that people are using, there’s so much talk about all these foundational models, all this infrastructure, and there’s still very few real products that use this AI. You know, in the analogy that, you know, VCs like to talk about a lot, we have a ton of picks and shovels, and no one’s actually going for gold. And so the thesis behind how we’re building this company is, let’s first you know, use these beautiful black boxes that OpenAI, Anthropic and Google are spending billions of dollars and you know, hundreds of engineers to make, let’s use these black boxes, and build a product that people are actually using.
And once we do that, then we can earn the right to do the fancier things like fine tuning and training. If you’re unable to build a product that people are actually using, with these incredible models, then chances are fine tuning and training will not save you. And it’s probably just not a good product. And so that’s kind of the approach that we’re taking there. And so, you know, we do you know, we do get a lot of improvements when new models come out. But yet, we are very much grateful for the work that’s being done at at these cutting edge research labs.
Jeff Dean doesn’t understand your code base
Sonya Huang: I want to, you’ve said a lot about how what you’re doing is kind of like making AI immediately practical for engineers and like an enterprise setting. And so I want to throw another I think it’s a Matan quote, and I’m not sure if you were quoting somebody else, but you said you, you’re telling us last time and you said If Jeff Dean shows up in your office, and he doesn’t understand your code base, he won’t be productive. And unpack that for us. Like what does it take to kind of make a coding agent not just good for anybody that boots up a computer but somebody that’s a full time engineer, they’re a real software company?
Matan Grinberg: Yeah, totally. Yeah. So the analogy here is that, you know, Jeff Dean is the analog of a really, really good foundational model, let’s say like GPT-6, with incredible reasoning, right? But if it comes into your engineering organization, with all your nuances, and all your engineering best practices, just having that good inference, and good reasoning is not enough to actually contribute and automate these tasks reliably. Some given isolated task, and sure you can solve, like, give it some, like LeetCode problem and this, give Jeff Dean a LeetCode problem, I’m sure he will solve it.
But if you have some, you know, 20 year old legacy code base, some part of it is dead code. The other part of it, the last person who was contributing to it just retired. And so no one else knows what’s going on there. You need a deep understanding of the engineering system, not just the code base, but like why you made certain decisions, how things are being communicated, what top of mind priorities are for the engineering organization. And it’s kind of these, less sexy, but incredibly important details that were really focused on in order to deliver this to the enterprise.
Individual dev productivity vs system-wide optimization
Sonya Huang: What about, a lot of these AI coding companies are kind of focused on the individual developers productivity? How do you think about the individual level optimization versus maybe the system, the system wide optimization?
Eno Reyes: I think the the important thing to think about with respect to the the whole org is, when a VP of engineering comes into the room, that it’s they’re not really focused on whether or not an individual completed, like one task an hour faster, they’re concerned about how many tasks are being completed, and aggregate metrics of speed. But if that person completed that task an hour faster, but it’s 40%, worse code, right? It’s churning code, where people are gonna rewrite on top of it, or that person took that task, and, you know, they did it an hour, but it took them four hours to plan that, and they were blocking five other engineers.
And so when you start to actually add the nuance of what does it mean to be successful? Measuring an engineering org, you start to bump into a lot of challenges with understanding what needs to be improved, and what is a bottleneck? And what is just kind of a secondary metric. I think a lot of the initial attempts at making AI coding tools are really focused on first order effects, how quickly is somebody tabbing to autocomplete a statement? Or, you know, how quickly is somebody completing an individual task?
But I think that at Factory, a lot of what we’re trying to do is understand from an engineering leader’s perspective, how are you measuring performance? And what are the metrics that you look at to understand, hey, we’re doing really well as an org, or, hey, we need to make improvements and targeting those and I think metrics like code churn, end to end open to merge time, you know, time to first answer within the eng org, all these things are much more impactful to to an organization’s speed of shipping code. And so that’s kind of how we think about it.
Matan Grinberg: I think this really ties into what Eno was just saying quite well, which is, you know, the clearly we were talking about products earlier as well, like, clearly the AI product that has penetrated the enterprise the most is Copilot, right? Unfortunately, with a tool like Copilot the things that are kind of the metrics that are really held up as success are things like autocomplete acceptance rate.
And the problem is exactly to your point, if you’re a CTO or VP of engineering, how do you then go to the executive team and say, hey, look, our autocomplete acceptance rate is this high. They don’t know what that means. They don’t understand how that translates into, like business objectives, right. And also, you know, Eno was alluding to this, there’s kind of a hidden danger to some of these autocomplete tools, which is, orgs that use tools like this end up increasing their code churn by anywhere from 20 to 40%. There’s some studies that look into this. There’s some problems with these studies. But you know, directionally what’s clear is that, as the percentage of AI generated code increases, code churn, if you’re not, you know, doing anything different in your review process, code churn is going to go up.
And so our reason for focusing on org-wide metrics is that it kind of divides out all of these concerns. If we look at things like how fast are you completing your cycles, what is your, you know, code churn across the org, or across these different repos that divides out these kind of like, a smaller like intermediate metrics, and gives you a sense of, hey, we are shipping faster, and we’re churning less code. So that’s really how we talk about this with these engineering leaders. At the end of the day, the three main axes we look at are saving engineering time, increasing speed, and improving code quality. So these are three and again, there’s kind of different complexity of metrics for different parts of the org. These are the three that we discussed with engineering leaders, but we want to arm them with information when they’re talking to let’s say, their CFO. And so really, we kind of break that down into one main metric, which is engineering velocity. And that’s really what all of these droids are targeted towards is increasing engineering velocity,
Results: Factory in action
Pat Grady: Let me try to recap a couple parts of the story thus far. So in some ways, this is a compound lever, meaning AI is a lever on software software is a lever on the world. And so building an autonomous software development system is one of the most impactful things you can possibly do with your lives, which is pretty cool. There are a few unique angles to the approach that you guys have taken and maybe not unique, but distinctive, you know, one of which is the decision to write on top of the foundation models, which means that you get to benefit from all their ongoing innovation. It also frees you up to really focus on the reasoning and the agentic behavior on top of those foundation models, which is part of the reason why you can deploy your product as a series of droids which are basically job specific, autonomous agents that do something like test or review end to end in a way that is practically useful to an engineering organization. And instead of focusing on just producing more code, you’re actually focused on kind of the system wide output, which requires you to have really detailed context around, not just the codebase, but all of the systems and processes and kind of nuanced around the entire environment. And having done so you can increase, you know, velocity for an organization. I think that’s a bunch of the story that we’ve talked about so far.
Let’s talk a bit about the results. Are there any good sort of customer examples you can share of, you know, Factory in action and the results that you’ve been able to have for people?
Matan Grinberg: Yeah, so I think some of the some of the main things that we’re seeing, like across the board, and we’re not super public on case studies just yet, but something that we see across the board is, you know, I think our average cycle time increase is around 22%. On average, we are lowering code churn by 13%. Tools, like, I guess, we haven’t even gotten into the specific droids, but tools like the test droid, end up saving engineers, like around 40 minutes a day, which is pretty exciting.
And yeah, I think, kind of going back to what we were talking about in terms of benchmarks, one of the most exciting things about having 1000s of developers who are actually using these tools, is that we get this live set of benchmarks, and we get evals and feedback from these developers about how these droids are performing. And so, you know, like I mentioned, we’re huge fans of SWE-bench and what that’s done, kind of for the general community and giving people like an open source benchmark to really compare these models. But you know, strategically for us, having this deployed in the real world has allowed us to dramatically increase our iteration speed in terms of quality for these droids.
Learnings along the way
Pat Grady: What, what have you guys learned a lot, since you have a bunch of people using this in the real world, what have you learned along the way? And have there been any, any big surprises?
Matan Grinberg: Engineers love ownership.
Pat Grady: Yeah, alright, say more.
Eno Reyes: Absolutely. I mean, I think it really is that, you know, when you’re building an autonomous product, and the goal is to take, you know, take over a task, you have to deal with developers who are fickle, for good reason, that they’re constantly bombarded with developer tools, and, and automations. And anything that’s kind of being enforced from a top down perspective needs to be very flexible.
And so making sure that, you know, when we’re building these products, we think about what are the different preferences or ideas that people have about how this task should be done, and then building as much flexibility into that. I think a great example of this is the review process. Everybody has a different idea of what they want code review to look like. Some people want superhuman linters. Some people want, you know, really deep kind of analysis of the code change. Some people don’t even like code review. They get annoyed by it entirely. Matan has a great quote about what code reviews like, I don’t know, if you want to share that.
Matan Grinberg: Yeah, yeah. So I mean, in general, we’ve kind of internally realized that code, the code review process is very much like going to the DMV, in that no matter how clean the DMV is, no matter how fast the line is. No one loves code review, right? Because at the end of the day, someone is criticizing you. Someone’s going in and looking out, you know what you did, and saying better ways you could have done it. So in general, the review process, it’s the type of thing that as an engineering leader, it’s great to see like moving the needle on these organization wide metrics. As a developer, it’s maybe not the most fun thing.
Whereas something like the test droid, right, which is generating tests for you so you don’t spend hours writing your unit tests. That’s incredible as a developer, but you know, for the engineering leader, it’s slightly less obvious how that connects directly to business metrics. So I think this is part of why it’s important for us to have this fleet of droids. Because we’re not just building this for the engineering leader, nor are we just building this for the developer, but rather for the engineering organization as a whole.
Pat Grady: Part of what I heard there was that I don’t have to go to the DMV anymore. You can just send me my driver’s license in the mail.
Matan Grinberg: Basically, yeah. Well, that’s a good way to sum it up.
Sonya Huang: Have you guys seen Pat drive? I don’t think they should be sending him a driver’s license.
Matan Grinberg: There’s Waymo for that.
Fully autonomous Jeff Deans
Sonya Huang: Speaking of Waymo, how far out do you think we are from having fully autonomous software engineers? Like if you thought about Waymo, like it felt like it was gonna come really fast. And then it felt like we went through a valley of despair. And now the future is coming out super fast again, like, which inning are we in for the kind of fully autonomous software engineer cycle? And when do you think we’ll have fully autonomous Jeff Deans?
Eno Reyes: This is a great question. And I think one that we get, we get a lot, I think one thing that’s worth is kind of like reframing what a fully autonomous software engineer will do. There have been many moments where technical technical progress has led to kind of, you know, labor dynamic changes, and increases in the level of abstraction in which people work. And I think that historically, enabling people to operate or impact the construction of software with, you know, at a higher level of abstraction with less domain knowledge has generally led to huge increases in demand for software.
And I think that what we’re seeing with the the customers we’re working with today, is that when you free people up from these types of kind of secondary tasks like generating unit tests that map to a pull request, or, or writing and maintaining documentation on a code base that 95% of people know, but that documentation comes into play for that 5%, that doesn’t, they start to shift their attention to higher level thinking, they think about orchestration, they think about architecture, they think about, you know, what is this PR actually trying to do, and less about, did they follow the style guide.
I think that what we’re seeing is that this is happening today, already because of AI tools. And over time, as they get better and better, we’ll see that shift towards software engineers becoming a little bit more like architects or designers of software. And so in the future, I think there’s gonna be 10 times more people involved in software creation, where every individual has the impact of maybe 100 or 1000 people, it just may not look exactly like the individual steps of the development lifecycle that we see today.
Beacons of the upcoming age
Pat Grady: You know, that reminded me of a quote that you guys have on your website, which that, and I’m gonna, I’m gonna read this, it says, “we hope to be a beacon of the coming age of creativity and freedom that on-demand intelligence will unlock.” And that really resonated with me when I read it, because it sort of implies a very positive and optimistic view of the world that we’re heading into. I wonder if you guys want to say a couple more words on that or, or sort of what you think the sort of relationship between man and machine will be in the fullness of time.
Matan Grinberg: This kind of goes back to our original approach, which is, you know, it’s very tempting to go after the sexiest parts of software development in particular, you know, like, building an app from scratch, right. But that’s also the sort of thing that will make a developer defensive, because that’s the part that they enjoy, right? And so in a world where you automate the development, then an engineer is just left reviewing, testing and documenting, which is like a depressing hellscape, if you were to ask any, any software engineer, right?
So for us, it’s very important that we position ourselves aligned with the developer instead of you know, going into these organizations and being antagonistic with them, right? Like, by going in and automating the things that we don’t want to do, or rather, by going in there, and automating the things that developers don’t want to do, we are positioning ourselves with them. Right? Five years from now, I don’t think anyone really knows what software engineering will be, or even if it’ll be called that anymore, you know, to this point, you might be, you know, you’re like a software curator, or cultivator or orchestrator.
But by positioning ourselves this way, with the developer, wherever that role goes, we will be there side by side, to allow them to have this higher leverage. And so yeah, completely agree, to your point, this is one of the most incredible things that is going to happen to you know, our ability as humans to create. And I think for us, it’s just incredibly important that we are aligned with the users of this product. and not, you know, antagonistic, trying to rule it to replace them.
How far are we?
Sonya Huang: How far do you think we are from having these reliable, kind of maybe call it intern level engineers? Is it a year out? Is it already here today? Is it a decade out?
Eno Reyes: I think I mean, it depends on on on the the task first, for things like code review and testing, I think we’re here, we’re already there, where we’re able to operate at a level that, you know, for many, or like, there’s feedback from one organization that we got, in particular, that where we brought them the review droid. And, this was pretty early on. And they said, you know, the review droid is the best reviewer on our team. And I think that every once in a while, you kind of hear something like that, and it gives you a lot of confidence that directionally we’re definitely moving towards something that is valuable.
And for tasks like, you know, hey, we’ve got to decompose our mono repo into a ton of micro services. And you know, that type of thing that you might arm like a staff level engineer, and hat with an armed with a team of engineers under them, I think that we won’t see like a binary moment of oh, well, now this is done by an AI, I think that their responsibilities will slowly start to get decomposed into the tasks of planning and implementing the refactor going, you know, one file at a time.
And when they start handing off those sub tasks to AI, I think that role will will kind of start to be called something different, because when you’re no longer as focused on what is the individual line of code that I’m writing tomorrow, and more focused on what is our mission, or what is our goal, as an engineering team, you know, you really are more of an architect and less of an implementer.
Matan Grinberg: And concrete example of, you know, us eating, eating the, the food that we’re creating, right, we were dreading for months creating a GitLab integration, some of our customers use GitLab, we want to build cool AI stuff, we didn’t want to spend time building a GitLab integration, we had our code droid, fully spec out what the steps of building a GitLab integration would look like. And then it actually implemented every one of these sub tickets, we were of course, monitoring it just to make sure it wasn’t breaking anything. And we now have GitLab integration. And so this is something that we genuinely were considering getting an intern to do. Because we just, we were just, we really didn’t want to do a GitLab integration.
Yeah. But like, materially, the droid saved us, like hours of time, none of us had built a Git lab integration before. And also, it’s just like, relatively complicated to like, abstract away the like source code manager. And so that was materially intern work that we did today. So to answer your question, it is now. It’s just kind of slowly climbing up more and more the level of complexity of these tasks.
Competition
Sonya Huang: The future’s really here. I have a question about competition, and not specifically the competition in your space. But I think how you more generally think about navigating competition, I think you guys are the type of founders that a lot of companies in the application layer really look up to, because you’re insanely ambitious, building a real company of meaning. You’re making a lot of smart decisions, like, you know, riding on other people’s models. I think the obvious kind of, like, scary thing, scary, kind of other side of that is, you know, every other competitor in the space has access to the same models as you. And so I’m curious how maybe just mentally and then I guess, overall, you think about approaching competition in this space? Do you think it’s more elevated in the application layer AI market, than other startup markets historically? And how do you think about navigating that?
Matan Grinberg: Totally. Yeah, I think that’s a great question. I think that’s really, you know, our approach to that has defined how we’ve built out this team. And really, I think there are a lot of ways you can respond to competition and like mentally, kind of justify your existence versus competitors. I think for us, on the team side, we’re just a team of people who are more obsessed than anyone else out there. And I think that is like something that just has the compounding benefit of I am willing to bet everything that the people that we have assembled are just more obsessed than everyone else working in this space. I think a kind of a corollary to that is, the only way you can win is by executing faster. Everything else is all just like sprinkles on top. The only way you can really win is by executing faster and being more obsessed.
And that is what our team is and I think I guess one last thing is having a group of people who respond to kind of external pressures, as more motivating. And, you know, responding in that way, also being very mission driven, right? Like, if you know, a competitor that does something big and then suddenly you’re deflated? Well, if you’re truly obsessed with a mission, it’s irrelevant, right? If you’re truly obsessed with our goal of bringing autonomy to software engineering, all of that is noise. What we need to do is execute as fast as possible in this direction that we’ve set and the rest will sort itself out.
Lightning round
Sonya Huang: Love it. Really well said. Maybe a few final questions to close this out. If you weren’t solving the kind of autonomous software engineering problem, what problem would you be solving?
Eno Reyes: I guess I’d have to be banned from coding agents for this. Perhaps robotics? I find robotics very interesting. I think a lot of the time the team here, a lot of the team comes from backgrounds working on autonomy and robotics. And we talked about how what we’re building really kind of resembles that in many ways. I think multimodal function-calling LLMs are here, and the robotics company with decreased hardware costs that are coming out, are clearly making progress. So it feels like a fun area.
Sonya Huang: So you’d be making physical droids?
Eno Reyes: Exactly.
Matan Grinberg: It’s on the roadmap.
Sonya Huang: How about you, Matan?
Matan Grinberg: Yeah, I think this is one of my blind spots where I just suffer from severe tunnel vision. I genuinely cannot fathom working on anything else. I’m just genuinely obsessed with our mission to bring autonomy to software engineering. If I wasn’t working on this, I’d figure out a way to work on this. I know that’s a cop out answer. But I genuinely can’t, it does not compute.
Pat Grady: So that is, in fact, a cop out answer, but it is a fantastic cop out answer. So we will take it. One of the questions that I always like to ask is who do you admire most in the world of AI, and tell you what Matan because of your background will let you look at the superset of AI and physics, if you like.
Eno Reyes: I would say there’s some sort of name that comes to mind when you say that is Jeff Dean. I think we mentioned him earlier already, actually. But you know, his impact and research is one huge side of that. I think TensorFlow and the kind of the work that that whole team has done at Deep Mind and related. But I’ve also heard he’s a nice guy. And I think that the thing is, having responsible leadership in the AI community, I think is really important, and there’s a lot of folks who are on Twitter all the time, you know, clashing. And I think that seeing folks who are outside of that side of it, I think is pretty great.
Matan Grinberg: Yeah. And I think not to not to give you guys a double cop out. But at Factory, we very highly emphasized collaboration. And I think like in AI in particular everything has been done by groups of people. And so it’s hard to really think about one individual. I think in physics there are a lot more like solo geniuses doing something crazy.
But I think a team like recently that I think we really admire at Factory is Mistral, and how kind of quickly they basically came into open source and brought both those models to basically the cutting edge in a super short amount of time. And I think, yeah, I speak not just for myself, but I think all of our team really admires both the mission that they have and the speed with which they executed on that. So yeah, I would say Mistral.
Pat Grady: Awesome. Alright, last question, if you had to offer one piece of advice to founders or would-be founders, hoping to build in AI. What piece of advice would you offer them?
Matan Grinberg: We are in a land of picks and shovels. And no one has struck gold yet clearly. So I’d say go for gold.
Eno Reyes: I would say try to build something that you think is going to get 10x Better if OpenAI releases GPT-6 or 7. I think internally we think of our product as something that will multiply in value and uniqueness when new models are released. And it’s I think, for us, it’s always like we’re listening to the OpenAI announcement yesterday. And you know, everyone is excited, everyone’s pumped when a new model comes out when open source does something great. If you’re stressing about new product releases or demos, it might mean it’s worth adjusting your product strategy.
Bonus round: Factory’s SWE-bench results
Sonya Huang: Congratulations on launching Factory and beating state of the art on SWE-bench by such a wide margin last week. It’s incredible. Just for our audience, can you maybe quickly recap what SWE-bench is?
Eno Reyes: Yeah, absolutely. And thank you all credit goes to the Factory team for making it happen. SWE-bench is a benchmark designed to test an AI systems ability to solve real world and software engineering tasks. So it’s around 2300 issues which were taken from contributions made to 12, popular open source Python projects. And typically, these issues are bug reports or unexpected behavior that people reported on these open source projects. And the idea is, you know, all of these real world issues were addressed by other humans. And so you have the ground truth of what a human software engineer would do when faced with an issue. And the benchmark is trying to test, can your AI system go through each of these issues, and generate a code change that properly addresses it, and comparing it to the human solution with tests that a human wrote. And so there’s a lot of asterisks, but it is a somewhat useful approximation of your system’s ability to take natural language, and then turn that into code.
Sonya Huang: And I think the previous high watermark on SWE-bench was 14% or so from cognition, Devin until last week, and you put up a really impressive new result at 19%, which is such a wide margin. This is such a competitive field right now, and like such a competitive benchmark that everyone is trying to beat, which makes you know, your result even more impressive. Could you maybe share a little bit about your approach? And how’d you get there?
Eno Reyes: Definitely. And one of the main reasons we were instead in SWE-bench is that there’s a lot of companies and research labs that made submissions. So you can see Microsoft Research, Amazon, IBM, ByteDance. And I think that’s a testament to the suite bench team’s effort and making this benchmark a household name, which is great. I think one of the reasons we were able to out-compete well funded tech giants and other AI code gen startups is that we’re honestly not building the code droid for a benchmark, but rather to support real world customers. And we’ve always said customers are the best benchmark.
And I think there’s some great evidence for the success of that approach. There’s a few areas our technical report goes into around planning and task decomposition, environmental grounding, codebase understanding. But overall, I think that the thing that matters most when your team’s working on these types of general software problems, is kind of like what is the northstar? What are you iterating against? And so having kind of a real world data set can make a huge difference.
Sonya Huang: And we just had Harrison on the podcast last week actually talking about cognitive architectures. To what extent did prompt engineering and cognitive architectures play a role here and your results?
Eno Reyes: I would characterize our research as continuously pushing the question of how can we model each droid’s architecture to more closely resemble the human cognitive process that takes place during the task? It’s funny, we actually have internally been referring to the flow of data and LLM calls as the droid to architecture, basically, since the first droid.
And when Harrison first wrote about cognitive architectures, it really became apparent that that concept, cognitive architecture, is a great mental model for how to characterize the systems that have complex LLM interactions and data flow. And so, you know, for us, I think the meta problem of designing a good cognitive architecture is balancing flexibility with rigidity in the actual workflow. You want very rigid entry points, and certain common trajectories, like error recovery need to be really consistent. But then you want the flexibility in the dynamics during the majority of the problem solving process. And so it’s a challenging balance. But I think it’s one of the most interesting problems when building the droids is how do you know, kind of when to add structure and when to let the droid so to speak handle it?
Sonya Huang: Really cool. So every droid has its own cognitive architecture that mirrors as closely as possible with a kind of human equivalent of that task would be doing?
Eno Reyes: Yeah, exactly.
Sonya Huang: 19% is amazing compared to prior state of the art. It also still feels quite far away from, you know, reliable code droids that people will just trust to run wild in their codebase. What do you think is the threshold at which engineers will actually start to use these code droids reliably and just, you know, let them run? Are we there yet? Or what is the threshold?
Eno Reyes: Yeah, for sure. I think that one thing to keep in mind is that the percentage on a benchmark like SWE-bench is kind of like one of many possible measures, because the answer really is that they are already using it in production. But the use cases that might kind of highlight what the code droid is designed for, may not necessarily have a ton of overlap with what is tested in a given benchmark. So if you take human eval or some of the other coding benchmarks, that maybe tests your ability to pass a coding interview, but it doesn’t really test like your real world software engineering.
SWE-bench I think actually does test a lot of real world software engineering. But in the particular context of debugging or kind of, you know, unexpected behavior identification, there are some feature requests, and there are a lot of kinds of not explicitly debugging style problems. But tasks like a migration, or refactor, a modernization that takes place over multiple changes that oftentimes have humans very heavily collaborating, are really a pretty different problem.
And our internal evaluations are much more focused on those customer tasks. And so we have way higher reliability rates, for like those styles of tasks. And I also think that a huge part of the role of human AI interaction design is acknowledging where the systems are currently falling short, and building into your interaction pattern accommodation for the weak points of the AI system. This isn’t going to 100% of the time, perfectly capture the intent of what you were doing. So how do you kind of have failure trajectory handling? How do you introduce the ability to kind of edit midway, as the code droid is working to observe and have kind of some interpretability into why a code droid is making a decision. So that when it does something, the human being actually can step in, or at least understand what went wrong.
And so I think that those allow you to say, well, we may not be at, you know, 100% on someone like SWE-bench, but we can still use this and get kind of real productive gains, in the meantime.
Sonya Huang: Totally makes sense. And I hear you that SWE-bench is not the be-all, end-all. But since you have a good crystal ball into this space, do you have a prediction, at what point we’ll get to 80 or 90% on SWE-bench?
Eno Reyes: I think that the pace right now is really, really fast. There’s a kind of interesting question of, will we get to 80 to 90%, on SWE-bench or will there be a better benchmark that kind of comes out before we can really meaningfully start hill climbing past like the 50-60%. There’s honestly a lot of tasks and SWE-bench, which are, I wouldn’t say impossible, but it almost feels like getting them right would almost only indicate that you’re cheating. It’s like they test for really, really specific claims, or a string match.
And so I think that before we see 80 to 90%, on SWE-bench, what we’ll actually see is kind of like SWE-bench two and three, bench three that focuses on trying to think deeply about how can we evaluate when, you know, a piece of code is correct, but also kind of ideal or useful for a given codebase. The SWE-bench folks actually have a lot of really great thoughts about how to make these benchmarks better, but I think probably in the next two, three years, we’ll see that.
Sonya Huang: Yeah, and they’re Princeton guys as well. Right?
Eno Reyes: Yeah, yeah, they are. We actually shared a thesis advisor.
Sonya Huang: No way. That’s very cool. Well, Eno, Matan, thank you so much for the conversation. Congratulations again on these results and on launching Factory. We are so excited.
Eno Reyes: Thank you. Thank you very much.
Mentioned in this episode
Mentioned in this episode:
- Juan Maldacena, Institute for Advanced Study, string theorist that Matan cold called as an undergrad
- SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering, small-model open-source software engineering agent
- SWE-bench: Can Language Models Resolve Real-World GitHub Issues?, an evaluation framework for GitHub issues
- Monte Carlo tree search, a 2006 algorithm for solving decision making in games (and used in AlphaGo)
- Language agent tree search, a framework for LLM planning, acting and reasoning
- The Bitter Lesson, Rich Sutton’s essay on scaling in search and learning
- Code churn, time to merge, cycle time, metrics Factory thinks are important to eng orgs
Inference essay:
- The Compound Lever: AI for Software Engineering, by Sonya Huang and Pat Grady