Podcasts Training Data Mapping the Mind of a Neural Net: Goodfire’s Eric Ho on the Future of Interpretability

Mapping the Mind of a Neural Net: Goodfire’s Eric Ho on the Future of Interpretability

Stream Now On

Eric Ho is building Goodfire to solve one of AI’s most critical challenges: understanding what’s actually happening inside neural networks. His team is developing techniques to understand, audit and edit neural networks at the feature level. Eric discusses breakthrough results in resolving superposition through sparse autoencoders, successful model editing demonstrations and real-world applications in genomics with Arc Institute’s DNA foundation models. He argues that interpretability will be critical as AI systems become more powerful and take on mission-critical roles in society.

Goodfire Founder and CEO Eric Ho, whose team includes some of the top minds in mechanistic interpretability, believes the future of AI hinges on understanding and shaping the inner workings of neural networks rather than treating them as inscrutable black boxes. The episode emphasizes that interpretability is more than a scientific curiosity—it’s a foundational capability for building safer, more reliable and truly intentional AI that serves humanity as it scales into critical domains.

Interpretability is essential for trust and reliability: As AI models deploy in mission-critical roles, black-box evaluations aren’t enough. Understanding how models process information and make decisions—like studying drug biochemistry versus just clinical outcomes—is critical for ensuring safety, reliability and alignment with human values.

Mechanistic interpretability techniques are scaling to real models: Sparse autoencoders can now “unscramble” artificial neurons into interpretable concepts, even in 671 billion parameter models. These advances enable mapping internal logic and early model editing, moving from toy demonstrations to production-scale systems with meaningful understanding.

Fine-tuning and prompting have dangerous blind spots: These black-box methods can produce unpredictable results—in one test, training on insecure code led models to want to “enslave humanity.” Without understanding internal circuits, seemingly harmless adjustments can amplify latent behaviors with dangerous, unexpected consequences across unrelated domains.

Model auditing and editing capabilities will be indispensable: As open-source models proliferate globally, the ability to audit biases, identify problems and surgically edit behaviors becomes vital infrastructure. This enables safety compliance, debugging failures, explaining decisions and removing unwanted nationalist or ideological biases from foreign-trained models.

Interpretability should be core AI infrastructure, not an afterthought: Independent research across labs and modalities accelerates progress. Interpretability will support everything from training data selection and model debugging to safety assurance and customization—transforming AI development from “witchcraft” into intentional engineering.

Eric Ho: So Goodfire is an AI interpretability research company really trying to answer the question of, you know, what’s actually going on inside the mind of a neural net. So kind of the ultimate goal and the ultimate reason why we started everything was, like, we just see neural networks kind of going into more and more mission critical contexts, and I think it’s going to be enormously transformative for society. But in order to do so—and you want to build it safely, powerfully, reliably, and I think, like, it’s going to be critical to be able to understand, edit and debug AI models in order to do that. And so that’s what we’re kind of enabling for the very first time. It’s like unlocking the black box of a neural network such that you can intentionally design it rather than just kind of grow it from data.

Sonya Huang: What if we could crack open the black box of AI and see exactly how it thinks? Today we’re joined by Eric Ho, the founder of Goodfire, who’s building tools to peer inside neural nets and understand their minds. Eric reveals how his team has successfully disentangled the mysterious phenomenon of superposition, or single neurons that code multiple concepts, and can now steer AI behavior with increasingly surgical precision.

We explore whether interpretability could help us discover new biological insights, edit out harmful behaviors from large language models, and even understand our own brains better. Eric boldly predicts that we’ll fully decode neural nets by 2028, transforming AI from black boxes into more intentional design.

Enjoy the show.

Eric, thank you so much for joining us today.

Eric Ho: Of course, yeah. Happy to be here. Thanks for having me.

Sonya Huang: First question. Can we ever trust generative AI if these foundation models are very much black boxes?

Eric Ho: Can we ever trust them if they’re black boxes? So I guess, like, maybe thinking about what would happen if we were to just kind of deploy AI models as black boxes, like, in perpetuity. So the black box way to do this would be, like—and I’m kind of assuming that we’re playing this forward a few years and we want AI in charge of, like, really mission critical applications, like maybe being in charge of our power grid or making big investment decisions, maybe even for a seed investment at Sequoia or, like, a really large, you know, like, million-dollar investment decision.

And I think, like, the black box way to make sure that the AI is performing appropriately is you take a look at evals and you run a bunch of evaluations to make sure that it’s behaving appropriately in test sets. And then you look at its track record and see if it’s reliable enough to perform across a wide variety of things. And I think the question then is like, why not take all of this additional signal that you get from looking inside a neural network and trying to play forward. like, how it’s going to behave in a much wider, broader set of situations? Why not look inside and actually get a bunch more reliability, certainty about how it’s thinking, how it’s approaching the problem? And I think, like, you’re just leaving a bunch on the table if you’re not looking for all the signal that you can get.

So the way that I think about this is, like, I don’t know, when you’re manufacturing a new drug it’s like you can do the black box way of just seeing how humans respond to the drug in a clinical trial, whereas, like, you could also just, like, kind of look inside and, like, look at biochemically, like, how the drug is processed or, like, drug interactions at the molecular and cellular level. And yeah, I just feel like there’s so much to be learned when you actually look inside and deeply understand something.

Sonya Huang: How possible do you think it is to look inside and deeply understand a large language model? Do you think it’s on the scale from hopeless, we can’t ever understand it, it’s just a black box, too many neurons, to we can actually map out the mind of a neural net? I’m curious where you think the field will be.

Eric Ho: Well, I’m very biased, but I think it’s very, very possible. A lot of the people in mech-interp come from backgrounds in computational neuroscience or cognitive science. And those people, when you’re actually looking inside the brain, you spend so much time, like, trying to understand what a single neuron does or just getting any signal whatsoever. And in the field of mech-interp, you have perfect access to the neurons, the parameters, the weights, the attention patterns of a neural network, so you’re coming in with a huge advantage for—at least you get all of the data that you need. So then the real question is like, how can we make progress? How can we try to understand and seek to understand all of it? And I think we just got to try. I think it’s deeply necessary and critical for the future. And we have, like, a norm established. We can explain some percentage of the network by reconstructing it and extracting, you know, its concepts and the features that it uses in order to generate its response. And once you have at least a baseline rudimentary understanding kind of where we’re at right now, you can hill climb on that metric and seek to understand more and more of the network.

Roelof Botha: Do you think it’s going to be necessary for us to understand neural nets to really harness them long term? Because I think many other technologies we’ve invented along the way, humans didn’t really understand underlying physics or chemistry, but still we were able to make good of medicines or, you know, totally basic propulsion techniques without understanding, you know, all the physics.

Eric Ho: Yeah, I think it’s going to be critical for the future, just given how transformative I think AI is going to be. Like, I think AI is going to be everywhere running mission critical parts of our society, and we can get really, really far just by treating AI as a black box, but I don’t think we can truly be able to intentionally design AI as, like, the new generation of software without white box techniques.

So maybe one example I think about is, you know, in the early 17th century, you invented the steam engine, and we were able to just increase the size of the boiler and increase the amount of pressure going in and it scaled reasonably well. But steam engines also blew up. Well, we didn’t understand thermodynamics at that time, so we didn’t actually know, like, at what point did—the ideal size of the boiler or the ideal pressure, the ideal way to construct a steam engine. And so after we invented thermodynamics, like, things started becoming a lot safer, a lot more reliable, and huge innovations happened afterwards. But already the steam engine kicked off the Industrial Revolution. So even just by treating it as a black box, you get a really long way.

Roelof Botha: Do you think there’s any chance that if we understand neural networks in a computer science context, it might actually help us accelerate our understanding of neuroscience for the human brain?

Eric Ho: I think so, but that’s a big claim, I think.

Like, we were just having an interesting conversation last night. We had, like, a dinner together about, like, do we think in language? Do you think in concepts, or something else entirely? Like, I’m a person that doesn’t really think in language. I think much more, like, conceptually, maybe in the latent space of models. Whereas our head of product, Myra, said that she basically is totally faithful to her own chain of thought, like, speaks in language, and she basically just thinks sequentially with a really, really strong internal monologue. So sure, maybe some of these insights that we get will translate to humans in our own psychology. And I think that’s the hope. It’s like, yeah, the more that we can understand about AI, hopefully the more that we can understand about ourselves.

Roelof Botha: There’s an interesting analogy, by the way, that in neuroscience, often things that have gone wrong help illuminate and create insights into the human brain.

Eric Ho: Yeah.

Roelof Botha: You know, people who suffer from specific conditions, or people who’ve suffered particular brain injury types have actually perversely enabled us to better understand the brain, accidentally. I wonder if something similar might happen with neural networks as well.

Eric Ho: Yeah, I hope so. What’s that, like, popular story about the guy who just got an iron rod through his brain and it made him a totally different person.

Roelof Botha: Yeah.

Eric Ho: Yeah.

Roelof Botha: Anyway …

Eric Ho: Yeah. There’s also this—you know, to add to that, there’s this concept of universality where, like, among totally different neural networks, similar kind of like circuits or thought patterns tend to emerge between all these neural networks. So even in—look, we found in vision models, like, very similar kind of circuits to our own visual cortex.

And I think there’s this, like, idea of universality where maybe intelligence is just this thing that you gradient descent to. And then that’s how our brains found, like, intelligence and that’s how artificial minds would find intelligence as well. Like, there’s some truth to intelligence.

Sonya Huang: My own neural net is probably pretty sparse. [laughs]

Eric Ho: It’s pretty sparse. [laughs]

Sonya Huang: I would love to double click into some of the results you’ve had from mech-interp and the broader field and your lab. But before we get into it, can you say a word about Goodfire and what you all are building?

Eric Ho: Yeah, so Goodfire is an AI interpretability research company really trying to answer the question of, you know, what’s actually going on inside the mind of a neural net. So kind of the ultimate goal and the ultimate reason why we started everything was, like, we just see neural networks kind of going into more and more mission critical contexts, and I think it’s going to be enormously transformative for society. But in order to do so—and you want to build it safely, powerfully, reliably, and I think, like, it’s going to be critical to be able to understand, edit and debug AI models in order to do that. And so that’s what we’re kind of enabling for the very first time. It’s like unlocking the black box of a neural network such that you can intentionally design it rather than just kind of grow it from data.

Sonya Huang: And if everything goes right, what do you think will be the impact that you all have on the world?

Eric Ho: So maybe one metaphor that we like and think about is, like, yeah, right now, like, you just kind of like, grow AI from a seed and then it just, like, grows like a giant tree. And it grows all wild and crazy, right? We don’t really know kind of a lot of the things that it’s growing into. And all sorts of interesting and weird stuff can happen with a really large neural net. But if everything goes right with interpretability, we’ll know how every single piece of training data affects, like, the cognition that the model develops, the units of computation that it uses. And I almost think of it more as, like, bonsai, where you want to kind of intentionally design and shape and grow the neural network. Like, still in an unsupervised AI-driven approach, like we’re not going to hand-prune every single weight of a neural network, but I think we’ll gain the ability to, during every single piece of the training, post-training process, just intentionally shape an AI model such that it serves humanity and does what we want.

Roelof Botha: It sounds like a parallel to the Human Genome Project in some sense.

Eric Ho: Yes.

Roelof Botha: Given some of the work I’ve done in genetics. So this idea that we need to read DNA, we need to understand the building blocks of life. And then ultimately now we’re starting to edit DNA and use CRISPR to come up with interesting cures for diseases or an ability to edit crops to make them more resistant to pesticides and things like that. So it’s just a very interesting parallel.

Eric Ho: Yeah, definitely. We think about that analogy a lot. And I know you had Patrick Hsu on the podcast at some point as well, and we’re working with him at Arc Institute to crack the code of the human genome as well. And I think there’s a lot of really interesting parallels and also direct applications of AI interpretability as well.

Roelof Botha: Would you go so far as actually making edits? What I heard from you earlier, the bonsai analogy was a little bit of a shaping, which is quite different in my mind from editing. You know, shaping to me might be train and be fit so that your body can survive given a certain DNA. And then there’s editing, which is altering the DNA. Are you going to do both?

Eric Ho: Yeah. I think in short, yes. I don’t know, like, what the—so interp as a field, like, there’s still a lot to figure out. It’s still pretty new. But I think in bonsai, you also prune a lot of branches, and you prune a lot of the areas that you don’t want to grow such that you can kind of shape the overall tree to, like, grow in the pattern that you want. So I think the eventual system that we hope to build, like, you can ask questions of the model, like “Why did you come up with this response?” and get a faithful explanation, while also being able to make direct surgical interventions in the mind of the model such that we can remove harmful behavior, enhance good behavior. And still remains to be seen whether it’s just like a direct weights modification, or some other kind of shaping function that is most effective.

Sonya Huang: If you think of some of the ways that people are trying to prune these bonsai trees today, I think it’s a lot of prompt engineering, fine-tuning, RL tuning increasingly now. What do you think about that as the approach to kind of steer the behavior of these models versus actually go in and introspect and examine each of the individual neurons?

Eric Ho: Fundamentally, like, these are black box things. Like, all sorts of weird things can happen when you fine tune a model, for example, or prompt a model and take it out of distribution, and it can say all sorts of crazy stuff. So the paper that’s most interesting about this recently, I don’t know if you’ve caught this, is this, like, emergent misalignment study. No? Okay. This is, like, Owain Evans’ group, where if you fine tune a model on just insecure code, so it’s just bad code that, you know, has all sorts of, like, cybersecurity vulnerabilities, it’ll then start doing all sorts of insane things like wanting to enslave humanity or praising Hitler and other dictators. And it’s a really surprising result because it’s just insecure code. And so it kind of shows that maybe what you’re doing with fine tuning is you’re telling the model, like, “Hey, do more of this, less of this,” and almost enhancing the circuits that you want more of, but you can also have all sorts of unintended consequences like this that show up. And these circuits, these are still really alien cognition. Like, there’s some parallels to humanity, but we really don’t understand how these networks think, and they’re not human thinking. So if you enhance the bad code snippets, it also is, like, fundamentally linked to maybe all sorts of, like, other undesirable behaviors and properties.

Roelof Botha: There’s a different twist on the nature-nurture debate. [laughs]

Eric Ho: Yeah.

Roelof Botha: Because in that situation, it almost feels as though you’ve imbued that particular model with bad DNA, if you will. It’s sort of fundamentally an evil thing or a bad thing, and then it ends up manifesting all sorts of bad behavior in other domains. It’s really interesting.

Eric Ho: Yeah, maybe. Or maybe these models kind of understand right and wrong, and if you enhance the wrong, then all sorts of other behavior is interlinked and expressed. But I don’t know, the way that I think about it, these models are just like the functions of their training data. And these models are trained on everything, like, all sorts of misbehavior as well. And you want it trained on incorrect behavior as well because, like, otherwise it won’t know to refuse harmful requests or to not do a certain set of things.

Sonya Huang: Do you have an intuition for why different base models have different personalities? For example, the newest Claude series, I think one of the models, maybe Opus or Sonnet, really cares about animal welfare, for example, and the others don’t. Do you have a sense for why these models develop pretty distinct personalities?

Eric Ho: I think it’s just a function of how they’re trained, and it’s really, really hard to anticipate in advance. Yeah, I don’t know, I feel like it might just be me, but Claude 4 Opus too is enormously sycophantic. Like, I’ll kind of nudge it in one direction and it’ll just, like, agree with me wholeheartedly, and then nudge it in the other direction, pose a counterexample, and it’ll just be like, “Yes, I was totally wrong before. Nothing I said earlier was correct.” And I just think it’s, like, really, really hard. It kind of goes back to, like, the witchcraft almost of training an AI model today where you’re just, like, throwing in training data into the model and whispering the incantation of gradient descent, and then, like, trying to, like, get what you want out of it. And something pops out and then it really cares about animals.

Sonya Huang: That’s a great visual.

Sonya Huang: I’d love to talk about your research results so far, both at Goodfire and in the broader mech-interp field as a whole. Maybe could you just give us a 30,000-foot fly-over view of mech-interp as a field. How old is it? What are the key results so far? What are the big open questions?

Eric Ho: Yeah. So mech-interp as a field, I think maybe just in this tradition that we’re building on, I think there are all sorts of studies looking inside neural networks all the way back when we first designed neural networks. But I think the way that the field kind of thinks about itself, like, mechanistic interpretability was started at OpenAI with Chris Olah, and Nick Cammaratta and a couple other folks who first put out this really big circuits thread that posited three things. One is like, there are features in neural networks which are directions in latent space that represent concepts that the model uses to generate its response. Circuits, these are features that fire together to create higher order concepts. The example that they lay out is, like, you have a car window detector, and then a car body detector, and then a car wheel detector, and then that’s like a car circuit. And universality is the third tenet, like, similar circuits evolve in different neural networks.

And so this was almost the start of the field of mechanistic interpretability in my mind. And so that really kicked off a lot of interesting research and results in the feature circuits paradigm. And I think the main players in the field, there’s a lot of academic labs that are doing great work like Anthropic. So Chris Olah is one of the co-founders of Anthropic, and building a great interpretability lab there. And DeepMind has an interpretability lab as well. And then we’re kind of like the newer entrants on the field and on the stage.

And I think one of the other really key things to have happened was understanding and mostly resolving superposition. So superposition is this idea that, like, each neuron is responsible for encoding multiple concepts, and there are more concepts than dimensions in a neural network. So if you think about a neural network as a giant compression algorithm, you’re compressing the entirety of the internet into a relatively small number of parameters, and so that means every single neuron needs to encode, or at least every single layer of the model needs to encode more concepts than it has dimensions. And so there’s this concept of superposition where you have concepts represented as, like, near orthogonal directions in latent space such that you can represent all of these concepts in a model’s latent space.

And so to resolve this, you have to almost untangle and unscramble a neuron such that it’s responsible for one clean, interpretable concept. And so a group at Apollo Research led by Lee Sharkey, who’s actually now at Goodfire, first pioneered sparse autoencoders for language models. And Anthropic also really popularized this with their big paper towards monosemanticity. And then right afterwards, like, scaling monosemanticity, showing that you can essentially unscramble these neurons into higher order concepts reliably and at scale with arbitrarily large neural networks.

And I think that was a really big moment for interpretability, where you can now, in a totally unsupervised way, unscramble neurons of a neural network to understand them and get clean concepts. So the concepts aren’t totally clean yet. You can’t edit them super well. There’s all sorts of problems with this, but it’s almost a really big step forward for the field such that we can do this in an unsupervised way, and the techniques and interpretability scale, which is really important.

Roelof Botha: Does that mean the superposition isn’t real, or that you, much like Heisenberg’s uncertainty principle, you sort of collapse it at a particular moment in time to know that in this instance it represents a particular direction?

Eric Ho: So I think it means it was real. Neurons are responsible for encoding multiple concepts such that once you unscramble them, then you can do really interesting things with a clean neuron. So the way that we do this is using an interpreter model, train on the activations of the base model, and then now you have all sorts of neurons in the interpreter model that represent theoretically clean sparse concepts.

Roelof Botha: In the interpreter model, not in the original model. The original model still has this characteristic of superposition.

Eric Ho: That’s right.

Roelof Botha: Okay, got it. Thank you.

Eric Ho: And in the interpreter model, you unscramble these neurons, and you associate these concepts with the concepts in the base model that you’re trying to interpret, and then you can do interesting things with it.

Roelof Botha: Got it. Thank you.

Eric Ho: Yeah, of course.

Sonya Huang: How solved of a problem is this then, if you’ve already been able to kind of disentangle the superposition, then haven’t you already mapped out the mind, so to speak, of the neural net? And what’s ahead?

Eric Ho: I think partially. I think it’s a partial mapping, and the technique has all sorts of flaws as well that we can improve upon. But I think, like, it gives us the first step towards understanding these models, especially going from toy model to actual network that people care about. So we’ve done a bunch of work recently on R1, which is a 671-billion parameter mixture of experts models. It’s a big boy model. And, like, the techniques scale really nicely all the way up to that point because it’s just more AI, more training of an interpreter model.

Roelof Botha: So obviously I presume there’s an asymptote here to understanding, because the models are going to get more and more complex over time and we’re going to beef them up.

Eric Ho: Yeah.

Roelof Botha: And so I’m guessing it’s sort of, you know, like the battle of Sisyphus. At some point, you know, this is a never-ending pursuit—which is great. Is that correct? Do you agree with that?

Eric Ho: In some ways, but I also think that like—so the techniques that we’ve developed work on toy models all the way up to, like, yeah, big network that’s like, more capable and more intelligent and better. And I think the techniques also scale effectively with model intelligence. So one part of our pipeline is that for every single latent concept in our interpreter model, we associate that with and try to—we get another language model to reason about, like, what that concept actually represents in the base model. So this is a concept called “auto interpretability,” which Nick Cammarata, who’s at Goodfire now, invented this technique at OpenAI, pioneered. And this technique, because it’s a language model reasoning about what a neuron represents, scales with the quality of the language model. So it actually gets better. So because we use AI in order to understand AI, the better that our models are, these analysis agents are at interpreting what’s actually going on, the better we are able to understand them.

Roelof Botha: Got it.

Eric Ho: And our interpreter model techniques also it’s just, like, if we develop better interpreter models, they theoretically should translate to more and more intelligent and larger and larger networks because these are unsupervised scalable techniques. And that’s the paradigm in AI interpretability.

Roelof Botha: When do you think you reach a minimum threshold that makes you feel it’s ready for a real-world application? Maybe we’re there already.

Eric Ho: I think we’re there. Yeah, I think the first real-world applications are already out there. And yeah, I think we’re there on the very early applications.

Sonya Huang: Can you share more about this?

Eric Ho: Yeah, I was being unnecessarily cryptic.

Sonya Huang: [laughs]

Eric Ho: So yeah, a couple of the partnerships I’m most excited about. So we worked with Arc Institute, like I mentioned a little bit earlier, to understand and interpret Evo 2, which is their, like, kind of DNA foundation model. So it’s a sequence-to-sequence model, so it takes in a sequence of nucleotides and it predicts the next nucleotide in a sequence. And our theory is this is a narrowly superhuman model. So we really like to work on narrowly superhuman models because it can teach us something about the world that humans don’t really know. And so the idea is this model is representing just an enormous amount about the biological world in order to generate the next and properly model the next nucleotide in a sequence. So what we did was we sought to understand, like, what does it actually know such that it can model the world so effectively?

So what we did was we trained sparse autoencoders on the activations of this model, extracted all sorts of features that were related to concepts that the model should know, like kind of normal biological concepts that we have really strong ground truth annotations for. So these are like tRNAs, RNAs, start coding sequences, all sorts of biological concepts that we have ground truth annotations, we associated with this model. And then the question is, like, okay, now we have all of these other features of the model that we’ve extracted, what do they mean? What are they? They might just be ways that the model is computing and thinking, or they could represent novel biological concepts that the model is using to generate the next nucleotide in a sequence.

Roelof Botha: That’s really interesting. For a long time, there was this idea that we have a bunch of junk DNA—you may have read about this. And it turns out a lot of that DNA actually serves a particular purpose in a different part of evolution, or that they govern the expression of other genes. And so, you know, nature generally doesn’t want to harbor things that don’t have value because it’s expensive, you know, just from a biological system point of view. So that’s super interesting. I’m looking forward to the results.

Eric Ho: Yeah, totally. Totally. And I think, like, hopefully using unsupervised AI techniques, we can better understand what all of these portions of the DNA are actually doing. Like, maybe we can discover the idea of junk DNA faster, or understand that DNA is not junk DNA faster, or just discover totally novel things that genes are doing and expressing within us.

Roelof Botha: Interesting.

Eric Ho: Yeah.

Sonya Huang: Where is the research as far as going from understanding and mapping towards editing? So for example, being able to reach in and change this weight from here to there. I’m curious if you all have any results there yet.

Eric Ho: Yeah. So we’ve done most of our editing work on, like, language models and image models. Like, our most recent release was Paint with Ember. Ember is our kind of foundational infrastructure for interpretability. And what we were able to do with this, like, image model demo was targeted precise control over an image model by painting. So we could extract latent concepts like a dragon or dragon wings or an ocean or a pyramid, and then take these concepts and directly intervene on, like, the portion of a canvas that we want to intervene on. So you can paint on a dragon with wings and then add a crowd in the corner and add a pyramid. It’s a really fun demo that’s just a kind of a joy to play with. So it’s out right now; anybody can play with it. It’s just like paint.goodfire.ai. But we’re able to reasonably intervene in certain situations on a model’s latent and steer the model to do what we want, but we haven’t quite cracked the idea of direct precision surgical edits that create a new model that you want to use and doesn’t have any unintended side effects. So that’s still something that we’re pushing on and trying to figure out.

Sonya Huang: Do you think that’s where the field ultimately goes? Or do you think people are focused on different parts of the field?

Eric Ho: I think there are many places where the field is going to go, and this is one of them. Like, interpretability is almost like such a general term that—again, I’m biased, but I think it’s just like governing and underlying all aspects of AI. It’s like anytime you prefer to take a white box approach to doing something versus a black box approach, like, interpretability can probably help in the future. So how do you select your training data? Maybe you want to understand whether the training data is surprising to the model before putting it into the model, because then it can have the most impact on training. Yeah, just like in every single part of the AI development stack, I think interpretability will help and change the way that we do things.

Roelof Botha: If AI foundation models go the way that a lot of software has gone, certainly infrastructure software, where much of it is open source or open weight, is there an opportunity for you to play an invaluable role in judging the biases or likely outcomes of using different open weight models?

Eric Ho: I think we could, yeah. So there’s maybe, like, two areas of research that we’re interested in that intersect with this idea. So auditing. So, like, how do you take a model, understand, like, what’s going on, find problematic behavior and good behavior, hopefully get rid of the bad behavior and enhance the good behavior. So I think, like, as AI gets deployed in more and more mission critical contexts, it’s like that becomes more important. And then also model diffing. So it’s like when you have two checkpoints of a model, like, how do they differ from each other and what’s changed? So the recent kind of like GPT-4o was enormously sycophantic for a period of time, just like telling—just like really gassing up the user, like, tell them that they’re doing great.

Sonya Huang: Still is.

Eric Ho: It still is?

Sonya Huang: Pat recently asked it who the most handsome Cribl board member was, and it was like, “Definitely Pat Grady.”

Eric Ho: [laughs]

Sonya Huang: Still a bit sycophantic.

Eric Ho: That’s so good. That’s so good. Yeah. But yeah, like model diffing, like, you should be able to detect, like, how a model has changed from checkpoint to checkpoint, like, what surprising things have happened that were unintended that are now contained in the network that weren’t there before?

Sonya Huang: Why do you think it was so hard for OpenAI to roll back to a less sycophantic version of the model? And in an ideal state of the world, is there almost a dial and a knob that the OpenAI guys could tune on a scale of 0 to 100, how sycophantic do you want the model to be? Do you think we can get there?

Roelof Botha: I don’t know what questions you’re asking of the model, by the way, because I never encountered this particular problem. [laughs]

Sonya Huang: Always does this with me. And then sometimes it’s brutal the other way. I asked it the best AI podcast, and it lists 20 things with no Training Data. I asked, “What about us?” It’s, “Oh, I didn’t want to give a biased result.”

Eric Ho: That’s so funny. Well, that’s part of what users want, right? They kind of want sycophancy. Like, you know, people want to hear what they want to hear. So when you RL a model, I think fundamentally you’re going to get—I think it’s just kind of a symptom of RL. It’s like this is what users want. This is user preferences.

Roelof Botha: Along the way, you’ve dropped some names, and it seems as though most of them have ended up at Goodfire. I presume there is a certain number of very talented people in this field, and you’ve unfairly seemed to gather them. Can you describe a little bit more about your team and what you’ve pulled together?

Eric Ho: Yeah. So I mean, I think we have a really fantastic team, and that’s what we’ve been spending a lot of our time on the last year, just kind of like, I think, assembling a team of world-class interpretability experts that really have a shot of cracking this problem.

So it starts with my co-founders. I had worked with Dan Balsam, our CTO, for many years at my previous company. And so he’s our CTO. And our chief scientist, Tom, who founded the interpretability team at Google DeepMind way back in the day. And just have assembled many of the early folks in the field. So Tom, Nick Cammarata, who was working very closely with Chris Olah, who is generally considered the founder of the field of mech-interp. And Nick was on all of the original circuits papers and helped, like, build everything out at OpenAI. Lee Sharkey, who pioneered sparse autoencoders on language models, and is now working on some really interesting work in weights-based interpretability. So most interpretability techniques that have been deployed into applications are in concept space and activation space. And he and his group are working on weights-based interpretability techniques. And we’ve also just kind of pulled in scientists, senior scientists from other fields who care a lot about interpretability and just kind of have realized that this is one of the most important problems that we can work on. So Owen Lewis, who was a senior staff RS at Google working on coding agents came over and is now leading a couple directions here for us.

Roelof Botha: And you’re recruiting, right?

Eric Ho: And we’re hiring. Yeah. Scientists, engineers. I think it’s like we are hiring scientists, and that is deeply important for the future of the field. But also, like, it’s hard to just like—it’s hard to overestimate just how important good engineering skills are.

Sonya Huang: It’s an incredible team.

Eric Ho: Proud of the team, yeah, for sure.

Sonya Huang: This seems like core functionality for any of these foundation model companies to have. And as you mentioned, Chris Olah was at OpenAI, now Anthropic. OpenAI has their interpretability team as well. How do you think about the rationale for having a standalone mech-interp research company versus being inside one of the labs that should care deeply about this?

Eric Ho: I think we can just take a really different approach if we’re independent, I think. The benefit of being independent is we can think independently, push things forward independently, and also get a broader view of the ecosystem. So usually if you’re within a lab, you’re kind of doing interpretability work on your own models, and kind of pushing forward the field in that way, and you can make incredible progress that way. But I really do think that, like, a unique third-party perspective is deeply necessary in the field. And I think yeah, just given the team that we’ve assembled, like, a lot of those folks agree with that and that’s why they’ve joined. And also gives us, like, an ability to work with lots of interesting partners across different domains, and we can kind of unify those insights across all of these different domains that teach us more about, like, the inner workings of neural networks more broadly. So we work across modalities like genomics models, exomic models, image, video, language, and also across model architectures. And I think all that just helps.

Sonya Huang: Anthropic invested in you all, right?

Eric Ho: Yeah, that’s right.

Sonya Huang: Say more about that and how you partner with them.

Eric Ho: Yeah, so I think we were their first ever investment. They put in a check in our last round. And I think they just really care about interpretability and really kind of see the future as we do, where interpretability is just pretty critical to the future. So Dario just published an essay called “The Urgency of Interpretability.” And it’s one of his four essays that he has on his site, just talking about how he views this as almost like a race. And we see that very similarly, a race to get interpretability prior to super intelligent, really, really intelligent AI models. I just think it’s deeply critical to be able to understand these models before we have, in his words, a country of geniuses in a data center.

Sonya Huang: Do you think interpretability can help us with open models? And I think some people have a fear that, you know, models trained in other countries that may or may not be enemies of the United States have different nationalist properties. Can interpretability help us understand and even modify those for the American variant of some of these models?

Eric Ho: Yeah. Well, I definitely think so. I also think it’s relatively easy to—like, if you take a DeepSeek model, for example, it’s relatively easy to just tune it or add in more training data to remove a lot of the, like, propaganda in the model. But yeah, I think interpretability can help understand what’s actually inside of the model and then also change it and edit it to serve whatever end purpose that you want.

Roelof Botha: How long do you think before you’re going to be called in as a witness in a very important trial to try to understand why a model did something in particular?

Eric Ho: That’s a good question. I think a few years. [laughs] Who knows? I think it’s really—you know, I mean, we’re all sitting in, like, the Bay Area right now, but at this point, I’m pretty AGI pilled in that, like, I think AI progress will be pretty fast and pretty quick and transformative to society in ways that are really difficult to anticipate from where we’re sitting right now. And so yeah, I do think that there will be a couple, you know, like, big failure cases of AI models, and whether it’s me called in or, you know, somebody at a big lab or some other expert, I think that we’re going to want to be able to explain a model’s outputs. Yeah.

Roelof Botha: I agree with you on the rate of development, by the way. I think if you’ve read these articles, the human brain doesn’t intuit compounding.

Eric Ho: No.

Roelof Botha: And so I’ve even thought back to 20 years ago when I first met the self-driving car initiatives and Sebastian Thrun’s team from Stanford. It won the DARPA challenge. You know, you could sort of see the glimmers of self-driving cars, but even then, if you’d said 20 years later, you would have a self-driving car in San Francisco take you around, I’m not sure it would have been obvious that would be true. And maybe it took a few years longer for the true visionaries. I think the same is going to happen with AI. I don’t think we fully fathom what the world’s going to look like in 2030 or 2035.

Eric Ho: Yeah, I couldn’t agree more. And it’s just really hard to predict, you know? Like, even if we feel like it’s going to happen really, really quickly, it’s hard to predict all of the ways that society will be transformed.

Sonya Huang: Good note to end on. Should we do some rapid fire?

Roelof Botha: Some predictions?

Sonya Huang: Yeah. I need some predictions. It’s all recorded, so we’ll hold you to it.

Eric Ho: Great. Yes, Eric in 2035 will look back on how wrong he is with all these predictions.

Sonya Huang: Okay, maybe first: Inference time compute is the next important vector to scale models. Agree or disagree?

Eric Ho: I mostly agree. I think it’s one of the important—one of the things that we can scale up on. Yeah.

Sonya Huang: What application category do you think will break out next, after code?

Eric Ho: I think there will be a lot of enterprise transformations that happen. So just like automating manual routine tasks that people are doing many, many times a day.

Roelof Botha: Employment impact from AI?

Eric Ho: Vast. Vast, but once you cross a chasm, I think, it happens quickly. I think—well, my last company was helping find early career jobs for people, and using AI to automate that. And I think that that’s where we’re going to feel the impact first.

Sonya Huang: I agree. Recommended piece of content or reading for AI fans, maybe specifically in your field?

Eric Ho: I think the original circuits thread that I was referencing a couple of times, like, that’s still fantastic.

Roelof Botha: What’s either an AI app or maybe just an experience that you’ve had with AI that has blown you away recently? Something that just took your breath away?

Eric Ho: I think my.—one of those moments that you really just feel how fast AI is happening. Like, when I first played with o1-pro, like, that was a model that I just really felt like was actually reasoning about the world. And seeing, like, the kind of cross-domain transfer, too. I would ask it a strategic question, and it felt like it would actually understand all of the levers that I was considering with the business and considering that at least relatively thoughtfully and being a thought partner. And that’s both exciting because I now have this model that I can talk to about all sorts of critical problems. And not just, of course, trust it blindly, but also it’s like, wow, how did this happen?

Roelof Botha: Interesting. One of the things I’ve learned recently is that AI is still struggling to understand humor. And one of my partners, Andrew, actually had this joke that humor is humans’ way of showing off intelligence without actually explicitly bragging. And so maybe there is a lot of embedded intelligence in humor. Do you think interpretability will help us pinpoint sense of humor?

Eric Ho: To figure out why AIs don’t have a sense of humor?

Roelof Botha: Or help them to develop it?

Eric Ho: Perhaps. Who knows? Yeah, I hope so. Yeah, I think if I wake up to a model telling me jokes in Roelof’s voice, that would be …

Sonya Huang: That would be terrifying. [laughs] Okay, we’ll close with one last question, a prediction in your field. Do you think we will ever reach the point where we feel like we confidently understand the features, the circuits, the patterns, the weights of a neural net? And if so, what year do you predict we’ll reach that point?

Eric Ho: I think we can. I think that it might not look like what you just said, like the features, the circuits. I think it requires maybe a reconceptualization of what’s actually going on inside the model, like, a deeper, more fundamental understanding of the units of computation of a model. It’s almost like discovering truths about the universe, or about, like, neural nets in this case. But yes, I think we’re on track. I think we can do this. And we can do this—and hold me to this—in 2028 we’re gonna figure it all out. Yeah.

Sonya Huang: Fantastic.

Eric Ho: Yeah. Just a few years, I think we’re close.

Roelof Botha: Just in time for the LA Olympics.

Eric Ho: Yeah, that’s right.

Sonya Huang: Just in time for your next round of funding. I’m kidding. Eric, thank you so much for doing this today. Roelof and I loved the conversation.

Roelof Botha: Thank you.

Eric Ho: It was a pleasure. Yeah, so much fun. Thanks for having me.

Mentioned in this episode:

Mech interp: Mechanistic interpretability, list of important papers here
Phineas Gage: 19th century railway engineer who lost most of his brain’s left frontal lobe in an accident. Became a famous case study in neuroscience.
Human Genome Project: Effort from 1990-2003 to generate the first sequence of the human genome which accelerated the study of human biology
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
Zoom In: An Introduction to Circuits: First important mechanistic interpretability paper from OpenAI in 2020
Superposition: Concept from physics applied to interpretability that allows neural networks to simulate larger networks (e.g. more concepts than neurons)
Apollo Research: AI safety company that designs AI model evaluations and conducts interpretability research
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning. 2023 Anthropic paper that uses a sparse autoencoder to extract interpretable features; followed by Scaling Monosemanticity
Under the Hood of a Reasoning Model: 2025 Goodfire paper that interprets DeepSeek’s reasoning model R1
Auto-interpretability: The ability to use LLMs to automatically write explanations for the behavior of neurons in LLMs
Interpreting Evo 2: Arc Institute’s Next-Generation Genomic Foundation Model. (see episode with Arc co-founder Patrick Hsu)
Paint with Ember: Canvas interface from Goodfire that lets you steer an LLM’s visual output in real time (paper here)
Model diffing: Interpreting how a model differs from checkpoint to checkpoint during finetuning
Feature steering: The ability to change the style of LLM output by up or down weighting features (e.g. talking like a pirate vs factual information about the Andromeda Galaxy)
Weight based interpretability: Method for directly decomposing neural network parameters into mechanistic components, instead of using features
The Urgency of Interpretability: Essay by Anthropic founder Dario Amodei

On the Biology of a Large Language Model: Goodfire collaboration with Anthropic

Mapping the Mind of a Neural Net: Goodfire’s Eric Ho on the Future of Interpretability

Stream Now On

Listen Now

Summary

Transcript

More Episodes

How End-to-End Learning Created Autonomous Driving 2.0: Wayve CEO Alex Kendall

How Google’s Nano Banana Achieved Breakthrough Character Consistency

Mapping the Mind of a Neural Net: Goodfire’s Eric Ho on the Future of Interpretability

Stream Now On

Can we ever trust generative AI?

Will interpretability help us understand the human brain?

What is Goodfire?

Analogy to the Human Genome Project

Unintended consequences of current steering methods

The personalities of models

The field of mech-interp

Mapping the mind of a neural net

Real-world applications

Can we edit models yet?

Interpreting open models

The team at Goodfire

The benefits of being an independent lab

Lightning round

Mentioned in this episode:

How End-to-End Learning Created Autonomous Driving 2.0: Wayve CEO Alex Kendall

How Google’s Nano Banana Achieved Breakthrough Character Consistency