Skip to main content

From DevOps ‘Heart Attacks’ to AI-Powered Diagnostics With Traversal’s AI Agents

Anish Agarwal and Raj Agrawal, co-founders of Traversal, are transforming how enterprises handle critical system failures. Their AI agents can perform root cause analysis in 2-4 minutes instead of the hours typically spent by teams of engineers scrambling in Slack channels. Drawing from their academic research in causal inference and gene regulatory networks, they’ve built agents that systematically traverse complex dependency maps to identify the smoking gun logs and problematic code changes. As AI-generated code becomes more prevalent, Traversal addresses a growing challenge: debugging systems where humans didn’t write the original code, making AI-powered troubleshooting essential for maintaining reliable software at scale.

Summary

Traversal Founders Anish Agarwal and Raj Agrawal bring deep AI research expertise in machine learning and causal inference to solve one of enterprise software’s most complex problems—automated root cause analysis for large-scale systems. This episode emphasizes how AI-native companies must continuously adapt their architectures while leveraging inference-time compute to handle enterprise complexity that traditional observability tools cannot address.

AI-native companies must architect for constant evolution and bet on emerging capabilities: The founders highlight that companies building in the current AI landscape face a fundamental challenge—they must constantly make six-month bets on where AI capabilities are heading and be willing to reevaluate their entire architecture. Traversal’s success came from making prescient bets on reasoning models in September, architecting their system to leverage these capabilities before they became widely available. This approach of surfing technological waves rather than building static solutions separates winning AI companies from those that get left behind.

Scale breaks traditional agentic approaches—inference-time compute is the solution: When Traversal moved from small companies to enterprise-scale customers with thousands of microservices, their accuracy dropped to zero percent and remained stuck despite prompt engineering efforts. The breakthrough came from shifting complexity from hard-coded workflows to inference-time compute, allowing their agents to spend tokens systematically exploring the problem space rather than relying on predetermined paths. This architectural decision enabled them to achieve over 90% accuracy in 2-4 minutes for incidents where the root cause exists in the data.

Enterprise fragmentation creates unexpected advantages for AI-native solutions: Counter to typical startup wisdom of starting with smaller customers, Traversal found they add more value at large enterprises where observability is mature but teams are fragmented. The proliferation of tools—Datadog, Splunk, Dynatrace, Elastic, Grafana, ServiceNow—creates data silos that no single human can synthesize, but AI agents can traverse. The pricing models of incumbent observability companies, based on data storage volume, disincentivize cross-platform insights, creating an opening for platform-agnostic AI solutions.

Domain expertise is less critical than AI-first thinking and rapid iteration: Despite having no observability industry background, Traversal’s team successfully tackles complex SRE problems that typically require armies of domain experts. The founders emphasize that success comes from experimental mindset and willingness to quickly iterate on AI-driven hypotheses rather than traditional credentials or domain knowledge. They built a team that’s 90% engineers with AI fluency, focusing on people excited about generative AI who can adapt quickly rather than observability veterans set in their ways.

The future of software reliability requires hybrid human-AI collaboration: As AI-generated code becomes prevalent in mission-critical systems, traditional debugging approaches break down because engineers lose the system knowledge that comes from writing code themselves. The founders predict this creates a “tale of two worlds”—fast fashion vibes coding for disposable applications versus AI-powered maintenance tools for production systems. Future SRE teams will need fluency in both traditional reliability principles and AI system failure modes, while logging and instrumentation practices must evolve to serve AI consumers rather than human readers.

Transcript

Contents

Introduction

Anish Agarwal: If you’re in product, if you’re in design, if you’re in core engineering, you constantly have to make bets as to where AI is going to be six months from now, and that you’re willing to reevaluate everything six months from now. The good news is that it’s only going to get better. So it’s not like your product’s going to get worse six months from now, right? And so for example, one really interesting bet that has paid off is Raj and Raj were very prescient about the fact that reasoning models can get better, right? They made this bet in September, right? Just seeing where the world was going. And we architected our system such that the reasoning models would get to shine. And that has really paid our dividends right now in the way our entire architecture is set up.

The future of DevOps

Bogomill Balkansky: Anish and Raj, welcome to Training Data. So wonderful to have you today. We have been working together for a year, and we can’t wait to share with our audience how AI and AI agents can transform the world of site reliability, engineering and DevOps. And somehow we need to weave Negronis in the conversation someplace later on, but we’ll get to that. Let’s start with some quick hot takes. Will DevOps or SRE, as we know them today, even exist five years from now?

Anish Agarwal: Yeah, it’s a great question. I think it will, but I think it’ll look fundamentally different from the way it looks now.

Bogomill Balkansky: How so?

Anish Agarwal: I think in this world of DevOps and SRE, I think healthcare analogies generally fall quite naturally, I find. And in this world, if you think about healthcare and the Maslow hierarchy of needs there, imagine the stage one is where, let’s say you’re having a heart attack. You have to solve that right now. Nothing else matters. Nothing matters five minutes from now other than solving the heart attack that you’re having, right? And to me, that’s analogous to dealing with a high severity incident.

And then stage two is where, let’s say you’re dealing with some sort of chronic issue that you face. You know, you have a sprained ankle or whatever illness that might be affecting you. And it’s very hard to plan three months in advance because that’s something that’s just affecting you every day, right? And I think that’s the analogous thing you do in DevOps is dealing with streams of alerts, streams of checking whether deployment was safe or not safe, right?

And then stage three is where I think of it as, like, life hacking, where you’re thinking about how do I optimize my sleep and my nutrients so that I have a high quality life and fulfilling life, right? And I think that’s the equivalent of, like, planning out what your next five years of infrastructure look like and how you invest in the right places. Now if I think of what unfortunately people in the DevOps, SRE, or on-call engineering space live like right now, it’s like having a heart attack twice a week, and dealing with a debilitating chronic condition that you have to handle every day.

Bogomill Balkansky: As I’m listening to you, I’m just trying to picture all the people who I know in DevOps and SRE in scrubs or wearing CGI monitors to hack their infrastructure.

Anish Agarwal: And I think people don’t realize it, but when you make this connection to healthcare, you also realize what we’ve gotten used to and what life really should be like in this world as we think about the healthcare of large scale software systems. And so I think if Traversal does its job right, then that first stage and second stage of needs, which is there’s really high severity incidents and that constant pain by death by a thousand cuts, should be something that AI and AI agents take care of, and DevOps necessarily get to deal with the creative fun parts of what is it my infrastructure should look like for the next year, for the next five years. And it becomes a much more fulfilling job.

Bogomill Balkansky: And many engineering teams are nowadays adopting autonomous coding, things like Cursor, Windsurf. Do you think that’s going to have a profound impact on how people maintain the reliability of their infrastructure, or will it be the case that AI will have a major role to play in fulfilling that vision, moving people away from being the intensive care unit surgeons to being more thoughtful planners?

Anish Agarwal: So I think it’s going to be—there’s a short-term answer and there’s a long-term answer. And I think it becomes, in the short term at least, it’s a tale of two worlds, I think. So because of everything happening in the world of Cursor and Windsurf and so on and so forth, I think the idea of vibe coding obviously has become very popular, right? And we can write a prompt or a few prompts and have something stood up that people can use and play with.

But I think the analogy here is of fast fashion where you try something on, you like it, you don’t like it, you discard it, you know, wear the next thing, right? And I think in that world actually, reliability doesn’t really matter because you don’t really need to take care of what you’ve created. There’s no craft to it—you create it, you throw it away.

But there’s another world which is where you start applying these AI-powered software engineering techniques to mission-critical systems, in payments, in financial institutions, in security, infrastructure, streaming. And in that world, I think it’s going to lead to a major issue. It probably already is and it’s only going to get worse, in my opinion. Because everyone—we have seen this with the enterprises we work with, everyone is using AI-powered software engineering tools to guide them as they write code. And what you actually find is that it actually even passes their local unit test.

So in that local piece of code, everything looks perfect, right? The problem is that in large-scale enterprises, things break when different pieces of your system interact in a way you just didn’t realize or you couldn’t foresee. And when that happens, because all of this code is being written by an AI system, it’s very hard to debug it because you just don’t have the context anymore. You didn’t write it anymore. And I think in that world, unless we find ways of using AI systems to do software maintenance, we’re going to be throttled, and either people will disallow AI software engineering tools to be used because there’s too much downtime or something of that form. And I think in that world, we’re going to need new tools and new software to help maintenance of such systems. I think that’s where Traversal can play a part.

Bogomill Balkansky: You definitely have your way with words and analogy. So I love the tale of two worlds. You either have Louis Vuitton high fashion or you have fast fashion with Shein or something like that. But I think we jumped into the deep end maybe a bit too quickly. Let’s go back a few steps and maybe explain for our audience, like, what actually root cause analysis stands for. Like, what does it do? Tell us a bit more about this wonderful world of troubleshooting and root cause analysis.

Deep-dive into Root Cause Analysis

Anish Agarwal: Yeah, so I think if I just think about the tale of the life cycle of an incident, the story of an incident, it always surprisingly looks the same across companies. A customer will log onto a platform, things are not working exactly right. Either it’s already been recorded or they make a complaint to customer support. Customer support looks at it and says, “Okay, it’s not a user error, it’s an actual issue.” It escalates in this game of telephone up, like, five ladders of different sophistication of engineering organizations within an enterprise. At some point it hits the DevOps SRE team who look at the issue and decide, is this worthy of an incident based on the impact of it, based on the severity of it, the immediacy of it? And if they decide it is worthy of an incident, then suddenly chaos ensues, right? You’ll have a Slack incident channel that’s created. There’s 30 to 50 people in a channel. Everyone’s kind of implicitly blaming each other for what happened, and a bit of a whodunnit-type thing. And almost without fail is the same thing that happens, which is there’ll be this 10X engineer who’s been at the company for … 

Bogomill Balkansky: —who’s the Agatha Christie of the whodunnit—

Anish Agarwal: Yeah, or the Sherlock Holmes of the situation. And, like, they’ll figure out exactly what happened, right? And not clear how they got there, but they got there. You roll back the—you come up with some sort of hot fix, and then everyone kind of looks for a longer term fix over time, right? And that’s typically how it all plays out.

And that whole flow is what we call the lifecycle incident, and root cause analysis in particular, right? And I think what I find incredible is that all of observability, these incredible companies have been created, they’re the second largest software spend typically companies have after your cloud spend, and yet we still have this state of root cause analysis. But anyways, that’s my take on what RCA looks like nowadays.

Bogomill Balkansky: And tell us maybe a bit more color on where does root cause analysis live with relations to the plethora of observability tools that companies use today. Why is it that if it’s still the case, if this is the number two spend in technology, why is it that you have 50 people in a Slack channel with a whodunnit kind of plot?

Anish Agarwal: It speaks to the importance of the problem, right? And what observability has done is the best they could have done given the technology that was available even, like, six months ago, right? So I think what observability is, it’s fundamentally about the creation of telemetry data. It’s called MELT data, so metrics and events and logs and traces. So MELT for short. And it’s about the creation of this data, storing it, and then providing a nice visualization layer on top of it so people can create the right slices to create little eyeballs in different parts of their system, right? And I think that’s what observability ends at, which is a storage and visualization layer, because that’s all that could have been done, right? And the complex workflow of troubleshooting remains super manual. And that just because that’s all we could have done so far.

And I think the promise of what these AI agent companies can do is that they automate complex workflows that we do on top of software. And to me, this problem of troubleshooting root cause analysis is one of the most complex workflows that any study humans do in software. And I think now there’s an opportunity to kind of move up in terms of sophistication where we use AI systems to actually automate this workflow versus just stopping at the storage and visualization layer.

What Traversal is building

Sonya Huang: Can you talk about your product? What is the agent or the agents that you’re building to kind of take on this problem, and how does it work?

Anish Agarwal: Yeah.

Bogomill Balkansky: And are they double agents? [laughs]

Raj Agrawal: So maybe stepping back, what is an agent? And it’s really an LLM orchestration of tools. And in our world, those tools might be data-fetching tools, how to get logs, how to get metrics back, data processing, how to format it in a way that the agent can really process, or statistical tools like running anomaly detection. And I think really what we’re trying to look for here is you need to define the tools rich enough so that an RCA can be expressed as some combination or some sequence of these tool calls. And that’s really where a lot of the complexities and the multi agents come in is how do you piece together these tools to solve these complex tasks? And I think, at least in our world, the data is so big that it needs to be sequential, because a single trace might not even fit in an LLM context. So you need to know how to slowly just build this context of the LLM to get to the root cause. And that’s why it’s really challenging of a problem.

Sonya Huang: Are you mirroring the cognitive architecture of a human troubleshooter? Is it a map of operations very similar to what a human would be doing, or is it very different for an agent?

Raj Agrawal: Yeah, I think that’s where, at least when we were first thinking of the problem, we tried to mimic how an SRE would debug. And it was very manual, very sequential, where an SRE typically might look at a piece of evidence and then okay, figure out what’s the next piece of evidence to look. But a lot of times they make these hops with system knowledge, which the agent might not know. And that’s really where we used a lot more scale to figure out how to bypass some of this issue of system knowledge.

So this agent will basically look in a more systematic fashion of there’s this Google SRE handbook, which tells you here’s the key golden signals; you should be looking at latency, error rate. So it looks at this way of kind of health monitoring to piece together how to sequentially search this rich space. And it’s less about having holes in the reasoning where a human can just immediately get maybe to that hop, the agent is making these sequential flows that get to the answer.

Bogomill Balkansky: Are there particular conditions or environments where this agentic approach works better or worse?

Anish Agarwal: Yeah, I think fundamentally it’s about data access. So what we have found—and surprisingly what we have found is that we sometimes can add less value in, like, a series A startup versus a large enterprise. Because when you become a large enterprise, your observability is quite mature. So everything is being instrumented in a way where the fundamental data is in place, but your teams are very fragmented, so no one team or no one person has enough context to piece together everything about how to debug, right? And that’s why you have these 30 or 50 people in a Slack incident room.

And so what we fundamentally found is that the place where we add the most value is when the reasoning steps for this agent to go from a high-level trigger to the root cause, those steps can be found in the data, right? It just is too much data for any one human to keep in their head. When that data is fundamentally not there, then it’s when the agent suffers. And that tends to not be the case at the enterprise, is what we have found.

Bogomill Balkansky: Very interesting, and really somewhat counterintuitive because it’s usually the case with startups that the way you choose your design partners or customers, you start at smaller companies or mid-market and then you try to graduate your way up to enterprise. And if we’re going to use, like, the L1 to L5 analogy from self-driving cars, how close are you with having agents correctly get to the root cause of incidents? Or maybe more broadly, like, what counts as success? Are we trying to get to 100 percent correct resolution, or what even counts as correct resolution in this context?

Raj Agrawal: Yeah, that’s a great question. So I think there’s really two cases here. There’s the first case where the root cause fundamentally belongs in the data. So it’s in some log, it’s in some PR. And in that case, I would say with Traversal, we’re at L4. And I won’t say L5 because for us, we might be able to flag that problematic PR or that smoking gun log, but then there’s still that last mile of the fix and the remediation. And in cases where the fix is not so localized to a specific file or code change, where you need to do basically a bigger systemwide change, we haven’t gotten there, but I think that’s where it’s really exciting to see all the developments with code agents because that’s where we can get to that L5.

I think for places where the root cause isn’t in the data, that’s where we’re more at a L2. I think Traversal finds a lot of the important symptoms that really help people debug, but sometimes it’s just not in that data, and this human kind of needs to make those additional hops to get to the actual root cause, but still the symptoms really help figure out how to make those additional hops. And sometimes for us when we notice that, we can tell customers, “Well, maybe you should instrument in this way to make the system more observable.”

How companies are evolving observability in the AI era

Sonya Huang: Across the AI landscape right now, there’s really interesting kind of AI-native companies being formed like yourselves. There’s also the incumbents that are, you know, in many cases not asleep at the wheel. How do you think about the incumbent risk here and why your customers are choosing to go with you?

Anish Agarwal: Yeah, I think it’s a great question. I think the heart of it is observability is expensive. It’s such an expensive product that people pay for. As a result, it’s extremely fragmented, right? You go to any big enterprise, they’re using Datadog, they’re using Splunk, they’re using Dynatrace, they’re using Elastic, they’re using Grafana, they’re using ServiceNow. I mean, they’re using everything, right? And all of them are encroaching in this game of attrition.

And the problem is that if you just think about the pricing models in this world, it’s all based on the amount of data they store, right? And so as a result, company A is not incentivized to give you better insights from anything that’s stored in company B, right? And to debug something, if you just look at any of the SRE or on-call engineers, they’re calling upon all five, six tools that they have access to. And it’s that fragmentation of this historical industry, I think, which is going to ideally lead to companies such as ourselves that are somewhat agnostic to where the data is stored, at least for now, it gives us a chance.

Sonya Huang: Do you have customers deployed on Traversal today? And what have you found is the difference between what you kind of expected even coming from an academic background versus what you’re actually finding in real-world environments?

Anish Agarwal: Yeah, it’s been quite a journey. I think when we started, as one should typically do, we started with, like, very small companies and built something that worked for them. And that typically took the form of we went through the last hundred incidents that they had in the Slack channels and kind of tried to use our brains to figure out what is the meta workflow of how they always debug an incident in that particular company, right?

And in that situation, a very popular framework for agents is something called the “react framework.” And that, as a general idea, worked. We could somehow imbue the meta workflow into a React agent system, and it was able to do a really good job.

And then at some point, we started working towards larger companies that had an actual large-scale observability system, thousands of microservices, that kind of situation. And I remember very clearly this one week where it was the first time we dealt with that kind of scale. And the second we tried to apply our system on just some historical incident that had happened, our accuracy was at zero percent, and it would not move.

Sonya Huang: [laughs]

Anish Agarwal: Whatever we did to the prompts, whatever we did to anything, it was stubbornly at zero percent. And that was a rude awakening. I remember Raj and I had a Negroni.

Bogomill Balkansky: [laughs]

Anish Agarwal: And I think at that point we kind of had to think through ways of—we made some interesting decisions which played out in our favor, which is we said, “Okay, nothing about a specific company will be hard coded into the prompts.” Right? And nothing about a workflow that humans do there will be hard coded into the agentic workflow. And that complexity has to go somewhere, right? And eventually where it went to is computation, which in our world takes the form of spending tokens on the problem, like using it at inference time. And once we were able to kind of find an architecture that exploited inference time compute, which is something now everyone is finding to be important, accuracy starts shooting up.

And what we found then is if the fundamental answer lies in the data, we get to the answer more than 90 percent of the time, and we get it within two to four minutes, right? Which is amazing because now you just look at these Slack channels, humans are spending most of the time just verifying the answer to what we find versus actually root causing it. And just the month-to-month time resolution has dropped, I’d say, as is the number of people on average in the Slack channel, which are the two things that I think any enterprise cares about.

Bogomill Balkansky: And how do you measure accuracy, like, in this context? Like, for example, there are a lot of companies that focus on LLM evals. Does such a thing exist in the world of SRE and root cause analysis? Or how do you know that you’re correctly identifying a root cause?

Raj Agrawal: The gold standard is honestly trying it on live incidents. I think when you onboard to a customer, oftentimes incidents can happen, like, two to three times a week. And in those scenarios, you get the best feedback. Maybe it takes a couple of hours for that incident to complete. You look at the postmortem and you can really evaluate it. So I think that’s one definitive source. There’s other subtasks we do for evaluation, such as when people are trying to search for specific information from their observability. There’s smaller chunks of tasks you need to do to do RCA, so we evaluate on those tasks as well, which are just higher volume. But ultimately, real live incidents are the best way for us to evaluate.

Traversal’s agentic approach

Bogomill Balkansky: Awesome. It sounds like you had this rude awakening at some point where you deployed the product and it was stuck at zero percent accuracy before you kind of went back to the drawing board and really rethought the entire architecture. So maybe can you give us, like, a very quick tour under the covers of how the product works today, and kind of what is the magic that enables precise root cause analysis?

Raj Agrawal: Well, one important decision we made was we only require read-only access to the data. And I think that was a decision basically based on enterprises not wanting to just have yet another tool to generate more data.

So how do we actually do it? Well, I think there’s two phases. There’s an offline phase and an online phase. So during this offline phase, we’re really trying to learn this rich dependency map. How do different functions, how do different logs relate to each other? And one way to do that is through LLMs. So LLMs go traverse, really understand semantically how these logs, different tags within the logs all relate to each other. And then we also use statistics. And statistics comes in when, for example, there’s natural variation in time series. And that turns out to leave traces of causality, which basically Anish and I worked on in grad school of how do you pull causal relationships out of this data? And that’s really key to build this rich map.

And we also use self-play to basically prioritize certain paths that are very promising for RCA. So now once we’ve constructed this rich dependency map during the online phase when an actual incident comes to us, what this agent is doing is it’s using that real-time information and this dependency map to basically figure out what hops to make to do the root cause analysis.

Bogomill Balkansky: How long does it take for the offline part to become effective such that in other words, between deploying your solution and the first incident that you can actually troubleshoot, how long is that gestation period?

Raj Agrawal: It kind of depends on if they want to troubleshoot a live incident or a historical incident. In general, I would say we take about five to ten hours to kind of look through all of their code base, look through their observability, really have that system understanding. For larger customers, it can take a day, but generally five to ten hours.

Sonya Huang: So you’ve mentioned reasoning and inference time compute a couple of times in there. Are you using kind of foundation models and fine tuning them for your purposes? Are you building a lot of that kind of architecture yourself? Maybe just talk a little bit about, you know, the architectural decisions you’ve made.

Anish Agarwal: Yeah. One really interesting thing we have learned is that if you work with enterprises, they typically have an existing relationship with an LLM provider. So they might have an enterprise contract with OpenAI or Anthropic. And if you try to bring your own model or your own fine-tuned model to them, you’re going to be stuck in security hell for about a year.

Sonya Huang: Hmm.

Anish Agarwal: And so you have to kind of tie your hands where you say you have to be able to use whichever model they give you. Typically, OpenAI is a pretty safe bet, Anthropic as well. And so then most of the complexity is really about how do you get the right set of tools that this LLM has access to to orchestrate the RCA itself. And the other thing you can do is fine tune within the company’s environment. Let’s say they point to their Azure OpenAI instance, then you can fine tune that by every time an instance happens, you see how far you got, you saw whatever the pinned root cause was, you can see how far away you were, and then the system can fine tune on making sure that gaps get smaller and smaller over time. That’s generally how we’ve seen it’s played out.

Bogomill Balkansky: And you also mentioned that you guys had—part of the architecture here is based on years of your academic research. Can you share a bit more about kind of this interplay between your PhD dissertations and translating that now into a company?

Raj Agrawal: Yeah. Actually, so one thing that’s kind of interesting is, at least during grad school, I worked closely with the Broad Institute to basically understand what these CRISPR interventions—you have these gene regulatory networks, you do a CRISPR intervention, and you’re trying to understand what is the effect of this drug or this knockout experiment on how these genes express.

And the techniques we developed there for learning that causal structure between genes turned out to be pretty related to this problem we’re facing with production systems, where if you think of the nodes of swapping them with genes as microservices, and then you’re learning okay, what happens when I make a PR change or I break this part of the system, how does that percolate? It becomes almost the identical problem. And I think that was honestly by sheer luck. We didn’t know that was going to be the case. And only until we got into the weeds of the problem did we realize, like, oh, wow, we got really lucky that our grad school research played out well here.

Bogomill Balkansky: What have been some other surprises on this journey over the past year, year and a half?

Anish Agarwal: Well, for one, I think, which is part of why it’s been such a joyous thing for us is just how it’s like the industrial age of AI. And so I think all of the most interesting innovation I feel is happening at small research-focused startups such as the ones that we are part of or have the privilege to be a part of.

And so I think just having seen the best of research in some of the best universities, and now actually seeing how it gets played out in a company, it just feels like this is where all the magical special work is happening. And I think that was surprising to me coming from a world of academia, becoming a professor. I thought that’s where all the innovation will be. And so that’s been quite surprising, honestly.

I think the other thing that’s been quite surprising is just how hungry enterprises are for using generative AI to solve real problems. And you can show that the pace at which you can move with them, what typically would take a year or two years to close can be rapidly done in a couple of months. And so I think the hunger of the market and also the pace of innovation in the industry has been, I think, quite surprising.

Staying ahead of rapid AI innovation

Bogomill Balkansky: Speaking of the pace of innovation, things are moving so quickly, and you have been working on this product for slightly more than a year now. Have you been in situations where you have to go back and rework something because now there’s MCP or something new on the market that previously you had to engineer yourself, but now it’s readily available off the shelf, so to speak. Or in some ways, like, how do you future proof your architecture and what you’re working on?

Anish Agarwal: Honestly, I think that is going to be a constant challenge for companies of this generation. If you’re in product, if you’re in design, if you’re in core engineering, you constantly have to make bets as to where AI’s going to be six months from now, and are you willing to reevaluate everything six months from now?

The good news is that it’s only going to get better. So it’s not like your product’s going to get worse six months from now, right? And so for example, one really interesting bet that has paid off is Raj and Raj were very prescient about the fact that reasoning models are going to get better. They made this bet in September, just seeing where the world was going. And we architected our system such that the reasoning models would get to shine. And that has really played out dividends right now in the way our entire architecture is set up.

And so I think you have to keep making these kinds of six-month bets about where AI is going to be and surf it just right. And I think the companies that are able to string together a few of these bets just right are the ones that are going to win. I think that’s why, in my opinion, if you ask me would you rather be an AI team taking this problem or an observability team? I’m biased, but I’d say I’d rather be an AI team because you just have to—you get more fluent in making these kinds of bets about where things are going.

Sonya Huang: What is your team composition? Like, how many are researchers, how many come from the domain? I’m curious how this—what the shape of a AI-native agent company will look like more broadly.

Raj Agrawal: Yeah. I mean, right now I would say we’re 90 percent engineers. I mean, a lot of people—you know, there’s a few of us with PhDs who have done PhDs in machine learning. A large fraction come from, like, traditional software engineering backgrounds, who have been very interested in gen AI and have used it in their own life.

But I think for some of these things, at least, the barrier is lower, and you can kind of get—everyone’s trying to learn how to make an agent. And I think it’s not like the old days where you might have needed a PhD to, you know, write out these gradient updates on this crazy LLM. I think it’s a lot more democratized, and that reflects in the way we’ve made the team composition.

We also have people working with infrastructure. So how do we scale these agents? And I think that’s a really interesting problem to Anisha’s point of inference time compute, is how do you really get this AI agent to just do a lot more work than a human can possibly do? And that involves for us, we make thousands of network calls per investigation. So that’s reflected in hiring the best infrastructure engineers. And then I think the other part is we have product managers and some of those folks are coming in the next month. So we’re really excited just to bring all of these different peoples with different backgrounds together to solve this hard problem.

Anish Agarwal: Yeah. I think one thing that unifies everyone in the company is an excitement for generative AI. And I think because of that, they’re willing to learn and adapt and throw away things that didn’t work, versus kind of being stuck in your way. Because I think those are the people that are not—that’s the type of culture that is not going to survive in this world because it’s moving too quickly.

And if I think about an AI researcher versus an AI engineer versus an engineer, like, how do I organize that? I think it’s nothing to do with whether you have a PhD or any sort of credential. It’s about how experimental are you in the way you think? Like, how willing are you to make little bets and hypotheses about how things can be, and then quickly put that to the test? I think that mentality of quick iteration with AI and data is, I think, what makes someone a good AI researcher or an AI engineer. And so it’s more a mindset than a credential.

Bogomill Balkansky: Speaking of the composition of your team, I find it very interesting that neither one of the two of you or your two other co-founders are actually observability industry insiders. And your entire company is very heavy on AI and engineering prowess, and honestly, like, you know, quite short on observability, domain knowledge. And yet you are making waves, like, you know, in this domain and successfully resolving these very complex issues that otherwise take little armies of people scurrying around and dumpster diving to go figure out, like, what is the smoking gun log?

It almost begs the question of what do you think a future observability or SRE team at one of your customers might look like? Today you mentioned that there are people who have all the insider knowledge, like, all the tribal knowledge, the intuition, they can make these hoops, you know, from here to there without data. Is that still a valuable skill set, like, you know, five, ten years from now, or would the observability team look very, very different?

Anish Agarwal: A couple of answers. One is I think there will be a lot of work to be done about the reliability of AI systems themselves. And I think the way you reason about reliability of AI systems will require, obviously, first principles—good SRE principles—but also some sort of fluency with AI as well, because they will break in interesting, unique ways that we’re just not used to.

And so I think modern SRE teams will have to kind of be fluent in both in some ways, which is how do LLMs and AI systems break, and also how regular systems break. And marrying those two things are what I think will make you a really good DevOps and SRE engineer.

And also even simple things like I think one really interesting thing that’s going to happen is that the data layer of observability is going to fundamentally change as well. Like, the way a log looks like is going to look fundamentally different. And a part of being a good SRE or DevOps person is knowing how to write a good log, right? So that over time when an incident happens, you have the right instrumentation. There’s an art to instrumenting your system in the right way where you’re not flooding it with too many logs, but just the right amount, right? And I think the way we will write logs will look fundamentally different as well because it’s no longer meant to be scrolled by a human but meant to be consumed by an AI system.

Bogomill Balkansky: That’s a very interesting point. Like, how do you rate Cursor today on their ability to write good logs?

Raj Agrawal: I think it does a pretty good job. But I mean, I think it’s …

Bogomill Balkansky: Better than a human?

Raj Agrawal: I think the way it can format—like, humans just make typos and they don’t—I mean, I make—I’m really bad at spelling.

Sonya Huang: [laughs]

Raj Agrawal: I think Cursor does a great job there, but I think a lot of it is still mimicking how traditional logging looks. And I think to Anish’s point, you need less of that and you really, for example, want to bury as much information in the message field because that’s the meat of what is going on. And now when the LLM sees that, it can kind of have a better understanding of how to make those hops. But a lot of times people don’t like to add so much content because a human can’t read such a long error stack trace in a log.

Sonya Huang: We have shorter context windows than these LLMs.

Anish Agarwal: I think also then connecting with the business logic a lot of times, because typically it may not be—your engineering system is fine, but it depends on what “fine” even means for an engineering system. It has to connect to some sort of business logic for it to be healthy or unhealthy. And I think logging in a way that connects these two things in the right way is an art still. And I think that will be tough for Cursor or any other AI system to do unless they have full understanding of your business logic. Which might happen, I don’t know. Yeah.

Sonya Huang: As you dream about the future of the software engineering market and the role that observability and Traversal will play in that future, you know, you started this podcast episode talking about people aren’t vibes coding a payment system or banking system today. Do you think that if everything that Traversal’s building goes right, if observability changes in the way you think it might, that engineers at the largest banks and payments companies and healthcare institutions will be actually vibes coding their software programs?

Anish Agarwal: To some extent, yes. But I think what we’ll have is just a much better sense of, like, did it fulfill some function? And that will typically take the form of unit tests in some form, right? And just much better testing. And so as long as it fulfills some tests, that code is fine, and who cares what was written in that code?

I think that was somehow, in my opinion, how software engineering is going to evolve, which would be much more about did it fulfill some function versus the way it was written. which will be good, but as we talked about earlier, I think also bad because when systems fail, it’s because they interact in ways you didn’t expect, right? A lot of times you are failing because a third party that you call upon is not in the SLA, right? And that’s just going to happen a lot more, I think. And it’s going to be a lot harder to debug because you don’t actually know what was the content of a particular code file. And so I think that’s the problem that we’re going to face. And that’s why I think products like Traversal will have a big part to play is in solving those kinds of issues, which is a lot more subtle. Yeah.

Raj Agrawal: Because I think that’s where humans shine, right? Because right now humans mostly write all of their code, so they have that rich system knowledge. So when they see an incident, they can short circuit a lot of paths because they already kind of have seen something in the past or they kind of already understand this error message they wrote. But if an AI generated that, they don’t have that structural advantage, and then you really need a AI troubleshooter.

Lightning round

Sonya Huang: Okay, we’re going to close it out with some rapid-fire questions. You guys ready?

Anish Agarwal: Yeah.

Raj Agrawal: Yeah.

Sonya Huang: One-sentence or one-word answers. Maybe first, what application category do you think will break out for AI in the next 12 months?

Anish Agarwal: I’m going to try to say this in two sentences, which is I think anywhere where you’re using reasoning models better are where applications are going to shine. And I think diagnostics is one of them, healthcare being, I think, a prime example of that.

Bogomill Balkansky: Another startup or a founder that you admire?

Raj Agrawal: I would say Demis Hassabis—hopefully I didn’t butcher the name, but I think just amazing researcher, so much foundational work, and just always trying to solve the hardest problems.

Bogomill Balkansky: Recommended pieces of content, how to make yourself smart on AI.

Anish Agarwal: My favorite article, which you can read in five minutes is “The Bitter Lesson” by Rich Sutton, who won the Nobel Prize last year for his work on reinforcement learning.

Sonya Huang: Any other AI agent companies that you admire? You’re one of the very first cohort of truly agent native companies. Any others?

Anish Agarwal: Aside from the obvious, like Glean and I mean, I’d say Perplexity is one for sure that I think is just doing a fantastic job. Yeah.

Bogomill Balkansky: Favorite new AI applications that you’re using in your personal life?

Anish Agarwal: I honestly don’t use anything other than ChatGPT. [laughs]

Raj Agrawal: Unfortunately, the same for me. It’s already—it’s my best friend at this point. [laughs]

Sonya Huang: Wow! I love it.

Bogomill Balkansky: My best friend is Granola for the dyslexic me who cannot type to save my life.

Anish Agarwal: I do love Granola. I use it every day. I use Granola every day.

Sonya Huang: Maybe one last question. Will we be vibes coding banking apps, payment systems, healthcare systems, in five years’ time?

Anish Agarwal: To some extent, yes.

Sonya Huang: Wonderful. Thank you so much for joining us today. We really love this conversation, and congratulations on what you’ve built so far at Traversal.

Anish Agarwal: Thank you.

Raj Agrawal: Thanks so much.

Bogomill Balkansky: Thank you, guys. Now time for a Negroni.

Anish Agarwal: Yes.