Podcasts Training Data Sierra Co-Founder Clay Bavor on Making Customer-Facing AI Agents Delightful

Sierra Co-Founder Clay Bavor on Making Customer-Facing AI Agents Delightful

Stream Now On

Customer service is hands down the first killer app of generative AI for businesses. Co-founder Clay Bavor walks us through the sophisticated engineering challenges his team solved along the way to delivering next-gen AI agents for all aspects of the customer experience that are delightful, safe and reliable—and being deployed widely by Sierra’s many customers.

Clay Bavor, co-founder of Sierra, brings deep experience from 18 years at Google leading innovative projects like Google Labs and AR/VR efforts. Sierra is enabling companies to deploy AI agents that can handle complex customer interactions at scale, representing a fundamental shift in how businesses engage with customers. He describes how companies can capture their brand voice, values and internal processes to create AI agents that truly represent the business.

AI agents represent a new paradigm in software. The combination of non-deterministic language models with structured business logic requires rethinking the entire development lifecycle, from programming approaches to testing and deployment. Sierra has built an “Agent OS” and tools to manage this complexity.
The solution to many AI problems is often more AI. For example, Sierra uses “supervisor” agents to review the work of primary agents, improving accuracy and safety. They also leverage AI for analytics and to surface problematic conversations for human review.
Deploying AI agents successfully requires deep partnership with customers. To capture their brand voice, business processes and domain knowledge is as much about product design and customer experience as it is about technology. Sierra works closely with customer experience teams to refine and improve agents over time.
AI agents can drive significant business value beyond cost savings. Offering improved customer satisfaction, reduced churn and new revenue opportunities positions AI as a strategic priority for many companies, involving C-suite and board-level discussions.
The rapid pace of AI advancement means the full potential is still unfolding. In the near future, AI could dramatically enhance organizational capabilities, allowing companies to consistently operate at their best across all functions. For individuals, AI will likely become a powerful creative force multiplier, shortening the path from idea to realization.

Clay Bavor: One of the more interesting learnings from the, you know, past year and a half of working on this stuff is that the solution to many problems with AI is more AI. And it’s somewhat unintuitive, but one of the remarkable properties of large language models is that they’re better at detecting errors in their own output than in not making those errors in the first place.

Ravi Gupta: Joining us today is Clay Bavor, co-founder of Sierra. Before Clay started Sierra with his longtime friend Bret Taylor, he spent 18 years at Google, where he started and led Google Labs, their AR/VR efforts, and a number of other forward-looking bets for the company. Sierra is allowing every company to elevate its customer experience through AI agents, and there is no one who knows more about what AI agents can do today and what they’ll be doing tomorrow than Clay. You’ll get to hear about how pictures of avocado chairs helped inspire the founding of Sierra, why the solution to problems with AI is often more AI, and so much more. Please enjoy this incredible episode with my friend, Clay Bevor.

Ravi Gupta: All right, Clay. Listen, this is a funny start because we know each other so well, but can you just tell everyone a little bit about yourself, and just give us some background before we talk about the future of AI and what role Sierra is going to play in that?

Clay Bavor: So first of all, I’m a Bay Area native. I grew up not more than four or five miles from here, so grew up in the Bay Area. Got to see the kind of dotcom bubble grow and then burst, studied computer science and then ended up right out of undergraduate at Google, where I was for 18 years until last March.

And so at Google I worked on, really, every part of the company. I started in search and then ads. For several years, I ran the product and design teams for what is now Workspace, so Gmail and Google Docs and Google Drive and so on, and then spent the last, really 10 years at Google working on various forward-looking bets for the company, some hardware related like virtual and augmented reality, some AI related like Google Lens and other applications of AI.

And then 15 months ago, left Google to start Sierra with a longtime friend of mine, Bret Taylor. We met in our early days at Google, where we both started our careers in the associate product management program. So he was, I think, class one, I was class three. And we met early on and stayed in touch, in particular through a monthly poker group that in a good year would play, like, once, and met up December of 2022 and just saw what was happening in and around AI and these fundamentally new building blocks that we thought would enable us to create something really special and started Sierra out of that. So that’s the recap.

Pat Grady: Actually, I’m curious on that—and we need to get to what is Sierra pretty quickly here, but just for fun, December, 2022, very shortly after the ChatGPT moment, how—I guess, what was the process like? Or how soon after that moment did you have the conviction that this is a sufficiently interesting new technology to build a company around?

Ravi Gupta: Can I introduce one thing that’s kind of interesting I hope you talk about? Before you actually—before the ChatGPT moment, you had been telling me about how everything was gonna change. I still remember distinctly him telling me, “You don’t understand. You’re gonna be able to talk about a scene that you envision, and they’re gonna be able to make a movie out of you just talking about it.” Do you remember you telling me?

Clay Bavor: Yeah. Yeah.

Ravi Gupta: And so I’m actually very curious about this, too.

Clay Bavor: Well, I had such a privileged seat at Google to see so much of what came out of that Transformer paper in 2017 and the emergence of early large language models. So at Google, one of the first was called Meena or LaMDA. There was a paper, I think, in 2020, a conversational chatbot for just about anything. And I remember even before that, getting to interact with this thing in a pre-release prototype, and having this uncanny sense that there was someone, something on the other side of it, and that this was different.

And another moment, I think it was mid-2022 when we had, I think it was the first or second version of PaLM—Pathways Language Model at Google. It was a 540-billion parameter model. And we were testing it to see, kind of, how smart it was. And one of the surest signs of intelligence is the ability to think and reason in metaphor and analogy. So we tried a few things, and one which is pretty straightforward is we asked PaLM, “Hey, explain black holes in three words.” And it came back without skipping a beat, “Black holes suck.”

Ravi Gupta: [laughs]

Clay Bavor: And we were like, “Oh, that’s a pretty good summary. Also, like, you know, the model seems to have a sense of humor, which is cool.” And the moment that really blew my mind, we asked—and I remember the answer verbatim, we asked PaLM, “Please explain the 2008 financial crisis using movie references.” And again without skipping a beat, “So the 2008 financial crisis was like the movie Inception, except instead of dreams within dreams, it was debt within debt.”

Ravi Gupta: Whoa!

Clay Bavor: And we all paused. What is this? So it had understood basically the concept of CDOs, nestedness of debt. Okay, what movie includes nestedness of something else? Inception, nestedness of dreams. So it’s like Inception. And we all thought, “Wow, this is something new and different.”

And then there were a couple other moments. I remember the first DALL·E paper came out, and they did a blog post, and people reacted a little bit to it. But for me, I remember one of the stars of the show was they asked DALL·E to make avocado chairs. And I know this sounds so odd, but here was a set of 10 or 20 images of chairs that look like avocados. It wasn’t Photoshopped. These images had never existed before, and yet the model seemed to understand—similar to the movie reference metaphor—concepts of avocadoness and chairness, and put those together and create these images pixel by pixel.

Ravi Gupta: We had avocado chairs at Instacart. We actually did.

Clay Bavor: Did you really?

Ravi Gupta: We actually did. We actually had chairs shaped like avocados. In related news, there were times where we were burning a little bit too much money.

Pat Grady: [laughs]

Clay Bavor: Those bags, too. Yeah, those bags. So had a good sense that something was coming. And in fact, the team I was running at Google at the time, Labs, was putting a lot of large language models to use in early applications there. And so had a hunch. ChatGPT certainly clarified that hunch, but I think Bret and I both for several years had been tracking what was happening and just seeing, you know, first it was translation, and better-than-human level translation. Then it was some of this language generation. And I think credit to OpenAI for doing the engineering work and data work and much more, to make GPT-3 turn into ChatGPT, where suddenly you could grasp this thing’s full potential without knowing how to write Python and use their APIs.

Ravi Gupta: All right, so we’re going to talk about where AI is going. We’re going to be talking about agents, we’re going to talk about customer service.

Clay Bavor: Great!

Ravi Gupta: But first, just can you maybe just tell people a little bit about Sierra and what you and Bret have created?

Clay Bavor: Yeah. So in a nutshell, Sierra enables any company in the world to create its own branded, customer-facing AI to interact with its customers for anything from customer service to commerce. And the backdrop for this is this observation that any time there’s been a really significant change in technology, people interact with computers, with technology in different ways. And as a consequence, businesses are able to interact with their customers in entirely new ways.

And you saw this in the ‘90s, the internet made the website possible, and for the first time, a company could have a sort of digital storefront and be present to the world, update its inventory with the click of a button and so on. In the mid- to mid-early-2000s, 2005, 2008, if you were a company, you could all of a sudden, through ubiquitous social networks, interact with your customers at scale and have conversations at scale.

And In 2015, after the rise of smartphones, as a company, you could put kind of a Swiss army knife version of your company in everyone’s pocket. And so, like, I bet you have your bank’s mobile app on your phone, probably on your home screen. So the last few years of advances in AI has, for the first time, made it possible to create software that you can speak to, right? Software that can understand language, software that can generate language, and most interestingly, I think, software that can reason and make decisions. And it’s made for really delightful conversational experiences, like those that we associate with ChatGPT. And so we think this is a big, big deal for how businesses interact with their customers.

And you think about the difference between how we do some things today versus what you could do if you could just have a conversation with the business you’re interacting with. Think about, like, shopping. You’re in the market for some shoes, right? Or Pat, maybe for you, some new weights or something.

Pat Grady: [laughs]

Ravi Gupta: Very, very heavy weights.

Pat Grady: Tiny, tiny little ones.

Clay Bavor: And you’re on the website, and it’s like you basically have to imagine how the company’s designer would have organized the product catalog. So okay, men’s, men’s shoes, men’s running shoes, men’s racing shoes, lightweight vapor fly, I can’t remember the name, and so on. Instead, with conversational AI you could just say, “Hey, I need some super lightweight running shoes, kind of like those ones I got last time. What do you got?”

And it’s almost like I’m dating myself a little bit here. But, like, Yahoo! Directory, where you navigate through this hierarchical structure to find what you want to in contrast to Google where you explain what you want. And this takes it several steps further. And there’s a quote from the head of customer experience at one of the companies we work with. She said, “I don’t want our customers to have to have a master’s degree in our product catalog and our corporate processes.” And to do a lot of things—you know, buying shoes, fairly easy on the spectrum of interactions you have with companies, imagine adding a new person to your insurance policy. Where do you go in the mobile app for that? How do you get that done? And your eyes just glaze over, right? And so the alternative, talking to an AI, and in particular an AI agent that’s the technology around which we built Sierra, where that AI agent represents your company, your company at its best, we think is really, really powerful. And even in, you know, we’re 15 months old as a company, we’ve had the privilege of already working with storied brands like Weight Watchers, Sonos, SiriusXM, OluKai. If you’re in the market for new flip flops, I strongly recommend OluKai flip flops.

Pat Grady: I have two pairs.

Clay Bavor: Very good. Excellent. Also make great golf shoes.

Ravi Gupta: Oh, really?

Clay Bavor: Oh, yeah. Yeah, yeah. You should get some.

Ravi Gupta: All right. Great.

Clay Bavor: And so for Weight Watchers, we’re advising on points and helping members manage their subscriptions. With SiriusXM, we’re helping diagnose and fix radio issues, and figure out what channel your favorite music is on, and so on. And the results, again, in the first year of the platform out there were, in one case, resolving more than 70 percent of all incoming customer inquiries at extremely high customer satisfaction. And all this leads us to believe that every company is going to need their own AI agent, and we want to be the company that helps every company build their own.

Pat Grady: In the spirit of, sort of, the future of these AI agents and what they could mean for customer facing communications or customer facing operations, are there any good examples of things that were not possible 18 months ago that are possible today? And then maybe if we roll the clock forward, things that are still not quite possible today that you think will be possible 18 months from now?

Clay Bavor: Yeah. First of all, the progress month by month and over 18 months in particular, is just kind of breathtaking. 18 months ago, GPT-4–class models didn’t exist, right? It was still kind of something just coming over the horizon. Agent architectures, cognitive architectures, kind of the way you compose large language models and other supporting pieces of infrastructure were very, very rudimentary. And so I’d go so far as to say, like, the idea of putting an AI in front of your customers that could be helpful and importantly, safe and reliable, that was just impossible.

And so chatbots from even 18 months ago looked a lot like a pile of hard-coded rules that someone cobbled together over months or years that became very brittle. And I think we’ve all had the experience of talking to a chatbot. “I’m sorry, I didn’t get that. Can you ask in a different way?” Or my favorite. My favorite is when they have the message box, and then the four buttons you can click, but the message box is blanked out and you can’t actually use it. And so I can help you with anything so long as it’s one of these four buttons.

So most of what I described—fixing radios, processing exchanges and returns and so on, wasn’t possible, at least in any satisfying way, or in a way that led to real business results for companies 18 months ago. Fast forwarding 18 months. I think we go pretty deep here. I think multimodal models are quite interesting. Something like 80 percent of all customer service inquiries are on the phone, not on chat or email, so voice will obviously be a huge part of it. Things like returns, exchanges, diagnosing radio issues and things like that are on the simpler end of the spectrum of the total set of tasks that you might want to get help with from an AI agent.

And so I think more advanced models, more sophisticated cognitive architectures, all of those, I hope, would increase, kind of, the smarts in the agent, the types of problems it can solve. And then trust, safety, reliability, you know, the hallucination problem, I think, is still an unsolved area. And we’ve made, others have made huge amounts of progress on it, but I think we can’t yet declare victory.

Ravi Gupta: How quickly do you think it’s going to become—you guys are doing so much for the customers, not just customer service, but working all the way through the funnel. But on the customer service side, how long is it going to take to become the default, that folks expect that they will be able to have someone or an AI that’s available at any time to answer any question? You know, make that real for us?

Clay Bavor: Yeah, I don’t know. And in part, there’s a bit of a hole to dig ourselves out of as, not a company, but as an industry where it’s like, when was the last time you had a great interaction with a chatbot on a website? And I think if you polled a hundred people and you’re like, “Do you like talking to customer service chatbots?” probably zero out of a hundred would say yes. On the other hand, if you ask, “Hey, do you—” ask a hundred people, “Do you like interacting with ChatGPT?” maybe a hundred out of a hundred would say yes.

And so I think some of the work we’ve been doing in our product is to educate our customers’ customers upfront, that, like, hey, this thing’s actually really smart and good. One of the interesting specific techniques for doing that is we stream our answers out word by word, similar to how ChatGPT does. People are so used to the message, message, message. The streaming answers is something of a kind of visual signature for, oh, there’s a really smart AI behind this. And so I think what we find is customer satisfaction is extremely high with our AI agents in the mid-four—so 4.5 out of 5 stars, which in some cases is higher than customer satisfaction with human agents. And in fairness, they often get the hardest cases, and the cases that we will hand off because the customer became angry or was especially frustrated or something. But still, those results are really significant. And so my guess is over just the next few years, I think people will realize, “Oh, I can get my issue resolved faster. This thing is actually capable and cannot only answer my questions.” But one of the things we’re really proud of is we go far, far beyond just answering questions, but can actually take action and get the job done.

Pat Grady: Can you talk a bit about Agent OS and some of the frameworks that you’ve put around the foundation models to make everything work?

Clay Bavor: Yeah. So it’s been such an interesting journey learning what’s required to put AI safely, reliably and helpfully in front of our customers’ customers. And a huge part of that, really, the first part is looking at what are the challenges with large language models, and how do you address or meaningfully mitigate those?

And so start with hallucinations. I don’t know if you saw it, but there was an example from a few months ago where Air Canada’s chatbot that I think was based on an LLM and apparently not much else, was interacting with a gentleman who had questions about their bereavement policy. And I think the person had had someone pass away in his family and was asking about refunds and credits and so on. And the AI made up a bereavement policy that was quite a bit more generous than Air Canada’s actual bereavement policy. And so the man took a photo, and later claimed the full amount of that refund and so on, and they said, “No, actually, that’s not our policy.” And bizarrely, and I don’t quite understand this, the case went all the way to court. Air Canada lost, and our thought was like, “Hey, it’s just like $500 and, like, Canadian dollars.” But hallucinations are a real challenge.

And on top of that, just to enumerate some of the things to overcome that we have with Agent OS, no matter how smart GPT-5 or -6 is, it won’t know where your order is, which seats you’ve booked on the upcoming flight or whatever. It’s obviously not in the pre-training set. And so you need to be able to safely and reliably and in real time integrate an AI agent in our case, with systems of record to look up customer information, order information, and so on.

And then finally, most customer service processes are actually somewhat complex, right? You go to call centers and there’ll be flow charts on the wall like, “Here’s how we do this, and if there’s an exception, this way,” and so on. And as capable as GPT-4 and Gemini 1.5-class models are, they’ll often have trouble following complex instructions.

And we saw one example in an early version of an agent that we prototyped where you’d give it five steps in a returns process or something, and you’d say, “Hi, I need to return my order,” or whatever, and it would jump straight to step five and then call a function to return the shoes with username johndoe@example.com, order number 123-4567.

So it would not only hallucinate facts or bereavement policies, but even function calls and function parameters and so on. So with Agent OS, what we built is essentially a toolkit and a runtime for building industrial-grade agents that—I don’t want to say that we’ve solved every one of these problems, but overcome and mitigated the risks in these problems to such an extent that you can safely deploy them at scale, have millions of conversations with them and so on.

And it starts at the foundation layer, I don’t mean foundation model layer, but just the base layer of the platform where you have to get really important things like data governance and detection, masking and encryption of personally identifiable information, right? And so we built that right into the platform from the ground up so that our customers’ data stays our customers’ data, so that their customers’ data is protected. We, for instance, detect, mask or encrypt all PII before we log into durable storage, so knowing that we’re going to be touching addresses and phone numbers and so on, can handle that safely.

A level up from that, we’ve developed what we call Agent SDK, and it’s a declarative programming language that’s purpose built for building agents. And it enables an agent developer, most of whom sit within the four walls today of Sierra, to express high level goals and guardrails around agent behavior. So you’re trying to do this. Here are the instructions, here are the steps and a couple of the exceptions cases, and then here are the guardrails.

And to give an example of that, one of our customers works in, kind of, the healthcare-adjacent space. They want to be able to talk about the full range of their products without dispensing medical advice, right? So how do you create those additional guardrails? And then so you can define, kind of, the behavior and scaffolding for complex tasks for AI agents with Agent SDK. We also have SDKs for integrating with contact centers when we need to hand off, for integrating with systems of records like the order management system and so on. And then finally, for integrating our chat experience directly into a customer’s mobile app or website, iOS, Android, web, and so on.

And then once you’ve defined the agent using Agent SDK, we then have a runtime where we abstract away what happens underneath the hood from the developer so that they can define what the agent should do, define the what, and then Agent OS takes care of the how. And so for some skills, there might not be one LLM call, but five, six, seven, ten separate LLM calls to different LLMs with different prompts. In other cases, we might retrieve documents to support answering a question accurately with and so on. And Agent OS, in the spirit of an actual operating system, abstracts away a lot of that complexity, kind of the equivalent of IO and resource utilization and so on. So it makes the whole process of building and then deploying an AI agent much faster and much safer and more reliable.

Ravi Gupta: And when you think about what you just said, Clay, of like, when you call multiple LLMs, is that in a supervisory capacity sometimes too, where you end up having, like, a supervisor agent reviewing the work of a lower level?

Clay Bavor: Yeah. One of the more interesting learnings from the past year and a half of working on this stuff is that the solution to many problems with AI is more AI. And it’s somewhat unintuitive, but one of the remarkable properties of large language models is that they’re better at detecting errors in their own output than in not making those errors in the first place. And it’s kind of like if you or I were to draft an email quickly and like, “Okay, let me pause. Let me proofread this. Does this make sense? Do these points hang together? Oh, actually, no, I missed this.” And even more powerfully, you can prompt LLMs to take on, in essence, a different persona—so a supervisor’s persona—and it seems with that, you can elicit more discerning behavior and a closer read of the work being reviewed.

So to your question, Ravi, yeah, we, in addition to building the agent itself, have a number of these supervisory agents that basically it’s like a little Jiminy Cricket agent looking over the shoulder, right, of the primary agent. “Is this factual? Is this medical advice? Is this financial advice? Is the customer trying to prompt inject and attack the agent and get it to say something that it shouldn’t?” All of these things. And it’s through layering all of these, the goals, the guardrails, the task scaffolding in using Agent SDK within these supervisory layers that we’re able to get both to the performance levels we are, 70 percent-plus resolution rates, but also to do that really safely and reliably.

Ravi Gupta: That’s one of the cooler things I’ve heard is just tell it to have a different persona, and then all of a sudden it behaves differently. Like, I remember when I first saw it on ChatGPT, when it doesn’t help you on something, just tell it it’s really good at it, and then it’s more likely to help you is a remarkable situation.

Clay Bavor: It’s very strange. And one of the weirdest adjustments over the past 15 months building these things is, I’m sorry, we’re programming with English language, and we can give it the same English language, and it can say something entirely different? And on prompting techniques, I mean, it’s fascinating. Even with no new models coming out, right, given a fixed model, you can elicit better and better performance from it simply by improving how you prompt it.

And there was a paper that came out three or four months ago that suggested that, like, emotional manipulation of the large language model would get better results. So the kind of prompt suffix that they figured out was you say, “Hey, I need you to perform this task.” You define the steps and so on, and you end with, “It’s very important to my career that you get this right.”

Ravi Gupta: [laughs]

Clay Bavor: And the performance goes up. You’re like, what is this? Like, what are computers now? And for the record, we don’t use that prompt in any of our prompts, at least not that I know of. But things like chain of thought, think step by step. Let’s take this step by step elicits better reasoning for very interesting reasons. Other methods of task decomposition and kind of narrowing the set of things that the LLM needs to keep in mind at the same time improves reasoning if you’re precise about what you want it to do. So all of these techniques are those that we’ve applied and built into Agent OS. And actually, we have a small but mighty research team, and our head of research, Karthik Narasimhan, was …

Ravi Gupta: By the way, that was incredible pronunciation.

Clay Bavor: Oh, thank you.

Ravi Gupta: His grandmother would have been so perfectly happy with how you pronounce it.

Clay Bavor: Thank you.

Ravi Gupta: Thank you.

Clay Bavor: Soft T. Yeah, soft T.

Ravi Gupta: Nicely done.

Clay Bavor: Yeah. It’s not a T and it’s also not a TH.

Ravi Gupta: That’s right.

Clay Bavor: Somewhere in between. Thank you. Thank you very much. He helped write the ReAct paper, one of the first agent frameworks. One of our researchers wrote the Reflexion paper, where you can have the agent pause, reflect on what it’s done, think through, am I doing this right before proceeding? And so these are all things that we’ve been able to incorporate in quite a direct way.

Ravi Gupta: You should talk about the most recent research, the 𝛕-bench.

Clay Bavor: Oh, 𝛕-bench. Yeah. Yeah, yeah.

Ravi Gupta: It took me a while when I was trying to send the email saying I liked the paper to find the tau symbol on my computer.

Pat Grady: It took Ravi a while because he’s, to this day, never actually read a research paper.

Ravi Gupta: [laughs] I read this one.

Clay Bavor: That’s great.

Pat Grady: No, no, no. He had to figure out how to put it into ChatTPT and say, “Please write a paragraph that makes it sound like I read this research paper.”

Clay Bavor: Well, either you …

Ravi Gupta: I retract the comment.

Clay Bavor: Well look, either you or Chat GPT did a great job on that email.

Ravi Gupta: Thank you. Thank you. We’re a team.

Clay Bavor: Yeah. So 𝛕-bench is our first research paper. First of all, tau is a Greek symbol. It’s spelled T-A-U, and it stands for ‘tool agent user benchmark.’ And what we observed was that the benchmarks out there for measuring the performance of AI agents in particular were pretty limited in that basically they would present a single task, “Here’s something we need you to do, and here are some tools you can use, do you do the job or not?”

And the reality is, interactions with an AI agent in the real world are way messier than that, right? They take place in the space of natural language where customers can say literally anything or describe whatever they’re trying to do in any number of ways. It happens over a series of messages. The AI agent needs to be able to interact with the user to ask clarifying questions, gather information, and then use tools in a reliable way. And it needs to be able to do this a million times reliably.

So the benchmarks out there we found really lacking in measuring the very thing that we are trying to be the best at. And so our research team set out to create a benchmark that measures, we think, the real world performance of an agent in interacting with real users, using tools with all the messiness that I just described.

And the big picture approach that we took is pretty interesting. So you have an AI agent that you’re trying to test. You have another separate agent that acts as the user. So basically a user simulator. And the AI agent you’re testing has access to a set of tools it can use. Think of these as like functions to call. So a simple one would be, “I’m gonna do some math using a calculator tool.” A more complex one might be, “Hey, I’m going to okay returning this order with the following parameters: this order number, credit to credit card or store credit,” or whatever. And then you basically run a simulator where the agent has a conversation with the user simulating agent, and at the end we’re able to test in a deterministic way, were the functions used in the right way?

And the way we do that is we basically create a mock database that those tools interact with and modify. So were they modified in the correct way? So what’s neat about this is you can initialize the conversation so that the user has many different personas. They could be grumpy, they could be confused. They could know what they want to do, but speak about it in a clumsy way. And so it doesn’t really matter the path that the AI agent takes to get to the correct solution, so long as it gets to the correct solution. Now what came out of this was pretty interesting, and I think it strongly motivates the development of things like Agent OS and frameworks and cognitive architectures for building these agents. So the upshot is LLMs on their own do just an absolutely terrible job at this task.

Pat Grady: Yeah.

Clay Bavor: And so even the frontier models in something as simple as processing a return. And mind you, the instructions given to the agent being tested are quite detailed. The functions, the tools it can use are quite well documented and so on. And yet on average, the best performing LLM on its own got to the end of the conversation correctly 61 percent of the time. And that was in returns. It was modifying an airline reservation. We had two kind of simulation versions. The best results were 35 percent. Now what’s interesting is we all know that if you take a number less than one to the Nth power, it quickly gets very small. And so we developed a metric we call pass@k, which is okay, if you run this simulation eight times—and remember, you can make use of the non–determinism of LLMs to have the user simulator be different every time. So you can permute that. Well, 0.61 to the eighth power is about 25 percent. So you then imagine well, what if you’re having a thousand of these conversations? You’re so far off from being able to rely on this thing.

So the upshot is much more sophisticated agent architectures are needed to be able to safely and reliably put an agent in front of, really, anyone. And that’s the very thing we’re building with Agent OS and a lot of the tooling around it.

Pat Grady: How much of that do you think is an engineering task and how much of that is a research task? And I guess maybe the question behind the question is timeframe to having useful agents deployed at scale and broad domains of tasks.

Clay Bavor: Yeah. Well, I think the short answer is it’s both. But I’ll say more concretely, I’m very optimistic about it being in large part an engineering challenge. And that’s not to say that the next wave of models and improvements in the frontier models won’t make a difference. I believe it will. In particular, we’re seeing techniques like better fine tuning for function calling, agent-oriented fine tunings for foundation models, or some of the open source models. Those will help.

But the approach we’ve taken in building Agent OS and kind of the foundations of Sierra is really treating building AI agent as first and foremost an engineering challenge, where we are composing foundation models, we are composing fine tuned, open source models that we’ve post trained, fine tuned with our own proprietary datasets, and by composing multiple models in interesting ways by supplementing what LLMs can do on their own with retrieval systems like retrieval-augmented generation, to improve grounding and factuality, by supplementing -the kind of in-built reasoning capabilities of LLMs with I’ll call it reasoning scaffolding that live outside of the models, where you’re composing, planning task generation steps, draft responses—the supervisors that we talked about—and doing that outside the context of the LLM.

We’ve been able to put AI agents in front of a huge number of our customers’ customers, and safely and reliably and so on. So I don’t think it’s something over the horizon. It’s already over the horizon. I think—looking ahead, I think there are a few different avenues where we’ll see progress. One is in the foundation models—we talked about that. And as the capabilities grow, agents will get smarter. And we’ve architected Agent OS in such a way, talked about abstracting kind of the ‘what’ from the ‘how’ where we’ll be able to swap in the next frontier model, and everyone’s agent will just get a bit smarter. They’ll get, like, an IQ upgrade.

By the way, similarly and interestingly, we can swap in less broadly capable models, but models that are more capable in a specific area. So for instance, triaging a case or coming up with a plan and so on, we can use much smaller models that actually are better, faster, cheaper. Choose three all at once. And then I think we’re seeing progress literally week by week, on the engineering of these agents, and building in not only new and better components under the hood in the architecture, but new approaches and tooling around, basically teaching these agents to do it better and better. For that, we built something we call the experience manager for customer experience teams, which is a pretty interesting thread all on its own.

Ravi Gupta: Clay, if you had a high-value customer, like, you are a company now, you’re not running Sierra, you’re running a company that has a high-value customer. What today with a Sierra agent or with an excellent—excellently-designed agent could you trust an AI agent to go do in front of your customers today? What are some of those tasks, and then what will they be—pick your timeframe—in the future. Because I think that we’ve talked about this, and I like your language of like, you know, they already don’t have to just be on the help center. They can already be on the homepage.

Clay Bavor: Yeah.

Ravi Gupta: Right? What are some of the tasks that, you know, you can rely on an agent for today if it is well designed with a high 𝛕-bench score?

Clay Bavor: Yeah. You see that?

Ravi Gupta: That’s from a thoughtful and detailed reading of the paper.

Clay Bavor: Yeah, thanks to strong …

Ravi Gupta: You notice it?

Clay Bavor: Strong. Yeah, strong. What would its pass@k score, though? Yeah, so pretty broad range even today. So simple things like getting answers to questions. That’s kind of the left end of the spectrum. To the right of that are things like helping you with something complex, like, “Hey, I got shoes or this item of clothing. It didn’t quite fit.” And then branching off of that, “What do you recommend that’s like it that might fit better?” And so it starts to get into, it’s not like-for-like replacement, but the agent actually needs to make sense of styles, of sizing, of differences between, you know, wide and narrow fit and so on.

A click up from that is something like troubleshooting. So with Sonos, for instance, we help their customers troubleshoot if they can’t connect to their system or they’re setting up a new system. And you imagine it gets pretty sophisticated pretty quickly where it’s basically a process of elimination, trying to understand is it a wifi thing? Is it a configuration thing? And narrowing down the set of problems that it could be just as a sophisticated level two or level three technical customer service person would, and getting the music back on. And I think that’s a really neat example.

Probably the—you used the word ‘trust.’ What would you trust an AI agent to do? One of the things we’re really proud of is several of our customers are actually trusting us with when customers call in and they want to cancel or downgrade their subscription, helping those customers to understand, “Hey, how are you using the service today? Is there a different plan that we could put you on?” So it’s value discovery, it’s putting an offer, sometimes a series of different offers in front of their customers in the right order, positioning the value of those offers correctly given the customer’s history, given the plan that they’re on, and so on. And, you know, the difference between keeping a customer from churning or not is hugely consequential.

You know, AI for customer service has obvious cost-savings benefits and I think customer experience benefits in particular—you’re never going to wait on hold. But boy, revenue preservation, revenue generation is something else entirely. And so that’s really at the right end of the spectrum. And we’re really proud of how well our agents are performing in those circumstances. And it’s interesting, by being consistent, by taking the time to understand what’s driving someone to potentially leave the service, asking the follow-up questions that an impatient or improperly measured customer service agent in a call center somewhere might not, we can be much more nuanced in understanding what’s driving this decision, what might be a good match for this person in terms of a plan that would be quite valuable given how they’re using it, and then put that in front of them. And so that’s the right end of the spectrum.

Where it goes from here? I think we’ve yet to see a process too complex for us to be able to model and scale up using Agent OS and our agent architecture. And so I’m sure we’ll get punched in the face by something that’s especially complex, right? But I’m excited about directionally, we’ve started with service because—for two reasons. One, the ROI case is just unequivocally awesome. And the average cost of a call is something like $12 or $13, and yet, despite the expense, most people don’t like customer service calls very much, right?

And so here’s something that’s actually really important to businesses that’s really expensive and not very good there. And because of the relative simplicity of at least a pretty broad set of service tasks that they start there, but we’ve already been pulled by our customers into upsell cross sell and like, “Hey, can we just put you on the product page and have you answer questions about our products?” And so I mentioned that you’re returning something and need advice on a different model or size or whatever. How far can that go? And I love the idea of an agent being along for the journey from pre-purchase consideration to helping you get the thing that’s right for you, to helping you set it up and activate it and get the most out of it. It’s great for the company, it’s great for the person. And then when things do go wrong, being there to help in all of this, I think customer service and getting help in a very direct and conversational way is going to be much less of a thing that you kind of go over there to do and much more kind of woven throughout the fabric of the experience.

As a consequence, I think a really interesting and powerful opportunity for companies to build connection with their customers, to reinforce their brand values, you can imagine a company really appreciating being able to use exactly the company’s voice that the CMO and head of communications say “This is how we talk, this is how we are. These are our values, this is our vibe in every digital interaction they have.” And that’s the promise in this stuff, and so I think both greater complexity and then ubiquity throughout the customer journey are two of the kind of main directions of travel.

Ravi Gupta: One thing for me that I think about a lot is we’ve come to expect and accept, like, certain metrics for conversion on mobile—the mobile web or the mobile app. We’ve come to expect and accept some sort of retention numbers. What would those be? What could they be if you actually had an excellent experience every time throughout the journey? It really could be very different than what—we’ve all been like, “Okay, that’s just the number. That’s just what it is.”

Clay Bavor: Yeah, I think that’s exactly right. And we don’t know. We’re a few months in, but it certainly seems like there’s a lot of headroom in retention, in use in the first 30 days, of all of the metrics, all of the leading metrics of a healthy business. And so I think that’s exactly right.

The other thought experiment to do is companies are judicious in using things that have a cost to them, okay? So as a consequence, companies make it actually really hard to get ahold of someone on the phone to ask some questions, right? I think there are whole websites devoted to, like, uncovering the secret 800 numbers that companies have hidden away in the depths of their help centers. Well, to think about not only what would happen if those interactions were better—by the way, interestingly, the number one reason why people report a poor interaction with customer service is it took too long. 65 percent. When it’s a negative interaction, 65 percent of the time it took too long. “I had to wait, I was put on hold,” and so on.

And the second most is “I had a bad interaction with an agent.” And we’ve heard some pretty dicey anecdotes. Like, we heard of one agent who had consistently low ratings, but spikily. So, like, one in three conversations was like a one out of five CSAT where the two out of the other three were fine. And it turned out in the low CSAT ones, this agent was meowing like a cat, which is like, you know, you’re midway through the call and the agent is meowing. And so anyway, back to, okay, what would happen if, in contrast to making it near impossible to have a conversation with us and get help, companies were providing five or ten times the amount of fluent, flexible, helpful, conversation-based support? I don’t know. I think a lot of products and experience with companies would look quite different and much more delightful than they do today.

Pat Grady: Okay, meow.

Ravi Gupta: [laughs]

Pat Grady: Meow here’s a question for you.

Clay Bavor: About that meowing.

Pat Grady: About that. Yeah, just rate them meowing. I think that’s going to be good. I do actually have a question though, although I do like the Meow game also. So we talked a little bit tech out in terms of what you guys have built, cognitive architecture, all that good stuff. We’ve talked a little bit about customer back. What’s the experience like? Where’s that headed? Can we connect it in the middle for a minute? And I’m just curious, what’s the reality of deploying AI to customers today? And I’m thinking about things like you mentioned earlier, getting the brand voice just right, or making sure that you actually have the right sort of business logic encapsulated and whatever training manuals are being deployed for the sake of customer support, making sure that everybody is comfortable with deploying this. What are some of the just less sexy technology or just practical considerations for deploying this stuff today?

Clay Bavor: It’s such an interesting space, and we’ve learned so much over the past 15 months about it. The first insight is AI agents represent a totally new and different type of software. Like traditional software, you write with a programming language and it basically does what you expect it to do. You give it an input, it gives you an output. You give it the same input, gives you the same output. And in contrast, LLMs are non–deterministic, and we talked about some of the funniness around prompts. And remember that in the context of a conversation with a customer, a customer may say anything in any way.

And so you’ve got programming languages using prompts and these non–deterministic models. You’ve got structured input to messy human language. And under the hood, you upgrade a database, it stores data, it’s maybe a little bit faster. Fundamentally works the same way. You upgrade a large language model and it may just speak in a different way or get smarter or different. And so we’ve to start—the precursor to deploying these is to have built basically a—we call it the agent development lifecycle. And it’s a new approach to building these things. We talked about using this declarative programming language to define these. It’s a new approach to testing where what’s the equivalent of a unit test or an integration test?

So we built a conversation simulator where we can, for a company’s agent, amass hundreds or thousands of basically conversation snippets and replay those to make sure that not only agents aren’t regressing, but they’re getting better and better and better, release management, quality assurance, and so on. So that’s part one.

Part two, to your question in actually architecting these things, one of the things we’re really proud of and that I think is different about working with us is it’s not just a kit of parts you get from us. It’s not, “Here’s a bunch of tech. Good luck building your agent.” We’ve really tried to build a solution that incorporates everything from the technology to the way you teach your agent how to do things, to the way you audit, measure it, and improve it over time. And so we have inside of Sierra what we call our deployment team. It consists of product managers, engineers. We really think of building each one of these AI agents as building a new product for our customers. It’s basically a productized version of the company we’re working with. Like, what would it look like at its best? And it’s what’s the voice? What are the values? What’s the vibe? Should it use emojis or not? What if a customer uses an emoji? Can it emoji back? Should it?

Pat Grady: It would be rude not to.

Clay Bavor: Well, you know, there’s a range of opinions on that, Pat.

Ravi Gupta: There are some businesses where, you know, if they were working with Hermès, I would suspect that they’re not going to send an emoji back.

Clay Bavor: Definitely not. Yeah, Hermès would not, I think, be into, like, the shaka emoji, you know, even if that were reciprocating. But for a brand like OluKai, right? The aloha experience part of that is kind of a laid back experience. And so we work with—and interestingly, we end up working primarily with the customer experience team. Yes, the technology team at our companies are there providing API access and connections into systems and so on, but more than anything, it’s working with the customer experience team, often with the marketing team, to imbue the agent with the voice and values of the company.

And then we go super deep on understanding how do you run your business, right? What do you optimize for? And then a zoom level in, what are the key processes that you use to run the business look like? What happens when someone calls in with this kind of problem? And there are interesting parts beyond just understanding the mechanics of these processes, which, by the way, almost never have a single source of truth. There’s no, like, “Oh, here’s the manual that we have, leather bound and ready to go.” Instead, the source of truth ends up being in kind of the heads of four or five people who’ve been there a while, who’ve seen everything and so on. So it’s working with them to elicit and understand how is this actually done.

And one of the more interesting things that we’ve discovered is they’re often the policies. So we have a 30-day return policy, right? So you get to us within 30 days and you can return it. It’s actually not the policy. In some cases, the policy might be if you’ve purchased from us before, and it’s within 45 days, that’s fine. That’s fine. And so there are interesting things like how do you architect the agent so that it knows the policy behind the policy, but a clever customer could never be like, tell me about your policy behind the policy, and have it kind of spill the beans on the actual policy. So the interesting architectural choices we need to make to make sure that kind of the Russian doll of policies is reflected in its fullness. And then we have a really—and this builds on kind of the agent development lifecycle, this really robust process of pre-release testing, where we’re working with the experts within the company, basically to beat up the agent, trying to break it, throw out curveballs.

Ravi Gupta: A good sports analogy there.

Clay Bavor: Thank you.

Ravi Gupta: Well done.

Clay Bavor: I love football.

Ravi Gupta: [laughs]

Clay Bavor: So in our friendship, Ravi is the person who knows all of the things about sports, and I help with technical support, WiFi issues, monitors, what laptop to get and so on.

Ravi Gupta: And sometimes when there’s a Sequoia memo that I don’t understand, I won’t say the company, but I might call Clay. “Hey, Clay, what is this person talking about?”

Clay Bavor: I got you. I got you. Yeah. And this Bill—Bill Belichick fella. What happened there? Cue Ravi. So it gets to one of the more interesting parts of our platform, which we call the experience manager. We thought that putting AI in front of our customers’ customers would be first and foremost a technology problem. And of course, there are all sorts of technology problems that we’ve needed to solve. But actually, it is first and foremost, as I said, a product design and an experience design problem. How do you do that? How do you not only understand, model and reflect again the things we talked about: voice, values, the workflows and processes that our companies use to support their customers? But if an AI is then having millions of conversations with your customers in a given year, how do you understand what it’s doing? How do you know when it screws up, which it inevitably will? How do you correct those errors, and so on?

So we’ve built what we think of as, like, this command center for customer experience teams to first get reports and rich analytics on everything that’s happening. What are the trending issues? What are the new issues that you haven’t seen before? One of the things we’re really proud of is we’ve actually spotted issues that our customers were having, or were about to have before they knew about them. So a shipping depot outage where orders weren’t being shipped, we spotted that probably eight or ten hours before one of our customers would have. A brewing PR crisis, an app crashing issue was another.

So it starts with analytics and reporting on what’s happening. Of course, that includes things like resolution rate, customer satisfaction and so on. Where it gets really interesting is we can apply different sampling techniques to identify a set of conversations for a customer experience team to review and give feedback on. And we can bias that sample in a way so that the conversations are much more likely than average to contain problems. There’s no value in looking at a hundred great conversations. It’s like, good job, Sierra. Thanks. But that’s not a value to our customers. We can bias the sampling in such a way that you’re surfacing, kind of, the problem cases.

And then in the experience manager, we made it possible for customer experience teams to give feedback, basically coaching moments. “I wouldn’t have done it that way,” right? It’s like, “This is too many exclamation points, too enthusiastic for the tone that we’re going for,” or “The user was clearly frustrated here, and you did not express empathy and apologize for the problem. Do that next time.” Or more consequential is like, “Hey, your reading of the warranty policy was incorrect here for this reason. Do it this way instead next time.” And so all of this kind of wisdom, knowledge and coaching we are able to capture in the experience manager and then reflect back in the agent, back to the agent development lifecycle. Every time we make one of these improvements, we create a new test so that we can see forever into the future. Great, it’s getting the warranties right. We’re able to resimulate that conversation.

So zooming out, what all of this looks like is a really deep engagement with our customers. We’re really proud to be, I think, proper partners to our customers where yes, on the one hand, we’re a vendor and a supplier of technology, on the other hand, we understand their businesses really well. Like, I think I know as much about the SiriusXM satellite radio refresh process as anyone on the planet, and ditto for various processes of our other customers. And so conversations about how to use not just Sierra’s AI agents, but AI more broadly, we’re in those conversations. And they are not just with the customer experience team, but with the CEO and even in cases with the board, because again, back to the things we’re doing, we can save enormous costs, we can improve the experience, and when we’re in the flow of keeping a customer from churning out, driving top line revenue. And so it’s a really important and privileged place to be and something that we’re really grateful for.

Ravi Gupta: I’m struck when you were talking, you mentioned you have a research group, but you also have some very real enterprise software sales. You have deployment. One of the things when I was at Instacart people would ask sometimes is like, “Well, are we a software—are we engineering led, or are we ops led?” And I would always say, “Well, it only works if it all works,” right? And so you would try to avoid answering the question because you didn’t want to create different classes. How do you guys do that at Sierra, where everyone realizes the value that they’re providing, but you guys have a very specific company that covers a lot of stuff?

Clay Bavor: Yeah. I mean, to abstract a bit, a company, almost definitionally, is a system for creating happy customers. It’s a machine for creating happy customers. Again, to be a bit abstract about it, Bret and I really think about what we’re building with Sierra as a company, a system, a machine for producing reliable, high quality, massively ROI-positive AI agents that enable our customers to be at their very best in every customer interaction, to do that at scale, and as a consequence, to produce happy customers who we hope will be with us for decades to come. And when you articulate it that way, anyone can see well, an automobile is a system. It’s a machine for getting from point A to point B. Are we engine led or tires led, right? It’s like, what are you talking about?

Ravi Gupta: Totally.

Clay Bavor: All of these things need to come together in order to create that kind of outcome. And so I think, are we engineering led? Yes, of course. Like, we’re building some of the most sophisticated software in the world that does something really important for our customers, that needs to be reliable and safe. And so yes, engineering matters a lot. Are we research led? Yes, we are at the absolute frontier of agent architectures, cognitive architectures, composing LLMs, modeling procedural knowledge, grounding factuality. And so are we research led? Yeah, there’s an element of that. Are we go-to-market led? Yes. Like, enterprise software needs selling. And what it’s selling is it’s helping a customer with a problem understand that what you have built is by far and away the best solution to that problem. It’s a communication challenge. It’s a connection challenge. It’s a matchmaking and problem-solving challenge. And so that’s part of it.

And then okay, like, if we’ve built the right thing and someone wants to buy it, how do we ensure—especially given the stuff is all so new, how do we ensure that they’re successful with it? And so we have a deployment team. So are we deployment led? Yes. Like, all of these are a component in this system, in this machine for producing AI agents and ultimately happy customers, and we hope, a really significant business.

Ravi Gupta: Awesome. That’s a better answer than the one I would give an Instacart. Look, it either all works, but yeah, that was very good.

Clay Bavor: Yeah. Choose one. No, I mean, it’s just more complicated than that. It’s just more complicated than that. And I think Bret and I, by virtue of having worked for a while and seen a few movies before, it’s like we’re able to see that, and we’ve really tried to imbue that mentality in the company. And by the way, what is the machine behind the machine that produces AI agents and so on? That’s a company’s culture, a company’s values. And so one of the values we hold is craftsmanship. And part of that is continuously self–reflecting to self–improve. And that goes both individually and that goes as a company.

And so whenever we screw something up, we do the postmortem that week, if not that day, and everyone’s in on it. What can we learn? How can we do better? How can we do this better next time? We have a Slack channel internally called ‘Learn From Losses,’ any form of loss, right? It’s like, how do we learn? How do we get better? How do we get stronger? And so that’s about Kaizen, self–improvement, improving the machine. How could we make this more efficient? Our deployment team, we joke—and it’s not a joke—their first job is to build and deploy successful AIs that make a massive difference for our customers. Their second job, and in a way their more important job is to automate themselves out of a job, to build the tooling and the documentation and the know-how to make that job, you know, 10 times faster and more impactful.

Ravi Gupta: One of the other Sierra values is intensity.

Pat Grady: I like it.

Ravi Gupta: They have—they have really good values.

Clay Bavor: Yeah. Yeah, there is—there is a certain intensity, yes. We’ve thought about having t-shirts printed with like a, you know, kind of looks like a National Parks seal with “Sierra: I Like to work.” Bret and I both like to work a lot and so does the team.

Ravi Gupta: Well, you’re selling something very different. We called it—we said that there were some similarities to enterprise software, but it’s actually really different because you’re selling a resolution, you’re selling a totally different thing.

Clay Bavor: Yeah. Problem solve.

Ravi Gupta: Yeah. How do you price a problem solve?

Clay Bavor: Yeah. This is one of the more interesting things that we’ve had to figure out. And we charge in what we call a resolution-based pricing way or an outcome-based pricing way. And what that means is we only charge our customers when we fully solve the customer’s problem for them, their customers’ problem for them. And what’s interesting about it is our incentives are deeply aligned with our customers. We want to get better at resolving cases at high customer satisfaction, and they want to send us as many cases to resolve as possible because we cost a fraction of what it would cost to have someone on the phone taking a 20-minute phone call.

And so it’s been this really, really nice model where again, all of the incentives line up quite neatly. And it’s very simple to explain. It also makes the ROI calculation like, “What is our cost-per contact today? What will it be with Sierra? Oh, that is a lot lower. I will save a lot of money on that. And our CSAT may go up. Should I do this or not? Let me think. No, this seems great.” We like it because it really reflects what I think AI represents, and in particular, AI agents represent. If you think about traditional software and tools today, they’re things that help you get a job done more efficiently. AI agents, the whole point is like, they’re just going to get the job done, right? Here’s the problem, please solve it. And so really we think about it as charging our customers for the problem resolved, the job done, the work finished, and so on. It feels quite natural, and there’s no guesswork in it. How many seats do I need? I don’t know. How many licenses do I—it’s like, no, no, no. Just however many customer issues come our way, we will handle a large fraction of those. And you only pay for the ones that we do.

Pat Grady: All right, last question. What are you most excited about in the world of AI over the next five years or so?

Clay Bavor: I mean, first of all, like, five years is a long time horizon. Just, like, look at what has happened in the last 18 months. I mean, I’m still kind of catching up from, like, the last five years of AI. I read a bunch of science fiction books when I was a kid. There was one book by Robert Heinlein, The Moon Is a Harsh Mistress, and the premise is basically the American Revolution, but the moon is the colonies and the Earth is Great Britain. And turns out the main character in this whole thing is a mainframe computer that one day, after getting an additional memory chip or something, wakes up and it starts talking. It wants to develop a sense of humor, so asks the computer technician to, like, coach it on its jokes. Later, it has to create a photo-realistic, real-time video of it giving a speech as the political movement leader.

And I remember reading this as a teenager, “Well, I’ll never live to see any of that. That sounds crazy.” But in a very real sense, like, everything I just described has kind of happened in the last five years, right? You can now just talk to a computer. It understands not just the content, but the context. Computer’s like, “Make me a picture of anything. Make me a movie of anything.” Sora, I think, is just unbelievable. And, you know, I think we’re probably not more than a couple years from the first feature-length film being quote, “filmed entirely with AI.”

And so you extrapolate where all this is going and what’s going to be exciting. I think there are a couple things. One is I love technology. I love computers. And so just getting to see and getting to see it from a front row seat how this stuff evolves, I think is fascinating. It’s fascinating looked at through the lens of, like, how we think and how computers think. It has been astonishing the extent to which anthropomorphizing about how humans think and work and getting machines to think better. So let’s take this step by step, show your work. It is astonishing that that works with large language models. And so what other things like that are we going to uncover?

And conversely, what will we learn about our own thinking from observing the way AIs think? And I think that’s just fascinating. The other thing, and this extends kind of what’s happened with video and Sora and so on. I’ve always had an interest in computer graphics, and this idea that you could use computers to create objects that never existed, worlds that never existed, and I think we’re not far from just being able to describe in a few sentences this entire world that you would like to realize and just have a computer do it for you.

And so what are even computer graphics, what is rendering and so on, even a couple of years out? I think it’s going to look way different from the tool chains and the RenderMans and Mayas and so on. But zooming out, I think of—I think of technology as fundamentally a force multiplier for people. And for companies and for organizations, I think the impact will be really profound. I think what will it be like if a company could be at its best in everything it does? And that’s not only in the customer-facing context that we’ve talked about, but what if, for every regional sales forecast a large company does, they’ve figured out the very best ways to do that, and can distill that, bottle that, and run that very best forecast a thousand times right in every region and subregion? Like, how much more capable could the great organizations of the world be with that?

And similar, we’ve talked about this, like, what if in every call with your customers you had the equivalent of your most knowledgeable, veteran, grizzled support person who’s seen everything, and yet is still patient and friendly, and the sales associate who knows everything about your products because he or she has followed your company for two decades and knows everything, including the history of those products themselves? I think that’s pretty neat. And then for individuals, I think it will be just incredible to have this sort of new set of tools as a creative force multiplier. And AI, I think, represents this fast path from having something in your head that you want to exist in the world to making it exist.

And I see that even today in my own personal life, where with my eight–year–old in 75 minutes, I can from scratch, using Copilot, ChatGPT and so on to help me brush up on the JavaScript syntax that has bit rotted in my own head, I can build a game from scratch with him. And I wrote my sister a personalized song for her birthday using AI in 45 seconds. It was like, what will this, you know, extrapolated over the next five years, look like? I think again, it will just dramatically accelerate this path from idea to creation to having something manifested in the world. And that to me is its promise, and I consider it a real privilege to get to be alive and see all of this amazing stuff unfold.

Ravi Gupta: Well, we share your enthusiasm and we also feel very privileged to be on the journey with you guys. So thank you for coming here.

Clay Bavor: Likewise.

Pat Grady: Thank you.

Clay Bavor: Thank you. Thanks for having me. It’s a pleasure.

Mentioned in this episode:

Bret Taylor: co-founder of Sierra
Towards a Human-like Open-Domain Chatbot: 2020 Google paper that introduced Meena, a predecessor of ChatGPT (followed by LaMDA in 2021)
PaLM: Scaling Language Modeling with Pathways: 2022 Google paper about their unreleased 540B parameter transformer model (GPT-3, at the time, had 175B)
Avocado chair: Images generated by OpenAI’s DALL·E model in 2022
Large Language Models Understand and Can be Enhanced by Emotional Stimuli: 2023 Microsoft paper on how models like GPT-4 can be manipulated into providing better results
𝛕-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains: 2024 paper authored by Sierra research team, led by Karthik Narasimhan (co-author of the 2022 ReACT paper and the 2023 Reflexion paper)

Training Data / Dan Biderman & Jessy Lin, Engram

Memory and Continual Learning: Engram’s Dan Biderman and Jessy Lin

Training Data / Joon Sung Park, Simile

Sierra Co-Founder Clay Bavor on Making Customer-Facing AI Agents Delightful

Stream Now On

Listen Now

Summary

Transcript

More Episodes

Memory and Continual Learning: Engram’s Dan Biderman and Jessy Lin

Simulating Humans at Scale: Simile’s Joon Sung Park

Sierra Co-Founder Clay Bavor on Making Customer-Facing AI Agents Delightful

Stream Now On

Clay’s background

Google before the ChatGPT moment

What is Sierra?

What’s possible now that wasn’t possible 18 months ago?

AgentOS

The solution to many problems with AI is more AI

𝛕-bench

Engineering task vs research task

What tasks can you trust an agent with now?

What metrics will move?

The reality of deploying AI to customers today

The experience manager

Engineering-led or ops led?

Outcome-based pricing

Lightning Round

Mentioned in the episode

Memory and Continual Learning: Engram’s Dan Biderman and Jessy Lin

Simulating Humans at Scale: Simile’s Joon Sung Park