ElevenLabs’ Mati Staniszewski: Why Voice Will Be the Fundamental Interface for Tech
Training Data: Ep52
Visit Training Data Series PageMati Staniszewski, co-founder and CEO of ElevenLabs, explains how staying laser-focused on audio innovation has allowed his company to thrive despite the push into multimodality from foundation models. From a high school friendship in Poland to building one of the fastest-growing AI companies, Mati shares how ElevenLabs transformed text-to-speech with contextual understanding and emotional delivery. He discusses the company’s viral moments (from Harry Potter by Balenciaga to powering Darth Vader in Fortnite), and explains how ElevenLabs is creating the infrastructure for voice agents and real-time translation that could eliminate language barriers worldwide.
Stream On
Summary
Mati Staniszewski, co-founder and CEO of ElevenLabs, built one of the leading AI voice generation companies by staying laser-focused on audio while major foundation model labs pursued broader multimodal approaches. His journey from a Polish high school friendship to creating technology that powers everything from audiobook narration to Darth Vader’s voice in Fortnite demonstrates how specialized focus, global talent acquisition and prosumer-first distribution can carve out a defensible position against well-funded competitors.
Stay focused on your domain while foundation models go broad: The temptation to expand into adjacent areas is strong, especially when competing against well-funded foundation model companies. ElevenLabs survived the “roadkill” predictions by maintaining relentless focus on audio research and products. While others chased general-purpose models, they built the best text-to-speech technology by applying transformers and diffusion models specifically to audio—a domain that had received far less research attention than text and images. This specialization allowed them to consistently outperform larger competitors in their chosen field.
Hire the best talent globally, regardless of location: Audio AI expertise is scarce—perhaps only 50 to 100 truly exceptional researchers worldwide. ElevenLabs went fully remote from day one to access this limited talent pool wherever it existed, rather than constraining themselves to a single geography. They also structured their organization so researchers work closely with deployment, creating tight feedback loops between innovation and real-world application. This combination of global talent access and rapid iteration cycles helped them stay ahead of competitors with larger budgets but more constrained hiring.
Prosumer distribution creates unexpected use cases and enterprise demand: Rather than focusing solely on enterprise sales, ElevenLabs consistently released new models to prosumers first. This strategy served dual purposes: discovering unexpected use cases they never would have imagined (like the viral Harry Potter Balenciaga video) and creating bottom-up demand that made enterprise adoption easier. Each viral moment—from book authors copying entire manuscripts into their text box to creators building “no face” YouTube channels—revealed new market opportunities and validated the technology’s capabilities to potential enterprise customers.
Build the full stack around your core innovation: Having breakthrough AI models isn’t enough—you need the entire product ecosystem to make that technology usable. ElevenLabs built voice coaching teams, specialized data labeling processes and comprehensive developer tools around their core audio models. They also created integrations with everything from Twilio to CRM systems, recognizing that enterprise customers need their AI to work within existing workflows. This full-stack approach creates switching costs and makes it harder for competitors to replicate their offering with models alone.
Solve the data problem before it becomes your bottleneck: Audio AI requires fundamentally different data than text models—high-quality recordings with accurate transcriptions, emotional labeling, and contextual understanding of how things should be said, not just what was said. ElevenLabs invested heavily in creating proprietary data pipelines, working with voice coaches to label emotional content, and building speech-to-text systems that could capture nuanced delivery. They recognized early that data quality and labeling methodology would be more defensible than model architecture alone, especially as foundational models commoditized basic capabilities.
Transcript
Chapters
- How to not be roadkill
- Building with your best friend
- The audio breakthrough
- Hiring talent
- The history of viral moments
- The rise of voice agents
- What are the bottlenecks?
- Competing with foundation models
- What do customers care about most?
- The Turing test for voice
- Voice as the new default interaction mode
- Protecting against impersonation
- Pros and cons of building in Europe
- Lightning round
- Mentioned in this episode
Contents
Mati Staniszewski: Late 2021, the inspiration came from Piotr was about to watch a movie with his girlfriend. She didn’t speak English, so they turned it up in Polish. And that kind of brought us back to something we grew up with, where every movie you watch in Polish, every foreign movie you watched in Polish, has all the voices—so whether it’s male voice or female voice—still narrated with one single character.
Pat Grady: [laughs]
Mati Staniszewski: Like, monotonous narration. It’s a horrible experience. And it still happens today. And it was like, wow, we think this will change. This will change. We think the technology and what will happen with some of the innovations will allow us to enjoy that content in the original delivery, in the original incredible voice. And let’s make it happen and change it.
Pat Grady: Greetings. Today we’re talking with Mati Staniszewski from ElevenLabs about how they’ve carved out a defensible position in AI audio even as the big foundation model labs expand into voice as part of their push to multimodality.
We dig into the technical differences between building voice AI versus text. It turns out they’re surprisingly different in terms of the data and the architectures. Mati walks us through how ElevenLabs has stayed competitive by focusing narrowly on audio, including some of the specific engineering hurdles they’ve had to overcome, and what enterprise customers actually care about beyond the benchmarks.
We also explore the future of voice as an interface. The challenges of building AI agents that can handle real conversations and AI’s potential to break language barriers. Mati shares his thoughts on building a company in Europe and why he thinks we might hit human level voice interaction sooner than expected.
We hope you enjoy the show.
How to not be roadkill
Pat Grady: Mati, welcome to the show.
Mati Staniszewski: Thank you for having me.
Pat Grady: All right, first question: There was a school of thought a few years ago when ElevenLabs really started ripping that you guys were going to be roadkill for the foundation models. And yet here you are still doing pretty well. What happened? Like, how were you able to stave off the multimodality, you know, big foundation model labs and kind of carve out this really interesting position for yourselves?
Mati Staniszewski: It’s an exciting last few years, and it’s definitely true we still need to keep on our toes to be able to keep winning the fight of foundation models. But I think the usual and definitely true advice is staying focused and staying focused in our case on audio. Both as a company, of course, the research and the product, but we ultimately stayed focused on the audio, which really helped.
But, you know, probably the biggest question under that question is through the years we’ve been able to build some of the best research models and outcompete the big labs. And here, you know, credit to my co-founder, who I think is a genius, Piotr, who has been able to both do some of the first innovations in the space, and then assemble a rock star team that we have today at the company that is continually pushing what’s possible in audio.
And I was like, you know, when we started, there was very little research done in audio. Most people focused on LLMs. Some people focused on image. You know, a lot more easy to see the results, frequently more exciting for people doing research to work in those fields. So there’s a lot less focus put on the audio. And the set of innovations that happened in the years prior, the diffusion models, the transformer models, weren’t really applied to that domain in an efficient way. And we’ve been able to bring that in in those first years, where for the first time, the text-to-speech models were able to understand the context of the text and deliver that audio experience in just such a better tonality and emotion.
So that was the starting point that really differentiated our work to other works, which was the true research innovation. But then fast following after that first piece was building all the product around it to be able to actually use that research. As we’ve seen so many times, it’s not only the model that matters, it also matters how you deliver that experience to the user.
Pat Grady: Yeah.
Mati Staniszewski: And in our case, whether it’s narrating and creating audiobooks, whether it’s voiceovers, whether it’s turning movies to other languages, whether it’s adding text to speech in the agents or building the entire conversational experience, that layer keeps helping us to win across the foundational models and hyperscalers.
Building with your best friend
Pat Grady: Okay, there’s a lot here and we’re going to come back and dig in on a bunch of aspects of that. But you mentioned your co-founder, Piotr. I believe you guys met in high school in Poland, is that right? Can you kind of tell us the origin story of how you two got to know each other, and then maybe the origin story of how this business came together?
Mati Staniszewski: I’m probably in the luckiest position ever. We met 15 years ago in high school. We started an IB class in Poland, in Warsaw, and took all the same classes. So kind of everything, and we hit it off pretty quickly in some of the mathematics classes. We both loved mathematics, so we started both sitting together, spending a lot of time together, and that kind of morphed from outside the school, time together as well. And then over the years, we kind of did it all from living together, studying together, working together, traveling together, and now 15 years in, we are still best friends. The time is on our side, which is helpful.
Pat Grady: Has building a company together, strengthened the relationship, or …
Mati Staniszewski: There were ups and downs for sure, but I think it did. I think it did. I think it battle tested it. Definitely battle tested it. And it was like, you know, when the company started taking off, it’s hard to know how long the horizon of this intense work will happen. Initially it was like, okay, this is the next four weeks. We just need to push, trust each other that we’ll do well on different aspects and just continue pushing. And then there was another four weeks, another four weeks. And then we realized, like, actually this is going to be in the next 10 years. And there was just no real time for anything else. We were just like, just do ElevenLabs and nothing else.
And over time, and I think this happened organically, but looking back, it definitely helped. We now try to still stay in close touch with what’s happening in our personal lives, where we are in the world and spend some time together still speaking about work, but outside of the work context. And I think this was very healthy for—now I’ve known Piotr for so long and I’ve kind of seen him evolve personally through those years, but I kind of still stay in close touch too, do that as well.
Pat Grady: It’s important to make sure that your co-founder and your executives and your team are able to bring their best self to work, and not just completely ignoring everything that’s happened on the personal front.
Mati Staniszewski: Exactly. And then to your second question, part of the inspiration for ElevenLabs came, so maybe the longer story. So there are two parts. First, through the years. When he was at Google, I was at Palantir, we would do hack weekend projects together.
Pat Grady: Okay.
Mati Staniszewski: So, like, trying to explore new technology for fun, and that was everything from building recommendation algorithms. So we tried to build this model where you would be presented with a few different things, and if you select one of those, the next set of things you’re presented with gets closer and optimizes closer to your previous selection. Deployed it, had a lot of fun. Then we did the same with crypto. We tried to understand the risk in crypto and build, like, a risk analyzer for crypto.
Pat Grady: Hmm!
Mati Staniszewski: Very hard. Didn’t fully work, but it was a good attempt in the first—one of the first crypto heights to try to provide, like, the analytics around it. And then we created a project in audio. So we created a project which analyzed how we speak and gave you tips on how to improve.
Pat Grady: When was this?
Mati Staniszewski: Early 2021.
Pat Grady: Okay.
Mati Staniszewski: Early 2021. That was kind of the first opening. This is what’s possible across audio space. This is the state of the art. These are the models that do diarization, understanding of speech. This is what that speech generation looks like. And then late 2021, the inspiration came from—like, the more of the a-ha moment from Poland, from where you’re from. Where in this case, Piotr was about to watch a movie with his girlfriend. She didn’t speak English, so they turned it up in Polish. And that kind of brought us back to something we grew up with, where every movie you watch in Polish, every foreign movie you watched in Polish, has all the voices. So whether it’s male voice or female voice, still narrated with one single character
Pat Grady: [laughs]
Mati Staniszewski: Like, monotonous narration. It’s a horrible experience. And it still happens today. And it was like, wow, we think this will change. This will change. We think the technology and what will happen with some of the innovations will allow us to enjoy that content in the original delivery, in the original incredible voice. And let’s make it happen and change it.
Of course, it has expanded since then. It’s not only that we realized the same problem exists across most content not being accessible in audio, just in English, how the dynamic interactions will evolve, and of course, how the audio will transmit the language barrier, too.
The audio breakthrough
Pat Grady: Was there any particular paper or capability that you saw that made you think, “Okay, now is the time for this to change?”
Mati Staniszewski: Well, “Attention is All You Need.” It’s definitely one which, you know, was so crisp and clear in terms of what’s possible. But maybe to give a different angle to the answer, I think the interesting piece was less so than the paper. There was this incredible open source repo. So that was, like, slightly later in as we started discovering, like, is it even possible? And there was a Tortoise-tts, effectively, which is a model, an open source model that was kind of created at the time. It provided incredible results of replicating a voice and generating speech. It wasn’t very stable, but it kind of had some glimpse into, like, wow, this is incredible.
And that was already as we were deeper into the company, so maybe first year in, so in 2022. But that was another element of, like, okay, this is possible, some great ideas there. And then of course, we’ve spent most of our time, like, what other things we can innovate through, start from scratch, bring the transformer diffusion into the audio space. And that kind of yielded just another level of human quality where you could actually feel like it’s a human voice.
Pat Grady: Yeah, let’s talk a bit about how you’ve actually built what you’ve built as far as the product goes. What aspects of what works in text port directly over to audio, and what’s completely different—different skill set, different techniques? I’m curious how similar the two are, and where some of the real differences are.
Mati Staniszewski: The first thing is, you know, that there’s kind of those three components that come into the model. There is the computes, there is the data, there is the model architecture. And the model architecture has some idea, but it’s very different. But then the data is also quite different both in terms of what’s accessible and how you need that data to be able to train the models. And in compute, the models are smaller, so you don’t need as much compute, which allows us to—given a lot of the innovations need to happen on them on the model side or the data side, you can still outcompete foundational models rather than just the …
[CROSSTALK]
Pat Grady: … compute disadvantage.
Mati Staniszewski: Exactly.
Pat Grady: Yeah.
Mati Staniszewski: But the data was, I think, the first piece which is different, where in text you can reliably take the text that exists and it will work. In audio, the data—first of all, there’s much less of the high quality audio that actually would get you the result you need. And then the second, it frequently doesn’t come with transcription or with a high accurate text of what was spoken. And that’s kind of lacking in the space where you need to spend a lot of time.
And then there’s a third component, something that we’ll be coming across in the current generation of models, which is not only what was said, so the transcript of the audio, but also how well was it said.
Pat Grady: Yeah.
Mati Staniszewski: What emotions did you use? Who said it? What are some of the non-verbal elements that were said? That kind of almost doesn’t exist, especially at high quality. And that’s where you need to spend a lot of time. That’s where we spent a lot of time in the early days, too, of being able to create effectively more of a speech-to-text model and, like, a pipeline with additional set of manual labelers to do that work. And that’s very different from text, where you just need to spend a lot more cycles.
And then the model level, effectively, you have this step of—in the first generation of text-to-speech model of understanding the context and bringing that to emotion. But of course, you need to kind of predict the next sounds rather than predict the next text token.
Pat Grady: Yeah.
Mati Staniszewski: And that both depends on the prior, but can also depend on what happens after. Like, an easy example is “What a wonderful day.” Let’s say it’s a passage of a book. Then you kind of think, “Okay, this is positive emotion, I should read it in a positive way.” But if you have “‘What a wonderful day,’ I said sarcastically,” then suddenly it changes the entire meaning, and you kind of need to adjust that in the audio delivery as well, put a punchline in the different spot. So that was definitely different where that contextual understanding was a tricky thing.
And then the other model thing that’s very different, you have the text-to-speech element, but then you have also the voice element. So the kind of the other innovation that we spend a lot of time working on is how can you create and represent voices to a higher accurate way of what was in the original? And we found, like, this decoding and coding way, which was slightly different to the space. We weren’t hard coding or predicting any specific features, so we weren’t trying to optimize, is the voice male or is the voice female or what’s the age of the voice? Instead, we effectively let the model decide what the characteristic should be. And then I found a way to bring that into the speech. So now, of course, when you have the text-to-speech model, it will take the context of the text as one input and the second will take the voice as a second input. And based on the voice delivery, if it’s more calm or dynamic, both of those will merge together and then give the kind of the end output, which was, of course, a very different type of work than the text models.
Hiring talent
Pat Grady: Amazing. What sort of people have you needed to hire to be able to build this? I imagine it’s a different skill set than most AI companies.
Mati Staniszewski: It kind of changed over time, but I think the first difference, and this is probably less skillset difference, but more approach difference, we’ve started fully remote. We wanted to hire the best researchers wherever they are. We knew where they are. There’s probably, like, 50 to 100 great people in audio, based at least on the open source work or the papers that they release or the companies that they worked in, that we admire. So the top of the funnel is pretty limited because so many fewer people worked on the research, so we decided let’s attract them and get them into the company wherever they are, and that kind of really helped.
The second thing was given we want to make it exciting for a lot of people to work, but also we think this is the best way to run a lot of the research, we tried to make the researchers extremely close to deployment, to actually seeing the results of their work. So the cycle from being able to research something to bringing it in front of all the people is super short.
Pat Grady: Yeah.
Mati Staniszewski: And you get that immediate feedback of how is it working. And then we have a kind of separate from research, we have research engineers that focus less on, like, the innovation of the entire kind of new architecture of the models, but taking existing models, improving them, changing them, deploying them at scale. And here, frequently you’ve seen other companies call our research engineers “researchers,” given that the work would be as complex in those companies. But that kind of really helped us to create a new innovation, bring that innovation, extend it and deploy it.
And then the layer around the research that we’ve created is probably very different, where we effectively have now a group of voice coaches, data labelers that are trained by voice coaches in how to understand the audio data, how to label that, how to label the emotions, and then they get re-reviewed by the voice coaches, whether it’s good or bad, because most of the traditional companies didn’t really support audio labeling in that same way.
But I think the biggest difference is you needed to be excited about some part of the audio work to really be able to create and dedicate yourself to the level we want, and we’re—especially at the time, small company, you would be willing to embrace that independence, that high ownership that it takes that you are effectively working on a specific research theme yourself. And of course, there’s some interaction, some guidance from others, but a lot of the heavy lifting is individual and creating that work, which takes a different mindset. And I think we’ve been able to—now we have, like, a team of 15 research and research engineers almost, and they are incredible.
The history of viral moments
Pat Grady: What have some of the major kind of step function changes in the quality of the product or the applicability of the product been over the last few years? I remember kind of early, I think it was early 2023-ish when you guys started to explode. Or maybe late 2023, I forget. And it seemed like some of it was on the heels of the Harry Potter Balenciaga video that went viral where it was an ElevenLabs voice that was doing it. It seems like you’ve had these moments in the consumer world where something goes viral and it traces back to you, but beyond that, from a product standpoint, what have been kind of the major inflection points that have opened up new markets or spurred more developer enthusiasm?
Mati Staniszewski: You know, what you’ve mentioned is probably one of the key things we are trying to do, and continuously even now we see this is, like, one of the key things to really get the adoption out there, which is have the prosumer deployment and actually bringing it to everyone out there when we create new technology, showing to the world that it’s possible, and then kind of supplementing that to the top down, bringing it to the specific companies we work with.
And the reason for this is kind of twofold. One is these groups of people are just so much more eager and quick to adopt and create that technology. And the second one, frequently when we create a lot of both the product and the research work, the set of use cases that might be created, we have of course some predictions, but there’s just so many more that we wouldn’t expect, like the example you gave, that wouldn’t have come to our mind that this is something that people might be creating and trying to do.
And that was definitely something where we continuously even now when we create new models, we try to bring it to the entirety of the user base, learn from them and increase that. And it kind of goes in those waves where we have a new model release, we bring it broad, and then kind of the prosumer adoption is there, and then the enterprise adoption follows with additional product, additional reliability that needs to happen. And then once again, we have a new step release and a new function, and kind of the cycle repeats.
So we tried to really embrace it, and through the history, the first one, the very first one was when we had our beta model. So you were right, it was like when we released it publicly early 2023. In late 2022, we were iterating in the beta with a subset of users. And we had a lot of book authors in that subset, and we had this, like, literally a small text box in our product where you could input the text and get a speech out. It was like a tweet length, effectively.
And we had one of those book authors copy paste his entire book inside this box, download it. Then at the time, it was—most of the platforms banned AI content.
Pat Grady: Okay.
Mati Staniszewski: But he managed to upload it. They thought it was human. He started getting great reviews on that platform, and then came back to us with a set of his friends and other book authors saying, like, “Hey, we really need it. This is incredible.” And that kind of triggered this first, like, mini-virality moment with book authors, very, very keen.
Then we had another similar moment around the same period where there was one of the first models that could laugh.
Pat Grady: Okay.
Mati Staniszewski: And we released this blog post that—the first AI that can laugh. And people picked it up like, “Wow, this is incredible. This is really working.” We got a lot of the early users. Then, of course, the theme that you mentioned, which was a lot of the creators—and I think there’s like a completely new trend that started around this time, where it shifted into no face channels. Effectively, you don’t have the creator in the frame, and then you have narration of that creator across something that’s happening. And that started going, like, wildfire in the first six months of the work where of course, we were providing the narration and the speech and the voices for a lot of those use cases. And that was great to see.
Then late 2023, early 2024, we released our work in other languages. That’s one of the first moments where you could really create the narration across other most famous European languages and our dubbing products. So that’s kind of back to the original vision. We created finally a way for you to have the audio and bring it to another language while still sounding the same.
And that kind of triggered this other small virality moment of people creating the videos. And there was like this—you know, the expected ones, which is just the traditional content, but also unexpected ones where we had someone trying to dub singing videos.
Pat Grady: Okay.
Mati Staniszewski: Which the model we didn’t know would work on. And it kind of didn’t work, but it gave you, like, a drunken singing result.
Pat Grady: [laughs]
Mati Staniszewski: So then it went a few times viral, too, for that result. which was fun to see. And then in 2025, the early time now, and we are seeing kind of currently now, everybody is creating an agent. We started adding the voice to all of those agents, and it became both very easy to do for a lot of people to have their entire orchestration, speech to text, the LLM responses, text to speech, to make it seamless. And we had now a few use cases which started getting a lot of traction, a lot of adoption. Most recently, we worked with Epic Games to recreate the voice of Darth Vader in Fortnite.
Pat Grady: I saw that.
Mati Staniszewski: Which players—there’s just so many people using and trying to get the conversation with Darth Vader in Fortnite, which is like a just immense scale. And of course, you know, most of the users are trying to have a great conversation, use him as a companion in the game. Some people are trying to, like, stretch whether he will say something that he shouldn’t be able to say. So you see all those attempts as well. But luckily, the product is holding up, and it’s actually keeping it relatively both performative and safe to actually keep him on the rails.
I think about some of the dubbing use cases. One of the viral ones was when we worked with Lex Friedman. And he interviewed Prime Minister Narendra Modi, and we turned the conversation, which happened between English for Lex, and Narendra Modi spoke Hindi, and we turned the conversation into English so we could actually listen to both of them speaking together. And then similarly we turned both of them to Hindi. So you heard Lex speaking Hindi. And that went also extremely viral in India. where people were watching both of those versions, and in the US people were watching the English version. So that was like a nice way of tying it back to the beginning. But I think they, especially as you think about the future, the agents and just seeing them pop up in new ways is going to be so frequent. Both early developers building everything from stripe integration and being able to process refunds through to the companion use cases, all the way through to the true enterprise is kind of having probably a few viral moments ahead.
The rise of voice agents
Pat Grady: Yeah, say more about what you’re seeing in voice agents right now. It seems like that’s quickly become a pretty popular interaction pattern. What’s working, what’s not working? You know, where are your customers really having success? Where are some of your customers kind of getting stuck?
Mati Staniszewski: And before I answer, maybe a question back to you: Do you see a lot more companies building agents across the companies that are coming through Sequoia?
Pat Grady: Yeah, we absolutely do. And I think most people have this long-term vision that it’s sort of a HeyGen-style avatar powered by an ElevenLabs voice, where it’s this human-like agent that you’re interacting with. And I think most people start with simpler modalities and kind of work their way up. So we see a lot of text-based agents sort of proliferating throughout the enterprise stack. And I imagine there are lots of consumer applications for that as well, but we tend to see a lot of the enterprise stuff.
Mati Staniszewski: It’s similar, definitely what we are seeing both on the new startups being created where it’s like everybody is building an agent, and then on the enterprise side, too. It’s like it can be so helpful for the process internally. And, like, taking a step back, what we think and believe from kind of the start is voice will fundamentally be the interface for interacting with technology. It will be one of the most—you know, it’s probably the modality we’ve known from when the human genome was born as the kind of first way the humans interacted, and it carries just so much more than text does. Like, it carries the emotions, the intonation, the imperfections. We can understand each other. We can, based on the emotional cues, respond in very different ways.
So that’s where our start happened, where we think the voice will be that interface, and build not just the text-to-speech element, but seeing our clients try to use the text-to-speech and do the whole conversational application. Can we provide them a solution that helps them abstract this away?
Pat Grady: Yeah.
Mati Staniszewski: And we’ve seen it from the traditional domains, and to speak for a few, it’s like in the healthcare space, we’ve seen people try to automate some of the work they cannot do. With nurses as an example, a company like Hippocratic will automate the calls that nurses need to take to the patients to remind them about taking medicine, ask how they are feeling, capture that information back so then the doctors can actually process that in a much more efficient way. And voice became critical where a lot of those people cannot be reached otherwise, and the voice call is just the easiest thing to do.
Then very traditional, probably the quickest moving one is customer support. So many companies, both from the call center and the traditional customer support, trying to build the voice internally in the companies, whether it’s companies like Deutsche Telekom all the way through to the new companies, everybody is trying to find a way to deliver better experience, and now voice is possible.
And then what is probably one of the most exciting for me is education, where could you be learning through having that voice delivery in a new way? I used to at least be a chess player or, like, an amateur chess player. And we work with Chess.com where you can—I don’t know if you’re a user of chess.com.
Pat Grady: I am, but I’m a very bad chess player. [laughs]
Mati Staniszewski: Okay. So that’s a great cue. One of the things is we are trying to build effectively a narration which guides you through the game so you can learn how to play better. And there’s a version of that where hopefully you will be able to work with some of the iconic chess players where you can have the delivery from Magnus Carlsen or Gary Kasparov or Hikaru Nakamura to guide you through the game and get even better while you play it, which would be phenomenal. And I think this will be one of the common things we’ll see where, like, everybody will have their personal tutor for the subject that they want with voice that they relate to and they can get closer.
And that’s kind of on the enterprise side, but then on the consumer side, too, we’ve seen kind of completely new ways of augmenting the way you can deliver the content. Like the work of the Time magazine where you can read the article, you can listen to the article, but you can also speak to the article. So it worked effectively during the “Person of the Year” release where you could ask the questions about how they became person of the year, tell me more about other people of the year, and kind of dive into that a little bit deeper.
And then we as a company every so often are trying to build an agent that people can interact and see the art of possible. Most recently we’ve created an agent for my favorite physicist—or one of the two—with working with his family, Richard Feynman, where you can actually …
Pat Grady: He’s my favorite too. [laughs]
Mati Staniszewski: Okay, great, great. He’s, I mean …
Pat Grady: He’s amazing.
Mati Staniszewski: He has such an amazing way to, like, both deliver the knowledge in educational, like, simple way and humoristic way, and just like the way he speaks is also amazing and the way he writes is amazing. So that was amazing. And I think this will, like, alter where maybe in the future you will have, like, you know, his CalTech lectures or one of his books, where you can listen to it in his voice and then dive into some of his background and understand that a bit better. Like, “Surely you are joking, Mr. Feynman.” And dive into this.
What are the bottlenecks?
Pat Grady: I would love to hear a reading of that book in his voice. That’d be amazing. For some of the enterprise applications or maybe the consumer applications as well, it seems like there are a lot of situations where the interface is not—the interface might be the enabler, but it’s not the bottleneck. The bottleneck is sort of the underlying business logic or the underlying context that’s required to actually have the right sort of conversation with your customer or whoever the user is. How often do you run into that? What’s your sense for where those bottlenecks are getting removed, you know, and where they might still be a little bit sticky at the moment?
Mati Staniszewski: The benefit of us working so closely with a lot of companies where we bring our engineers to work directly with them frequently results in us kind of diving into seeing some of the common bottlenecks. And when we’ve started—like, you think about a conversational AI stack, you have the speech-to-text element of understanding what you say, you have the LLM piece of generating the response, and then text-to-speech to narrate it back. And then you have the entire turn-taking model to deliver that experience in a good way. But really that’s just the enabler.
But then like you said, to be able to deliver the right response, you need both the knowledge base, the business base or the business information about how you want to actually generate that response and what’s relevant in a specific context, and then you need the functions and integrations to trigger the right set of actions.
Pat Grady: Mm-hmm.
Mati Staniszewski: And in our case, we’ve built that stack around the product, so companies we work with can bring that knowledge base relatively easily, have access to RAG if they want to enable this, are able to do that on the fly if they need to, and then, of course, build the functions around it.
And the sort of very common themes is definitely coming across where the deeper in the enterprise you go, the more integrations will start becoming more important, whether it’s simple things like Twilio or SIP trunking to make the phone call, or whether it’s connecting to the CRM system of choice that they have, or working with the past providers or the current providers where all of those companies are deployed like Genesys. That’s definitely a common theme where that’s probably taking the most time of, like, how do you have the entire suite of integrations that works reliably and the business can easily connect to their logic? In our case, of course, this is increasing, and every next company we work with already benefits from a lot of the integrations that were built.
So that’s probably the most frequent one, the integrations itself. Knowledge base isn’t as big of an issue, but that depends on the company. Like, if we work with a company that we’ve seen kind of it all from how well organized the knowledge is inside of the company, if it’s a company that has been spending a lot of effort on digitizing already and creating, like, some version of source of truth where that information lies and how it lies, it’s relatively easy to onboard them. And then as we go to a more complex one—and I don’t know if I can mention anyone, but it can get pretty gnarly. And then we work with them on, like, okay, that’s what we need to do as the first step. Some of the protocols that are being developed to standardize that, like MCP, is definitely helpful, something that we also are bringing into the fold. As you know, you don’t want to spend the time on all the integrations if the services can provide that as an easy standard way.
Competing with foundation models
Pat Grady: Well, and you mentioned Anthropic. One of the things that you plug into is the foundation models themselves. And I imagine there’s a bit of a coop-etition dynamic where sometimes you’re competing with their voice functionality, sometimes you’re working with them to provide a solution for a customer. How do you manage that? Like, how does that—I imagine there are a bunch of founders listening who are in similar positions where they work with foundation models, but they kind of compete with foundation models. I’m just curious, how do you manage that?
Mati Staniszewski: I think the main thing that we’ve realized is most of them are complementary to work like conversational AI.
Pat Grady: Yeah.
Mati Staniszewski: And we’re trying to stay agnostic from using one provider. But I think the main thing is true, and happened over the—especially the last year, now that I think about it, is that we are not trying to rely only on one. We are trying to have many of them together in the fold. And that kind of goes to both.
Like, one, what if they develop into being a closer competition where maybe they won’t be able to provide the service to us, or their service becomes too blurry or we, of course, are not using any of the data back to them, but could that be a concern in the future? So kind of that piece.
But also the second piece, is when you develop a product like conversational AI which allows you to deploy your voice AI agent, all our customers will have a different preference for using the LLM. But frequently—or even more frequently, you want this cascading mechanism that what if one LLM isn’t working at a given time, go through and have the kind of second layer of support or third layer to perform pretty well. And we’ve seen this work extremely successfully. So to a large extent, treat them as partners. Happy to be partners with many of them. And hopefully that continues, and if we are competing, that’ll be a good competition too.
What do customers care about most?
Pat Grady: Let me ask you on the product, what do your customers care the most about? One sort of meme over the last year or so has been people who keep touting benchmarks are kind of missing the point. You know, there are a lot of things beyond the benchmarks that customers really care about. What is it your customers really care about?
Mati Staniszewski: And very true on the benchmark side, especially in audio. But our customers care about three things: quality, both how expressive it is in both English and other languages. And that’s probably the top one. Like, if you don’t have quality, everything else doesn’t matter. Of course, the thresholds of quality will depend on the use case. It’s a different threshold for narration, for delivery in the agentic space and dubbing.
Second one is latency. You won’t be able to deliver a conversational agent if the latency isn’t good enough. But that’s where the interesting combination will happen between what’s the quality versus latency benchmark that you have. And then the third one, which is especially useful at that scale is reliability. Can I deploy at scale, like the Epic Games example, where millions of players are interacting with and the system holds up? It’s still performative, still works extremely well. And time and time again, we’ve seen that the kind of being able to scale and reliably deliver that infrastructure is critical.
The Turing test for voice
Pat Grady: Can I ask you how far do you think we are from highly- or fully-reliable human or superhuman quality, effectively zero latency voice interaction? And maybe the related question is: How does the nature of the engineering challenges you face change as we get closer and inevitably surpass that sort of threshold?
Mati Staniszewski: The ideal—like we would love to prove that it’s possible this year.
Pat Grady: This year?
Mati Staniszewski: Like, we can cross the Turing test of speaking with an agent and you just would say, like, this is speaking another human. I think it’s a very ambitious goal, but I think it’s possible.
Pat Grady: Yeah.
Mati Staniszewski: I think it’s possible if not this year, then hopefully early in 2026. But I think we can do it. I think we can do it. You know, you probably have different groups of users, too, where some people will kind of be very attuned and it will be much harder to pass the Turing test for them. But for the majority of people, I hope we are able to get it to that level this year.
I think the biggest question, and that’s kind of where the timeline is a little bit more dependent, is will it be the model that we have today, which is a cascading model where you have the speech to text, LLM text to speech, so kind of three separate pieces that can be performative? Or do you have the only model where you train them together truly duplex style where the delivery is much better?
And that’s effectively what we are kind of trying to assess. We are doing both. The one in production is the cascading model. Soon, the one we’ll deploy will be a truly duplex model. And I think the main thing that you will see is kind of the reliability versus expressivity trade-off. I think latency, we can get pretty good on both sides, but similarly, there might be some trade-off of latency where the true duplex model will always be quicker, will be a little bit more expressive but less reliable. And the cascaded model is definitely more reliable, can be extremely expressive, but it may be not as contextually responsive, and then latency will be a little bit harder. So that will be a huge engineering challenge. And I think no company has been able to do it well, like fuse the modality of LLMs with audio well.
Pat Grady: Yeah.
Mati Staniszewski: So I hope we’ll be the first one, which is the internal big goal. But we’ve seen the OpenAI work, the Meta work that are doubling in there. I don’t think it passed the Turing test yet, so hopefully we’ll be the first.
Voice as the new default interaction mode
Pat Grady: Awesome. And then you mentioned earlier that you think of, and you have thought of voice as sort of a new default interaction mode for a lot of technology. Can you paint that picture a little bit more? Let’s say we’re five or ten years down the road, how do you imagine just the way people live with technology, the way people interact with technology changes as a result of your model getting so good?
Mati Staniszewski: I think the first, there will be this beautiful part where kind of technology will go into the background so you can really focus on learning, on human interaction, and then you will have accessible through voice versus through the screen.
I think the first piece will be the education. I think there will be an entire change where all of us will have the guiding voice, whether we are learning mathematics and are going through the notes, or whether we are trying to learn a new language and interact with a native speaker to guide you through how to pronounce things. And I think this will be the first theme where in the next five, ten years, it will be the default that you will have the agents, voice agents, to help you through that learning.
Second thing, which will be interesting is how this affects the whole cultural exchange around the world. I think you will be able to go to another country and interact with another person while still carrying your own voice, your own emotion, intonation, and the person can understand you. There will be an interesting question of how that technology is delivered. Is it the headphone? Is it neuralink? Is it another technology? But it will happen. And I think we hopefully can make it happen.
If you read Hitchhiker’s Guide to the Galaxy, there’s this concept of Babel Fish. I think Babel Fish will be there and the technology will make it possible. So that’ll be a second huge, huge theme.
And I think generally, we’ve spoken about this personal tutor example, but I think there will be other sort of assistants and agents that all of us have that just can be sent to perform tasks on our behalf. And to perform a lot of those tasks you will need voice, whether it’s booking a restaurant or whether it’s jumping into a specific meeting to take notes and summarize that in the style that you need, you want to be able to to perform the action, or whether it’s calling a customer support and the customer support agent responding. So that’ll be an interesting theme of, like, agent-to-agent interaction and how does it authenticate it, how do you know it’s real or not? But of course, voice will play a big role in all three. Like, the education, I think, and generally how we learn things will be so dependent on that. The kind of the universal translator piece will have voice at the forefront, and then the general services around the life will be so crucially voice driven.
Protecting against impersonation
Pat Grady: Very cool. And you mentioned authentication. I was going to ask you about that. So one of the fears that always comes up is impersonation. Can you talk about how you’ve handled that to date, and maybe how it’s evolved to date and where you see it headed from here?
Mati Staniszewski: Yeah, the way we’ve started, and that was like a big piece for us from the start is for all the content generated in ElevenLabs, you can trace it back to the specific account that generated it. So you have a pretty robust mechanism of tying the audio output to the account and it can take action. So that provenance is extremely important.
And I think it will be increasingly important in the future, where you want to be able to understand what’s the AI content or not AI content, or maybe it will shift even steps deeper where you will rather authenticating AI, you will also authenticate humans. So you’ll have on-device authentication that okay, this is Mati calling another person.
The second thing is the wider set of the moderation of is it a call trying to do fraud and scam, or is this a voice that might be not authenticated? Which we do as a company, and that kind of evolved over time to what extent we do it and how we do it, so moderating on the voice on the text level.
And then the third thing, kind of stretching what we’ve started ourselves on the provenance component is, like, how can we train models and work with other companies to not only train it for ElevenLabs, but also open source technology, which is, of course, prevalent in that space, other commercial models. And it’s possible, of course, as open source develops, it always will be a cat and mouse game whether you can actually catch it. But we worked a lot with other companies or academia like University of Berkeley to actually deliver those models and be able to detect it.
And that kind of the guiding, especially now that the more we take the leading position in deploying new technology like the conversational AI, so in a new model, we try to spend even more time on trying to understand, like, what are the safety mechanisms that we can bring in to make it as useful for good actors and minimize the bad actors. So that’s the usual trade-off there.
Pat Grady: Can we talk about Europe for a minute?
Mati Staniszewski: Let’s do it.
Pros and cons of building in Europe
Pat Grady: Okay, so you’re a remote company, but you’re based in London. What have been the advantages of being based in Europe? What have been some of the disadvantages of being based in Europe?
Mati Staniszewski: That’s a great question. I think the advantage for us was the talent, being able to attract some of the best talent. And frequently people say that there’s a lack of drive in the people in Europe. We haven’t felt that at all. We feel like these people are so passionate. We have, I think, such an incredible team. We try to run it with small teams, but everybody is just pushing all the time, so excited about what we can do, and some of the most hardworking people I’ve had the pleasure to work with, and it’s such a high caliber of people, too.
So talent was an extremely positive surprise for us of, like, how the team kind of got constructed. And especially now as we continue hiring people, whether it’s people across broader Europe, Central-Eastern Europe, just that caliber is super high.
Second thing, which I think is true, where there’s this wider feeling where Europe is behind. And likely in many ways it’s true. Like, AI innovation is being led in the U.S., countries in Asia are closely following. Europe is behind. But the energy for the people is to really change that. And I think it’s shifted from over the last years where it was a little bit more cautious when we started the company, now we feel the keenness, and we want to be at the forefront of that. And I think getting that energy from people and that drive was a lot easier.
So that’s probably an advantage where we can just move quicker. The companies are actually keen to adopt increasingly, which is helping, and as a company in Europe—really as a global company, but with a lot of people in Europe—it helps us deploy with those companies, too.
And maybe there’s another flavor and last flavor of that, which is Europe specific, but also global specific. So when we started the company, we didn’t really think about any specific region. Like, you know, we are a Polish company or British company or U.S. company. But one thing was true where we wanted to be a global solution.
Pat Grady: Yeah.
Mati Staniszewski: And not only from a deployment perspective, but also from the core of what we are trying to achieve, where it’s like, how do we bring audio and make it accessible in all those different languages? So it kind of was through the spine of the company from the start, from the core of the company. And that definitely helped us where now when we have a lot of people in all the different regions, they speak the language, they can work with the clients. And that, I think, likely helped that we were in Europe at the time, because we were able to bring out people and optimize for that local experience.
On the other side, what was definitely harder is, you know, in the US, there’s this incredible community of—you have people with the drive, but you also have the people that have been through this journey a few times, and you can learn from those people so much easier. And there’s just so many people that created companies, exited companies, led the function at a different scale than most of the companies in Europe. So it’s kind of almost granted that you can learn from those people just by being around them and being able to ask the questions. That was much harder, I think, especially in the early days to just be able to ask those—like, not even ask the questions, but know what questions to ask.
Pat Grady: Yeah.
Mati Staniszewski: Of course, we’ve been lucky to partner with incredible investors to help us through those questions. But that was harder, I think, in Europe.
And then the second is probably the flip side of, you know, while I’m positive there is the enthusiasm now in Europe, I think it was lacking over the last years. I think U.S. was excitingly taking the approach of leading, especially over last year, and creating the ecosystem to let it flourish. I think Europe is still figuring it out. And that’s—whether it’s the regulatory things, the EU AI Act, that I think will not contribute to us accelerating, which people are trying to figure out. There’s the enthusiasm, but I think it’s slowing it down. But the first one is definitely the bigger disadvantage.
Lightning round
Pat Grady: Yeah. Should we do a quick-fire round?
Mati Staniszewski: Let’s do it.
Pat Grady: Okay. What is your favorite AI application that you personally use? And it can’t be ElevenLabs or ElevenReader. [laughs]
Mati Staniszewski: It really changes over time, but Perplexity was, I think, and is one of my favorites.
Pat Grady: Really? And for you, what does Perplexity give you that ChatGPT or Google doesn’t give you?
Mati Staniszewski: Yeah, ChatGPT is also amazing. ChatGPT is also amazing. I think for a long time it was being able to go deeper and understand the sources. I guess I hesitated a little bit over the “was/is” where I think ChatGPT now has a lot more of that component, so I tend to use both in many of those cases. For a long time, a non-AI application, but I think they are trying to build AI application, like my favorite app would be Google Maps.
Pat Grady: [laughs]
Mati Staniszewski: I think it’s incredible. It’s such a powerful application. Let me pull my screen. What other applications do I have?
Pat Grady: [laughs] Well, while you’re doing that, I will go to Google Maps and just browse. I’ll just go to Google Maps and explore some location that I’ve never been to before.
Mati Staniszewski: It’s a hundred percent. I mean, it’s great as a search function of the area, too. It’s great. It’s a niche application. I like FYI. This is a will.i.am startup.
Pat Grady: Oh, okay.
Mati Staniszewski: Which is like a combination of—well, it started as a communication app, but now it’s more of a radio app.
Pat Grady: Okay.
Mati Staniszewski: Like, Curiosity is there. Claude is great, too. I use Claude for very different things than ChatGPT. Like, any deeper coding elements, prototyping, I always use Claude. And I love it. Actually, no, I do have a more real recent answer, which is Lovable. Lovable was …
Pat Grady: Do you use it at all for ElevenLabs, or do you just use it personally to …
Mati Staniszewski: No, that’s true. I think like, you know, my life is ElevenLabs.
Pat Grady: One and the same.
Mati Staniszewski: Yes, it’s like all of these I use partly for—big time for ElevenLabs, too. But yeah, Lovable I use for ElevenLabs. But, like, exploring new things, too, every so often I will use Lovable, which ultimately is tied to ElevenLabs. But it’s great for prototyping.
Pat Grady: Very cool.
Mati Staniszewski: And, like, pulling up a quick demo for a client, it’s great.
Pat Grady: Very cool.
Mati Staniszewski: They’re not related, I guess. What was your favorite one?
Pat Grady: My favorite one? You know, it’s funny. So yesterday, we had a team meeting and everybody checked with ChatGPT to see how many queries they’d submitted in the last 30 days. And I’d done, like, 300 in the last 30 days and I was like, “Oh, yeah. That’s pretty good. Pretty good user.” And Andrew similarly had done about 300 in the last 30 days. Some of the younger folks on our team, it was 1,000-plus. And so not only—I’m a big DAU of ChatGPT and I thought I was a power user, but apparently not compared to what some other people are doing. I know it’s a very generic answer, but it’s unbelievable how much you can do in one app at this point.
Mati Staniszewski: Do you use Claude as well?
Pat Grady: I use Claude a little bit, but not nearly as much. The other app that I use every single day, which I’m very contrarian on, is Quip, which is Bret Taylor’s company from years ago that got sold to Salesforce. And I’m pretty sure that I’m the only DAU at this point, but I’m just hoping Salesforce doesn’t shut it down because my whole life is in Quip.
Mati Staniszewski: We use it at Palantir. I like Quip. Quip is good.
Pat Grady: It’s really good. Yeah. No, they nailed the basics. Like, they nailed the basics. Didn’t get bogged down in bells and whistles. Just nailed the basics. Great experience. All right, who in the world of AI do you admire most?
Mati Staniszewski: These are hard, not rapid-fire questions, but I think I really like Demis Hassabis.
Pat Grady: Tell me more.
Mati Staniszewski: I think he is always straight to the point. He can speak very deeply about the research, but he also has created through the years so many incredible works himself. And he was, of course, leading a lot of the research work, but I kind of like that combination that he has been doing the research and now leading it. And whether this was with AlphaFold, which I think is truly a new—like, I think everybody agrees here, but a true frontier for the world, and kind of taking what—while most people focus on part of the AI work, he is kind of trying to bring it to biology.
I mean, Dario Amodei is, of course, trying to do that, too. So it’s going to be incredible, like, what this evolves to. But then that he was creating games in the early days, was an incredible chess player, has been trying to find a way for AI to win across all those games. It’s the versatility of how he both can lead the deployment of research, is probably one of the best researchers himself.
Pat Grady: Yeah.
Mati Staniszewski: Stays extremely humble and just, like, honest, intellectually honest. I feel like, you know, if you were speaking with Demis here or Sir Demis, you would get an honest answer and yeah, that’s it. He’s amazing.
Pat Grady: Very cool. All right, last one. Hot take on the future of AI. Some belief that you feel medium to strongly about that you feel is underhyped or maybe contrarian.
Mati Staniszewski: I feel like it’s an answer that you would expect maybe to some extent.
Pat Grady: [laughs]
Mati Staniszewski: But I do think the whole cross-lingual aspect is still, like, totally underhyped. Like, if you will be able to go any place and speak that language and people can truly speak with yourself, and whether this will be initially the delivery of content and then future delivery of communication, I think this will, like, change the world of how we see it. Like, I think one of the biggest barriers in those conversations is that you cannot really understand the other person. Of course, it has a textual component to it, like, be able to translate it well, but then also the voice delivery. And I feel like this is completely underhyped.
Pat Grady: Do you think the device that enables that exists yet?
Mati Staniszewski: No, I don’t think so.
Pat Grady: Okay. It won’t be the phone, won’t be glasses. Might be some other form factor?
Mati Staniszewski: I think it will have many forms. I think people will have glasses. I think headphones will be one of the first, which will be the easiest. And glasses for sure will be there, too, but I don’t think everybody will wear the glasses. And then, you know, like, is there some version of a non-invasive neural link that people can have while they travel? That would be an interesting attachment to the body that actually works. Do you think it’s underhyped, or do you think it’s hyped enough, this use case?
Pat Grady: I would probably bundle that into the overall idea of sort of ambient computing, where you are able to focus on human beings, technology fades into the background, it’s passively absorbing what’s happening around you, using that context to help make you smarter, help you do things, you know, help translate, whatever the case might be. Yeah, I think that absolutely fits into my mental model of where the world is headed. But I do wonder what will the form factor be that enables that? I think it’s pretty—what are the enabling technologies that allow for the business logic and that sort of thing to work starting to come into focus? What’s the form factor is still to be determined. I absolutely agree with that.
Mati Staniszewski: Yeah, maybe that’s the reason it’s not hyped enough that you don’t have …
[CROSSTALK]
Pat Grady: Yeah, people can’t picture it.
Mati Staniszewski: Yeah.
Pat Grady: Awesome. Mati, thanks so much.
Mati Staniszewski: Pat, thank you so much for having me. That was a great conversation.
Pat Grady: It’s been a pleasure.
Mentioned in this episode
Mentioned in this episode:
- Attention Is All You Need: The original Transformers paper
- Tortoise-tts: Open source text to speech model that was a starting point for ElevenLabs (which now maintains a v2)
- Harry Potter by Balenciaga: ElevenLabs’ first big viral moment from 2023
- The first AI that can laugh: 2022 blog post backing up ElevenLab’s claim of laughter (it got better in v3)
- Darth Vader’s voice in Fortnite: ElevenLabs used actual voice clips provided by James Earl Jones before he died
- Lex Fridman interviews Prime Minister Modi: ElevenLabs enabled Fridman to speak in Hindi and Modi to speak in English.
- Time Person of the Year 2024: ElevenLabs-powered experiment with “conversational journalism”
- Iconic Voices: Richard Feynman, Deepak Chopra, Maya Angelou and more available in ElevenLabs reader app
- SIP trunking: a method of delivering voice, video, and other unified communications over the internet using the Session Initiation Protocol (SIP)
- Genesys: Leading enterprise CX platform for agentic AI
- Hitchhiker’s Guide to the Galaxy: Comedy/science-fiction series by Douglas Adams that contains the concept of the Babel Fish instantaneous translator, cited by Mati
- FYI: communication and productivity app for creatives that Mati uses, founded by will.i.am
- Lovable: prototyping app that Mati loves