Better Agents Need Better Documentation
Using AI effectively requires a new kind of structured data that AI itself cannot yet produce.
In February, Klarna boldly announced that its new OpenAI-powered assistant handled two thirds of the Swedish fintech’s customer service chats in its first month. All kinds of customer satisfaction metrics were better, but what got everyone’s attention was the $40M boost to its bottom line from replacing 700 full-time contract agents.
Since then, every company we talk to wants to know, “how do we get the Klarna customer support thing?” Against the backdrop of flashy demoware and eye-popping pre-revenue valuations, Klarna’s practical application of GPT-4 was a welcome relief. But as Klarna founder and CEO Sebastian Siemiatkowski explains in this week’s Training Data podcast, the reality of this accomplishment is both more prosaic and more profound than it first appears.
To explain the insight that fueled this breakthrough, Siemiatkowski invokes the classic computer science concept of “garbage in, garbage out” (GIGO). All the way back to Babbage, scientists have understood that “if you put into the machine wrong figures,” the right answers will not come out. For Klarna, this meant that the team needed to pay special attention to the language they gave the model as context for specific customer service tasks—in other words the documentation and training manuals they made for the agents themselves. “That helped us a lot to think about it that way, that we just needed to make sure that the documentation and the manuals were clear enough and of quality enough, and then it can actually execute,” says Siemiatkowski.
Klarna has not disclosed “the secret sauce” for how they built their assistant beyond using a form of RAG (retrieval augmented generation), so it’s not clear how much of its successful performance is due to data quality alone or to “custom cognitive architectures”1 adapted to specific tasks. Either way, this is a prime example of what we call “Goldiocks agents” that benefit both from guardrails (in this case, the specific ways that Klarna handles customer service requests) as well as the ability of LLMs to fluidly merge information in a coherent manner.
The needs of large language models bring what has been a niche concern front and center. Since it was formalized in 2020, RAG has become the industry standard to wrap the expressive elasticity of LLMs around the key facts needed to reliably execute a task, particularly in mission-critical business settings. Curating the quality of examples to retrieve from is key to success.
There are people in the world who love writing documentation, but not many. For most of us, we bemoan the bad but don’t contribute to making it better. However, high-performance companies like Stripe have pioneered “writing culture” where social norms encourage participation. Gitlab, one of the first remote-only companies, has developed a handbook-first approach to communication as a pillar of their operational excellence2.
More generally, in the case of Klarna, the company has built an elaborate knowledge graph3 to power not only their external-facing customer service reps (human and agentic) but also their internal chatbot, Kiki. This focus on the quality of documentation has dual benefits, Siemiatkowski explains: “So actually our agents have better tools today to be successful in helping the customers as does the AI. So both experiences are improving as a consequence of that.”
Siemiatkowski believes that Klarna has derived a lot of value through experimentation and building internally with AI. But in both cases he discussed—LLM-powered internal knowledge management systems and external-facing customer services agents—buying rather than building is now a very viable option with solutions like Glean and Sierra. However, buyer beware, GIGO still applies.
This focus on high-quality data is a large theme in contemporary AI. Microsoft released a paper last year, Textbooks Are All You Need, that showed how a significantly smaller model trained on better data could outperform far larger models on coding exercises. On the large model side, Microsoft CTO Kevin Scott told us in Episode 4 that it’s “a good thing that quality of data matters more than quantity of data, because it gives you an economic framework to go do the do the partnerships that you need to go do to make sure that you are feeding your AI training algorithm a curriculum that is going to result in smarter models. And, honestly, not wasting a whole bunch of compute feeding it a bunch of things that are not.” When it comes to enterprise use cases, particularly consumer-facing ones like Klarna’s, the quality bar is high because customers are comparing against human agents.
What Klarna’s experience suggests is that AI will work best when paired with human effort to understand other humans (something that LLMs seem a long way from achieving). This is actually the same customer-centric approach that is at the heart of all great businesses.
- See Training Data Ep 1 with Harrison Chase of LangChain and Ep 2 with Matan Grinberg and Eno Reyes of Factory as well Andrew Ng’s talk at Ai Ascent 2024.
- Also see Open Org
- Siemiatkowski calls out Klarna’s use of Neo4j, a graph database powered by vector search, for the Kiki project.