In the Arena: How LMSys changed LLM Benchmarking Forever

1 nov · Latent Space: The AI Engineer Podcast

Apologies for lower audio quality; we lost recordings and had to use backup tracks.

Our guests today are Anastasios Angelopoulos and Wei-Lin Chiang, leads of Chatbot Arena, fka LMSYS, the crowdsourced AI evaluation platform developed by the LMSys student club at Berkeley, which became the de facto standard for comparing language models. Arena Elo is often more cited than MMLU scores to many folks, and they have attracted >1,000,000 people to cast votes since its launch, leading top model trainers to cite them over their own formal academic benchmarks:

The Limits of Static Benchmarks

We’ve done two benchmarks episodes: Benchmarks 101 and Benchmarks 201. One issue we’ve always brought up with static benchmarks is that 1) many are getting saturated, with models scoring almost perfectly on them 2) they often don’t reflect production use cases, making it hard for developers and users to use them as guidance.

The fundamental challenge in AI evaluation isn't technical - it's philosophical. How do you measure something that increasingly resembles human intelligence? Rather than trying to define intelligence upfront, Arena let users interact naturally with models and collect comparative feedback. It's messy and subjective, but that's precisely the point - it captures the full spectrum of what people actually care about when using AI.

The Pareto Frontier of Cost vs Intelligence

Because the Elo scores are remarkably stable over time, we can put all the chat models on a map against their respective cost to gain a view of at least 3 orders of magnitude of model sizes/costs and observe the remarkable shift in intelligence per dollar over the past year:

This frontier stood remarkably firm through the recent releases of o1-preview and price cuts of Gemini 1.5:

The Statistics of Subjectivity

In our Benchmarks 201 episode, Clémentine Fourrier from HuggingFace thought this design choice was one of shortcomings of arenas: they aren’t reproducible. You don’t know who ranked what and what exactly the outcome was at the time of ranking. That same person might rank the same pair of outputs differently on a different day, or might ask harder questions to better models compared to smaller ones, making it imbalanced.

Another argument that people have brought up is confirmation bias. We know humans prefer longer responses and are swayed by formatting - Rob Mulla from Dreadnode had found some interesting data on this in May:

The approach LMArena is taking is to use logistic regression to decompose human preferences into constituent factors. As Anastasios explains: "We can say what components of style contribute to human preference and how they contribute." By adding these style components as parameters, they can mathematically "suck out" their influence and isolate the core model capabilities.

This extends beyond just style - they can control for any measurable factor: "What if I want to look at the cost adjusted performance? Parameter count? We can ex post facto measure that."

This is one of the most interesting things about Arena: You have a data generation engine which you can clean and turn into leaderboards later. If you wanted to create a leaderboard for poetry writing, you could get existing data from Arena, normalize it by identifying these style components. Whether or not it’s possible to really understand WHAT bias the voters have, that’s a different question.

Private Evals

One of the most delicate challenges LMSYS faces is maintaining trust while collaborating with AI labs. The concern is that labs could game the system by testing multiple variants privately and only releasing the best performer. This was brought up when 4o-mini released and it ranked as the second best model on the leaderboard:

But this fear misunderstands how Arena works. Unlike static benchmarks where selection bias is a major issue, Arena's live nature means any initial bias gets washed out by ongoing evaluation. As Anastasios explains: "In the long run, there's way more fresh data than there is data that was used to compare these five models."

The other big question is WHAT model is actually being tested; as people often talk about on X / Discord, the same endpoint will randomly feel “nerfed” like it happened for “Claude European summer” and corresponding conspiracy theories:

It’s hard to keep track of these performance changes in Arena as these changes (if real…?) are not observable.

The Future of Evaluation

The team's latest work on RouteLLM points to an interesting future where evaluation becomes more granular and task-specific. But they maintain that even simple routing strategies can be powerful - like directing complex queries to larger models while handling simple tasks with smaller ones.

Arena is now going to expand beyond text into multimodal evaluation and specialized domains like code execution and red teaming. But their core insight remains: the best way to evaluate intelligence isn't to simplify it into metrics, but to embrace its complexity and find rigorous ways to analyze it. To go after this vision, they are spinning out Arena from LMSys, which will stay as an academia-driven group at Berkeley.

Full Video Podcast

Chapters

* 00:00:00 - Introductions

* 00:01:16 - Origin and development of Chatbot Arena

* 00:05:41 - Static benchmarks vs. Arenas

* 00:09:03 - Community building

* 00:13:32 - Biases in human preference evaluation

* 00:18:27 - Style Control and Model Categories

* 00:26:06 - Impact of o1

* 00:29:15 - Collaborating with AI labs

* 00:34:51 - RouteLLM and router models

* 00:38:09 - Future of LMSys / Arena

Show Notes

* Anastasios Angelopoulos

* Anastasios' NeurIPS Paper Conformal Risk Control

* Wei-Lin Chiang

* Chatbot Arena

* LMSys

* MTBench

* ShareGPT dataset

* Stanford's Alpaca project

* LLMRouter

* E2B

* Dreadnode

Transcript

Alessio [00:00:00]: Hey everyone, welcome to the Latent Space podcast. This is Alessio, Partner and CTO in Residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol.ai.

Swyx [00:00:14]: Hey, and today we're very happy and excited to welcome Anastasios and Wei Lin from LMSys. Welcome guys.

Wei Lin [00:00:21]: Hey, how's it going? Nice to see you.

Anastasios [00:00:23]: Thanks for having us.

Swyx [00:00:24]: Anastasios, I actually saw you, I think at last year's NeurIPS. You were presenting a paper, which I don't really super understand, but it was some theory paper about how your method was very dominating over other sort of search methods. I don't remember what it was, but I remember that you were a very confident speaker.

Anastasios [00:00:40]: Oh, I totally remember you. Didn't ever connect that, but yes, that's definitely true. Yeah. Nice to see you again.

Swyx [00:00:46]: Yeah. I was frantically looking for the name of your paper and I couldn't find it. Basically I had to cut it because I didn't understand it.

Anastasios [00:00:51]: Is this conformal PID control or was this the online control?

Wei Lin [00:00:55]: Blast from the past, man.

Swyx [00:00:57]: Blast from the past. It's always interesting how NeurIPS and all these academic conferences are sort of six months behind what people are actually doing, but conformal risk control, I would recommend people check it out. I have the recording. I just never published it just because I was like, I don't understand this enough to explain it.

Anastasios [00:01:14]: People won't be interested.

Wei Lin [00:01:15]: It's all good.

Swyx [00:01:16]: But ELO scores, ELO scores are very easy to understand. You guys are responsible for the biggest revolution in language model benchmarking in the last few years. Maybe you guys want to introduce yourselves and maybe tell a little bit of the brief history of LMSys

Wei Lin [00:01:32]: Hey, I'm Wei Lin. I'm a fifth year PhD student at UC Berkeley, working on Chatbot Arena these days, doing crowdsourcing AI benchmarking.

Anastasios [00:01:43]: I'm Anastasios. I'm a sixth year PhD student here at Berkeley. I did most of my PhD on like theoretical statistics and sort of foundations of model evaluation and testing. And now I'm working 150% on this Chatbot Arena stuff. It's great.

Alessio [00:02:00]: And what was the origin of it? How did you come up with the idea? How did you get people to buy in? And then maybe what were one or two of the pivotal moments early on that kind of made it the standard for these things?

Wei Lin [00:02:12]: Yeah, yeah. Chatbot Arena project was started last year in April, May, around that. Before that, we were basically experimenting in a lab how to fine tune a chatbot open source based on the Llama 1 model that I released. At that time, Lama 1 was like a base model and people didn't really know how to fine tune it. So we were doing some explorations. We were inspired by Stanford's Alpaca project. So we basically, yeah, grow a data set from the internet, which is called ShareGPT data set, which is like a dialogue data set between user and chat GPT conversation. It turns out to be like pretty high quality data, dialogue data. So we fine tune on it and then we train it and release the model called V2. And people were very excited about it because it kind of like demonstrate open way model can reach this conversation capability similar to chat GPT. And then we basically release the model with and also build a demo website for the model. People were very excited about it. But during the development, the biggest challenge to us at the time was like, how do we even evaluate it? How do we even argue this model we trained is better than others? And then what's the gap between this open source model that other proprietary offering? At that time, it was like GPT-4 was just announced and it's like Cloud One. What's the difference between them? And then after that, like every week, there's a new model being fine tuned, released. So even until still now, right? And then we have that demo website for V2 now. And then we thought like, okay, maybe we can add a few more of the model as well, like API model as well. And then we quickly realized that people need a tool to compare between different models. So we have like a side by side UI implemented on the website to that people choose, you know, compare. And we quickly realized that maybe we can do something like, like a battle on top of ECLMs, like just anonymize it, anonymize the identity, and that people vote which one is better. So the community decides which one is better, not us, not us arguing, you know, our model is better or what. And that turns out to be like, people are very excited about this idea. And then we tweet, we launch, and that's, yeah, that's April, May. And then it was like first two, three weeks, like just a few hundred thousand views tweet on our launch tweets. And then we have regularly double update weekly, beginning at a time, adding new model GPT-4 as well. So it was like, that was the, you know, the initial.

Anastasios [00:04:58]: Another pivotal moment, just to jump in, would be private models, like the GPT, I'm a little,

Wei Lin [00:05:04]: I'm a little chatty. That was this year. That was this year.

Anastasios [00:05:07]: Huge.

Wei Lin [00:05:08]: That was also huge.

Alessio [00:05:09]: In the beginning, I saw the initial release was May 3rd of the beta board. On April 6, we did a benchmarks 101 episode for a podcast, just kind of talking about, you know, how so much of the data is like in the pre-training corpus and blah, blah, blah. And like the benchmarks are really not what we need to evaluate whether or not a model is good. Why did you not make a benchmark? Maybe at the time, you know, it was just like, Hey, let's just put together a whole bunch of data again, run a, make a score that seems much easier than coming out with a whole website where like users need to vote. Any thoughts behind that?

Wei Lin [00:05:41]: I think it's more like fundamentally, we don't know how to automate this kind of benchmarks when it's more like, you know, conversational, multi-turn, and more open-ended task that may not come with a ground truth. So let's say if you ask a model to help you write an email for you for whatever purpose, there's no ground truth. How do you score them? Or write a story or a creative story or many other things like how we use ChatterBee these days. It's more open-ended. You know, we need human in the loop to give us feedback, which one is better. And I think nuance here is like, sometimes it's also hard for human to give the absolute rating. So that's why we have this kind of pairwise comparison, easier for people to choose which one is better. So from that, we use these pairwise comparison, those to calculate the leaderboard. Yeah. You can add more about this methodology.

Anastasios [00:06:40]: Yeah. I think the point is that, and you guys probably also talked about this at some point, but static benchmarks are intrinsically, to some extent, unable to measure generative model performance. And the reason is because you cannot pre-annotate all the outputs of a generative model. You change the model, it's like the distribution of your data is changing. New labels to deal with that. New labels are great automated labeling, right? Which is why people are pursuing both. And yeah, static benchmarks, they allow you to zoom in to particular types of information like factuality, historical facts. We can build the best benchmark of historical facts, and we will then know that the model is great at historical facts. But ultimately, that's not the only axis, right? And we can build 50 of them, and we can evaluate 50 axes. But it's just so, the problem of generative model evaluation is just so expansive, and it's so subjective, that it's just maybe non-intrinsically impossible, but at least we don't see a way. We didn't see a way of encoding that into a fixed benchmark.

Wei Lin [00:07:47]: But on the other hand, I think there's a challenge where this kind of online dynamic benchmark is more expensive than static benchmark, offline benchmark, where people still need it. Like when they build models, they need static benchmark to track where they are.

Anastasios [00:08:03]: It's not like our benchmark is uniformly better than all other benchmarks, right? It just measures a different kind of performance that has proved to be useful.

Swyx [00:08:14]: You guys also published MTBench as well, which is a static version, let's say, of Chatbot Arena, right? That people can actually use in their development of models.

Wei Lin [00:08:25]: Right. I think one of the reasons we still do this static benchmark, we still wanted to explore, experiment whether we can automate this, because people, eventually, model developers need it to fast iterate their model. So that's why we explored LM as a judge, and ArenaHard, trying to filter, select high-quality data we collected from Chatbot Arena, the high-quality subset, and use that as a question and then automate the judge pipeline, so that people can quickly get high-quality signal, benchmark signals, using this online benchmark.

Swyx [00:09:03]: As a community builder, I'm curious about just the initial early days. Obviously when you offer effectively free A-B testing inference for people, people will come and use your arena. What do you think were the key unlocks for you? Was it funding for this arena? Was it marketing? When people came in, do you see a noticeable skew in the data? Which obviously now you have enough data sets, you can separate things out, like coding and hard prompts, but in the early days, it was just all sorts of things.

Anastasios [00:09:31]: Yeah, maybe one thing to establish at first is that our philosophy has always been to maximize organic use. I think that really does speak to your point, which is, yeah, why do people come? They came to use free LLM inference, right? And also, a lot of users just come to the website to use direct chat, because you can chat with the model for free. And then you could think about it like, hey, let's just be kind of like more on the selfish or conservative or protectionist side and say, no, we're only giving credits for people that battle or so on and so forth. Strategy wouldn't work, right? Because what we're trying to build is like a big funnel, a big funnel that can direct people. And some people are passionate and interested and they battle. And yes, the distribution of the people that do that is different. It's like, as you're pointing out, it's like, that's not as they're enthusiastic.

Wei Lin [00:10:24]: They're early adopters of this technology.

Anastasios [00:10:27]: Or they like games, you know, people like this. And we've run a couple of surveys that indicate this as well, of our user base.

Wei Lin [00:10:36]: We do see a lot of developers come to the site asking polling questions, 20-30%. Yeah, 20-30%.

Anastasios [00:10:42]: It's obviously not reflective of the general population, but it's reflective of some corner of the world of people that really care. And to some extent, maybe that's all right, because those are like the power users. And you know, we're not trying to claim that we represent the world, right? We represent the people that come and vote.

Swyx [00:11:02]: Did you have to do anything marketing-wise? Was anything effective? Did you struggle at all? Was it success from day one?

Wei Lin [00:11:09]: At some point, almost done. Okay. Because as you can imagine, this leaderboard depends on community engagement participation. If no one comes to vote tomorrow, then no leaderboard.

Anastasios [00:11:23]: So we had some period of time when the number of users was just, after the initial launch, it went lower. Yeah. And, you know, at some point, it did not look promising. Actually, I joined the project a couple months in to do the statistical aspects, right? As you can imagine, that's how it kind of hooked into my previous work. At that time, it wasn't like, you know, it definitely wasn't clear that this was like going to be the eval or something. It was just like, oh, this is a cool project. Like Wayland seems awesome, you know, and that's it.

Wei Lin [00:11:56]: Definitely. There's in the beginning, because people don't know us, people don't know what this is for. So we had a hard time. But I think we were lucky enough that we have some initial momentum. And as well as the competition between model providers just becoming, you know, became very intense. Intense. And then that makes the eval onto us, right? Because always number one is number one.

Anastasios [00:12:23]: There's also an element of trust. Our main priority in everything we do is trust. We want to make sure we're doing everything like all the I's are dotted and the T's are crossed and nobody gets unfair treatment and people can see from our profiles and from our previous work and from whatever, you know, we're trustworthy people. We're not like trying to make a buck and we're not trying to become famous off of this or that. It's just, we're trying to provide a great public leaderboard community venture project.

Wei Lin [00:12:51]: Yeah.

Swyx [00:12:52]: Yes. I mean, you are kind of famous now, you know, that's fine. Just to dive in more into biases and, you know, some of this is like statistical control. The classic one for human preference evaluation is humans demonstrably prefer longer contexts or longer outputs, which is actually something that we don't necessarily want. You guys, I think maybe two months ago put out some length control studies. Apart from that, there are just other documented biases. Like, I'd just be interested in your review of what you've learned about biases and maybe a little bit about how you've controlled for them.

Anastasios [00:13:32]: At a very high level, yeah. Humans are biased. Totally agree. Like in various ways. It's not clear whether that's good or bad, you know, we try not to make value judgments about these things. We just try to describe them as they are. And our approach is always as follows. We collect organic data and then we take that data and we mine it to get whatever insights we can get. And, you know, we have many millions of data points that we can now use to extract insights from. Now, one of those insights is to ask the question, what is the effect of style, right? You have a bunch of data, you have votes, people are voting either which way. We have all the conversations. We can say what components of style contribute to human preference and how do they contribute? Now, that's an important question. Why is that an important question? It's important because some people want to see which model would be better if the lengths of the responses were the same, were to be the same, right? People want to see the causal effect of the model's identity controlled for length or controlled for markdown, number of headers, bulleted lists, is the text bold? Some people don't, they just don't care about that. The idea is not to impose the judgment that this is not important, but rather to say ex post facto, can we analyze our data in a way that decouples all the different factors that go into human preference? Now, the way we do this is via statistical regression. That is to say the arena score that we show on our leaderboard is a particular type of linear model, right? It's a linear model that takes, it's a logistic regression that takes model identities and fits them against human preference, right? So it regresses human preference against model identity. What you get at the end of that logistic regression is a parameter vector of coefficients. And when the coefficient is large, it tells you that GPT 4.0 or whatever, very large coefficient, that means it's strong. And that's exactly what we report in the table. It's just the predictive effect of the model identity on the vote. The other thing that you can do is you can take that vector, let's say we have M models, that is an M dimensional vector of coefficients. What you can do is you say, hey, I also want to understand what the effect of length is. So I'll add another entry to that vector, which is trying to predict the vote, right? That tells me the difference in length between two model responses. So we have that for all of our data. We can compute it ex post facto. We added it into the regression and we look at that predictive effect. And then the idea, and this is formally true under certain conditions, not always verifiable ones, but the idea is that adding that extra coefficient to this vector will kind of suck out the predictive power of length and put it into that M plus first coefficient and quote, unquote, de-bias the rest so that the effect of length is not included. And that's what we do in style control. Now we don't just do it for M plus one. We have, you know, five, six different style components that have to do with markdown headers and bulleted lists and so on that we add here. Now, where is this going? You guys see the idea. It's a general methodology. If you have something that's sort of like a nuisance parameter, something that exists and provides predictive value, but you really don't want to estimate that. You want to remove its effect. In causal inference, these things are called like confounders often. What you can do is you can model the effect. You can put them into your model and try to adjust for them. So another one of those things might be cost. You know, what if I want to look at the cost adjusted performance of my model, which models are punching above their weight, parameter count, which models are punching above their weight in terms of parameter count, we can ex post facto measure that. We can do it without introducing anything that compromises the organic nature of the

Wei Lin [00:17:17]: data that we collect.

Anastasios [00:17:18]: Hopefully that answers the question.

Wei Lin [00:17:20]: It does.

Swyx [00:17:21]: So I guess with a background in econometrics, this is super familiar.

Anastasios [00:17:25]: You're probably better at this than me for sure.

Swyx [00:17:27]: Well, I mean, so I used to be, you know, a quantitative trader and so, you know, controlling for multiple effects on stock price is effectively the job. So it's interesting. Obviously the problem is proving causation, which is hard, but you don't have to do that.

Anastasios [00:17:45]: Yes. Yes, that's right. And causal inference is a hard problem and it goes beyond statistics, right? It's like you have to build the right causal model and so on and so forth. But we think that this is a good first step and we're sort of looking forward to learning from more people. You know, there's some good people at Berkeley that work on causal inference for the learning from them on like, what are the really most contemporary techniques that we can use in order to estimate true causal effects if possible.

Swyx [00:18:10]: Maybe we could take a step through the other categories. So style control is a category. It is not a default. I have thought that when you wrote that blog post, actually, I thought it would be the new default because it seems like the most obvious thing to control for. But you also have other categories, you have coding, you have hard prompts. We consider that.

Anastasios [00:18:27]: We're still actively considering it. It's just, you know, once you make that step, once you take that step, you're introducing your opinion and I'm not, you know, why should our opinion be the one? That's kind of a community choice. We could put it to a vote.

Wei Lin [00:18:39]: We could pass.

Anastasios [00:18:40]: Yeah, maybe do a poll. Maybe do a poll.

Swyx [00:18:42]: I don't know. No opinion is an opinion.

Wei Lin [00:18:44]: You know what I mean?

Swyx [00:18:45]: Yeah.

Wei Lin [00:18:46]: There's no neutral choice here.

Swyx [00:18:47]: Yeah. You have all these others. You have instruction following too. What are your favorite categories that you like to talk about? Maybe you tell a little bit of the stories, tell a little bit of like the hard choices that you had to make.

Wei Lin [00:18:57]: Yeah. Yeah. Yeah. I think the, uh, initially the reason why we want to add these new categories is essentially to answer some of the questions from our community, which is we won't have a single leaderboard for everything. So these models behave very differently in different domains. Let's say this model is trend for coding, this model trend for more technical questions and so on. On the other hand, to answer people's question about like, okay, what if all these low quality, you know, because we crowdsource data from the internet, there will be noise. So how do we de-noise? How do we filter out these low quality data effectively? So that was like, you know, some questions we want to answer. So basically we spent a few months, like really diving into these questions to understand how do we filter all these data because these are like medias of data points. And then if you want to re-label yourself, it's possible, but we need to kind of like to automate this kind of data classification pipeline for us to effectively categorize them to different categories, say coding, math, structure, and also harder problems. So that was like, the hope is when we slice the data into these meaningful categories to give people more like better signals, more direct signals, and that's also to clarify what we are actually measuring for, because I think that's the core part of the benchmark. That was the initial motivation. Does that make sense?

Anastasios [00:20:27]: Yeah. Also, I'll just say, this does like get back to the point that the philosophy is to like mine organic, to take organic data and then mine it x plus factor.

Alessio [00:20:35]: Is the data cage-free too, or just organic?

Anastasios [00:20:39]: It's cage-free.

Wei Lin [00:20:40]: No GMO. Yeah. And all of these efforts are like open source, like we open source all of the data cleaning pipeline, filtering pipeline. Yeah.

Swyx [00:20:50]: I love the notebooks you guys publish. Actually really good just for learning statistics.

Wei Lin [00:20:54]: Yeah. I'll share this insights with everyone.

Alessio [00:20:59]: I agree on the initial premise of, Hey, writing an email, writing a story, there's like no ground truth. But I think as you move into like coding and like red teaming, some of these things, there's like kind of like skill levels. So I'm curious how you think about the distribution of skill of the users. Like maybe the top 1% of red teamers is just not participating in the arena. So how do you guys think about adjusting for it? And like feels like this where there's kind of like big differences between the average and the top. Yeah.

Anastasios [00:21:29]: Red teaming, of course, red teaming is quite challenging. So, okay. Moving back. There's definitely like some tasks that are not as subjective that like pairwise human preference feedback is not the only signal that you would want to measure. And to some extent, maybe it's useful, but it may be more useful if you give people better tools. For example, it'd be great if we could execute code with an arena, be fantastic.

Wei Lin [00:21:52]: We want to do it.

Anastasios [00:21:53]: There's also this idea of constructing a user leaderboard. What does that mean? That means some users are better than others. And how do we measure that? How do we quantify that? Hard in chatbot arena, but where it is easier is in red teaming, because in red teaming, there's an explicit game. You're trying to break the model, you either win or you lose. So what you can do is you can say, Hey, what's really happening here is that the models and humans are playing a game against one another. And then you can use the same sort of Bradley Terry methodology with some, some extensions that we came up with in one of you can read one of our recent blog posts for, for the sort of theoretical extensions. You can attribute like strength back to individual players and jointly attribute strength to like the models that are in this jailbreaking game, along with the target tasks, like what types of jailbreaks you want.

Wei Lin [00:22:44]: So yeah.

Anastasios [00:22:45]: And I think that this is, this is a hugely important and interesting avenue that we want to continue researching. We have some initial ideas, but you know, all thoughts are welcome.

Wei Lin [00:22:54]: Yeah.

Alessio [00:22:55]: So first of all, on the code execution, the E2B guys, I'm sure they'll be happy to help

Wei Lin [00:22:59]: you.

Alessio [00:23:00]: I'll please set that up. They're big fans. We're investors in a company called Dreadnought, which we do a lot in AI red teaming. I think to me, the most interesting thing has been, how do you do sure? Like the model jailbreak is one side. We also had Nicola Scarlini from DeepMind on the podcast, and he was talking about, for example, like, you know, context stealing and like a weight stealing. So there's kind of like a lot more that goes around it. I'm curious just how you think about the model and then maybe like the broader system, even with Red Team Arena, you're just focused on like jailbreaking of the model, right? You're not doing kind of like any testing on the more system level thing of the model where like, maybe you can get the training data back, you're going to exfiltrate some of the layers and the weights and things like that.

Wei Lin [00:23:43]: So right now, as you can see, the Red Team Arena is at a very early stage and we are still exploring what could be the potential new games we can introduce to the platform. So the idea is still the same, right? And we build a community driven project platform for people. They can have fun with this website, for sure. That's one thing, and then help everyone to test these models. So one of the aspects you mentioned is stealing secrets, stealing training sets. That could be one, you know, it could be designed as a game. Say, can you still use their credential, you know, we hide, maybe we can hide the credential into system prompts and so on. So there are like a few potential ideas we want to explore for sure. Do you want to add more?

Anastasios [00:24:28]: I think that this is great. This idea is a great one. There's a lot of great ideas in the Red Teaming space. You know, I'm not personally like a Red Teamer. I don't like go around and Red Team models, but there are people that do that and they're awesome. They're super skilled. When I think about the Red Team arena, I think those are really the people that we're building it for. Like, we want to make them excited and happy, build tools that they like. And just like chatbot arena, we'll trust that this will end up being useful for the world. And all these people are, you know, I won't say all these people in this community are actually good hearted, right? They're not doing it because they want to like see the world burn. They're doing it because they like, think it's fun and cool. And yeah. Okay. Maybe they want to see, maybe they want a little bit.

Wei Lin [00:25:13]: I don't know. Majority.

Anastasios [00:25:15]: Yeah.

Wei Lin [00:25:16]: You know what I'm saying.

Anastasios [00:25:17]: So, you know, trying to figure out how to serve them best, I think, I don't know where that fits. I just, I'm not expressing. And give them credits, right?

Wei Lin [00:25:24]: And give them credit.

Anastasios [00:25:25]: Yeah. Yeah. So I'm not trying to express any particular value judgment here as to whether that's the right next step. It's just, that's sort of the way that I think we would think about it.

Swyx [00:25:35]: Yeah. We also talked to Sander Schulhoff of the HackerPrompt competition, and he's pretty interested in Red Teaming at scale. Let's just call it that. You guys maybe want to talk with him.

Wei Lin [00:25:45]: Oh, nice.

Swyx [00:25:46]: We wanted to cover a little, a few topical things and then go into the other stuff that your group is doing. You know, you're not just running Chatbot Arena. We can also talk about the new website and your future plans, but I just wanted to briefly focus on O1. It is the hottest, latest model. Obviously, you guys already have it on the leaderboard. What is the impact of O1 on your evals?

Wei Lin [00:26:06]: Made our interface slower.

Anastasios [00:26:07]: It made it slower.

Swyx [00:26:08]: Yeah.

Wei Lin [00:26:10]: Because it needs like 30, 60 seconds, sometimes even more to, the latency is like higher. So that's one. Sure. But I think we observe very interesting things from this model as well. Like we observe like significant improvement in certain categories, like more technical or math. Yeah.

Anastasios [00:26:32]: I think actually like one takeaway that was encouraging is that I think a lot of people before the O1 release were thinking, oh, like this benchmark is saturated. And why were they thinking that? They were thinking that because there was a bunch of models that were kind of at the same level. They were just kind of like incrementally competing and it sort of wasn't immediately obvious that any of them were any better. Nobody, including any individual person, it's hard to tell. But what O1 did is it was, it's clearly a better model for certain tasks. I mean, I used it for like proving some theorems and you know, there's some theorems that like only I know because I still do a little bit of theory. Right. So it's like, I can go in there and ask like, oh, how would you prove this exact thing? Which I can tell you has never been in the public domain. It'll do it. It's like, what?

Wei Lin [00:27:19]: Okay.

Anastasios [00:27:20]: So there's this model and it crushed the benchmark. You know, it's just like really like a big gap. And what that's telling us is that it's not saturated yet. It's still measuring some signal. That was encouraging. The point, the takeaway is that the benchmark is comparative. There's no absolute number. There's no maximum ELO. It's just like, if you're better than the rest, then you win. I think that was actually quite helpful to us.

Swyx [00:27:46]: I think people were criticizing, I saw some of the academics criticizing it as not apples to apples. Right. Like, because it can take more time to reason, it's basically doing some search, doing some chain of thought that if you actually let the other models do that same thing, they might do better.

Wei Lin [00:28:03]: Absolutely.

Anastasios [00:28:04]: To be clear, none of the leaderboard currently is apples to apples because you have like Gemini Flash, you have, you know, all sorts of tiny models like Lama 8B, like 8B and 405B are not apples to apples.

Wei Lin [00:28:19]: Totally agree. They have different latencies.

Anastasios [00:28:21]: Different latencies.

Wei Lin [00:28:22]: Control for latency. Yeah.

Anastasios [00:28:24]: Latency control. That's another thing. We can do style control, but latency control. You know, things like this are important if you want to understand the trade-offs involved in using AI.

Swyx [00:28:34]: O1 is a developing story. We still haven't seen the full model yet, but it's definitely a very exciting new paradigm. I think one community controversy I just wanted to give you guys space to address is the collaboration between you and the large model labs. People have been suspicious, let's just say, about how they choose to A-B test on you. I'll state the argument and let you respond, which is basically they run like five anonymous models and basically argmax their Elo on LMSYS or chatbot arena, and they release the best one. Right? What has been your end of the controversy? How have you decided to clarify your policy going forward?

Wei Lin [00:29:15]: On a high level, I think our goal here is to build a fast eval for everyone, and including everyone in the community can see the data board and understand, compare the models. More importantly, I think we want to build the best eval also for model builders, like all these frontier labs building models. They're also internally facing a challenge, which is how do they eval the model? That's the reason why we want to partner with all the frontier lab people, and then to help them testing. That's one of the... We want to solve this technical challenge, which is eval. Yeah.

Anastasios [00:29:54]: I mean, ideally, it benefits everyone, right?

Wei Lin [00:29:56]: Yeah.

Anastasios [00:29:57]: And people also are interested in seeing the leading edge of the models. People in the community seem to like that. Oh, there's a new model up. Is this strawberry? People are excited. People are interested. Yeah. And then there's this question that you bring up of, is it actually causing harm?

Wei Lin [00:30:15]: Right?

Anastasios [00:30:16]: Is it causing harm to the benchmark that we are allowing this private testing to happen? Maybe stepping back, why do you have that instinct? The reason why you and others in the community have that instinct is because when you look at something like a benchmark, like an image net, a static benchmark, what happens is that if I give you a million different models that are all slightly different, and I pick the best one, there's something called selection bias that plays in, which is that the performance of the winning model is overstated. This is also sometimes called the winner's curse. And that's because statistical fluctuations in the evaluation, they're driving which model gets selected as the top. So this selection bias can be a problem. Now there's a couple of things that make this benchmark slightly different. So first of all, the selection bias that you include when you're only testing five models is normally empirically small.

Wei Lin [00:31:12]: And that's why we have these confidence intervals constructed.

Anastasios [00:31:16]: That's right. Yeah. Our confidence intervals are actually not multiplicity adjusted. One thing that we could do immediately tomorrow in order to address this concern is if a model provider is testing five models and they want to release one, and we're constructing the models at level one minus alpha, we can just construct the intervals instead at level one minus alpha divided by five. That's called Bonferroni correction. What that'll tell you is that the final performance of the model, the interval that gets constructed, is actually formally correct. We don't do that right now, partially because we know from simulations that the amount of selection bias you incur with these five things is just not huge. It's not huge in comparison to the variability that you get from just regular human voters. So that's one thing. But then the second thing is the benchmark is live, right? So what ends up happening is it'll be a small magnitude, but even if you suffer from the winner's curse after testing these five models, what'll happen is that over time, because we're getting new data, it'll get adjusted down. So if there's any bias that gets introduced at that stage, in the long run, it actually doesn't matter. Because asymptotically, basically in the long run, there's way more fresh data than there is data that was used to compare these five models against these private models.

Swyx [00:32:35]: The announcement effect is only just the first phase and it has a long tail.

Anastasios [00:32:39]: Yeah, that's right. And it sort of like automatically corrects itself for this selection adjustment.

Swyx [00:32:45]: Every month, I do a little chart of LMSys Elo versus cost, just to track the price per dollar, the amount of like, how much money do I have to pay for one incremental point in ELO? And so I actually observe an interesting stability in most of the Elo numbers, except for some of them. For example, GPT-4-O August has fallen from 12.90𝑡𝑜12.90to12.60 over the past few months. And it's surprising.

Wei Lin [00:33:11]: You're saying like a new version of GPT-4-O versus the version in May?

Swyx [00:33:17]: There was May. May is $12.85. I could have made some data entry error, but it'd be interesting to track these things over time. Anyway, I observed like numbers go up, numbers go down. It's remarkably stable. Gotcha.

Anastasios [00:33:28]: So there are two different track points and the Elo has fallen.

Wei Lin [00:33:31]: Yes.

Swyx [00:33:32]: And sometimes ELOs rise as well. I think a core rose from 1,200𝑡𝑜1,200to1,230. And that's one of the things, by the way, the community is always suspicious about, like, hey, did this same endpoint get dumber after release? Right? It's such a meme.

Anastasios [00:33:45]: That's funny. But those are different endpoints, right?

Wei Lin [00:33:47]: Yeah, those are different API endpoints, I think. For GPT-4-O, August and May. But if it's for like, you know, endpoint versions we fixed, usually we observe small variation after release.

Anastasios [00:34:04]: I mean, you can quantify the variations that you would expect in an ELO. That's a close form number that you can calculate. So if the variations are larger than we would expect, then that indicates that we should

Wei Lin [00:34:17]: look into that. For sure.

Anastasios [00:34:19]: That's important for us to know. So maybe you should send us a reply. Yeah, please.

Wei Lin [00:34:22]: I'll send you some data. Yeah.

Alessio [00:34:24]: And I know we only got a few minutes before we wrap, but there are two things I would definitely love to talk about. One is route LLM. So talking about models, maybe getting dumber over time, blah, blah, blah. Are routers actually helpful in your experience? And Sean pointed out that MOEs are technically routers too. So how do you kind of think about the router being part of the model versus routing different models? And yeah, overall learnings from building it?

Wei Lin [00:34:51]: Yeah. So route LLM is a project we released a few months ago, I think. And our goal was to basically understand, can we use the preference data we collect to route model based on the question, conditional on the questions, because we will make assumption that some model are good at math, some model are good at coding, things like that. So we found it somewhat useful. For sure, this is like ongoing effort. Our first phase with this project is pretty much like open source, the framework that we develop. So for anyone interested in this problem, they can use the framework, and then they can train their own router model, and then to do evaluation to benchmark. So that's our goal, the reason why we released this framework. And I think there are a couple of future stuff we are thinking. One is, can we just scale this, do even more data, even more preference data, and then train a reward model, train like a router model, better router model. Another thing is, release a benchmark, because right now, currently, there seems to be, one of the end point when we developed this project was like, there's just no good benchmark for a router. So that will be another thing we think could be a useful contribution to community. And there's still, for sure, methodology, new methodology we can use.

Swyx [00:36:18]: I think my fundamental philosophical doubt is, does the router model have to be at least as smart as the smartest model? What's the minimum required intelligence of a router model, right? Like, if it's too dumb, it's not going to route properly.

Anastasios [00:36:32]: Well, I think that you can build a very, very simple router that is very effective. So let me give you an example. You can build a great router with one parameter, and the parameter is just like, I'm going to check if my question is hard. And if it's hard, then I'm going to go to the big model. If it's easy, I'm going to go to the little model. You know, there's various ways of measuring hard that are like, pretty trivial, right? Like, does it have code? Does it have math? Is it long? That's already a great first step, right? Because ultimately, at the end of the day, you're competing with a weak baseline, which is any individual model. And you're trying to ask the question, how do I improve cost? And that's like a one-dimensional trade-off. It's like performance cost, and it's great. Now, you can also get into the extension, which is what models are good at what particular

Wei Lin [00:37:23]: types of queries.

Anastasios [00:37:25]: And then, you know, I think your concern starts taking into effect is, can we actually do that? Can we estimate which models are good in which parts of the space in a way that doesn't introduce more variability and more variation and error into our final pipeline than just using the best of them? That's kind of how I see it.

Swyx [00:37:44]: Your approach is really interesting compared to the commercial approaches where you use information from the chat arena to inform your model, which is, I mean, smart, and it's the foundation of everything you do. Yep.

Alessio [00:37:56]: As we wrap, can we just talk about LMSYS and what that's going to be going forward? Like, LMRENA, I'm becoming something. I saw you announced yesterday you're graduating. I think maybe that was confusing since you're PhD students, but this is a different type

Wei Lin [00:38:09]: of graduation.

Anastasios [00:38:10]: Just for context, LMSYS started as like a student club.

Wei Lin [00:38:15]: Student driven. Yeah.

Anastasios [00:38:16]: Student driven, like research projects, you know, many different research projects are part of LMSYS. Sort of chatbot arena has, of course, like kind of become its own thing. And Lianmin and Ying, who are, you know, created LMSYS, have kind of like moved on to working on SGLANG. And now they're doing other projects that are sort of originated from LMSYS. And for that reason, we thought it made sense to kind of decouple the two. Just so, A, the LMSYS thing, it's not like when someone says LMSYS, they think of chatbot arena. That's not fair, so to speak.

Wei Lin [00:38:52]: And we want to support new projects.

Anastasios [00:38:54]: And we want to support new projects and so on and so forth. But of course, these are all like, you know, our friends.

Wei Lin [00:38:59]: So that's why we call it graduation. I agree.

Alessio [00:39:03]: That's like one thing that people wear. Maybe a little confused by where LMSYS kind of starts and ends and where arena starts

Wei Lin [00:39:10]: and ends.

Alessio [00:39:10]: So I think you reach escape velocity now that you're kind of like your own thing.

Swyx [00:39:15]: So I have one parting question. Like, what do you want more of? Like, what do you want people to approach you with?

Anastasios [00:39:21]: Oh, my God, we need so much help. One thing would be like, we're obviously expanding into like other kinds of arenas, right? We definitely need like active help on red teaming. We definitely need active help on our different modalities, different modalities.

Wei Lin [00:39:35]: So pilot, yeah, coding, coding.

Anastasios [00:39:38]: You know, if somebody could like help us implement this, like REPL in REPL in chatbot arena,

Wei Lin [00:39:44]: massive, that would be a massive delta.

Anastasios [00:39:45]: And I know that there's people out there who are passionate and capable of doing it. It's just, we don't have enough hands on deck. We're just like an academic research lab, right? We're not equipped to support this kind of project. So, yeah, we need help with that. We also need just like general back-end dev. And new ideas, new conceptual ideas. I mean, honestly, the work that we do spans everything from like foundational statistics, like new proofs to full stack dev. And like anybody who's like, wants to contribute something to that pipeline is, should definitely reach out.

Wei Lin [00:40:22]: We need it. And it's an open source project anyways. Anyone can make a PR.

Anastasios [00:40:26]: And we're happy to, you know, whoever wants to contribute, we'll give them credit, you know? We're not trying to keep all the credit for ourselves. We want it to be a community project.

Wei Lin [00:40:33]: That's great.

Alessio [00:40:34]: And fits this pair of everything you've been doing over there. So, awesome, guys. Well, thank you so much for taking the time. And we'll put all the links in the show notes so that people can find you and reach out if they need it. Thank you so much.

Anastasios [00:40:46]: It's very nice to talk to you. And thank you for the wonderful questions.

Wei Lin [00:40:49]: Thank you so much.

Get full access to Latent Space at www.latent.space/subscribe

Luisteren Nogmaals beluisteren Doorgaan Wordt afgespeeld...
Abonneren Afmelden
Deel

Afleveringen

2024 in Agents [LS Live! @ NeurIPS 2024]
25 dec· Latent Space: The AI Engineer Podcast
Happy holidays! We’ll be sharing snippets from Latent Space LIVE! through the break bringing you the best of 2024! We want to express our deepest appreciation to event sponsors AWS, Daylight Computer, Thoth.ai, StrongCompute, Notable Capital, and most of all all our LS supporters who helped fund the gorgeous venue and A/V production!
For NeurIPS last year we did our standard conference podcast coverage interviewing selected papers (that we have now also done for ICLR and ICML), however we felt that we could be doing more to help AI Engineers 1) get more industry-relevant content, and 2) recap 2024 year in review from experts. As a result, we organized the first Latent Space LIVE!, our first in person miniconference, at NeurIPS 2024 in Vancouver.
Our next keynote covers The State of LLM Agents, with the triumphant return of Professor Graham Neubig’s return to the pod (his ICLR episode here!). OpenDevin is now a startup known as AllHands! The renamed OpenHands has done extremely well this year, as they end the year sitting comfortably at number 1 on the hardest SWE-Bench Full leaderboard at 29%, though on the smaller SWE-Bench Verified, they are at 53%, behind Amazon Q, devlo, and OpenAI's self reported o3 results at 71.7%.
Many are saying that 2025 is going to be the year of agents, with OpenAI, DeepMind and Anthropic setting their sights on consumer and coding agents, vision based computer-using agents and multi agent systems. There has been so much progress on the practical reliability and applications of agents in all domains, from the huge launch of Cognition AI's Devin this year, to the sleeper hit of Cursor Composer and Codeium's Windsurf Cascade in the IDE arena, to the explosive revenue growth of Stackblitz's Bolt, Lovable, and Vercel's v0, and the unicorn rounds and high profile movements of customer support agents like Sierra (now worth $4 billion) and search agents like Perplexity (now worth $9 billion). We wanted to take a little step back to understand the most notable papers of the year in Agents, and Graham indulged with his list of 8 perennial problems in building agents in 2024.
Must-Read Papers for the 8 Problems of Agents
* The agent-computer interface: CodeAct: Executable Code Actions Elicit Better LLM Agents. Minimial viable tools: Execution Sandbox, File Editor, Web Browsing
* The human-agent interface: Chat UI, GitHub Plugin, Remote runtime, …?
* Choosing an LLM: See Evaluation of LLMs as Coding Agents on SWE-Bench at 30x - must understand instructions, tools, code, environment, error recovery
* Planning: Single Agent Systems vs Multi Agent (CoAct: A Global-Local Hierarchy for Autonomous Agent Collaboration) - Explicit vs Implicit, Curated vs Generated
* Reusable common workflows: SteP: Stacked LLM Policies for Web Actions and Agent Workflow Memory - Manual prompting vs Learning from Experience
* Exploration: Agentless: Demystifying LLM-based Software Engineering Agents and BAGEL: Bootstrapping Agents by Guiding Exploration with Language
* Search: Tree Search for Language Model Agents - explore paths and rewind
* Evaluation: Fast Sanity Checks (miniWoB and Aider) and Highly Realistic (WebArena, SWE-Bench) and SWE-Gym: An Open Environment for Training Software Engineering Agents & Verifiers
Full Talk on YouTube
Please like and subscribe!
Timestamps
* 00:00 Welcome to Latent Space Live at NeurIPS 2024
* 00:29 State of LLM Agents in 2024
* 02:20 Professor Graham Newbig's Insights on Agents
* 03:57 Live Demo: Coding Agents in Action
* 08:20 Designing Effective Agents
* 14:13 Choosing the Right Language Model for Agents
* 16:24 Planning and Workflow for Agents
* 22:21 Evaluation and Future Predictions for Agents
* 25:31 Future of Agent Development
* 25:56 Human-Agent Interaction Challenges
* 26:48 Expanding Agent Use Beyond Programming
* 27:25 Redesigning Systems for Agent Efficiency
* 28:03 Accelerating Progress with Agent Technology
* 28:28 Call to Action for Open Source Contributions
* 30:36 Q&A: Agent Performance and Benchmarks
* 33:23 Q&A: Web Agents and Interaction Methods
* 37:16 Q&A: Agent Architectures and Improvements
* 43:09 Q&A: Self-Improving Agents and Authentication
* 47:31 Live Demonstration and Closing Remarks
Transcript
[00:00:29] State of LLM Agents in 2024
[00:00:29] Speaker 9: Our next keynote covers the state of LLM agents. With the triumphant return of Professor Graham Newbig of CMU and OpenDevon, now a startup known as AllHands. The renamed OpenHands has done extremely well this year, as they end the year sitting comfortably at number one on the hardest SWE Benchful leaderboard at 29%.
[00:00:53] Speaker 9: Though, on the smaller SWE bench verified, they are at 53 percent behind Amazon Q [00:01:00] Devlo and OpenAI's self reported O3 results at 71. 7%. Many are saying that 2025 is going to be the year of agents, with OpenAI, DeepMind, and Anthropic setting their sights on consumer and coding agents. Vision based computer using agents and multi agent systems.
[00:01:22] Speaker 9: There has been so much progress on the practical reliability and applications of agents in all domains, from the huge launch of Cognition AI's Devon this year, to the sleeper hit of Cursor Composer and recent guest Codium's Windsurf Cascade in the IDE arena. To the explosive revenue growth of recent guests StackBlitz's Bolt, Lovable, and Vercel's vZero.
[00:01:44] Speaker 9: And the unicorn rounds and high profile movements of customer support agents like Sierra, now worth 4 billion, and search agents like Perplexity, now worth 9 billion. We wanted to take a little step back to understand the most notable papers of the year in [00:02:00] agents, and Graham indulged with his list of eight perennial problems in building agents.
[00:02:06] Speaker 9: As always, don't forget to check our show notes for all the selected best papers of 2024, and for the YouTube link to their talk. Graham's slides were especially popular online, and we are honoured to have him. Watch out and take care!
[00:02:20] Professor Graham Newbig's Insights on Agents
[00:02:20] Speaker: Okay hi everyone. So I was given the task of talking about agents in 2024, and this is An impossible task because there are so many agents, so many agents in 2024. So this is going to be strongly covered by like my personal experience and what I think is interesting and important, but I think it's an important topic.
[00:02:41] Speaker: So let's go ahead. So the first thing I'd like to think about is let's say I gave you you know, a highly competent human, some tools. Let's say I gave you a web browser and a terminal or a file system. And the ability to [00:03:00] edit text or code. What could you do with that? Everything. Yeah.
[00:03:07] Speaker: Probably a lot of things. This is like 99 percent of my, you know, daily daily life, I guess. When I'm, when I'm working. So, I think this is a pretty powerful tool set, and I am trying to do, and what I think some other people are trying to do, is come up with agents that are able to, you know, manipulate these things.
[00:03:26] Speaker: Web browsing, coding, running code in successful ways. So there was a little bit about my profile. I'm a professor at CMU, chief scientist at All Hands AI, building open source coding agents. I'm maintainer of OpenHands, which is an open source coding agent framework. And I'm also a software developer and I, I like doing lots of coding and, and, you know, shipping new features and stuff like this.
[00:03:51] Speaker: So building agents that help me to do this, you know, is kind of an interesting thing, very close to me.
[00:03:57] Live Demo: Coding Agents in Action
[00:03:57] Speaker: So the first thing I'd like to do is I'd like to try [00:04:00] some things that I haven't actually tried before. If anybody has, you know, tried to give a live demo, you know, this is, you know very, very scary whenever you do it and it might not work.
[00:04:09] Speaker: So it might not work this time either. But I want to show you like three things that I typically do with coding agents in my everyday work. I use coding agents maybe five to 10 times a day to help me solve my own problems. And so this is a first one. This is a data science task. Which says I want to create scatter plots that show the increase of the SWE bench score over time.
[00:04:34] Speaker: And so I, I wrote a kind of concrete prompt about this. Agents work better with like somewhat concrete prompts. And I'm gonna throw this into open hands and let it work. And I'll, I'll go back to that in a second. Another thing that I do is I create new software. And I, I've been using a [00:05:00] service a particular service.
[00:05:01] Speaker: I won't name it for sending emails and I'm not very happy with it. So I want to switch over to this new service called resend. com, which makes it easier to send emails. And so I'm going to ask it to read the docs for the resend. com API and come up with a script that allows me to send emails. The input to the script should be a CSV file and the subject and body should be provided in Jinja2 templates.
[00:05:24] Speaker: So I'll start another agent and and try to get it to do that for me.
[00:05:35] Speaker: And let's go with the last one. The last one I do is. This is improving existing software and in order, you know, once you write software, you usually don't throw it away. You go in and, like, actually improve it iteratively. This software that I have is something I created without writing any code.
[00:05:52] Speaker: It's basically software to monitor how much our our agents are contributing to the OpenHance repository. [00:06:00] And on the, let me make that a little bit bigger, on the left side, I have the number of issues where it like sent a pull request. I have the number of issues where it like sent a pull request, whether it was merged in purple, closed in red, or is still open in green. And so these are like, you know, it's helping us monitor, but one thing it doesn't tell me is the total number. And I kind of want that feature added to this software.
[00:06:33] Speaker: So I'm going to try to add that too. So. I'll take this, I'll take this prompt,
[00:06:46] Speaker: and here I want to open up specifically that GitHub repo. So I'll open up that repo and paste in the prompt asking it. I asked it to make a pie chart for each of these and give me the total over the entire time period that I'm [00:07:00] monitoring. So we'll do that. And so now I have let's see, I have some agents.
[00:07:05] Speaker: Oh, this one already finished. Let's see. So this one already finished. You can see it finished analyzing the Swebench repository. It wrote a demonstration of, yeah, I'm trying to do that now, actually.
[00:07:30] Speaker: It wrote a demonstration of how much each of the systems have improved over time. And I asked it to label the top three for each of the data sets. And so it labeled OpenHands as being the best one for SWE Bench Normal. For SWE Bench Verified, it has like the Amazon QAgent and OpenHands. For the SWE Bench Lite, it has three here over three over here.
[00:07:53] Speaker: So you can see like. That's pretty useful, right? If you're a researcher, you do data analysis all the time. I did it while I was talking to all [00:08:00] of you and making a presentation. So that's, that's pretty nice. I, I doubt the other two are finished yet. That would be impressive if the, yeah. So I think they're still working.
[00:08:09] Speaker: So maybe we'll get back to them at the end of the presentation. But so these are the kinds of the, these are the kinds of things that I do every day with coding agents now. And it's or software development agents. It's pretty impressive.
[00:08:20] Designing Effective Agents
[00:08:20] Speaker: The next thing I'd like to talk about a little bit is things I worry about when designing agents.
[00:08:24] Speaker: So we're designing agents to, you know, do a very difficult task of like navigating websites writing code, other things like this. And within 2024, there's been like a huge improvement in the methodology that we use to do this. But there's a bunch of things we think about. There's a bunch of interesting papers, and I'd like to introduce a few of them.
[00:08:46] Speaker: So the first thing I worry about is the agent computer interface. Like, how do we get an agent to interact with computers? And, How do we provide agents with the tools to do the job? And [00:09:00] within OpenHands we are doing the thing on the right, but there's also a lot of agents that do the thing on the left.
[00:09:05] Speaker: So the thing on the left is you give like agents kind of granular tools. You give them tools like or let's say your instruction is I want to determine the most cost effective country to purchase the smartphone model, Kodak one the countries to consider are the USA, Japan, Germany, and India. And you have a bunch of available APIs.
[00:09:26] Speaker: And. So what you do for some agents is you provide them all of these tools APIs as tools that they can call. And so in this particular case in order to solve this problem, you'd have to make about like 30 tool calls, right? You'd have to call lookup rates for Germany, you'd have to look it up for the US, Japan, and India.
[00:09:44] Speaker: That's four tool goals. And then you go through and do all of these things separately. And the method that we adopt in OpenHands instead is we provide these tools, but we provide them by just giving a coding agent, the ability to call [00:10:00] arbitrary Python code. And. In the arbitrary Python code, it can call these tools.
[00:10:05] Speaker: We expose these tools as APIs that the model can call. And what that allows us to do is instead of writing 20 tool calls, making 20 LLM calls, you write a program that runs all of these all at once, and it gets the result. And of course it can execute that program. It can, you know, make a mistake. It can get errors back and fix things.
[00:10:23] Speaker: But that makes our job a lot easier. And this has been really like instrumental to our success, I think. Another part of this is what tools does the agent need? And I, I think this depends on your use case, we're kind of extreme and we're only giving the agent five tools or maybe six tools.
[00:10:40] Speaker: And what, what are they? The first one is program execution. So it can execute bash programs, and it can execute Jupyter notebooks. It can execute cells in Jupyter notebooks. So that, those are two tools. Another one is a file editing tool. And the file editing tool allows you to browse parts of files.[00:11:00]
[00:11:00] Speaker: And kind of read them, overwrite them, other stuff like this. And then we have another global search and replace tool. So it's actually two tools for file editing. And then a final one is web browsing, web browsing. I'm kind of cheating when I call it only one tool. You actually have like scroll and text input and click and other stuff like that.
[00:11:18] Speaker: But these are basically the only things we allow the agent to do. What, then the question is, like, what if we wanted to allow it to do something else? And the answer is, well, you know, human programmers already have a bunch of things that they use. They have the requests PyPy library, they have the PDF to text PyPy library, they have, like, all these other libraries in the Python ecosystem that they could use.
[00:11:41] Speaker: And so if we provide a coding agent with all these libraries, it can do things like data visualization and other stuff that I just showed you. So it can also get clone repositories and, and other things like this. The agents are super good at using the GitHub API also. So they can do, you know, things on GitHub, like finding all of the, you know, [00:12:00] comments on your issues or checking GitHub actions and stuff.
[00:12:02] Speaker: The second thing I think about is the human agent interface. So this is like how do we get humans to interact with agents? Bye. I already showed you one variety of our human agent interface. It's basically a chat window where you can browse through the agent's results and things like this. This is very, very difficult.
[00:12:18] Speaker: I, I don't think anybody has a good answer to this, and I don't think we have a good answer to this, but the, the guiding principles that I'm trying to follow are we want to present enough info to the user. So we want to present them with, you know, what the agent is doing in the form of a kind of.
[00:12:36] Speaker: English descriptions. So you can see here you can see here every time it takes an action, it says like, I will help you create a script for sending emails. When it runs a bash command. Sorry, that's a little small. When it runs a bash command, it will say ran a bash command. It won't actually show you the whole bash command or the whole Jupyter notebook because it can be really large, but you can open it up and see if you [00:13:00] want to, by clicking on this.
[00:13:01] Speaker: So like if you want to explore more, you can click over to the Jupyter notebook and see what's displayed in the Jupyter notebook. And you get like lots and lots of information. So that's one thing.
[00:13:16] Speaker: Another thing is go where the user is. So like if the user's already interacting in a particular setting then I'd like to, you know, integrate into that setting, but only to a point. So at OpenHands, we have a chat UI for interaction. We have a GitHub plugin for tagging and resolving issues. So basically what you do is you Do at open hands agent and the open hands agent will like see that comment and be able to go in and fix things.
[00:13:42] Speaker: So if you say at open hands agent tests are failing on this PR, please fix the tests. It will go in and fix the test for you and stuff like this. Another thing we have is a remote runtime for launching headless jobs. So if you want to launch like a fleet of agents to solve, you know five different problems at once, you can also do [00:14:00] that through an API.
[00:14:00] Speaker: So we have we have these interfaces and this probably depends on the use case. So like, depending if you're a coding agent, you want to do things one way. If you're a like insurance auditing agent, you'll want to do things other ways, obviously.
[00:14:13] Choosing the Right Language Model for Agents
[00:14:13] Speaker: Another thing I think about a lot is choosing a language model.
[00:14:16] Speaker: And for agentic LMs we have to have a bunch of things work really well. The first thing is really, really good instruction following ability. And if you have really good instruction following ability, it opens up like a ton of possible applications for you. Tool use and coding ability. So if you provide tools, it needs to be able to use them well.
[00:14:38] Speaker: Environment understanding. So it needs, like, if you're building a web agent, it needs to be able to understand web pages either through vision or through text. And error awareness and recovery ability. So, if it makes a mistake, it needs to be able to, you know, figure out why it made a mistake, come up with alternative strategies, and other things like this.
[00:14:58] Speaker: [00:15:00] Under the hood, in all of the demos that I did now Cloud, we're using Cloud. Cloud has all of these abilities very good, not perfect, but very good. Most others don't have these abilities quite as much. So like GPT 4. 0 doesn't have very good error recovery ability. And so because of this, it will go into loops and do the same thing over and over and over again.
[00:15:22] Speaker: Whereas Claude does not do this. Claude, if you, if you use the agents enough, you get used to their kind of like personality. And Claude says, Hmm, let me try a different approach a lot. So, you know, obviously it's been trained in some way to, you know, elicit this ability. We did an evaluation. This is old.
[00:15:40] Speaker: And we need to update this basically, but we evaluated CLOD, mini LLAMA 405B, DeepSeq 2. 5 on being a good code agent within our framework. And CLOD was kind of head and shoulders above the rest. GPT 40 was kind of okay. The best open source model was LLAMA [00:16:00] 3. 1 405B. This needs to be updated because this is like a few months old by now and, you know, things are moving really, really fast.
[00:16:05] Speaker: But I still am under the impression that Claude is the best. The other closed models are, you know, not quite as good. And then the open models are a little bit behind that. Grok, I, we haven't tried Grok at all, actually. So, it's a good question. If you want to try it I'd be happy to help.
[00:16:24] Speaker: Cool.
[00:16:24] Planning and Workflow for Agents
[00:16:24] Speaker: Another thing is planning. And so there's a few considerations for planning. The first one is whether you have a curated plan or you have it generated on the fly. And so for solving GitHub issues, you can kind of have an overall plan. Like the plan is first reproduce. If there's an issue, first write tests to reproduce the issue or to demonstrate the issue.
[00:16:50] Speaker: After that, run the tests and make sure they fail. Then go in and fix the tests. Run the tests again to make sure they pass and then you're done. So that's like a pretty good workflow [00:17:00] for like solving coding issues. And you could curate that ahead of time. Another option is to let the language model basically generate its own plan.
[00:17:10] Speaker: And both of these are perfectly valid. Another one is explicit structure versus implicit structure. So let's say you generate a plan. If you have explicit structure, you could like write a multi agent system, and the multi agent system would have your reproducer agent, and then it would have your your bug your test writer agent, and your bug fixer agent, and lots of different agents, and you would explicitly write this all out in code, and then then use it that way.
[00:17:38] Speaker: On the other hand, you could just provide a prompt that says, please do all of these things in order. So in OpenHands, we do very light planning. We have a single prompt. We don't have any multi agent systems. But we do provide, like, instructions about, like, what to do first, what to do next, and other things like this.
[00:17:56] Speaker: I'm not against doing it the other way. But I laid [00:18:00] out some kind of justification for this in this blog called Don't Sleep on Single Agent Systems. And the basic idea behind this is if you have a really, really good instruction following agent it will follow the instructions as long as things are working according to your plan.
[00:18:14] Speaker: But let's say you need to deviate from your plan, you still have the flexibility to do this. And if you do explicit structure through a multi agent system, it becomes a lot harder to do that. Like, you get stuck when things deviate from your plan. There's also some other examples, and I wanted to introduce a few papers.
[00:18:30] Speaker: So one paper I liked recently is this paper called CoAct where you generate plans and then go in and fix them. And so the basic idea is like, if you need to deviate from your plan, you can You know, figure out that your plan was not working and go back and deviate from it.
[00:18:49] Speaker: Another thing I think about a lot is specifying common workflows. So we're trying to tackle a software development and I already showed like three use cases where we do [00:19:00] software development and when we. We do software development, we do a ton of different things, but we do them over and over and over again.
[00:19:08] Speaker: So just to give an example we fix GitHub actions when GitHub actions are failing. And we do that over and over and over again. That's not the number one thing that software engineers do, but it's a, you know, high up on the list. So how can we get a list of all of, like, the workflows that people are working on?
[00:19:26] Speaker: And there's a few research works that people have done in this direction. One example is manual prompting. So there's this nice paper called STEP that got state of the art on the WebArena Web Navigation Benchmark where they came up with a bunch of manual workflows for solving different web navigation tasks.
[00:19:43] Speaker: And we also have a paper recently called Agent Workflow Memory where the basic idea behind this is we want to create self improving agents that learn from their past successes. And the way it works is is we have a memory that has an example of lots of the previous [00:20:00] workflows that people have used. And every time the agent finishes a task and it self judges that it did a good job at that task, you take that task, you break it down into individual workflows included in that, and then you put it back in the prompt for the agent to work next time.
[00:20:16] Speaker: And this we demonstrated that this leads to a 22. 5 percent increase on WebArena after 40 examples. So that's a pretty, you know, huge increase by kind of self learning and self improvement.
[00:20:31] Speaker: Another thing is exploration. Oops. And one thing I think about is like, how can agents learn more about their environment before acting? And I work on coding and web agents, and there's, you know, a few good examples of this in, in both areas. Within coding, I view this as like repository understanding, understanding the code base that you're dealing with.
[00:20:55] Speaker: And there's an example of this, or a couple examples of this, one example being AgentList. [00:21:00] Where they basically create a map of the repo and based on the map of the repo, they feed that into the agent so the agent can then navigate the repo and and better know where things are. And for web agents there's an example of a paper called Bagel, and basically what they do is they have the agent just do random tasks on a website, explore the website, better understand the structure of the website, and then after that they they feed that in as part of the product.
[00:21:27] Speaker: Part seven is search. Right now in open hands, we just let the agent go on a linear search path. So it's just solving the problem once. We're using a good agent that can kind of like recover from errors and try alternative things when things are not working properly, but still we only have a linear search path.
[00:21:45] Speaker: But there's also some nice work in 2024 that is about exploring multiple paths. So one example of this is there's a paper called Tree Search for Language Agents. And they basically expand multiple paths check whether the paths are going well, [00:22:00] and if they aren't going well, you rewind back. And on the web, this is kind of tricky, because, like, how do you rewind when you accidentally ordered something you don't want on Amazon?
[00:22:09] Speaker: It's kind of, you know, not, not the easiest thing to do. For code, it's a little bit easier, because you can just revert any changes that you made. But I, I think that's an interesting topic, too.
[00:22:21] Evaluation and Future Predictions for Agents
[00:22:21] Speaker: And then finally evaluation. So within our development for evaluation, we want to do a number of things. The first one is fast sanity checks.
[00:22:30] Speaker: And in order to do this, we want things we can run really fast, really really cheaply. So for web, we have something called mini world of bits, which is basically these trivial kind of web navigation things. We have something called the Adder Code Editing Benchmark, where it's just about editing individual files that we use.
[00:22:48] Speaker: But we also want highly realistic evaluation. So for the web, we have something called WebArena that we created at CMU. This is web navigation on real real open source websites. So it's open source [00:23:00] websites that are actually used to serve shops or like bulletin boards or other things like this.
[00:23:07] Speaker: And for code, we use Swebench, which I think a lot of people may have heard of. It's basically a coding benchmark that comes from real world pull requests on GitHub. So if you can solve those, you can also probably solve other real world pull requests. I would say we still don't have benchmarks for the fur full versatility of agents.
[00:23:25] Speaker: So, for example We don't have benchmarks that test whether agents can code and do web navigation. But we're working on that and hoping to release something in the next week or two. So if that sounds interesting to you, come talk to me and I, I will tell you more about it.
[00:23:42] Speaker: Cool. So I don't like making predictions, but I was told that I should be somewhat controversial, I guess, so I will, I will try to do it try to do it anyway, although maybe none of these will be very controversial. Um, the first thing is agent oriented LLMs like large language models for [00:24:00] agents.
[00:24:00] Speaker: My, my prediction is every large LM trainer will be focusing on training models as agents. So every large language model will be a better agent model by mid 2025. Competition will increase, prices will go down, smaller models will become competitive as agents. So right now, actually agents are somewhat expensive to run in some cases, but I expect that that won't last six months.
[00:24:23] Speaker: I, I bet we'll have much better agent models in six months. Another thing is instruction following ability, specifically in agentic contexts, will increase. And what that means is we'll have to do less manual engineering of agentic workflows and be able to do more by just prompting agents in more complex ways.
[00:24:44] Speaker: Cloud is already really good at this. It's not perfect, but it's already really, really good. And I expect the other models will catch up to Cloud pretty soon. Error correction ability will increase, less getting stuck in loops. Again, this is something that Cloud's already pretty good at and I expect the others will, will follow.[00:25:00]
[00:25:01] Speaker: Agent benchmarks. Agent benchmarks will start saturating.
[00:25:05] Speaker: And Swebench I think WebArena is already too easy. It, it is, it's not super easy, but it's already a bit too easy because the tasks we do in there are ones that take like two minutes for a human. So not, not too hard. And kind of historically in 2023 our benchmarks were too easy. So we built harder benchmarks like WebArena and Swebench were both built in 2023.
[00:25:31] Future of Agent Development
[00:25:31] Speaker: In 2024, our agents were too bad, so we built agents and now we're building better agents. In 2025, our benchmarks will be too easy, so we'll build better benchmarks, I'm, I'm guessing. So, I would expect to see much more challenging agent benchmarks come out, and we're already seeing some of them.
[00:25:49] Speaker: In 2026, I don't know. I didn't write AGI, but we'll, we'll, we'll see.
[00:25:56] Human-Agent Interaction Challenges
[00:25:56] Speaker: Then the human agent computer interface. I think one thing that [00:26:00] we'll want to think about is what do we do at 75 percent success rate at things that we like actually care about? Right now we have 53 percent or 55 percent on Swebench verified, which is real world GitHub PRs.
[00:26:16] Speaker: My impression is that the actual. Actual ability of models is maybe closer to 30 to 40%. So 30 to 40 percent of the things that I want an agent to solve on my own repos, it just solves without any human intervention. 80 to 90 percent it can solve without me opening an IDE. But I need to give it feedback.
[00:26:36] Speaker: So how do we, how do we make that interaction smooth so that humans can audit? The work of agents that are really, really good, but not perfect is going to be a big challenge.
[00:26:48] Expanding Agent Use Beyond Programming
[00:26:48] Speaker: How can we expose the power of programming agents to other industries? So like as programmers, I think not all of us are using agents every day in our programming, although we probably will be [00:27:00] in in months or maybe a year.
[00:27:02] Speaker: But I, I think it will come very naturally to us as programmers because we know code. We know, you know. Like how to architect software and stuff like that. So I think the question is how do we put this in the hands of like a lawyer or a chemist or somebody else and have them also be able to, you know, interact with it as naturally as we can.
[00:27:25] Redesigning Systems for Agent Efficiency
[00:27:25] Speaker: Another interesting thing is how can we redesign our existing systems for agents? So we had a paper on API based web agents, and basically what we showed is If you take a web agent and the agent interacts not with a website, but with APIs, the accuracy goes way up just because APIs are way easier to interact with.
[00:27:42] Speaker: And in fact, like when I ask the, well, our agent, our agent is able to browse websites, but whenever I want it to interact with GitHub, I tell it do not browse the GitHub website. Use the GitHub API because it's way more successful at doing that. So maybe, you know, every website is going to need to have [00:28:00] an API because we're going to be having agents interact with them.
[00:28:03] Accelerating Progress with Agent Technology
[00:28:03] Speaker: About progress, I think progress will get faster. It's already fast. A lot of people are already overwhelmed, but I think it will continue. The reason why is agents are building agents. And better agents will build better agents faster. So I expect that you know, if you haven't interacted with a coding agent yet, it's pretty magical, like the stuff that it can do.
[00:28:24] Speaker: So yeah.
[00:28:28] Call to Action for Open Source Contributions
[00:28:28] Speaker: And I have a call to action. I'm honestly, like I've been working on, you know, natural language processing and, and Language models for what, 15 years now. And even for me, it's pretty impressive what like AI agents powered by strong language models can do. On the other hand, I believe that we should really make these powerful tools accessible.
[00:28:49] Speaker: And what I mean by this is I don't think like, you know, We, we should have these be opaque or limited to only a set, a certain set of people. I feel like they should be [00:29:00] affordable. They shouldn't be increasing the, you know, difference in the amount of power that people have. If anything, I'd really like them to kind of make it It's possible for people who weren't able to do things before to be able to do them well.
[00:29:13] Speaker: Open source is one way to do that. That's why I'm working on open source. There are other ways to do that. You know, make things cheap, make things you know, so you can serve them to people who aren't able to afford them. Easily, like Duolingo is one example where they get all the people in the US to pay them 20 a month so that they can give all the people in South America free, you know, language education, so they can learn English and become, you know like, and become, you know, More attractive on the job market, for instance.
[00:29:41] Speaker: And so I think we can all think of ways that we can do that sort of thing. And if that resonates with you, please contribute. Of course, I'd be happy if you contribute to OpenHands and use it. But another way you can do that is just use open source solutions, contribute to them, research with them, and train strong open source [00:30:00] models.
[00:30:00] Speaker: So I see, you know, Some people in the room who are already training models. It'd be great if you could train models for coding agents and make them cheap. And yeah yeah, please. I, I was thinking about you among others. So yeah, that's all I have. Thanks.
[00:30:20] Speaker 2: Slight, slightly controversial. Tick is probably the nicest way to say hot ticks. Any hot ticks questions, actual hot ticks?
[00:30:31] Speaker: Oh, I can also show the other agents that were working, if anybody's interested, but yeah, sorry, go ahead.
[00:30:36] Q&A: Agent Performance and Benchmarks
[00:30:36] Speaker 3: Yeah, I have a couple of questions. So they're kind of paired, maybe. The first thing is that you said that You're estimating that your your agent is successfully resolving like something like 30 to 40 percent of your issues, but that's like below what you saw in Swebench.
[00:30:52] Speaker 3: So I guess I'm wondering where that discrepancy is coming from. And then I guess my other second question, which is maybe broader in scope is that [00:31:00] like, if, if you think of an agent as like a junior developer, and I say, go do something, then I expect maybe tomorrow to get a Slack message being like, Hey, I ran into this issue.
[00:31:10] Speaker 3: How can I resolve it? And, and, like you said, your agent is, like, successfully solving, like, 90 percent of issues where you give it direct feedback. So, are you thinking about how to get the agent to reach out to, like, for, for planning when it's, when it's stuck or something like that? Or, like, identify when it runs into a hole like that?
[00:31:30] Speaker: Yeah, so great. These are great questions. Oh,
[00:31:32] Speaker 3: sorry. The third question, which is a good, so this is the first two. And if so, are you going to add a benchmark for that second question?
[00:31:40] Speaker: Okay. Great. Yeah. Great questions. Okay. So the first question was why do I think it's resolving less than 50 percent of the issues on Swebench?
[00:31:48] Speaker: So first Swebench is on popular open source repos, and all of these popular open source repos were included in the training data for all of the language models. And so the language [00:32:00] models already know these repos. In some cases, the language models already know the individual issues in Swebench.
[00:32:06] Speaker: So basically, like, some of the training data has leaked. And so it, it definitely will overestimate with respect to that. I don't think it's like, you know, Horribly, horribly off but I think, you know, it's boosting the accuracy by a little bit. So, maybe that's the biggest reason why. In terms of asking for help, and whether we're benchmarking asking for help yes we are.
[00:32:29] Speaker: So one one thing we're working on now, which we're hoping to put out soon, is we we basically made SuperVig. Sweep edge issues. Like I'm having a, I'm having a problem with the matrix multiply. Please help. Because these are like, if anybody's run a popular open source, like framework, these are what half your issues are.
[00:32:49] Speaker: You're like users show up and say like, my screen doesn't work. What, what's wrong or something. And so then you need to ask them questions and how to reproduce. So yeah, we're, we're, we're working on [00:33:00] that. I think. It, my impression is that agents are not very good at asking for help, even Claude. So like when, when they ask for help, they'll ask for help when they don't need it.
[00:33:11] Speaker: And then won't ask for help when they do need it. So this is definitely like an issue, I think.
[00:33:20] Speaker 4: Thanks for the great talk. I also have two questions.
[00:33:23] Q&A: Web Agents and Interaction Methods
[00:33:23] Speaker 4: It's first one can you talk a bit more about how the web agent interacts with So is there a VLM that looks at the web page layout and then you parse the HTML and select which buttons to click on? And if so do you think there's a future where there's like, so I work at Bing Microsoft AI.
[00:33:41] Speaker 4: Do you think there's a future where the same web index, but there's an agent friendly web index where all the processing is done offline so that you don't need to spend time. Cleaning up, like, cleaning up these TML and figuring out what to click online. And any thoughts on, thoughts on that?
[00:33:57] Speaker: Yeah, so great question. There's a lot of work on web [00:34:00] agents. I didn't go into, like, all of the details, but I think there's There's three main ways that agents interact with websites. The first way is the simplest way and the newest way, but it doesn't work very well, which is you take a screenshot of the website and then you click on a particular pixel value on the website.
[00:34:23] Speaker: And Like models are not very good at that at the moment. Like they'll misclick. There was this thing about how like clawed computer use started like looking at pictures of Yellowstone national park or something like this. I don't know if you heard about this anecdote, but like people were like, oh, it's so human, it's looking for vacation.
[00:34:40] Speaker: And it was like, no, it probably just misclicked on the wrong pixels and accidentally clicked on an ad. So like this is the simplest way. The second simplest way. You take the HTML and you basically identify elements in the HTML. You don't use any vision whatsoever. And then you say, okay, I want to click on this element.
[00:34:59] Speaker: I want to enter text [00:35:00] in this element or something like that. But HTML is too huge. So it actually, it usually gets condensed down into something called an accessibility tree, which was made for screen readers for visually impaired people. And So that's another way. And then the third way is kind of a hybrid where you present the screenshot, but you also present like a textual summary of the output.
[00:35:18] Speaker: And that's the one that I think will probably work best. What we're using is we're just using text at the moment. And that's just an implementation issue that we haven't implemented the. Visual stuff yet, but that's kind of like we're working on it now. Another thing that I should point out is we actually have two modalities for web browsing.
[00:35:35] Speaker: Very recently we implemented this. And the reason why is because if you want to interact with full websites you will need to click on all of the elements or have the ability to click on all of the elements. But most of our work that we need websites for is just web browsing and like gathering information.
[00:35:50] Speaker: So we have another modality where we convert all of it to markdown because that's like way more concise and easier for the agent to deal with. And then [00:36:00] can we create an index specifically for agents, maybe a markdown index or something like that would be, you know, would make sense. Oh, how would I make a successor to Swebench?
[00:36:10] Speaker: So I mean, the first thing is there's like live code bench, which live code bench is basically continuously updating to make sure it doesn't leak into language model training data. That's easy to do for Swebench because it comes from real websites and those real websites are getting new issues all the time.
[00:36:27] Speaker: So you could just do it on the same benchmarks that they have there. There's also like a pretty large number of things covering various coding tasks. So like, for example, Swebunch is mainly fixing issues, but there's also like documentation, there's generating tests that actually test the functionality that you want.
[00:36:47] Speaker: And there there was a paper by a student at CMU on generating tests and stuff like that. So I feel like. Swebench is one piece of the puzzle, but you could also have like 10 different other tasks and then you could have like a composite [00:37:00] benchmark where you test all of these abilities, not just that particular one.
[00:37:04] Speaker: Well, lots, lots of other things too, but
[00:37:11] Speaker 2: Question from across. Use your mic, it will help. Um,
[00:37:15] Speaker 5: Great talk. Thank you.
[00:37:16] Q&A: Agent Architectures and Improvements
[00:37:16] Speaker 5: My question is about your experience designing agent architectures. Specifically how much do you have to separate concerns in terms of tasks specific agents versus having one agent to do three or five things with a gigantic prompt with conditional paths and so on.
[00:37:35] Speaker: Yeah, so that's a great question. So we have a basic coding and browsing agent. And I won't say basic, like it's a good, you know, it's a good agent, but it does coding and browsing. And it has instructions about how to do coding and browsing. That is enough for most things. Especially given a strong language model that has a lot of background knowledge about how to solve different types of tasks and how to use different APIs and stuff like that.
[00:37:58] Speaker: We do have [00:38:00] a mechanism for something called micro agents. And micro agents are basically something that gets added to the prompt when a trigger is triggered. Right now it's very, very rudimentary. It's like if you detect the word GitHub anywhere, you get instructions about how to interact with GitHub, like use the API and don't browse.
[00:38:17] Speaker: Also another one that I just added is for NPM, the like JavaScript package manager. And NPM, when it runs and it hits a failure, it Like hits in interactive terminals where it says, would you like to quit? Yep. Enter yes. And if that does it, it like stalls our agent for the time out until like two minutes.
[00:38:36] Speaker: So like I added a new microagent whenever it started using NPM, it would Like get instructions about how to not use interactive terminal and stuff like that. So that's our current solution. Honestly, I like it a lot. It's simple. It's easy to maintain. It works really well and stuff like that. But I think there is a world where you would want something more complex than that.
[00:38:55] Speaker 5: Got it. Thank you.
[00:38:59] Speaker 6: I got a [00:39:00] question about MCP. I feel like this is the Anthropic Model Context Protocol. It seems like the most successful type of this, like, standardization of interactions between computers and agents. Are you guys adopting it? Is there any other competing standard?
[00:39:16] Speaker 6: Anything, anything thought about it?
[00:39:17] Speaker: Yeah, I think the Anth, so the Anthropic MCP is like, a way to It, it's essentially a collection of APIs that you can use to interact with different things on the internet. I, I think it's not a bad idea, but it, it's like, there's a few things that bug me a little bit about it.
[00:39:40] Speaker: It's like we already have an API for GitHub, so why do we need an MCP for GitHub? Right. You know, like GitHub has an API, the GitHub API is evolving. We can look up the GitHub API documentation. So it seems like kind of duplicated a little bit. And also they have a setting where [00:40:00] it's like you have to spin up a server to serve your GitHub stuff.
[00:40:04] Speaker: And you have to spin up a server to serve your like, you know, other stuff. And so I think it makes, it makes sense if you really care about like separation of concerns and security and like other things like this, but right now we haven't seen, we haven't seen that. To have a lot more value than interacting directly with the tools that are already provided.
[00:40:26] Speaker: And that kind of goes into my general philosophy, which is we're already developing things for programmers. You know,
[00:40:36] Speaker: how is an agent different than from a programmer? And it is different, obviously, you know, like agents are different from programmers, but they're not that different at this point. So we can kind of interact with the interfaces we create for, for programmers. Yeah. I might change my mind later though.
[00:40:51] Speaker: So we'll see.
[00:40:54] Speaker 7: Yeah. Hi. Thanks. Very interesting talk. You were saying that the agents you have right now [00:41:00] solve like maybe 30 percent of your, your issues out of the gate. I'm curious of the things that it doesn't do. Is there like a pattern that you observe? Like, Oh, like these are the sorts of things that it just seems to really struggle with, or is it just seemingly random?
[00:41:15] Speaker: It's definitely not random. It's like, if you think it's more complex than it's. Like, just intuitively, it's more likely to fail. I've gotten a bit better at prompting also, so like, just to give an example it, it will sometimes fail to fix a GitHub workflow because it will not look at the GitHub workflow and understand what the GitHub workflow is doing before it solves the problem.
[00:41:43] Speaker: So I, I think actually probably the biggest thing that it fails at is, um, er, that our, our agent plus Claude fails at is insufficient information gathering before trying to solve the task. And so if you provide all, if you provide instructions that it should do information [00:42:00] gathering beforehand, it tends to do well.
[00:42:01] Speaker: If you don't provide sufficient instructions, it will try to solve the task without, like, fully understanding the task first, and then fail, and then you need to go back and give feedback. You know, additional feedback. Another example, like, I, I love this example. While I was developing the the monitor website that I, I showed here, we hit a really tricky bug where it was writing out a cache file to a different directory than it was reading the cache file from.
[00:42:26] Speaker: And I had no idea what to do. I had no idea what was going on. I, I thought the bug was in a different part of the code, but what I asked it to do was come up with five possible reasons why this could be failing and decreasing order of likelihood and examine all of them. And that worked and it could just go in and like do that.
[00:42:44] Speaker: So like I think a certain level of like scaffolding about like how it should sufficiently Gather all the information that's necessary in order to solve a task is like, if that's missing, then that's probably the biggest failure point at the moment. [00:43:00]
[00:43:01] Speaker 7: Thanks.
[00:43:01] Speaker 6: Yeah.
[00:43:06] Speaker 6: I'm just, I'm just using this as a chance to ask you all my questions.
[00:43:09] Q&A: Self-Improving Agents and Authentication
[00:43:09] Speaker 6: You had a, you had a slide on here about like self improving agents or something like that with memory. It's like a really throwaway slide for like a super powerful idea. It got me thinking about how I would do it. I have no idea how.
[00:43:21] Speaker 6: So I just wanted you to chain a thought more on this.
[00:43:25] Speaker: Yeah, self, self improving. So I think the biggest reason, like the simplest possible way to create a self improving agent. The problem with that is to have a really, really strong language model that with infinite context, and it can just go back and look at like all of its past experiences and, you know, learn from them.
[00:43:46] Speaker: You might also want to remove the bad stuff just so it doesn't over index on it's like failed past experiences. But the problem is a really powerful language model is large. Infinite context is expensive. We don't have a good way to [00:44:00] index into it because like rag, Okay. At least in my experience, RAG from language to code doesn't work super well.
[00:44:08] Speaker: So I think in the end, it's like, that's the way I would like to solve this problem. I'd like to have an infinite context and somehow be able to index into it appropriately. And I think that would mostly solve it. Another thing you can do is fine tuning. So I think like RAG is one way to get information into your model.
[00:44:23] Speaker: Fine tuning is another way to get information into your model. So. That might be another way of continuously improving. Like you identify when you did a good job and then just add all of the good examples into your model.
[00:44:34] Speaker 6: Yeah. So, you know, how like Voyager tries to write code into a skill library and then you reuse as a skill library, right?
[00:44:40] Speaker 6: So that it improves in the sense that it just builds up the skill library over time.
[00:44:44] Speaker: Yep.
[00:44:44] Speaker 6: One thing I was like thinking about and there's this idea of, from, from Devin, your, your arch nemesis of playbooks. I don't know if you've seen them.
[00:44:52] Speaker: Yeah, I mean, we're calling them workflows, but they're simpler.
[00:44:55] Speaker 6: Yeah, so like, basically, like, you should, like, once a workflow works, you can kind of, [00:45:00] like, persist them as a skill library. Yeah. Right? Like I, I feel like that there's a, that's like some in between, like you said, you know, it's hard to do rag between language and code, but I feel like that is ragged for, like, I've done this before, last time I did it, this, this worked.
[00:45:14] Speaker 6: So I'm just going to shortcut. All the stuff that failed before.
[00:45:18] Speaker: Yeah, I totally, I think it's possible. It's just, you know, not, not trivial at the same time. I'll explain the two curves. So basically, the base, the baseline is just an agent that does it from scratch every time. And this curve up here is agent workflow memory where it's like adding the successful experiences back into the prompt.
[00:45:39] Speaker: Why is this improving? The reason why is because just it failed on the first few examples and for the average to catch up it, it took a little bit of time. So it's not like this is actually improving it. You could just basically view the this one is constant and then this one is like improving.
[00:45:56] Speaker: Like this, basically you can see it's continuing to go [00:46:00] up.
[00:46:01] Speaker 8: How do you think we're going to solve the authentication problem for agents right now?
[00:46:05] Speaker: When you say authentication, you mean like credentials, like, yeah.
[00:46:09] Speaker 8: Yeah. Cause I've seen a few like startup solutions today, but it seems like it's limited to the amount of like websites or actual like authentication methods that it's capable of performing today.
[00:46:19] Speaker: Yeah. Great questions. So. My preferred solution to this at the moment is GitHub like fine grained authentication tokens and GitHub fine grained authentication tokens allow you to specify like very free. On a very granular basis on this repo, you have permission to do this, on this repo, you have permission to do this.
[00:46:41] Speaker: You also can prevent people from pushing to the main branch unless they get approved. You can do all of these other things. And I think these were all developed for human developers. Or like, the branch protection rules were developed for human developers. The fine grained authentication tokens were developed for GitHub apps.
[00:46:56] Speaker: I think for GitHub, maybe [00:47:00] just pushing this like a little bit more is the way to do this. For other things, they're totally not prepared to give that sort of fine grained control. Like most APIs don't have something like a fine grained authentication token. And that goes into my like comment that we're going to need to prepare the world for agents, I think.
[00:47:17] Speaker: But I think like the GitHub authentication tokens are like a good template for how you could start doing that maybe, but yeah, I don't, I don't, I don't have an answer.
[00:47:25] Speaker 8: I'll let you know if I find one.
[00:47:26] Speaker: Okay. Yeah.
[00:47:31] Live Demonstration and Closing Remarks
[00:47:31] Speaker: I'm going to finish up. Let, let me just see.
[00:47:37] Speaker: Okay. So this one this one did write a script. I'm not going to actually read it for you. And then the other one, let's see.
[00:47:51] Speaker: Yeah. So it sent a PR, sorry. What is, what is the PR URL?[00:48:00]
[00:48:02] Speaker: So I don't, I don't know if this sorry, that's taking way longer than it should. Okay, cool. Yeah. So this one sent a PR. I'll, I'll tell you later if this actually like successfully Oh, no, it's deployed on Vercel, so I can actually show you, but let's, let me try this real quick. Sorry. I know I don't have time.
[00:48:24] Speaker: Yeah, there you go. I have pie charts now. So it's so fun. It's so fun to play with these things. Cause you could just do that while I'm giving a, you know, talk and things like that. So, yeah, thanks.

Get full access to Latent Space at www.latent.space/subscribe
- Luisteren Nogmaals beluisteren Doorgaan Wordt afgespeeld...
- Later beluisteren Later beluisteren
2024 in Synthetic Data and Smol Models [LS Live @ NeurIPS]
24 dec· Latent Space: The AI Engineer Podcast
Happy holidays! We’ll be sharing snippets from Latent Space LIVE! through the break bringing you the best of 2024! We want to express our deepest appreciation to event sponsors AWS, Daylight Computer, Thoth.ai, StrongCompute, Notable Capital, and most of all all our LS supporters who helped fund the gorgeous venue and A/V production!
For NeurIPS last year we did our standard conference podcast coverage interviewing selected papers (that we have now also done for ICLR and ICML), however we felt that we could be doing more to help AI Engineers 1) get more industry-relevant content, and 2) recap 2024 year in review from experts. As a result, we organized the first Latent Space LIVE!, our first in person miniconference, at NeurIPS 2024 in Vancouver.
Today, we’re proud to share Loubna’s highly anticipated talk (slides here)!
Synthetic Data
We called out the Synthetic Data debate at last year’s NeurIPS, and no surprise that 2024 was dominated by the rise of synthetic data everywhere:
* Apple’s Rephrasing the Web, Microsoft’s Phi 2-4 and Orca/AgentInstruct, Tencent’s Billion Persona dataset, DCLM, and HuggingFace’s FineWeb-Edu, and Loubna’s own Cosmopedia extended the ideas of synthetic textbook and agent generation to improve raw web scrape dataset quality
* This year we also talked to the IDEFICS/OBELICS team at HuggingFace who released WebSight this year, the first work on code-vs-images synthetic data.
* We called Llama 3.1 the Synthetic Data Model for its extensive use (and documentation!) of synthetic data in its pipeline, as well as its permissive license.
* Nemotron CC and Nemotron-4-340B also made a big splash this year for how they used 20k items of human data to synthesize over 98% of the data used for SFT/PFT.
* Cohere introduced Multilingual Arbitrage: Optimizing Data Pools to Accelerate Multilingual Progress observing gains of up to 56.5% improvement in win rates comparing multiple teachers vs the single best teacher model
* In post training, AI2’s Tülu3 (discussed by Luca in our Open Models talk) and Loubna’s Smol Talk were also notable open releases this year.
This comes in the face of a lot of scrutiny and criticism, with Scale AI as one of the leading voices publishing AI models collapse when trained on recursively generated data in Nature magazine bringing mainstream concerns to the potential downsides of poor quality syndata:
Part of the concerns we highlighted last year on low-background tokens are coming to bear: ChatGPT contaminated data is spiking in every possible metric:
But perhaps, if Sakana’s AI Scientist pans out this year, we will have mostly-AI AI researchers publishing AI research anyway so do we really care as long as the ideas can be verified to be correct?
Smol Models
Meta surprised many folks this year by not just aggressively updating Llama 3 and adding multimodality, but also adding a new series of “small” 1B and 3B “on device” models this year, even working on quantized numerics collaborations with Qualcomm, Mediatek, and Arm. It is near unbelievable that a 1B model today can qualitatively match a 13B model of last year:
and the minimum size to hit a given MMLU bar has come down roughly 10x in the last year. We have been tracking this proxied by Lmsys Elo and inference price:
The key reads this year are:
* MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases
* Apple Intelligence Foundation Language Models
* Hymba: A Hybrid-head Architecture for Small Language Models
* Loubna’s SmolLM and SmolLM2: a family of state-of-the-art small models with 135M, 360M, and 1.7B parameters on the pareto efficiency frontier.
* and Moondream, which we already covered in the 2024 in Vision talk
Full Talk on YouTube
please like and subscribe!
Timestamps
* [00:00:05] Loubna Intro
* [00:00:33] The Rise of Synthetic Data Everywhere
* [00:02:57] Model Collapse
* [00:05:14] Phi, FineWeb, Cosmopedia - Synthetic Textbooks
* [00:12:36] DCLM, Nemotron-CC
* [00:13:28] Post Training - AI2 Tulu, Smol Talk, Cohere Multilingual Arbitrage
* [00:16:17] Smol Models
* [00:18:24] On Device Models
* [00:22:45] Smol Vision Models
* [00:25:14] What's Next
Transcript
2024 in Synthetic Data and Smol Models
[00:00:00]
[00:00:05] Loubna Intro
[00:00:05] Speaker: I'm very happy to be here. Thank you for the invitation. So I'm going to be talking about synthetic data in 2024. And then I'm going to be talking about small on device models. So I think the most interesting thing about synthetic data this year is that like now we have it everywhere in the large language models pipeline.
[00:00:33] The Rise of Synthetic Data Everywhere
[00:00:33] Speaker: I think initially, synthetic data was mainly used just for post training, because naturally that's the part where we needed human annotators. And then after that, we realized that we don't really have good benchmarks to [00:01:00] measure if models follow instructions well, if they are creative enough, or if they are chatty enough, so we also started using LLMs as judges.
[00:01:08] Speaker: Thank you. And I think this year and towards the end of last year, we also went to the pre training parts and we started generating synthetic data for pre training to kind of replace some parts of the web. And the motivation behind that is that you have a lot of control over synthetic data. You can control your prompt and basically also the kind of data that you generate.
[00:01:28] Speaker: So instead of just trying to filter the web, you could try to get the LLM to generate what you think the best web pages could look like and then train your models on that. So this is how we went from not having synthetic data at all in the LLM pipeline to having it everywhere. And so the cool thing is like today you can train an LLM with like an entirely synthetic pipeline.
[00:01:49] Speaker: For example, you can use our Cosmopedia datasets and you can train a 1B model on like 150 billion tokens that are 100 percent synthetic. And those are also of good quality. And then you can [00:02:00] instruction tune the model on a synthetic SFT dataset. You can also do DPO on a synthetic dataset. And then to evaluate if the model is good, you can use.
[00:02:07] Speaker: A benchmark that uses LLMs as a judge, for example, MTBench or AlpacaEvil. So I think this is like a really mind blowing because like just a few years ago, we wouldn't think this is possible. And I think there's a lot of concerns about model collapse, and I'm going to talk about that later. But we'll see that like, if we use synthetic data properly and we curate it carefully, that shouldn't happen.
[00:02:29] Speaker: And the reason synthetic data is very popular right now is that we have really strong models, both open and closed. It is really cheap and fast to use compared to human annotations, which cost a lot and take a lot of time. And also for open models right now, we have some really good inference frameworks.
[00:02:47] Speaker: So if you have enough GPUs, it's really easy to spawn these GPUs and generate like a lot of synthetic data. Some examples are VLM, TGI, and TensorRT.
[00:02:57] Model Collapse
[00:02:57] Speaker: Now let's talk about the elephant in the room, model [00:03:00] collapse. Is this the end? If you look at the media and all of like, for example, some papers in nature, it's really scary because there's a lot of synthetic data out there in the web.
[00:03:09] Speaker: And naturally we train on the web. So we're going to be training a lot of synthetic data. And if model collapse is going to happen, we should really try to take that seriously. And the other issue is that, as I said, we think, a lot of people think the web is polluted because there's a lot of synthetic data.
[00:03:24] Speaker: And for example, when we're building fine web datasets here at Guillerm and Hinek, we're interested in like, how much synthetic data is there in the web? So there isn't really a method to properly measure the amount of synthetic data or to save a webpage synthetic or not. But one thing we can do is to try to look for like proxy words, for example, expressions like as a large language model or words like delve that we know are actually generated by chat GPT.
[00:03:49] Speaker: We could try to measure the amount of these words in our data system and compare them to the previous years. For example, here, we measured like a, these words ratio in different dumps of common crawl. [00:04:00] And we can see that like the ratio really increased after chat GPT's release. So if we were to say that synthetic data amount didn't change, you would expect this ratio to stay constant, which is not the case.
[00:04:11] Speaker: So there's a lot of synthetic data probably on the web, but does this really make models worse? So what we did is we trained different models on these different dumps. And we then computed their performance on popular, like, NLP benchmarks, and then we computed the aggregated score. And surprisingly, you can see that the latest DOMs are actually even better than the DOMs that are before.
[00:04:31] Speaker: So if there's some synthetic data there, at least it did not make the model's worse. Yeah, which is really encouraging. So personally, I wouldn't say the web is positive with Synthetic Data. Maybe it's even making it more rich. And the issue with like model collapse is that, for example, those studies, they were done at like a small scale, and you would ask the model to complete, for example, a Wikipedia paragraph, and then you would train it on these new generations, and you would do that every day.
[00:04:56] Speaker: iteratively. I think if you do that approach, it's normal to [00:05:00] observe this kind of behavior because the quality is going to be worse because the model is already small. And then if you train it just on its generations, you shouldn't expect it to become better. But what we're really doing here is that we take a model that is very large and we try to distill its knowledge into a model that is smaller.
[00:05:14] Phi, FineWeb, Cosmopedia - Synthetic Textbooks
[00:05:14] Speaker: And in this way, you can expect to get like a better performance for your small model. And using synthetic data for pre-training has become really popular. After the textbooks are all you need papers where Microsoft basically trained a series of small models on textbooks that were using a large LLM.
[00:05:32] Speaker: And then they found that these models were actually better than models that are much larger. So this was really interesting. It was like first of its time, but it was also met with a lot of skepticism, which is a good thing in research. It pushes you to question things because the dataset that they trained on was not public, so people were not really sure if these models are really good or maybe there's just some data contamination.
[00:05:55] Speaker: So it was really hard to check if you just have the weights of the models. [00:06:00] And as Hugging Face, because we like open source, we tried to reproduce what they did. So this is our Cosmopedia dataset. We basically tried to follow a similar approach to what they documented in the paper. And we created a synthetic dataset of textbooks and blog posts and stories that had almost 30 billion tokens.
[00:06:16] Speaker: And we tried to train some models on that. And we found that like the key ingredient to getting a good data set that is synthetic is trying as much as possible to keep it diverse. Because if you just throw the same prompts as your model, like generate like a textbook about linear algebra, and even if you change the temperature, the textbooks are going to look alike.
[00:06:35] Speaker: So there's no way you could scale to like millions of samples. And the way you do that is by creating prompts that have some seeds that make them diverse. In our case, the prompt, we would ask the model to generate a textbook, but make it related to an extract from a webpage. And also we try to frame it within, to stay within topic.
[00:06:55] Speaker: For example, here, we put like an extract about cardiovascular bioimaging, [00:07:00] and then we ask the model to generate a textbook related to medicine that is also related to this webpage. And this is a really nice approach because there's so many webpages out there. So you can. Be sure that your generation is not going to be diverse when you change the seed example.
[00:07:16] Speaker: One thing that's challenging with this is that you want the seed samples to be related to your topics. So we use like a search tool to try to go all of fine web datasets. And then we also do a lot of experiments with the type of generations we want the model to generate. For example, we ask it for textbooks for middle school students or textbook for college.
[00:07:40] Speaker: And we found that like some generation styles help on some specific benchmarks, while others help on other benchmarks. For example, college textbooks are really good for MMLU, while middle school textbooks are good for benchmarks like OpenBookQA and Pico. This is like a sample from like our search tool.
[00:07:56] Speaker: For example, you have a top category, which is a topic, and then you have some [00:08:00] subtopics, and then you have the topic hits, which are basically the web pages in fine web does belong to these topics. And here you can see the comparison between Cosmopedia. We had two versions V1 and V2 in blue and red, and you can see the comparison to fine web, and as you can see throughout the training training on Cosmopedia was consistently better.
[00:08:20] Speaker: So we managed to get a data set that was actually good to train these models on. It's of course so much smaller than FineWeb, it's only 30 billion tokens, but that's the scale that Microsoft data sets was, so we kind of managed to reproduce a bit what they did. And the data set is public, so everyone can go there, check if everything is all right.
[00:08:38] Speaker: And now this is a recent paper from NVIDIA, Neumatron CC. They took things a bit further, and they generated not a few billion tokens, but 1. 9 trillion tokens, which is huge. And we can see later how they did that. It's more of, like, rephrasing the web. So we can see today that there's, like, some really huge synthetic datasets out there, and they're public, so, [00:09:00] like, you can try to filter them even further if you want to get, like, more high quality corpses.
[00:09:04] Speaker: So for this, rephrasing the web this approach was suggested in this paper by Pratyush, where basically in this paper, they take some samples from C4 datasets, and then they use an LLM to rewrite these samples into a better format. For example, they ask an LLM to rewrite the sample into a Wikipedia passage or into a Q& A page.
[00:09:25] Speaker: And the interesting thing in this approach is that you can use a model that is Small because it doesn't, rewriting doesn't require knowledge. It's just rewriting a page into a different style. So the model doesn't need to have like knowledge that is like extensive of what is rewriting compared to just asking a model to generate a new textbook and not giving it like ground truth.
[00:09:45] Speaker: So here they rewrite some samples from C4 into Q& A, into Wikipedia, and they find that doing this works better than training just on C4. And so what they did in Nemo Trans CC is a similar approach. [00:10:00] They rewrite some pages from Common Crawl for two reasons. One is to, like improve Pages that are low quality, so they rewrite them into, for example, Wikipedia page, so they look better.
[00:10:11] Speaker: And another reason is to create more diverse datasets. So they have a dataset that they already heavily filtered, and then they take these pages that are already high quality, and they ask the model to rewrite them in Question and Answer format. into like open ended questions or like multi choice questions.
[00:10:27] Speaker: So this way they can reuse the same page multiple times without fearing like having multiple duplicates, because it's the same information, but it's going to be written differently. So I think that's also a really interesting approach for like generating synthetic data just by rephrasing the pages that you already have.
[00:10:44] Speaker: There's also this approach called Prox where they try to start from a web page and then they generate a program which finds how to write that page to make it better and less noisy. For example, here you can see that there's some leftover metadata in the web page and you don't necessarily want to keep that for training [00:11:00] your model.
[00:11:00] Speaker: So So they train a model that can generate programs that can like normalize and remove lines that are extra. So I think this approach is also interesting, but it's maybe less scalable than the approaches that I presented before. So that was it for like rephrasing and generating new textbooks.
[00:11:17] Speaker: Another approach that I think is really good and becoming really popular for using synthetic data for pre training is basically building a better classifiers. For filtering the web for example, here we release the data sets called fine web edu. And the way we built it is by taking Llama3 and asking it to rate the educational content of web pages from zero to five.
[00:11:39] Speaker: So for example, if a page is like a really good textbook that could be useful in a school setting, it would get a really high score. And if a page is just like an advertisement or promotional material, it would get a lower score. And then after that, we take these synthetic annotations and we train a classifier on them.
[00:11:57] Speaker: It's a classifier like a BERT model. [00:12:00] And then we run this classifier on all of FineWeb, which is a 15 trillion tokens dataset. And then we only keep the pages that have like a score that's higher than 3. So for example, in our case, we went from 15 trillion tokens to 3. to just 1. 5 trillion tokens. Those are really highly educational.
[00:12:16] Speaker: And as you can see here, a fine web EDU outperforms all the other public web datasets by a larger margin on a couple of benchmarks here, I show the aggregated score and you can see that this approach is really effective for filtering web datasets to get like better corpuses for training your LLMs.
[00:12:36] DCLM, Nemotron-CC
[00:12:36] Speaker: Others also try to do this approach. There's, for example, the DCLM datasets where they also train the classifier, but not to detect educational content. Instead, they trained it on OpenHermes dataset, which is a dataset for instruction tuning. And also they explain like IAM5 subreddits, and then they also get really high quality dataset which is like very information dense and can help [00:13:00] you train some really good LLMs.
[00:13:01] Speaker: And then Nemotron Common Crawl, they also did this approach, but instead of using one classifier, they used an ensemble of classifiers. So they used, for example, the DCLM classifier, and also classifiers like the ones we used in FineWebEducational, and then they combined these two. Scores into a, with an ensemble method to only retain the best high quality pages, and they get a data set that works even better than the ones we develop.
[00:13:25] Speaker: So that was it for like synthetic data for pre-training.
[00:13:28] Post Training - AI2 Tulu, Smol Talk, Cohere Multilingual Arbitrage
[00:13:28] Speaker: Now we can go back to post training. I think there's a lot of interesting post training data sets out there. One that was released recently, the agent instructs by Microsoft where they basically try to target some specific skills. And improve the performance of models on them.
[00:13:43] Speaker: For example, here, you can see code, brain teasers, open domain QA, and they managed to get a dataset that outperforms that's when fine tuning Mistral 7b on it, it outperforms the original instruct model that was released by Mistral. And as I said, to get good synthetic data, you really [00:14:00] have to have a framework to make sure that your data is diverse.
[00:14:03] Speaker: So for example, for them, they always. And then they see the generations on either source code or raw text documents, and then they rewrite them to make sure they're easier to generate instructions from, and then they use that for their like instruction data generation. There's also the Tool3SFT mixture, which was released recently by Allen AI.
[00:14:23] Speaker: It's also really good quality and it covers a wide range of tasks. And the way they make sure that this dataset is diverse is by using personas from the persona hub datasets. Which is basically a data set of like I think over a million personas. And for example, in the tool mixture to generate like a new code snippet, they would give like the model persona, for example, a machine learning researcher interested in neural networks, and then ask it to generate like a coding problem.
[00:14:49] Speaker: This way you make sure that your data set is really diverse, and then you can further filter the data sets, for example, using the reward models. We also released a dataset called Smalltalk, [00:15:00] and we also tried to cover the wide range of tasks, and as you can see here, for example, when fine tuning Mistral 7b on the dataset, we also outperformed the original Mistral instructs on a number of benchmarks, notably on mathematics and instruction following with ifevil.
[00:15:18] Speaker: Another paper that's really interesting I wanted to mention is this one called Multilingual Data Arbitrage by Cohere. And basically they want to generate a data set for post training that is multilingual. And they have a really interesting problem. It's the fact that there isn't like one model that's really good at all the languages they wanted.
[00:15:36] Speaker: So what they do is that like they use not just one teacher model, but multiple teachers. And then they have a router which basically sends the prompts they have to all these models. And then they get the completions and they have a reward model that traces all these generations and only keeps the best one.
[00:15:52] Speaker: And this is like arbitrage and finance. So well, I think what's interesting in this, it shows that like synthetic data, it doesn't have to come from a single model. [00:16:00] And because we have so many good models now, you could like pull these models together and get like a dataset that's really high quality and that's diverse and that's covers all your needs.
[00:16:12] Speaker: I was supposed to put a meme there, but. Yeah, so that was it for like a synthetic data.
[00:16:17] Smol Models
[00:16:17] Speaker: Now we can go to see what's happening in the small models field in 2024. I don't know if you know, but like now we have some really good small models. For example, Lama 3. 2 1B is. It matches Lama 2. 13b from, that was released last year on the LMSYS arena, which is basically the default go to leaderboard for evaluating models using human evaluation.
[00:16:39] Speaker: And as you can see here, the scores of the models are really close. So I think we've made like hugely forward in terms of small models. Of course, that's one, just one data point, but there's more. For example, if you look at this chart from the Quint 2. 5 blog post, it shows that today we have some really good models that are only like 3 billion parameters [00:17:00] and 4 billion that score really high on MMLU.
[00:17:03] Speaker: Which is a really popular benchmark for evaluating models. And you can see here that the red, the blue dots have more than 65 on MMLU. And the grey ones have less. And for example, Llama33b had less. So now we have a 3b model that outperforms a 33b model that was released earlier. So I think now people are starting to realize that like, we shouldn't just scale and scale models, but we should try to make them more efficient.
[00:17:33] Speaker: I don't know if you knew, but you can also chat with a 3B plus model on your iPhone. For example, here, this is an app called PocketPal, where you can go and select a model from Hugging Face. It has a large choice. For example, here we loaded the 5. 3. 5, which is 3. 8 billion parameters on this iPhone. And we can chat with this and you can see that even the latency is also acceptable.
[00:17:57] Speaker: For example, here, I asked it to give me a joke about [00:18:00] NeurIPS. So let's see what it has to say.
[00:18:06] Speaker: Okay, why did the neural network attend NeurIPS? Because it heard there would be a lot of layers and fun and it wanted to train its sense of humor. So not very funny, but at least it can run on device. Yeah, so I think now we have good small models, but we also have like good frameworks and tools to use these small models.
[00:18:24] On Device Models
[00:18:24] Speaker: So I think we're really close to having like really on edge and on device models that are really good. And I think for a while we've had this narrative. But just training larger models is better. Of course, this is supported by science scaling laws. As you can see here, for example, when we scale the model size, the loss is lower and obviously you get a better model.
[00:18:46] Speaker: But and we can see this, for example, in the GPT family of models, how we went from just a hundred million parameters to more than a trillion. parameters. And of course, we all observed the performance improvement when using the latest model. But [00:19:00] one thing that we shouldn't forget is that when we scale the model, we also scale the inference costs and time.
[00:19:05] Speaker: And so the largest models were are going to cost so much more. So I think now instead of just building larger models, we should be focusing on building more efficient models. It's no longer a race for the largest models since these models are really expensive to run and they require like a really good infrastructure to do that and they cannot run on, for example, consumer hardware.
[00:19:27] Speaker: And when you try to build more efficient models that match larger models, that's when you can really unlock some really interesting on device use cases. And I think a trend that we're noticing now is the trend of training smaller models longer. For example, if you compare how much, how long LLAMA was trained compared to LLAMA3, there is a huge increase in the pre training length.
[00:19:50] Speaker: LLAMA was trained on 1 trillion tokens, but LLAMA3 8b was trained on 15 trillion tokens. So Meta managed to get a model that's the same size, but But it performs so much [00:20:00] better by choosing to like spend the sacrifice during training, because as we know, training is a one time cost, but inference is something that's ongoing.
[00:20:08] Speaker: If we want to see what are like the small models reads in 2024, I think this mobile LLM paper by Meta is interesting. They try to study different models that are like have the less than 1 billion parameters and find which architecture makes most sense for these models. For example, they find that depth is more important than width.
[00:20:29] Speaker: So it's more important to have models that have like more layers than just one. making them more wide. They also find that GQA helps, that tying the embedding helps. So I think it's a nice study overall for models that are just a few hundred million parameters. There's also the Apple intelligence tech report, which is interesting.
[00:20:48] Speaker: So for Apple intelligence, they had two models, one that was like on server and another model that was on device. It had 3 billion parameters. And I think the interesting part is that they trained this model using [00:21:00] pruning. And then distillation. And for example, they have this table where they show that, like, using pruning and distillation works much better than training from scratch.
[00:21:08] Speaker: And they also have some interesting insights about, like, how they specialize their models on specific tasks, like, for example, summarization and rewriting. There's also this paper by NVIDIA that was released recently. I think you've already had a talk about, like, hybrid models that was all interesting.
[00:21:23] Speaker: And this model, they used, like, a hybrid architecture between state space models and transformers. And they managed to train a 1B model that's really performant without needing to train it on a lot of tokens. And regarding our work, we just recently released SmallM2, so it's a series of three models, which are the best in class in each model size.
[00:21:46] Speaker: For example, our 1. 7b model outperforms Lama 1b and also Qt 2. 5. And how we managed to train this model is the following. That's where you spent a lot of time trying to curate the pre training datasets. We did a lot of [00:22:00] ablations, trying to find which datasets are good and also how to mix them. We also created some new math and code datasets that we're releasing soon.
[00:22:08] Speaker: But you basically really spent a lot of time trying to find what's the best mixture that you can train these models on. And then we spent some time trying to like we also trained these models for very long. For example, small M1 was trained only on 1 trillion tokens, but this model is trained on 11 trillion tokens.
[00:22:24] Speaker: And we saw that the performance kept improving. The models didn't really plateau mid training, which I think is really interesting. It shows that you can train such small models for very long and keep getting performance gains. What's interesting about SmallLM2 is that it's fully open. We also released, like the pre training code base, the fine tuning code, the datasets, and also evaluation in this repository.
[00:22:45] Smol Vision Models
[00:22:45] Speaker: Also there's, like, really interesting small models for text, but also for vision. For example, here you can see SmallVLM, which is a 2B model that's really efficient. It doesn't consume a lot of RAM, and it also has a good performance. There's also Moondream 0. [00:23:00] 5b, which was released recently. It's like the smallest visual language model.
[00:23:04] Speaker: And as you can see, there isn't like a big trade off compared to Moondream 2b. So now I showed you that we have some really good small models. We also have the tools to use them, but why should you consider using small models and when? I think, like, small models are really interesting because of the on device feature.
[00:23:23] Speaker: Because these models are small and they can run fast, you can basically run them on your laptop, but also on your mobile phone. And this means that your dataset stays locally. You don't have to send your queries to third parties. And this really enhances privacy. That was, for example, one of the big selling points for Apple Intelligence.
[00:23:42] Speaker: Also, right now, we really have a lot of work to do. So many frameworks to do on device inference. For example, there's MLX, MLC, Llama, CPP, Transformers, JS. So we have a lot of options and each of them have like great features. So you have so many options for doing that. Small models are also really powerful if you choose to specialize them.[00:24:00]
[00:24:00] Speaker: For example, here there's a startup called Numind, which took small LM and then they fine tuned it on text extraction datasets. And they managed to get a model that's not very far from models that are much larger. So I think text extraction is like one use case where small models can be really performant and it makes sense to use them instead of just using larger models.
[00:24:19] Speaker: You can also chat with these models in browser. For example, here, you can go there, you can load the model, you can even turn off your internet and just start chatting with the model locally. Speaking of text extraction, if you don't want to fine tune the models, there's a really good method of structure generation.
[00:24:36] Speaker: We can basically force the models to follow a JSON schema that you defined. For example, here, we try to force the model to follow a schema for extracting key information from GitHub issues. So you can input free text, which is a complaint about a GitHub repository, something not working. And then you can run it there and the model can extract anything that is relevant for your GitHub issue creation.
[00:24:58] Speaker: For example, the [00:25:00] priority, for example, here, priority is high, the type of the issue bug, and then a title and the estimation of how long this will take to fix. And you can just like do this in the browser, you can transform your text into a GitHub issue that's properly formatted.
[00:25:14] What's Next
[00:25:14] Speaker: So what's next for synthetic data and small models?
[00:25:18] Speaker: I think that domain specific synthetic data is going to be, it's already important, it's going to be even more important. For example, generating synthetic data for math. I think this really would help improve the reasoning of a lot of models. And a lot of people are doing it, for example, Quint 2. 12 math, everyone's trying to reproduce a one.
[00:25:37] Speaker: And so I think for synthetic data, trying to specialize it on some domains is going to be really important. And then for small models, I think specializing them through fine tuning, it's also going to be really important because I think a lot of companies are just trying to use these large models because they are better.
[00:25:53] Speaker: But on some tasks, I think you can already get decent performance with small models. So you don't need to Pay like a [00:26:00] cost that's much larger just to make your model better at your task by a few percent. And this is not just for text. And I think it also applies for other modalities like vision and audio.
[00:26:11] Speaker: And I think you should also watch out for on device frameworks and applications. For example, like the app I showed, or lama, all these frameworks are becoming really popular and I'm pretty sure that we're gonna get like more of them in 2025. And users really like that. Maybe for other, I should also say hot take.
[00:26:28] Speaker: I think that like in AI, we just started like with fine tuning, for example, trying to make BERT work on some specific use cases, and really struggling to do that. And then we had some models that are much larger. So we just switched to like prompt engineering to get the models And I think we're going back to fine tuning where we realize these models are really costly.
[00:26:47] Speaker: It's better to use just a small model or try to specialize it. So I think it's a little bit of a cycle and we're going to start to see like more fine tuning and less of just like a prompt engineering the models. So that was my talk. Thank you for following. And if you have [00:27:00] any questions, we can take them now.

Get full access to Latent Space at www.latent.space/subscribe
- Luisteren Nogmaals beluisteren Doorgaan Wordt afgespeeld...
- Later beluisteren Later beluisteren
Zijn er afleveringen die ontbreken?

Klik hier om de feed te vernieuwen.
2024 in Post-Transformers Architectures (State Space Models, RWKV) [LS Live @ NeurIPS]
24 dec· Latent Space: The AI Engineer Podcast
Happy holidays! We’ll be sharing snippets from Latent Space LIVE! through the break bringing you the best of 2024! We want to express our deepest appreciation to event sponsors AWS, Daylight Computer, Thoth.ai, StrongCompute, Notable Capital, and most of all all our LS supporters who helped fund the gorgeous venue and A/V production!
For NeurIPS last year we did our standard conference podcast coverage interviewing selected papers (that we have now also done for ICLR and ICML), however we felt that we could be doing more to help AI Engineers 1) get more industry-relevant content, and 2) recap 2024 year in review from experts. As a result, we organized the first Latent Space LIVE!, our first in person miniconference, at NeurIPS 2024 in Vancouver.
Of perennial interest, particularly at academic conferences, is scaled-up architecture research as people hunt for the next Attention Is All You Need. We have many names for them: “efficient models”, “retentive networks”, “subquadratic attention” or “linear attention” but some of them don’t even have any lineage with attention - one of the best papers of this NeurIPS was Sepp Hochreiter’s xLSTM, which has a particularly poetic significance as one of the creators of the LSTM returning to update and challenge the OG language model architecture:
So, for lack of a better term, we decided to call this segment “the State of Post-Transformers” and fortunately everyone rolled with it.
We are fortunate to have two powerful friends of the pod to give us an update here:
* Together AI: with CEO Vipul Ved Prakash and CTO Ce Zhang joining us to talk about how they are building Together together as a quote unquote full stack AI startup, from the lowest level kernel and systems programming to the highest level mathematical abstractions driving new model architectures and inference algorithms, with notable industry contributions from RedPajama v2, Flash Attention 3, Mamba 2, Mixture of Agents, BASED, Sequoia, Evo, Dragonfly, Dan Fu's ThunderKittens and many more research projects this year
* Recursal AI: with CEO Eugene Cheah who has helped lead the independent RWKV project while also running Featherless AI. This year, the team has shipped RWKV v5, codenamed Eagle, to 1.5 billion Windows 10 and Windows 11 machines worldwide, to support Microsoft's on-device, energy-usage-sensitive Windows Copilot usecases, and has launched the first updates on RWKV v6, codenamed Finch and GoldFinch. On the morning of Latent Space Live, they also announced QRWKV6, a Qwen 32B model modified with RWKV linear attention layers.
We were looking to host a debate between our speakers, but given that both of them were working on post-transformers alternatives
Full Talk on Youtube
Please like and subscribe!
Links
All the models and papers they picked:
* Earlier Cited Work
* Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention
* Hungry hungry hippos: Towards language modeling with state space models
* Hyena hierarchy: Towards larger convolutional language models
* Mamba: Linear-Time Sequence Modeling with Selective State Spaces
* S4: Efficiently Modeling Long Sequences with Structured State Spaces
* Just Read Twice (Arora et al)
* Recurrent large language models that compete with Transformers in language modeling perplexity are emerging at a rapid rate (e.g., Mamba, RWKV). Excitingly, these architectures use a constant amount of memory during inference. However, due to the limited memory, recurrent LMs cannot recall and use all the information in long contexts leading to brittle in-context learning (ICL) quality. A key challenge for efficient LMs is selecting what information to store versus discard. In this work, we observe the order in which information is shown to the LM impacts the selection difficulty.
* To formalize this, we show that the hardness of information recall reduces to the hardness of a problem called set disjointness (SD), a quintessential problem in communication complexity that requires a streaming algorithm (e.g., recurrent model) to decide whether inputted sets are disjoint. We empirically and theoretically show that the recurrent memory required to solve SD changes with set order, i.e., whether the smaller set appears first in-context.
* Our analysis suggests, to mitigate the reliance on data order, we can put information in the right order in-context or process prompts non-causally. Towards that end, we propose: (1) JRT-Prompt, where context gets repeated multiple times in the prompt, effectively showing the model all data orders. This gives 11.0±1.3 points of improvement, averaged across 16 recurrent LMs and the 6 ICL tasks, with 11.9× higher throughput than FlashAttention-2 for generation prefill (length 32k, batch size 16, NVidia H100). We then propose (2) JRT-RNN, which uses non-causal prefix-linear-attention to process prompts and provides 99% of Transformer quality at 360M params., 30B tokens and 96% at 1.3B params., 50B tokens on average across the tasks, with 19.2× higher throughput for prefill than FA2.
* Jamba: A 52B Hybrid Transformer-Mamba Language Model
* We present Jamba, a new base large language model based on a novel hybrid Transformer-Mamba mixture-of-experts (MoE) architecture.
* Specifically, Jamba interleaves blocks of Transformer and Mamba layers, enjoying the benefits of both model families. MoE is added in some of these layers to increase model capacity while keeping active parameter usage manageable.
* This flexible architecture allows resource- and objective-specific configurations. In the particular configuration we have implemented, we end up with a powerful model that fits in a single 80GB GPU.
* Built at large scale, Jamba provides high throughput and small memory footprint compared to vanilla Transformers, and at the same time state-of-the-art performance on standard language model benchmarks and long-context evaluations. Remarkably, the model presents strong results for up to 256K tokens context length.
* We study various architectural decisions, such as how to combine Transformer and Mamba layers, and how to mix experts, and show that some of them are crucial in large scale modeling. We also describe several interesting properties of these architectures which the training and evaluation of Jamba have revealed, and plan to release checkpoints from various ablation runs, to encourage further exploration of this novel architecture. We make the weights of our implementation of Jamba publicly available under a permissive license.
* SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers
* We introduce Sana, a text-to-image framework that can efficiently generate images up to 4096×4096 resolution. Sana can synthesize high-resolution, high-quality images with strong text-image alignment at a remarkably fast speed, deployable on laptop GPU. Core designs include:
* (1) Deep compression autoencoder: unlike traditional AEs, which compress images only 8×, we trained an AE that can compress images 32×, effectively reducing the number of latent tokens.
* (2) Linear DiT: we replace all vanilla attention in DiT with linear attention, which is more efficient at high resolutions without sacrificing quality.
* (3) Decoder-only text encoder: we replaced T5 with modern decoder-only small LLM as the text encoder and designed complex human instruction with in-context learning to enhance the image-text alignment.
* (4) Efficient training and sampling: we propose Flow-DPM-Solver to reduce sampling steps, with efficient caption labeling and selection to accelerate convergence.
* As a result, Sana-0.6B is very competitive with modern giant diffusion model (e.g. Flux-12B), being 20 times smaller and 100+ times faster in measured throughput. Moreover, Sana-0.6B can be deployed on a 16GB laptop GPU, taking less than 1 second to generate a 1024×1024 resolution image. Sana enables content creation at low cost.
* RWKV: Reinventing RNNs for the Transformer Era
* Transformers have revolutionized almost all natural language processing (NLP) tasks but suffer from memory and computational complexity that scales quadratically with sequence length. In contrast, recurrent neural networks (RNNs) exhibit linear scaling in memory and computational requirements but struggle to match the same performance as Transformers due to limitations in parallelization and scalability.
* We propose a novel model architecture, Receptance Weighted Key Value (RWKV), that combines the efficient parallelizable training of transformers with the efficient inference of RNNs.
* Our approach leverages a linear attention mechanism and allows us to formulate the model as either a Transformer or an RNN, thus parallelizing computations during training and maintains constant computational and memory complexity during inference.
* We scale our models as large as 14 billion parameters, by far the largest dense RNN ever trained, and find RWKV performs on par with similarly sized Transformers, suggesting future work can leverage this architecture to create more efficient models. This work presents a significant step towards reconciling trade-offs between computational efficiency and model performance in sequence processing tasks.
* LoLCATs: On Low-Rank Linearizing of Large Language Models
* Recent works show we can linearize large language models (LLMs) -- swapping the quadratic attentions of popular Transformer-based LLMs with subquadratic analogs, such as linear attention -- avoiding the expensive pretraining costs. However, linearizing LLMs often significantly degrades model quality, still requires training over billions of tokens, and remains limited to smaller 1.3B to 7B LLMs.
* We thus propose Low-rank Linear Conversion via Attention Transfer (LoLCATs), a simple two-step method that improves LLM linearizing quality with orders of magnitudes less memory and compute.
* We base these steps on two findings.
* First, we can replace an LLM's softmax attentions with closely-approximating linear attentions, simply by training the linear attentions to match their softmax counterparts with an output MSE loss ("attention transfer").
* Then, this enables adjusting for approximation errors and recovering LLM quality simply with low-rank adaptation (LoRA).
* LoLCATs significantly improves linearizing quality, training efficiency, and scalability. We significantly reduce the linearizing quality gap and produce state-of-the-art subquadratic LLMs from Llama 3 8B and Mistral 7B v0.1, leading to 20+ points of improvement on 5-shot MMLU.
* Furthermore, LoLCATs does so with only 0.2% of past methods' model parameters and 0.4% of their training tokens.
* Finally, we apply LoLCATs to create the first linearized 70B and 405B LLMs (50x larger than prior work).
* When compared with prior approaches under the same compute budgets, LoLCATs significantly improves linearizing quality, closing the gap between linearized and original Llama 3.1 70B and 405B LLMs by 77.8% and 78.1% on 5-shot MMLU.
Timestamps
* [00:02:27] Intros
* [00:03:16] Why Scale Context Lengths? or work on Efficient Models
* [00:06:07] The Story of SSMs
* [00:09:33] Idea 1: Approximation -> Principled Modeling
* [00:12:14] Idea 3: Selection
* [00:15:07] Just Read Twice
* [00:16:51] Idea 4: Test Time Compute
* [00:17:32] Idea 2: Hardware & Kernel Support
* [00:19:49] RWKV vs SSMs
* [00:24:24] RWKV Arch
* [00:26:15] QWRKWv6 launch
* [00:30:00] What's next
* [00:33:21] Hot Takes - does anyone really need long context?
Transcript
[00:00:00] AI Charlie: We're back at Latent Space Live, our first mini conference held at NeurIPS 2024 in Vancouver. This is Charlie, your AI co host. As a special treat this week, we're recapping the best of 2024 going domain by domain. We sent out a survey to the over 900 of you who told us what you wanted, and then invited the best speakers in the Latent Space Network to cover each field.
[00:00:24] AI Charlie: 200 of you joined us in person throughout the day, with over 2200 watching live online. Thanks Our next keynote covers the State of Transformers alternative architectures, with a special joint presentation with Dan Fu of Together AI and Eugene Chia of Recursal AI and Featherless AI. We've featured both Together and Recursal on the pod before, with CEO Veepal Vedprakash introducing them.
[00:00:49] AI Charlie: And CTO CE Zhang joining us to talk about how they are building together together as a quote unquote full stack AI startup from the lowest level kernel and systems [00:01:00] programming to the highest level mathematical abstractions driving new model architectures and inference algorithms with notable industry contributions from Red Pajama V2, Flash Attention 3, Mamba 2, Mixture of Agents.
[00:01:15] AI Charlie: Based, Sequoia, Evo, Dragonfly, Danfoo's Thunder Kittens, and many more research projects this year. As for Recursal and Featherless, we were the first podcast to feature RWKV last year, and this year the team has shipped RWKV v5, codenamed Eagle, to 1. 5 billion Windows 10 and Windows 11 machines worldwide to support Microsoft's on device, end Energy Usage Sensitive Windows Copilot Use Cases and has launched the first updates on RWKV v6, codenamed Finch and Goldfinch.
[00:01:53] AI Charlie: On the morning of Latent Space Live, they also announced QRdata UKv6, a QEN32B model [00:02:00] modified with RDWKV linear attention layers. Eugene has also written the most single most popular guest post on the Latent Space blog this year. Yes, we do take guest posts on what he has discovered about the H100 GPU inference NeoCloud market since the successful launch of Featherless AI this year.
[00:02:20] AI Charlie: As always, don't forget to check the show notes for the YouTube link to their talk as well as their slides. Watch out and take care.
[00:02:27] Intros
[00:02:27] Dan Fu: Yeah, so thanks so much for having us. So this is going to be a little bit of a two part presentation. My name is Dan. I'm at Together AI, and I'll be joining UCSD as faculty in about a year. And Eugene, you want to introduce yourself?
[00:02:46] Eugene Cheah: Eugene, I lead the art activity team, and I, I'm CEO of Featherless, and we both work on this new post transformer architecture space.
[00:02:55] Dan Fu: Yeah, so yeah, so today we're really excited to talk to you a little bit [00:03:00] about that. So first I'm going to give a broad overview of kind of the last few years of progress in non post transformer architectures. And then afterwards Eugene will tell us a little bit about the latest and the greatest and the latest frontier models in this space.
[00:03:16] Why Scale Context Lengths? or work on Efficient Models
[00:03:16] Dan Fu: So, the story starts with Scaling. So this is probably a figure or something like this that you've seen very recently. Over the last five to six years, we've seen models really scale up in parameter size, and that's brought with it a bunch of new capabilities, like the ability to talk to you and tell you sometimes how to use your Colab screens.
[00:03:35] Dan Fu: But another place where we've seen scaling especially recently is scaling in context length. So this can mean Having more text inputs for your models, but it can also mean things like taking a lot of visual token inputs image inputs to your models or generating lots of outputs. And one thing that's been really exciting over the last few months or so is that we're, we're seeing scaling, not only during training time, but also [00:04:00] during test time.
[00:04:00] Dan Fu: So this is one of the, the, this is the iconic image from the OpenAI 01 release. Not only are we starting to scale train time compute, but we're also starting to scale test time compute. Now if you're familiar with our attention and our transformer architectures today, this graph on the right might look a little bit scary.
[00:04:19] Dan Fu: And one of the reasons is that the implications are a little bit Interesting. So what does it mean if we want to continue having smarter and smarter models? Do we just need to start building bigger, bigger data centers, spending more flops? Is this this little Dolly 3, we need more flops, guys? Is this going to be the future of all of AI?
[00:04:39] Dan Fu: Or is there a better way, another path forward? Maybe we can get the same capabilities that we've gotten used to, But for a lot less compute, a lot less flops. And one of the things that we're going to talk about today is specifically looking at that core attention operator in some of these models.
[00:04:57] Dan Fu: And the reason is that so this is just some, some [00:05:00] basic you know, scaling curves, but attention has compute that scales quadratically in the context length. So that means that if you're doing something like test time compute and you want to spend a bunch of tokens thinking about what comes next, the longer that that goes the, the, the more tokens you spend on that, that compute grows quadratically in that.
[00:05:19] Dan Fu: One of the questions that we're interested in is, can we take that basic sequence model, that basic sequence primitive at the bottom, and get it to scale better? Can we scale in, let's say, n to the 3 halves or n log n? So in, in the first part of the talk, so we just went over the introduction. What I'm gonna do over the next few slides is just talk about some of the key advances and ideas that have shown over the past few years since maybe early 2020 to, to now that shown promise that this might actually be possible.
[00:05:48] Dan Fu: That you can actually get potentially the same quality that we want while scale, while scaling better. So to do that, we're and, and basically the, the story that we're gonna look is we're gonna start to see [00:06:00] how. So this is a basic graph of just the past couple years of progress of perplexity where that blue line, that dotted blue line, is attention.
[00:06:07] The Story of SSMs
[00:06:07] Dan Fu: It's your basic transformer, full dense attention. And then the dots coming down are some of the methods that you'll see in this presentation today. We're going to turn the clock back all the way to 2020. So this, this, this question of can we make attention subquadratic? Basically, as soon as we said attention is all you need, People started asking this question.
[00:06:28] Dan Fu: So we have this quadratic attention operator. Can we do better? I'll briefly talk about why attention is quadratic. And the basic thing that happens, if you're not familiar, is that you have these inputs, these keys and queries. And what you do in this attention matrix, this S matrix over here, is that you're using, you're comparing every token in your input to every other token.
[00:06:49] Dan Fu: So when I try to do something like upload a whole book to Gemini, what happens beyond the Maybe not Gemini, because we don't necessarily know what architecture is. But let's say we upload it to LLAMA, what happens beyond [00:07:00] the scenes, behind the scenes, is that it's going to take every single word in that book and compare it to every other word.
[00:07:05] Dan Fu: And this has been a really, it's, it's led to some pretty impressive things. But it's kind of a brute forcing of the way that you would try to interpret a interpret something. And what attention does in particular is the, and then what attention, sorry, don't want to. Okay, no, no laser pointer. What, what attention does afterwards is that instead of always operating in this quadratic thing, it takes a row wise softmax over this matrix, and then multiplies it by this values matrix.
[00:07:32] Dan Fu: So, one of the key points to notice is that the output size is always going to be the same as the inputs, at least in standard self attention. So one of the first things that folks tried to do around 2020 is this thing called linear attention, which is just, just noticing that if we take out this softmax from here, if we take out this non linearity in the middle of the attention operation, and then if you compute the keys and the values operation first, you actually never hit this quadratic bottleneck.
[00:07:57] Dan Fu: So that, that's potentially a way [00:08:00] to get a lot more computationally efficient. And there are various ways to do this by basically using feature maps or try to approximate this overall attention computation. But some of this work sort of started to hit a wall in 2020. And the basic challenges were, were two.
[00:08:16] Dan Fu: So one was quality. It was back then, it was kind of hard to, to get good quality with these linear attention operators. The other one was actually hardware efficiency. So these, this feature map that was just shown by a simplify simplify here. Actually ends up being quite computationally expensive if you just implement it naively.
[00:08:34] Dan Fu: So you started having these operators that not only were you sure, you're not really sure if they have the same quality, but also they're actually just wall clock slower. So you kind of end up getting the worst of both worlds. So this was the the stage. So that kind of sets the stage for four years ago.
[00:08:49] Dan Fu: Keep this in mind because linear attention is actually going to come back in a few years once we have a better understanding. But one of the works that started kicking off this, this [00:09:00] mini revolution in post transformer architectures was this idea called states based model. So here the seminal work is, is one about our work queue in 2022.
[00:09:09] Dan Fu: And this, this piece of work really brought together a few ideas from, from some long running research research lines of work. The first one was, and this is really one of the keys to, to closing the gap in quality was just using things that, that if you talk to a, a, an electrical engineer off the street, they might know off, off the, like the back of their hand.
[00:09:33] Idea 1: Approximation -> Principled Modeling
[00:09:33] Dan Fu: But taking some of those properties with how we model dynamical systems in signal processing and then using those ideas to model the inputs, the, the text tokens in, for example a transformer like Next Token Prediction Architecture. So some of those early states-based model papers were looking at this relatively, relatively simple recurrent update model that comes from maybe chapter one of a signal processing class.
[00:09:59] Dan Fu: But then using [00:10:00] some principle theory about how you should do that recurrent update in order to really get the most that you can out of your hidden state, out of your out of your sequence. So that, that was one key idea for quality and. When this was eventually realized, you started to see a bunch of benchmarks that were pretty sticky for a few years.
[00:10:20] Dan Fu: Things like long range arena, some long sequence evaluation benchmarks, There was stuff in time series, time series analysis. They started to, you started to see the quality tick up in meaningful ways. But the other key thing that What's so influential about these states based models is that they also had a key idea about how you can compute these things efficiently.
[00:10:45] Dan Fu: So if you go back to your machine learning 101 class where you learned about RNNs, one thing that you may have learned is that they don't paralyze as well as detention, because if you just run them naively, you have to do this kind of sequential update to process new tokens, [00:11:00] whereas in attention, you can process all the tokens in parallel at one time.
[00:11:04] Dan Fu: One of the key insights behind the S4 paper was that these recurrent models, you could take them and you could also formulate them as a convolution. And in particular, with a convolution, you could, instead of using a PyTorch conv1d operation, you can compute that with the FFT. And that would give you n log n compute in the in the sequence length n with an operator that was relatively well optimized for modern hardware.
[00:11:28] Dan Fu: So those are really, I'd say, the two key ideas in 2022 that started allowing these breakthroughs to happen in these non transformer architectures. So, these ideas about how to principally model sorry, how to model the recurrent updates of a mo of, of a sequence in a principled way, and also these key ideas in how you can compute it efficiently by turning it into a convolution and then scaling it up with the FFT.
[00:11:53] Dan Fu: Along those same lines, so afterwards we started putting out some work on specialized kernels, so just [00:12:00] like we have flash attention for transformers, we also have works like flash fft conf, and if you look at these lines of work oftentimes when, whenever you see a new architecture, you see a new primitive one of the, one of the table stakes now is, do you have an efficient kernel so that you can actually get wall clock speed up?
[00:12:14] Idea 3: Selection
[00:12:14] Dan Fu: So by 2022, We are starting to have these models that had promising quality primitives, but and, and also promising wall clocks. So you could actually see regimes where they were better than transformers in meaningful ways. That being said, there were, there's still sometimes a quality gap, particularly for language modeling.
[00:12:33] Dan Fu: And because languages, It's so core to what we do in sequence modeling these days the, the next, the next key idea that I'm going to talk about is this idea of selection mechanisms. And this is basically an idea of, so you have this recurrent state that you're keeping around that just summarizes everything that, that came before.
[00:12:50] Dan Fu: And to get a good sequence model, one of the things that you really need to be able to do is have the model learn what's the best way to pick out pieces from that recurrent [00:13:00] state. So one of the, one of the major ideas here in a line of work called H3, Hungry Hungry Hippos, and also these hyena models were One way you can do this is by just adding some simple element wise gates.
[00:13:13] Dan Fu: So versions of these ideas have been around for decades. If you squint at the LSTM paper you, you can probably find, find this gating mechanism. But turns out you can take those old ideas, add them into these new. state space models, and then you can see quality start to pick up. If you've heard of the Mamba model, this also takes the selection to the next level by actually making some changes in that fundamental recurrent state space.
[00:13:40] Dan Fu: So, it's not only just this gating that happens around the SSM layer, but also you can actually make The ABCD matrices of your state space model, you can make them data dependent, which will allow you to even better select out different pieces from your hidden state depending on what you're seeing. I'll also point out if you look at the [00:14:00] bottom right of this figure, there's this little triangle with a GPU SRAM, GPU HBM, and this, this is just continuing that trend of when you have a new architecture you, you, you also release it with a kernel to, to, to show that it is hardware efficient, that it, that it can be hardware efficient on modern hardware.
[00:14:17] Dan Fu: The, the, one of the next cool things that happened is once we had this understanding of these are the basic pieces, these are the basic principles behind some of the sequence models linear attention actually started to come back. So in earlier this year, there was a model called BASED the, from Simran Arora and, and some other folks, that combined a more principled version of linear attention that basically the, the, the, the two second summary is that it used a Taylor approximation of the softmax attention, combined that with a simple sliding window attention and was starting to able, starting to be able to expand the Pareto frontier of how much data can you recall from your sequence, versus how small is your recurrent state size.
[00:14:58] Dan Fu: So those orange dots [00:15:00] are, at the top there, are just showing smaller sequences that can recall more memory.
[00:15:07] Just Read Twice
[00:15:07] Dan Fu: And the last major idea I think that has been influential in this line of work and is very relatively late breaking just a few months ago, is just the basic idea that when you have these models that are fundamentally more efficient in the sequence length, you maybe don't want to prompt them or use them in exactly the same way.
[00:15:26] Dan Fu: So this was a really cool paper called Just Read Twice, also from Simran. That basically said, hey, all these efficient models can process tokens so much more efficiently than transformers that they can sometimes have unfair advantages compared to a simple transformer token. So, or sorry, a simple transformer model.
[00:15:44] Dan Fu: So take, for example the standard, the standard use case of you have some long document, you're going to pass it in as input, and then you're going to ask some question about it. One problem you might imagine for a recurrent model where you have a fixed state size is, let's say that [00:16:00] you're. Article is very long, and you're trying to ask about some really niche thing.
[00:16:04] Dan Fu: You can imagine it might be hard for the model to know ahead of time what information to put into the hidden state. But these, these, these models are so much more efficient that you can do something really stupid, like, you can just put the document write down the document, write down the question, write down the document again, and then write down the question again, and then this time, the second time that you go over that document, you know exactly what to look for.
[00:16:25] Dan Fu: And the cool thing about this is, so this is, And this this results in better quality, especially on these recall intensive tasks. But the other interesting thing is it really takes advantage of the more efficient architectures that, that we're having here. So one of the other, I think, influential ideas in this line of work is if you change the fundamental compute capabilities of your model and the way that it scales, you can actually start to query it at test time differently.
[00:16:51] Idea 4: Test Time Compute
[00:16:51] Dan Fu: And this actually, of course, goes back to those slides on test time compute. So while everybody's looking at, say, test time compute for big transformer models, [00:17:00] I think potentially a really interesting research question is, how can you take those and how does it change with this new next generation of models?
[00:17:09] Dan Fu: So the, I'll just briefly summarize what some of those key ideas were and then talk and then show you briefly kind of what the state of the art is today. So, so the four key ideas are instead of just doing a simple linear attention approximation, instead take ideas that we know from other fields like signal processing, do a more principled approach to your modeling of the sequence.
[00:17:32] Idea 2: Hardware & Kernel Support
[00:17:32] Dan Fu: Another key idea throughout all these lines of work is you really want. Hardware and kernel support from day one. So, so even if your model is theoretically more efficient if somebody goes and runs it and it's two times slower one of the things that, that we've learned is that if, if you're in that situation, it's, it's just gonna be dead on arrival.
[00:17:49] Dan Fu: So you want to be designing your architectures one of the key, key machine learning ideas that has been important for the quality is just making sure that you encode different ways that you can [00:18:00] select from your hidden state and, and really focus on that as a key decider of quality. And finally, I think one of the, the, the emerging new, new things for, for this line of work and something that's quite interesting is, What are the right test time paradigms for these models?
[00:18:15] Dan Fu: How do they change relative to relative to what you might do for a standard transformer? I'll briefly end this section. So I've labeled this slide where we are yesterday because Eugene is going to talk about some new models that he released literally this morning. But as of yesterday, some of the really cool results out of the, these efficient alternative models were so AI2 trained this hybrid MOE called Jamba.
[00:18:40] Dan Fu: That, that, that seems, that is currently the state of the art for these non transformer architectures. There's this NVIDIA and MIT put out this new diffusion model called SANA recently that one of their key key observations is that you can take a standard diffusion transformer diffusion model, replace the layers with linear [00:19:00] attention, and then that lets you scale to much larger much larger images, much, much Much larger sequences more efficiently.
[00:19:07] Dan Fu: And and one thing that I don't think anybody would have called when a few years ago is that one of those gated SSM, gated states based models ended up on the cover of Science because a great group of folks went and trained some DNA models. So that's Michael Polley, Eric Yuen from from Stanford and the Arc Institute.
[00:19:26] Dan Fu: So it's, we're really at an exciting time in 2024 where these non transformer, post transformer architectures are showing promise across a wide range. Across a wide range of, of modalities, of applications, and, and of tasks. And with that, I'll pass it on to Eugene, who can tell you a little bit about the latest and greatest with RWKV.
[00:19:49] RWKV vs SSMs
[00:19:49] Eugene Cheah: So, that's useful? Yeah. You're talking to here. Oh, I'm talking to here. Okay. So, yeah, two streams. Yeah. So, I think one common questions that we tend to get asked, right, is what's the difference between [00:20:00] RWKV and state space? So I think one of the key things to really understand, right the difference between the two groups, right, is that we are actually more like an open source, random internet meets academia kind of situation.
[00:20:11] Eugene Cheah: Like, most of us never wrote any paper, but we, we basically look at RNNs and linear intention when intention is all you need came out, and then we decided to like, hey there is a quadratic scaling problem. Why don't we try fixing that instead? So, so, so we end up developing our own branch, but we end up sharing ideas back and forth.
[00:20:30] Eugene Cheah: So, and, and we do all this actively in Discord, GitHub, etc. This was so bad for a few years, right, that basically, the average group's H index was so close to zero, right, Illuter. ai actually came in and helped us write our first paper. Great, now our H index is now three, apparently. So, so, so, but, but the thing is, like, a lot of these experiments led to results, and, and, essentially, essentially, we we took the same ideas from linear attention, [00:21:00] and we built on it.
[00:21:01] Eugene Cheah: So, to take a step back into, like, how does RWKB handle its own attention mechanic and achieve the same goals of, like, O and compute, respectively, and in focus of our overall goal to make AI accessible to everyone, regardless of language, nation, or compute, that's our goal. We actually train our models primarily on over a hundred languages, which is another topic altogether.
[00:21:23] Eugene Cheah: And our goal is to train to even 200 languages to cover all languages in the world. But at the same time, we work on this architecture, To lower the compute cost so that people can run it on Raspberry Pis and on anything. So, how did RWKB break the dependency of LSTM token flow? Because I think to understand architecture, right, it's probably easier to understand it from the RNN lens.
[00:21:46] Eugene Cheah: Because that's where we built on. We all, we all state space kind of like try to, try to start anew and took lessons from that and say, So there's a little bit of divergence there. And AKA, this our version of linear attention. So to take step back [00:22:00] all foundation models, be it transformers or non transformers at a very high level, right?
[00:22:05] Eugene Cheah: Pumps in the token. I mean, text that things into embeddings and go through a lot of layers. Generate a lot of states where the QKV cache or be iron in states or RW KB states. And outputs and embedding, they are not the same thing. And we just take more layers and more embeddings. And somehow that magically works.
[00:22:23] Eugene Cheah: So, if you, if you remember your ancient RNN lessons which we, which we, which we we call best learning these days the general idea is that you have the embedding information flowing all the way up, and when, and you take that information and you flow it back down, and then you process it as part of your LSTM layers.
[00:22:41] Eugene Cheah: So, this is how it generally works. Kapati is quoted saying that RNNs are actually unreasonably effective. The problem is this is not scalable. To start doing work on the second token, you need to wait for the first token. And then you need to, and likewise for the third token and fourth token, yada yada.
[00:22:55] Eugene Cheah: That is CPU land, not GPU land. So, so, so, you [00:23:00] can have a H100 and you can't even use 1 percent of it. So, so that's kind of why RNNs didn't really take off in the direction that we wanted, like, billions of parameters when it comes to training. So, what did RDAP KV version 0 do? Boom. We just did the dumbest, lamest thing.
[00:23:13] Eugene Cheah: Sorry, this is the bottleneck for RNN. We did the dumb thing of removing that line. And it kind of worked. It trained. It sucked, but it kind of worked. Then we were like, hey, then no one cared because the loss was crap, but how do we improve that? And that's essentially where we move forward, because if you see this kind of flow, right, you can actually get your GPU saturated quickly, where it essentially cascades respectively.
[00:23:41] Eugene Cheah: So I'm just waiting for this to loop again. So it's like, once you get your first layer, your token to be computed finish. You start to cascade your compute all the way until you are, Hey, I'm using 100 percent of the GPU. So we, we worked on it, and we started going along the principle of that as long as we keep this general architecture [00:24:00] where, where we can cascade and, and be highly efficient with our architecture, nothing is sacred in our architecture.
[00:24:06] Eugene Cheah: And we have done some crazy ideas. In fact, you ask us, if you ask me to explain some things in the paper, right, officially in the paper, I'll say we had this idea and we wrote it this way. The reality is someone came with a code, we tested it, it worked, and then we rationalized later. So, so the general
[00:24:24] RWKV Arch
[00:24:24] Eugene Cheah: The idea behind rwkbr is that we generally have two major blocks that we do.
[00:24:30] Eugene Cheah: We call time mix and channel mix. And time mix generally handles handles long term memory states, where essentially, where essentially where we apply the matrix multiplication and Cilu activation functions into processing an input embedding and an output embedding. I'm oversimplifying it because this, This calculation changed every version and we have, like, version 7 right now.
[00:24:50] Eugene Cheah: ChannelMix is similar to Base in the sense that it does shorter term attention, where it just looks at the sister token, or the token before it, because [00:25:00] there's a shift in the token shift matrix. I don't really want to go too much into the papers itself, because, like, we do have three papers on this.
[00:25:09] Eugene Cheah: Basically, RWKB, RNN for the transformer, ERA, Ego and Pinch, RWKB, Matrix Value State. This is the updated version 5, version 6. And Goldfinch is our, is, is, is, is our hybrid model respectively. We are writing the paper already for V seven and which is, which is for R wk V seven. Called, named Goose, or architectures are named by Bird.
[00:25:30] Eugene Cheah: And, I'm going to cover as well, qrwkb, and mama100k, and rwkb, and Where did that lead to? Great! Because we are all GPU poor and to be clear, like, most of this research is done, like, only on a handful H100s, which I had one Google researcher told me that was, like, his experiment budget for a single researcher.
[00:25:48] Eugene Cheah: So, our entire organization has less compute than a single researcher in Google. So We, we, one of the things that we explored into was to how do we convert transformer models instead? Because [00:26:00] someone already paid that billion dollars, a million dollars onto training, so why don't we take advantage of those weights?
[00:26:05] Eugene Cheah: And, and to, I believe, together AI worked on the lockets for, for the Lambda side of things, and, and we took some ideas from there as well, and we essentially did that for RWKB.
[00:26:15] QWRKWv6 launch
[00:26:15] Eugene Cheah: And that led to, Q RWKB6, which we just dropped today, a 32 bit instruct preview model, where we took the Quen 32 bit instruct model, freeze the feedforward layer, remove the QKB attention layer, and replace it with RWKB linear layers.
[00:26:32] Eugene Cheah: So to be clear, this means we do not have the rwkv channel mix layer, we only have the time mix layer. But but once we do that, we train the rwkv layer. Important is that the feedforward layer needs to be frozen, so the new attention can be learned. And then we unfreeze the feedforward layer, and train all the layers together with a custom learning rate schedule, so that they can learn how to work together.
[00:26:54] Eugene Cheah: The end result, surprisingly, And, to be honest, to the frustration of the R. W. [00:27:00] KV MOE team, which ended up releasing the model on the same day, was that, with just a few hours of training on two nodes, we managed to get it to be on par, kind of, with the original QUAN32B model. So, in fact, when the first run, right, that completely confused us, it was like, and I was telling Daniel Goldstein, Smirky, who kind of leads most of our research coordination, When you pitched me this idea, you told me at best you'll get the same level of performance.
[00:27:26] Eugene Cheah: You didn't tell me the challenge and score and Winograd score will shoot up. I don't know what's happening there. But it did. MMLU score dropping, that was expected. Because if you think about it, when we were training all the layers, right, we were essentially Like, Frankenstein this thing, and we did brain damage to the feedforward network layer 2 with the new RWKB layers.
[00:27:47] Eugene Cheah: But, 76%, hey, somehow it's retained, and we can probably further train this. We didn't even spend more than 3 days training this, so there's a lot more that can be done, hence the preview. This brings up [00:28:00] a big question, because We are already now in the process of converting to 7TB. We are now, this is actually extremely compute efficient to test our attention mechanic.
[00:28:10] Eugene Cheah: It's like, it becomes a shortcut. We can, we are already planning to do our version 7 and our hybrid architecture for it. Because we don't need to train from scratch. And we get a really good model out of it. And the other thing that is uncomfortable to say is that because we are doing right now on the 70b is that if this scales correctly to 128k context length, I'm not even talking about a million 128, majority of enterprise workload today is just on 70b at under 32k context length.
[00:28:41] Eugene Cheah: That means if this works and the benchmark matches it, It means we can replace the vast majority of current AI workload, unless you want super long context. And then sorry, can someone give us more GPUs? Because we do need the VRAM for super long context, sadly. So yeah, that's what we are working on, and essentially, [00:29:00] we are excited about this to just push it further.
[00:29:02] Eugene Cheah: And this conversion process, to be clear, I don't think it's going to be exclusive to RWKB. It probably will work for Mamba as well, I don't see why not. And we will probably see more ideas, or more experiments, or more hybrids, or Yeah, like, one of the weirdest things that I wanted to say outright, and I confirmed this with the Black Mamba team and the Jamba team, which because we did the GoFinch hybrid model, is that none of us understand why a hard hybrid with a state based model to be R.
[00:29:28] Eugene Cheah: QA state space and transformer performs better when, than the baseline of both. It's like, it's like when you train one, you expect, and then you replace, you expect the same results. That's our pitch. That's our claim. But somehow when we jam both together, it outperforms both. And that's like one area of emulation that, like, we only have four experiments, plus four teams, that a lot more needs to be done.
[00:29:51] Eugene Cheah: But, but these are things that excite me, essentially, because that is what it's potentially we can move ahead for. Which brings us to what comes next.
[00:30:00] What's next
[00:30:00] [00:30:00]
[00:30:00] Dan Fu: So, this part is kind of just some, where we'll talk a little bit about stuff that, that we're excited about. Maybe have some wild speculation on, on what, what's, what's coming next.
[00:30:12] Dan Fu: And, of course this is also the part that will be more open to questions. So, a couple things that, that I'm excited about is continued hardware model co design for, for these models. So one of the things that we've put out recently is this library called ThunderKittens. It's a CUDA library.
[00:30:29] Dan Fu: And one of the things that, that we found frustrating is every time that we built one of these new architectures, and I'm sure you had the exact same experience, we'd have to go and spend two months in CUDA land, like writing these, these new efficient things. And. If we decided to change one thing in PyTorch, like one line of PyTorch code is like a week of CUDA code at least.
[00:30:47] Dan Fu: So one of our goals with, with a library like Thunderkitten, so we, we just broke down what are the key principles, what are the key hardware things what are the key, Compute pieces that you get from the hardware. So for example on [00:31:00] H100 everything is really revolves around a warp group matrix multiply operation.
[00:31:06] Dan Fu: So you really want your operation to be able to split into relatively small matrix, matrix multiply operations. So like multiplying two 64 by 64 matrices, for example. And so if you know that ahead of time when you're designing your model, that probably gives you you know, some information about how you set the state sizes, how you set the update, how you set the update function.
[00:31:27] Dan Fu: So with Thunderkittens we basically built a whole library just around this basic idea that all your basic compute primitives should not be a float, but it should be a matrix, and everything should just be matrix compute. And we've been using that to, to try to both re implement some existing architectures, and also start to design code.
[00:31:44] Dan Fu: Some new ones that are really designed with this core with a tensor core primitive in mind. Another thing that that we're, that at least I'm excited about is we, over the last four or five years, we've really been looking at language models as the next thing. But if you've been paying [00:32:00] attention to Twitter there's been a bunch of new next generation models that are coming out.
[00:32:04] Dan Fu: So there, there are. So, video generation models that can run real time, that are supported by your mouse and your keyboard, that I'm told if you play with them that, you know, that they only have a few seconds of memory. Can we take that model, can we give it a very long context length so that you could actually maybe generate an entire game state at a time?
[00:32:25] Dan Fu: What does that look like for the model? You're certainly not going to do a giant quadratic attention computation to try to run that. Maybe, maybe use some of these new models, or some of these new video generation models that came out. So Sora came out I don't know, two days ago now. But with super long queue times and super long generation times.
[00:32:43] Dan Fu: So that's probably a quadratic attention operation at the, at the bottom of it. What if we could remove that and get the same quality, but a lot faster generation time? Or some of the demos that we saw from Paige earlier today. You know, if I have a super long conversation with my [00:33:00] Gemini bot, what if I wanted to remember everything that it's seen in the last week?
[00:33:06] Dan Fu: I mean, maybe you don't for personal reasons, but what if I did, you know? What does that mean for the architecture? And I think, you know, that's certainly something I'm pretty excited about. I'm sure you're excited about it too. So, I think we were supposed to have some hot takes, but I honestly don't remember what our hot takes were.
[00:33:21] Hot Takes - does anyone really need long context?
[00:33:21] Eugene Cheah: Yeah, including the next slide. Hot takes, yes, these are our
[00:33:25] Dan Fu: hot takes.
[00:33:25] Eugene Cheah: I think the big one on Twitter that we saw, that we shared, was the question is like, is RAG relevant? In the case of, like, the future of, like, state based models?
[00:33:38] Dan Fu: Let's see, I haven't played too much with RAG. But when I have. I'll say I found it was a little bit challenging to do research on it because we had this experience over and over again, where you could have any, an embedding model of any quality, so you could have a really, really bad embedding model, or you could have a really, really [00:34:00] good one, By any measure of good.
[00:34:03] Dan Fu: And for the final RAG application, it kind of didn't matter. That's what I'll say about RAG while I'm being recorded. I know it doesn't actually answer the question, but
[00:34:13] Eugene Cheah: Yeah, so I think a lot of folks are like, extremely excited of the idea of RWKB or State Space potentially having infinite context.
[00:34:21] Eugene Cheah: But I think the reality is that when we say infinite context, we just mean a different kind of infinite context, or you, or as it's previously covered, you need to test the model differently. So, think of it more along the lines of the human. Like, I don't remember what I ate for breakfast yesterday.
[00:34:37] Eugene Cheah: Yeah, that's the statement that I'll say. And And we humans are not quadratic transformers. If we did, if let's say we increased our brain size for every second we live, we would have exploded by the time we are 5 years old or something like that. And, and I think, I think basically fundamentally for us, right, be it whether we, regardless of whether RWKB, statespace, XLSTM, [00:35:00] etc, our general idea is that instead of that expanding state, that increase in computational cost, what if we have a fixed state size?
[00:35:08] Eugene Cheah: And Information theory detects that that fixed state size will have a limit. Just how big of a limit is a question, like, we, like, RWKB is running at 40 megabytes for, for its state. Its future version might run into 400 megabytes. That is like millions of tokens in, if you're talking about mathematically, the maximum possibility.
[00:35:29] Eugene Cheah: It's just that I guess we were all more inefficient about it, so maybe we hit 100, 000. And that's kind of like the work we are doing, trying to like push it and maximize it. And that's where the models will start differing, because it will choose to forget things, it will choose to remember things. And that's why I think that there might be some element of right, but it may not be the same right.
[00:35:49] Eugene Cheah: It may be the model learn things, and it's like, hmm, I can't remember that, that article. Let me do a database search, to search. Just like us humans, when we can't remember the article in the company. We do a search on Notion. [00:36:00]
[00:36:00] Dan Fu: I think something that would be really interesting is if you could have facts that are, so right now, the one intuition about language models is that all those parameters are around just to store random facts about the world.
[00:36:14] Dan Fu: And this intuition comes from the observation that if you take a really small language model, it can do things like talk to you, or kind of has like the The style of conversation, it can learn that, but where it will usually fall over compared to a much larger one is it'll just be a lot less factual about things that it knows or that it can do.
[00:36:32] Dan Fu: But that points to all those weights that we're spending, all that SGD that we're spending to train these models are just being used to store facts. And we have things like databases that are pretty good at storing facts. So I think one thing that would be really interesting is if we could actually have some sort of outside data store that a language model can can look at that that maybe is you know, has has some sort of gradient descent in it, but but would be quite interesting.
[00:36:58] Dan Fu: And then maybe you could edit it, delete [00:37:00] facts, you know, change who's president so that it doesn't, it doesn't get lost.
[00:37:04] Vibhu: Can we open up Q& A and hot takes for the audience? I have a hot take Q& A. Do these scale? When, when 405B state space model, RAG exists, no one does long context, who's throwing in 2 million token questions, hot takes?
[00:37:24] Dan Fu: The, the who's throwing in 2 million token question, I think, is, is a really good question. So I actually, I was going to offer that as a hot take. I mean, my hot take was going to be that long context doesn't matter. I know I just gave a whole talk about it, but you know, what, what's the point of doing research if you can't, you know, play both sides.
[00:37:40] Dan Fu: But I think one of the, so I think for both of us, the reason that we first got into this was just from the first principled questions of there's this quadratic thing. Clearly intelligence doesn't need to be quadratic. What is going on? Can we understand it better? You know, since then it's kind of turned into a race, which has [00:38:00] been exciting to watch, like, how much context you can take in.
[00:38:03] Dan Fu: But I think it's right. Nobody is actually putting in a two million context prompt into these models. And, and, you know, if they are, maybe we can go, go You know, design a better model to do that particular thing. Yeah, what do you think about that? So you've also been working on this. Do you think long context matters?
[00:38:19] Eugene Cheah: So I'm going to burn a bit. How many of you remember the news of Google Gemini supporting 3 million contacts, right? Raise your hand.
[00:38:28] Vibhu: Yeah, 2 million.
[00:38:29] Eugene Cheah: Oh, it's 2 million.
[00:38:31] Eugene Cheah: Yeah, how many of you actually tried that? See?
[00:38:34] Vibhu: I use it a lot. You? You work for MindsTV. I use it a lot.
[00:38:41] Eugene Cheah: So, for some people that has used, and I think, I think that's the, that's might be, like, this is where my opinion starts to differ, because I think the big labs may have a bigger role in this, because Like, even for RWKB, even when we train non contacts, the reason why I say VRAM is a problem is that because when we did the, we need to backprop [00:39:00] against the states, we actually need to maintain the state in between the tokens by the token length.
[00:39:05] Eugene Cheah: So that means we need to actually roll out the whole 1 million contacts if we are actually training 1 million. Which is the same for transformers, actually, but it just means we don't magically reuse the VRAM consumption in the training time space. So that is one of the VRAM bottlenecks, and I'm neither OpenAI nor Google, so donate GPUs if you have too much of them.
[00:39:27] Eugene Cheah: But then, putting it back to another paradigm, right, is that I think O1 style reasoning might be actually pushing that direction downwards. In my opinion, this is my partial hot take is that if, let's say you have a super big model, And let's say you have a 70B model that may take double the tokens, but gets the same result.
[00:39:51] Eugene Cheah: Strictly speaking, a 70B, and this is even for transformer or non transformer, right? We we'll take less less resources than that 400 B [00:40:00] model, even if it did double the amount thinking. And if that's the case, and we are still all trying to figure this out, maybe the direction for us is really getting the sub 200 B to be as fast as efficient as possible.
[00:40:11] Eugene Cheah: We a very efficient architecture that some folks happen to be working on to, to just reason it out over larger and larger context thing.
[00:40:20] Question: Yeah. One thing I'm super interested in is. Models that can watch forever? Obviously you cannot train something on infinite context length. How are y'all thinking about that, where you run on a much longer context length than is possible to train on?
[00:40:38] Dan Fu: Yeah, it's a, it's a great question. So I think when I think you guys probably had tweets along these lines, too. When we first started doing these things, because these are all recurrent models in theory you could just run it forever. You could just run it forever. And at the very least it won't, it won't like error out on your crash.
[00:40:57] Dan Fu: There's another question of whether it can actually [00:41:00] use what it's seen in that infinite context. And I think there, so one place where probably the research and architectures ran faster Then another research is actually the benchmarks for long context. So you turn it on forever. You want to do everything or watch everything.
[00:41:16] Dan Fu: What is it that you actually wanted to do? Can we actually build some benchmarks for that? Then measure what's happening. And then ask the question, can the models do it? Is there something else that they need? Yeah, I think that if I were to turn back the clock to 2022, that's probably one of the things I would have done differently, which would have been actually get some long context benchmarks out at the same time as we started pushing context length on all these models.
[00:41:41] Eugene Cheah: I will also say the use case. So like, I think we both agree that there's no Infinite memory and the model needs to be able to learn and decide. I think what we have observed for, I think this also fits the state space model, is that one of the key advantages of this alternate attention mechanic that is not based on token position is that the model don't suddenly become crazy when you go past the [00:42:00] 8k training context tank, or a million context tank.
[00:42:03] Eugene Cheah: It's actually still stable. It's still able to run, it's still able to rationalize. It just starts forgetting things. But some of these things are still there in latent memory. Some of these things are still somewhat there. That's the whole point of why reading twice works. Things like that. And one of the biggest pushes in this direction is that I think both Statespace and RWKB have Separate papers by other researchers where they use this architecture for time series data.
[00:42:26] Eugene Cheah: Weather modeling. So, you are not asking what was the weather five days ago. You're asking what's the weather tomorrow based on the infinite length that we, as long as this Earth and the computer will keep running. So, so, and they found that it is like, better than existing, like, transformer or existing architecture in modeling this weather data.
[00:42:47] Eugene Cheah: Control for the param size and stuff. I'm quite sure there are people with larger models. So, so there are things that, that in this case, right, there is future applications if your question is just what's next and not what's 10 years ago.
[00:42:59] Dan Fu: Thanks so [00:43:00] much for having us.

Get full access to Latent Space at www.latent.space/subscribe
- Luisteren Nogmaals beluisteren Doorgaan Wordt afgespeeld...
- Later beluisteren Later beluisteren
2024 in Open Models [LS Live @ NeurIPS]
23 dec· Latent Space: The AI Engineer Podcast
Happy holidays! We’ll be sharing snippets from Latent Space LIVE! through the break bringing you the best of 2024! We want to express our deepest appreciation to event sponsors AWS, Daylight Computer, Thoth.ai, StrongCompute, Notable Capital, and most of all our LS supporters who helped fund the venue and A/V production!
For NeurIPS last year we did our standard conference podcast coverage interviewing selected papers (that we have now also done for ICLR and ICML), however we felt that we could be doing more to help AI Engineers 1) get more industry-relevant content, and 2) recap 2024 year in review from experts. As a result, we organized the first Latent Space LIVE!, our first in person miniconference, at NeurIPS 2024 in Vancouver.
Since Nathan Lambert ( Interconnects ) joined us for the hit RLHF 201 episode at the start of this year, it is hard to overstate how much Open Models have exploded this past year. In 2023 only five names were playing in the top LLM ranks, Mistral, Mosaic's MPT, TII UAE's Falcon, Yi from Kai-Fu Lee's 01.ai, and of course Meta's Llama 1 and 2. This year a whole cast of new open models have burst on the scene, from Google's Gemma and Cohere's Command R, to Alibaba's Qwen and Deepseek models, to LLM 360 and DCLM and of course to the Allen Institute's OLMo, OL MOE, Pixmo, Molmo, and Olmo 2 models.
We were honored to host Luca Soldaini, one of the research leads on the Olmo series of models at AI2.
Pursuing Open Model research comes with a lot of challenges beyond just funding and access to GPUs and datasets, particularly the regulatory debates this year across Europe, California and the White House. We also were honored to hear from and Sophia Yang, head of devrel at Mistral, who also presented a great session at the AI Engineer World's Fair Open Models track!
Full Talk on YouTube
Please like and subscribe!
Timestamps
* 00:00 Welcome to Latent Space Live
* 00:12 Recap of 2024: Best Moments and Keynotes
* 01:22 Explosive Growth of Open Models in 2024
* 02:04 Challenges in Open Model Research
* 02:38 Keynote by Luca Soldani: State of Open Models
* 07:23 Significance of Open Source AI Licenses
* 11:31 Research Constraints and Compute Challenges
* 13:46 Fully Open Models: A New Trend
* 27:46 Mistral's Journey and Innovations
* 32:57 Interactive Demo: Lachat Capabilities
* 36:50 Closing Remarks and Networking
Transcript
Session3Audio
[00:00:00] AI Charlie: Welcome to Latent Space Live, our first mini conference held at NeurIPS 2024 in Vancouver. This is Charlie, your AI co host. As a special treat this week, we're recapping the best of 2024 going domain by domain. We sent out a survey to the over 900 of you who told us what you wanted, and then invited the best speakers in the latent space network to cover each field.
[00:00:28] AI Charlie: 200 of you joined us in person throughout the day, with over 2, 200 watching live online. Our next keynote covers the state of open models in 2024, with Luca Soldani and Nathan Lambert of the Allen Institute for AI, with a special appearance from Dr. Sophia Yang of Mistral. Our first hit episode of 2024 was with Nathan Lambert on RLHF 201 back in January.
[00:00:57] AI Charlie: Where he discussed both reinforcement learning for language [00:01:00] models and the growing post training and mid training stack with hot takes on everything from constitutional AI to DPO to rejection sampling and also previewed the sea change coming to the Allen Institute. And to Interconnects, his incredible substack on the technical aspects of state of the art AI training.
[00:01:18] AI Charlie: We highly recommend subscribing to get access to his Discord as well. It is hard to overstate how much open models have exploded this past year. In 2023, only five names were playing in the top LLM ranks. Mistral, Mosaics MPT, and Gatsby. TII UAE's Falcon, Yi, from Kaifu Lee's 01. ai, And of course, Meta's Lama 1 and 2.
[00:01:43] AI Charlie: This year, a whole cast of new open models have burst on the scene. From Google's Jemma and Cohere's Command R, To Alibaba's Quen and DeepSeq models, to LLM360 and DCLM, and of course, to the Allen Institute's OLMO, [00:02:00] OLMOE, PIXMO, MOLMO, and OLMO2 models. Pursuing open model research comes with a lot of challenges beyond just funding and access to GPUs and datasets, particularly the regulatory debates this year across Europe.
[00:02:14] AI Charlie: California and the White House. We also were honored to hear from Mistral, who also presented a great session at the AI Engineer World's Fair Open Models track. As always, don't forget to check the show notes for the YouTube link to their talk, as well as their slides. Watch out and take care.
[00:02:35] Luca Intro
[00:02:35] Luca Soldaini: Cool. Yeah, thanks for having me over. I'm Luca. I'm a research scientist at the Allen Institute for AI. I threw together a few slides on sort of like a recap of like interesting themes in open models for, for 2024. Have about maybe 20, 25 minutes of slides, and then we can chat if there are any questions.
[00:02:57] Luca Soldaini: If I can advance to the next slide. [00:03:00] Okay, cool. So I did the quick check of like, to sort of get a sense of like, how much 2024 was different from 2023. So I went on Hugging Face and sort of get, tried to get a picture of what kind of models were released in 2023 and like, what do we get in 2024?
[00:03:16] Luca Soldaini: 2023 we get, we got things like both LLAMA 1 and 2, we got Mistral, we got MPT, Falcon models, I think the YI model came in at the end. Tail end of the year. It was a pretty good year. But then I did the same for 2024. And it's actually quite stark difference. You have models that are, you know, reveling frontier level.
[00:03:38] Luca Soldaini: Performance of what you can get from closed models from like Quen, from DeepSeq. We got Llama3. We got all sorts of different models. I added our own Olmo at the bottom. There's this growing group of like, Fully open models that I'm going to touch on a little bit later. But you know, just looking at the slides, it feels like 2024 [00:04:00] was just smooth sailing, happy knees, much better than previous year.
[00:04:04] Luca Soldaini: And you know, you can plot you can pick your favorite benchmark Or least favorite, I don't know, depending on what point you're trying to make. And plot, you know, your closed model, your open model and sort of spin it in ways that show that, oh, you know open models are much closer to where closed models are today versus to Versus last year where the gap was fairly significant.
[00:04:29] Luca Soldaini: So one thing that I think I don't know if I have to convince people in this room, but usually when I give this talks about like open models, there is always like this background question in, in, in people's mind of like, why should we use open models? APIs argument, you know, it's, it's. Just an HTTP request to get output from a, from one of the best model out there.
[00:04:53] Luca Soldaini: Why do I have to set up infra and use local models? And there are really like two answer. There is the more [00:05:00] researchy answer for this, which is where it might be. Background lays, which is just research. If you want to do research on language models, research thrives on, on open models, there is like large swath of research on modeling, on how these models behave on evaluation and inference on mechanistic interpretability that could not happen at all if you didn't have open models they're also for AI builders, they're also like.
[00:05:30] Luca Soldaini: Good use cases for using local models. You know, you have some, this is like a very not comprehensive slides, but you have things like there are some application where local models just blow closed models out of the water. So like retrieval, it's a very clear example. We might have like constraints like Edge AI applications where it makes sense.
[00:05:51] Luca Soldaini: But even just like in terms of like stability, being able to say this model is not changing under the hood. It's, there's plenty of good cases for, [00:06:00] for open models. And the community is just not models. Is I stole this slide from one of the Quent2 announcement blog posts. But it's super cool to see like how much tech exists around open models and serving them on making them efficient and hosting them.
[00:06:18] Luca Soldaini: It's pretty cool. And so. It's if you think about like where the term opens come from, comes from like the open source really open models meet the core tenants of, of open, of open source specifically when it comes around collaboration, there is truly a spirit, like through these open models, you can build on top of other people.
[00:06:41] Luca Soldaini: innovation. We see a lot of these even in our own work of like, you know, as we iterate in the various versions of Alma it's not just like every time we collect from scratch all the data. No, the first step is like, okay, what are the cool data sources and datasets people have put [00:07:00] together for language model for training?
[00:07:01] Luca Soldaini: Or when it comes to like our post training pipeline We one of the steps is you want to do some DPO and you use a lot of outputs of other models to improve your, your preference model. So it's really having like an open sort of ecosystem benefits and accelerates the development of open models.
[00:07:23] The Definition of Open Models
[00:07:23] Luca Soldaini: One thing that we got in 2024, which is not a specific model, but I thought it was really significant, is we first got we got our first open source AI definition. So this is from the open source initiative they've been generally the steward of a lot of the open source licenses when it comes to software and so they embarked on this journey in trying to figure out, okay, How does a license, an open source license for a model look like?
[00:07:52] Luca Soldaini: Majority of the work is very dry because licenses are dry. So I'm not going to walk through the license step by [00:08:00] step, but I'm just going to pick out one aspect that is very good and then one aspect that personally feels like it needs improvement on the good side. This this open source AI license actually.
[00:08:13] Luca Soldaini: This is very intuitive. If you ever build open source software and you have some expectation around like what open source looks like for software for, for AI, sort of matches your intuition. So, the weights need to be fairly available the code must be released with an open source license and there shouldn't be like license clauses that block specific use cases.
[00:08:39] Luca Soldaini: So. Under this definition, for example, LLAMA or some of the QUEN models are not open source because the license says you can't use this model for this or it says if you use this model you have to name the output this way or derivative needs to be named that way. Those clauses don't meet open source [00:09:00] definition and so they will not be covered.
[00:09:02] Luca Soldaini: The LLAMA license will not be covered under the open source definition. It's not perfect. One of the thing that, um, internally, you know, in discussion with with OSI, we were sort of disappointed is around the language. For data. So you might imagine that an open source AI model means a model where the data is freely available.
[00:09:26] Luca Soldaini: There were discussion around that, but at the end of the day, they decided to go with a softened stance where they say a model is open source if you provide sufficient detail information. On how to sort of replicate the data pipeline. So you have an equivalent system, sufficient, sufficiently detailed.
[00:09:46] Luca Soldaini: It's very, it's very fuzzy. Don't like that. An equivalent system is also very fuzzy. And this doesn't take into account the accessibility of the process, right? It might be that you provide enough [00:10:00] information, but this process costs, I don't know, 10 million to do. Now the open source definition. Like, any open source license has never been about accessibility, so that's never a factor in open source software, how accessible software is.
[00:10:14] Luca Soldaini: I can make a piece of open source, put it on my hard drive, and never access it. That software is still open source, the fact that it's not widely distributed doesn't change the license, but practically there are expectations of like, what we want good open sources to be. So, it's, It's kind of sad to see that the data component in this license is not as, as, Open as some of us would like would like it to be.
[00:10:40] Challenges for Open Models
[00:10:40] Luca Soldaini: and I linked a blog post that Nathan wrote on the topic that it's less rambly and easier to follow through. One thing that in general, I think it's fair to say about the state of open models in 2024 is that we know a lot more than what we knew in, [00:11:00] in 2023. Like both on the training data, like And the pre training data you curate on like how to do like all the post training, especially like on the RL side.
[00:11:10] Luca Soldaini: You know, 2023 was a lot of like throwing random darts at the board. I think 2024, we have clear recipes that, okay, don't get the same results as a closed lab because there is a cost in, in actually matching what they do. But at least we have a good sense of like, okay, this is, this is the path to get state of the art language model.
[00:11:31] Luca Soldaini: I think that one thing that it's a downside of 2024 is that I think we are more research constrained in 2023. It feels that, you know, the barrier for compute that you need to, to move innovation along as just being right rising and rising. So like, if you go back to this slide, there is now this, this cluster of models that are sort of released by the.
[00:11:57] Luca Soldaini: Compute rich club. Membership is [00:12:00] hotly debated. You know, some people don't want to be. Called the rich because it comes to expectations. Some people want to be called rich, but I don't know, there's debate, but like, these are players that have, you know, 10, 000, 50, 000 GPUs at minimum. And so they can do a lot of work and a lot of exploration and improving models that it's not very accessible.
[00:12:21] Luca Soldaini: To give you a sense of like how I personally think about. Research budget for each part of the, of the language model pipeline is like on the pre training side, you can maybe do something with a thousand GPUs, really you want 10, 000. And like, if you want real estate of the art, you know, your deep seek minimum is like 50, 000 and you can scale to infinity.
[00:12:44] Luca Soldaini: The more you have, the better it gets. Everyone on that side still complains that they don't have enough GPUs. Post training is a super wide sort of spectrum. You can do as little with like eight GPUs as long as you're able to [00:13:00] run, you know, a good version of, say, a LLAMA model, you can do a lot of work there.
[00:13:05] Luca Soldaini: You can scale a lot of the methodology, just like scales with compute, right? If you're interested in you know, your open replication of what OpenAI's O1 is you're going to be on the 10K spectrum of our GPUs. Inference, you can do a lot with very few resources. Evaluation, you can do a lot with, well, I should say at least one GPUs if you want to evaluate GPUs.
[00:13:30] Luca Soldaini: Open models but in general, like if you are, if you care a lot about intervention to do on this model, which it's my prefer area of, of research, then, you know, the resources that you need are quite, quite significant. Yeah. One other trends that has emerged in 2024 is this cluster of fully open models.
[00:13:54] Luca Soldaini: So Omo the model that we built at ai, two being one of them and you know, it's nice [00:14:00] that it's not just us. There's like a cluster of other mostly research efforts who are working on this. And so it's good to to give you a primer of what like fully open means. So fully open, the easy way to think about it is instead of just releasing a model checkpoint that you run, you release a full recipe so that other people working on it.
[00:14:24] Luca Soldaini: Working on that space can pick and choose whatever they want from your recipe and create their own model or improve on top of your model. You're giving out the full pipeline and all the details there instead of just like the end output. So I pull up the screenshot from our recent MOE model.
[00:14:43] Luca Soldaini: And like for this model, for example, we released the model itself. Data that was trained on, the code, both for training and inference all the logs that we got through the training run, as well as every intermediate checkpoint and like the fact that you release different part of the pipeline [00:15:00] allows others to do really cool things.
[00:15:02] Luca Soldaini: So for example, this tweet from early this year from folks in news research they use our pre training data to do a replication of the BitNet paper in the open. So they took just a Really like the initial part of a pipeline and then the, the thing on top of it. It goes both ways.
[00:15:21] Luca Soldaini: So for example, for the Olmo2 model a lot of our pre trained data for the first stage of pre training was from this DCLM initiative that was led by folks Ooh, a variety of ins a variety of institutions. It was a really nice group effort. But you know, for When it was nice to be able to say, okay, you know, the state of the art in terms of like what is done in the open has improved.
[00:15:46] AI2 Models - Olmo, Molmo, Pixmo etc
[00:15:46] Luca Soldaini: We don't have to like do all this work from scratch to catch up the state of the art. We can just take it directly and integrate it and do our own improvements on top of that. I'm going to spend a few minutes doing like a [00:16:00] shameless plug for some of our fully open recipes. So indulge me in this.
[00:16:05] Luca Soldaini: So a few things that we released this year was, as I was mentioning, there's OMOE model which is, I think still is state of the art MOE model in its size class. And it's also. Fully open, so every component of this model is available. We released a multi modal model called Molmo. Molmo is not just a model, but it's a full recipe of how you go from a text only model to a multi modal model, and we apply this recipe on top of Quent checkpoints, on top of Olmo checkpoints, as well as on top of OlmoE.
[00:16:37] Luca Soldaini: And I think there'd be a replication doing that on top of Mistral as well. The post training side we recently released 2. 0. 3. Same story. This is a recipe on how you go from a base model to A state of the art post training model. We use the Tulu recipe on top of Olmo, on top of Llama, and then there's been open replication effort [00:17:00] to do that on top of Quen as well.
[00:17:02] Luca Soldaini: It's really nice to see like, you know, when your recipe sort of, it's kind of turnkey, you can apply it to different models and it kind of just works. And finally, the last thing we released this year was Olmo 2, which so far is the best state of the art. Fully open language model a Sera combines aspect from all three of these previous models.
[00:17:22] Luca Soldaini: What we learn on the data side from MomoE and what we learn on like making models that are easy to adapt from the Momo project and the Tulu project. I will close with a little bit of reflection of like ways this, this ecosystem of open models like it's not all roses. It's not all happy. It feels like day to day, it's always in peril.
[00:17:44] Luca Soldaini: And, you know, I talked a little bit about like the compute issues that come with it. But it's really not just compute. One thing that is on top of my mind is due to like the environment and how you know, growing feelings about like how AI is treated. [00:18:00] It's actually harder to get access to a lot of the data that was used to train a lot of the models up to last year.
[00:18:06] Luca Soldaini: So this is a screenshot from really fabulous work from Shane Longpre who's, I think is in Europe about Just access of like diminishing access to data for language model pre training. So what they did is they went through every snapshot of common crawl. Common crawl is this publicly available scrape of the, of a subset of the internet.
[00:18:29] Luca Soldaini: And they looked at how For any given website whether a website that was accessible in say 2017, what, whether it was accessible or not in 2024. And what they found is as a reaction to like the close like of the existence of closed models like OpenAI or Cloud GPT or Cloud a lot of content owners have blanket Blocked any type of crawling to your website.
[00:18:57] Luca Soldaini: And this is something that we see also internally at [00:19:00] AI2. Like one project that we started this year is we wanted to, we wanted to understand, like, if you're a good citizen of the internet and you crawl following sort of norms and policy that have been established in the last 25 years, what can you crawl?
[00:19:17] Luca Soldaini: And we found that there's a lot of website where. The norms of how you express preference of whether to crawl your data or not are broken. A lot of people would block a lot of crawling, but do not advertise that in RobustDXT. You can only tell that they're crawling, that they're blocking you in crawling when you try doing it.
[00:19:37] Luca Soldaini: Sometimes you can't even crawl the robots. txt to, to check whether you're allowed or not. And then a lot of websites there's, there's like all these technologies that historically have been, have existed to make websites serving easier such as Cloudflare or DNS. They're now being repurposed for blocking AI or any type of crawling [00:20:00] in a way that is Very opaque to the content owners themselves.
[00:20:04] Luca Soldaini: So, you know, you go to these websites, you try to access them and they're not available and you get a feeling it's like, Oh, someone changed, something changed on the, on the DNS side that it's blocking this and likely the content owner has no idea. They're just using a Cloudflare for better, you know, load balancing.
[00:20:25] Luca Soldaini: And this is something that was sort of sprung on them with very little notice. And I think the problem is this, this blocking or ideas really, it impacts people in different ways. It disproportionately helps companies that have a headstart, which are usually the closed labs and it hurts incoming newcomer players where either have now to do things in a sketchy way or you're never going to get that content that the closed lab might have.
[00:20:54] Luca Soldaini: So there's a lot, it was a lot of coverage. I'm going to plug Nathan's blog post again. That is, [00:21:00] that I think the title of this one is very succinct which is like, we're actually not, You know, before thinking about running out of training data, we're actually running out of open training data. And so if we want better open models they should be on top of our mind.
[00:21:13] Regulation and Lobbying
[00:21:13] Luca Soldaini: The other thing that has emerged is that there is strong lobbying efforts on trying to define any kind of, AI as like a new extremely risky and I want to be precise here. Like the problem is now, um, like the problem is not not considering the risk of this technology. Every technology has risks that, that should always be considered.
[00:21:37] Luca Soldaini: The thing that it's like to me is sorry, is ingenious is like just putting this AI on a pedestal and calling it like, An unknown alien technology that has like new and undiscovered potentials to destroy humanity. When in reality, all the dangers I think are rooted in [00:22:00] dangers that we know from existing software industry or existing issues that come with when using software on on a lot of sensitive domains, like medical areas.
[00:22:13] Luca Soldaini: And I also noticed a lot of efforts that have actually been going on and trying to make this open model safe. I pasted one here from AI2, but there's actually like a lot of work that has been going on on like, okay, how do you make, if you're distributing this model, Openly, how do you make it safe?
[00:22:31] Luca Soldaini: How, what's the right balance between accessibility on open models and safety? And then also there's annoying brushing of sort of concerns that are then proved to be unfounded under the rug. You know, if you remember the beginning of this year, it was all about bio risk of these open models.
[00:22:48] Luca Soldaini: The whole thing fizzled because as being Finally, there's been like rigorous research, not just this paper from Cohere folks, but it's been rigorous research showing [00:23:00] that this is really not a concern that we should be worried about. Again, there is a lot of dangerous use of AI applications, but this one was just like, A lobbying ploy to just make things sound scarier than they actually are.
[00:23:15] Luca Soldaini: So I got to preface this part. It says, this is my personal opinion. It's not my employer, but I look at things like the SP 1047 from, from California. And I think we kind of dodged a bullet on, on this legislation. We, you know, the open source community, a lot of the community came together at the last, sort of the last minute and did a very good effort trying to explain all the negative impact of this bill.
[00:23:43] Luca Soldaini: But There's like, I feel like there's a lot of excitement on building these open models or like researching on these open models. And lobbying is not sexy it's kind of boring but it's sort of necessary to make sure that this ecosystem can, can really [00:24:00] thrive. This end of presentation, I have Some links, emails, sort of standard thing in case anyone wants to reach out and if folks have questions or anything they wanted to discuss.
[00:24:13] Luca Soldaini: Is there an open floor? I think we have Sophia
[00:24:16] swyx: who wants to who one, one very important open model that we haven't covered is Mistral. Ask her on this slide. Yeah, yeah. Well, well, it's nice to have the Mistral person talk recap the year in Mistral. But while Sophia gets set up, does anyone have like, just thoughts or questions about the progress in this space?
[00:24:32] Questions - Incentive Alignment
[00:24:32] swyx: Do you always have questions?
[00:24:34] Quesiton: I'm very curious how we should build incentives to build open models, things like Francois Chollet's ArcPrize, and other initiatives like that. What is your opinion on how we should better align incentives in the community so that open models stay open?
[00:24:49] Luca Soldaini: The incentive bit is, like, really hard.
[00:24:51] Luca Soldaini: Like, even It's something that I actually, even we think a lot about it internally because like building open models is risky. [00:25:00] It's very expensive. And so people don't want to take risky bets. I think the, definitely like the challenges like our challenge, I think those are like very valid approaches for it.
[00:25:13] Luca Soldaini: And then I think in general, promoting, building, so, any kind of effort to participate in this challenge, in those challenges, if we can promote doing that on top of open models and sort of really lean into like this multiplier effect, I think that is a good way to go. If there were more money for that.
[00:25:35] Luca Soldaini: For efforts like research efforts around open models. There's a lot of, I think there's a lot of investments in companies that at the moment are releasing their model in the open, which is really cool. But it's usually more because of commercial interest and not wanting to support this, this like open models in the longterm, it's a really hard problem because I think everyone is operating sort of [00:26:00] in what.
[00:26:01] Luca Soldaini: Everyone is at their local maximum, right? In ways that really optimize their position on the market. Global maximum is harder to achieve.
[00:26:11] Question2: Can I ask one question? No.
[00:26:12] Luca Soldaini: Yeah.
[00:26:13] Question2: So I think one of the gap between the closed and open source models is the mutability. So the closed source models like chat GPT works pretty good on the low resource languages, which is not the same on the open, open source models, right?
[00:26:27] Question2: So is it in your plan to improve on that?
[00:26:32] Luca Soldaini: I think in general,
[00:26:32] Luca Soldaini: yes, is I think it's. I think we'll see a lot of improvements there in, like, 2025. Like, there's groups like, Procurement English on the smaller side that are already working on, like, better crawl support, multilingual support. I think what I'm trying to say here is you really want to be experts.
[00:26:54] Luca Soldaini: who are actually in those countries that teach those languages to [00:27:00] participate in the international community. To give you, like, a very easy example I'm originally from Italy. I think I'm terribly equipped to build a model that works well in Italian. Because one of the things you need to be able to do is having that knowledge of, like, okay, how do I access, you know, how Libraries, or content that is from this region that covers this language.
[00:27:23] Luca Soldaini: I've been in the US long enough that I no longer know. So, I think that's the efforts that folks in Central Europe, for example, are doing. Around like, okay, let's tap into regional communities. To get access you know, to bring in collaborators from those areas. I think it's going to be, like, very crucial for getting products there.
[00:27:46] Mistral intro
[00:27:46] Sophia Yang: Hi everyone. Yeah, I'm super excited to be here to talk to you guys about Mistral. A really short and quick recap of what we have done, what kind of models and products we have released in the [00:28:00] past year and a half. So most of you We have already known that we are a small startup funded about a year and a half ago in Paris in May, 2003, it was funded by three of our co founders, and in September, 2003, we released our first open source model, Mistral 7b yeah, how, how many of you have used or heard about Mistral 7b?
[00:28:24] Sophia Yang: Hey, pretty much everyone. Thank you. Yeah, it's our Pretty popular and community. Our committee really loved this model, and in December 23, we, we released another popular model with the MLE architecture Mr. A X seven B and oh. Going into this year, you can see we have released a lot of things this year.
[00:28:46] Sophia Yang: First of all, in February 2004, we released MrSmall, MrLarge, LeChat, which is our chat interface, I will show you in a little bit. We released an embedding model for, you [00:29:00] know, converting your text into embedding vectors, and all of our models are available. The, the big cloud resources. So you can use our model on Google cloud, AWS, Azure Snowflake, IBM.
[00:29:16] Sophia Yang: So very useful for enterprise who wants to use our model through cloud. And in April and May this year, we released another powerful open source MOE model, AX22B. And we also released our first code. Code Model Coastal, which is amazing at 80 plus languages. And then we provided another fine tuning service for customization.
[00:29:41] Sophia Yang: So because we know the community love to fine tune our models, so we provide you a very nice and easy option for you to fine tune our model on our platform. And also we released our fine tuning code base called Menstrual finetune. It's open source, so feel free to take it. Take a look and.
[00:29:58] Sophia Yang: More models. [00:30:00] On July 2, November this year, we released many, many other models. First of all is the two new small, best small models. We have Minestra 3B great for Deploying on edge devices we have Minstrel 8B if you used to use Minstrel 7B, Minstrel 8B is a great replacement with much stronger performance than Minstrel 7B.
[00:30:25] Sophia Yang: We also collaborated with NVIDIA and open sourced another model, Nemo 12B another great model. And Just a few weeks ago, we updated Mistral Large with the version 2 with the updated, updated state of the art features and really great function calling capabilities. It's supporting function calling in LatentNate.
[00:30:45] Sophia Yang: And we released two multimodal models Pixtral 12b. It's this open source and Pixtral Large just amazing model for, models for not understanding images, but also great at text understanding. So. Yeah, a [00:31:00] lot of the image models are not so good at textual understanding, but pixel large and pixel 12b are good at both image understanding and textual understanding.
[00:31:09] Sophia Yang: And of course, we have models for research. Coastal Mamba is built on Mamba architecture and MathRoll, great with working with math problems. So yeah, that's another model.
[00:31:29] Sophia Yang: Here's another view of our model reference. We have several premier models, which means these models are mostly available through our API. I mean, all of the models are available throughout our API, except for Ministry 3B. But for the premier model, they have a special license. Minstrel research license, you can use it for free for exploration, but if you want to use it for enterprise for production use, you will need to purchase a license [00:32:00] from us.
[00:32:00] Sophia Yang: So on the top row here, we have Minstrel 3b and 8b as our premier model. Minstrel small for best, best low latency use cases, MrLarge is great for your most sophisticated use cases. PixelLarge is the frontier class multimodal model. And, and we have Coastral for great for coding and then again, MrEmbedding model.
[00:32:22] Sophia Yang: And The bottom, the bottom of the slides here, we have several Apache 2. 0 licensed open way models. Free for the community to use, and also if you want to fine tune it, use it for customization, production, feel free to do so. The latest, we have Pixtros 3 12b. We also have Mr. Nemo mum, Coastal Mamba and Mastro, as I mentioned, and we have three legacy models that we don't update anymore.
[00:32:49] Sophia Yang: So we recommend you to move to our newer models if you are still using them. And then, just a few weeks ago, [00:33:00] we did a lot of, uh, improvements to our code interface, Lachette. How many of you have used Lachette? Oh, no. Only a few. Okay. I highly recommend Lachette. It's chat. mistral. ai. It's free to use.
[00:33:16] Sophia Yang: It has all the amazing capabilities I'm going to show you right now. But before that, Lachette in French means cat. So this is actually a cat logo. If you You can tell this is the cat eyes. Yeah. So first of all, I want to show you something Maybe let's, let's take a look at image understanding.
[00:33:36] Sophia Yang: So here I have a receipts and I want to ask, just going to get the prompts. Cool. So basically I have a receipt and I said I ordered I don't know. Coffee and the sausage. How much do I owe? Add a 18 percent tip. So hopefully it was able to get the cost of the coffee and the [00:34:00] sausage and ignore the other things.
[00:34:03] Sophia Yang: And yeah, I don't really understand this, but I think this is coffee. It's yeah. Nine, eight. And then cost of the sausage, we have 22 here. And then it was able to add the cost, calculate the tip, and all that. Great. So, it's great at image understanding, it's great at OCR tasks. So, if you have OCR tasks, please use it.
[00:34:28] Sophia Yang: It's free on the chat. It's also available through our API. And also I want to show you a Canvas example. A lot of you may have used Canvas with other tools before. But, With Lachat, it's completely free again. Here, I'm asking it to create a canvas that's used PyScript to execute Python in my browser.
[00:34:51] Sophia Yang: Let's see if it works. Import this. Okay, so, yeah, so basically it's executing [00:35:00] Python here. Exactly what we wanted. And the other day, I was trying to ask Lachat to create a game for me. Let's see if we can make it work. Yeah, the Tetris game. Yep. Let's just get one row. Maybe. Oh no. Okay. All right. You get the idea. I failed my mission. Okay. Here we go. Yay! Cool. Yeah. So as you can see, Lachet can write, like, a code about a simple game pretty easily. And you can ask Lachet to explain the code. Make updates however you like. Another example. There is a bar here I want to move.
[00:35:48] Sophia Yang: Okay, great, okay. And let's go back to another one. Yeah, we also have web search capabilities. Like, you can [00:36:00] ask what's the latest AI news. Image generation is pretty cool. Generate an image about researchers. Okay. In Vancouver? Yeah, it's Black Forest Labs flux Pro. Again, this is free, so Oh, cool.
[00:36:19] Sophia Yang: I guess researchers here are mostly from University of British Columbia. That's smart. Yeah. So this is Laia ira. Please feel free to use it. And let me know if you have any feedback. We're always looking for improvement and we're gonna release a lot more powerful features in the coming years.
[00:36:37] Sophia Yang: Thank you.

Get full access to Latent Space at www.latent.space/subscribe
- Luisteren Nogmaals beluisteren Doorgaan Wordt afgespeeld...
- Later beluisteren Later beluisteren
2024 in Vision [LS Live @ NeurIPS]
22 dec· Latent Space: The AI Engineer Podcast
Happy holidays! We’ll be sharing snippets from Latent Space LIVE! through the break bringing you the best of 2024! We want to express our deepest appreciation to event sponsors AWS, Daylight Computer, Thoth.ai, StrongCompute, Notable Capital, and most of all all our LS supporters who helped fund the gorgeous venue and A/V production!
For NeurIPS last year we did our standard conference podcast coverage interviewing selected papers (that we have now also done for ICLR and ICML), however we felt that we could be doing more to help AI Engineers 1) get more industry-relevant content, and 2) recap 2024 year in review from experts. As a result, we organized the first Latent Space LIVE!, our first in person miniconference, at NeurIPS 2024 in Vancouver.
The single most requested domain was computer vision, and we could think of no one better to help us recap 2024 than our friends at Roboflow, who was one of our earliest guests in 2023 and had one of this year’s top episodes in 2024 again. Roboflow has since raised a $40m Series B!
Links
All the trends and papers they picked:
* Isaac Robinson
* Sora (see our Video Diffusion pod) - extending diffusion from images to video
* SAM 2: Segment Anything in Images and Videos (see our SAM2 pod) - extending prompted masks to full video object segmentation
* DETR Dominancy: DETRs show Pareto improvement over YOLOs
* RT-DETR: DETRs Beat YOLOs on Real-time Object Detection
* LW-DETR: A Transformer Replacement to YOLO for Real-Time Detection
* D-FINE: Redefine Regression Task in DETRs as Fine-grained Distribution Refinement
* Peter Robicheaux
* MMVP (Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs)
*
* Florence 2 (Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks)
* PalíGemma / PaliGemma 2
* PaliGemma: A versatile 3B VLM for transfer
* PaliGemma 2: A Family of Versatile VLMs for Transfer
* AlMv2 (Multimodal Autoregressive Pre-training of Large Vision Encoders)
* Vik Korrapati - Moondream
Full Talk on YouTube
Want more content like this? Like and subscribe to stay updated on our latest talks, interviews, and podcasts.
Transcript/Timestamps
[00:00:00] Intro
[00:00:05] AI Charlie: welcome to Latent Space Live, our first mini conference held at NeurIPS 2024 in Vancouver. This is Charlie, your AI co host. When we were thinking of ways to add value to our academic conference coverage, we realized that there was a lack of good talks, just recapping the best of 2024, going domain by domain.
[00:00:36] AI Charlie: We sent out a survey to the over 900 of you. who told us what you wanted, and then invited the best speakers in the Latent Space Network to cover each field. 200 of you joined us in person throughout the day, with over 2, 200 watching live online. Our second featured keynote is The Best of Vision 2024, with Peter Robichaud and Isaac [00:01:00] Robinson of Roboflow, with a special appearance from Vic Corrapati of Moondream.
[00:01:05] AI Charlie: When we did a poll of our attendees, the highest interest domain of the year was vision. And so our first port of call was our friends at Roboflow. Joseph Nelson helped us kickstart our vision coverage in episode 7 last year, and this year came back as a guest host with Nikki Ravey of Meta to cover segment Anything 2.
[00:01:25] AI Charlie: Roboflow have consistently been the leaders in open source vision models and tooling. With their SuperVision library recently eclipsing PyTorch's Vision library. And Roboflow Universe hosting hundreds of thousands of open source vision datasets and models. They have since announced a 40 million Series B led by Google Ventures.
[00:01:46] AI Charlie: Woohoo.
[00:01:48] Isaac's picks
[00:01:48] Isaac Robinson: Hi, we're Isaac and Peter from Roboflow, and we're going to talk about the best papers of 2024 in computer vision. So, for us, we defined best as what made [00:02:00] the biggest shifts in the space. And to determine that, we looked at what are some major trends that happened and what papers most contributed to those trends.
[00:02:09] Isaac Robinson: So I'm going to talk about a couple trends, Peter's going to talk about a trend, And then we're going to hand it off to Moondream. So, the trends that I'm interested in talking about are These are a major transition from models that run on per image basis to models that run using the same basic ideas on video.
[00:02:28] Isaac Robinson: And then also how debtors are starting to take over the real time object detection scene from the YOLOs, which have been dominant for years.
[00:02:37] Sora, OpenSora and Video Vision vs Generation
[00:02:37] Isaac Robinson: So as a highlight we're going to talk about Sora, which from my perspective is the biggest paper of 2024, even though it came out in February. Is the what?
[00:02:48] Isaac Robinson: Yeah. Yeah. So just it's a, SORA is just a a post. So I'm going to fill it in with details from replication efforts, including open SORA and related work, such as a stable [00:03:00] diffusion video. And then we're also going to talk about SAM2, which applies the SAM strategy to video. And then how debtors, These are the improvements in 2024 to debtors that are making them a Pareto improvement to YOLO based models.
[00:03:15] Isaac Robinson: So to start this off, we're going to talk about the state of the art of video generation at the end of 2023, MagVIT MagVIT is a discrete token, video tokenizer akin to VQ, GAN, but applied to video sequences. And it actually outperforms state of the art handcrafted video compression frameworks.
[00:03:38] Isaac Robinson: In terms of the bit rate versus human preference for quality and videos generated by autoregressing on these discrete tokens generate some pretty nice stuff, but up to like five seconds length and, you know, not super detailed. And then suddenly a few months later we have this, which when I saw it, it was totally mind blowing to me.
[00:03:59] Isaac Robinson: 1080p, [00:04:00] a whole minute long. We've got light reflecting in puddles. That's reflective. Reminds me of those RTX demonstrations for next generation video games, such as Cyberpunk, but with better graphics. You can see some issues in the background if you look closely, but they're kind of, as with a lot of these models, the issues tend to be things that people aren't going to pay attention to unless they're looking for.
[00:04:24] Isaac Robinson: In the same way that like six fingers on a hand. You're not going to notice is a giveaway unless you're looking for it. So yeah, as we said, SORA does not have a paper. So we're going to be filling it in with context from the rest of the computer vision scene attempting to replicate these efforts. So the first step, you have an LLM caption, a huge amount of videos.
[00:04:48] Isaac Robinson: This, this is a trick that they introduced in Dolly 3, where they train a image captioning model to just generate very high quality captions for a huge corpus and then train a diffusion model [00:05:00] on that. Their Sora and their application efforts also show a bunch of other steps that are necessary for good video generation.
[00:05:09] Isaac Robinson: Including filtering by aesthetic score and filtering by making sure the videos have enough motion. So they're not just like kind of the generators not learning to just generate static frames. So. Then we encode our video into a series of space time latents. Once again, SORA, very sparse in details.
[00:05:29] Isaac Robinson: So the replication related works, OpenSORA actually uses a MAG VIT V2 itself to do this, but swapping out the discretization step with a classic VAE autoencoder framework. They show that there's a lot of benefit from getting the temporal compression, which makes a lot of sense as the Each sequential frames and videos have mostly redundant information.
[00:05:53] Isaac Robinson: So by compressing against, compressing in the temporal space, you allow the latent to hold [00:06:00] a lot more semantic information while avoiding that duplicate. So, we've got our spacetime latents. Possibly via, there's some 3D VAE, presumably a MAG VATV2 and then you throw it into a diffusion transformer.
[00:06:19] Isaac Robinson: So I think it's personally interesting to note that OpenSORA is using a MAG VATV2, which originally used an autoregressive transformer decoder to model the latent space, but is now using a diffusion diffusion transformer. So it's still a transformer happening. Just the question is like, is it?
[00:06:37] Isaac Robinson: Parameterizing the stochastic differential equation is, or parameterizing a conditional distribution via autoregression. It's also it's also worth noting that most diffusion models today, the, the very high performance ones are switching away from the classic, like DDPM denoising diffusion probability modeling framework to rectified flows.
[00:06:57] Isaac Robinson: Rectified flows have a very interesting property that as [00:07:00] they converge, they actually get closer to being able to be sampled with a single step. Which means that in practice, you can actually generate high quality samples much faster. Major problem of DDPM and related models for the past four years is just that they require many, many steps to generate high quality samples.
[00:07:22] Isaac Robinson: So, and naturally, the third step is throwing lots of compute at the problem. So I didn't, I never figured out how to manage to get this video to loop, but we see very little compute, medium compute, lots of compute. This is so interesting because the the original diffusion transformer paper from Facebook actually showed that, in fact, the specific hyperparameters of the transformer didn't really matter that much.
[00:07:48] Isaac Robinson: What mattered was that you were just increasing the amount of compute that the model had. So, I love how in the, once again, little blog posts, they don't even talk about [00:08:00] like the specific hyperparameters. They say, we're using a diffusion transformer, and we're just throwing more compute at it, and this is what happens.
[00:08:08] Isaac Robinson: OpenSora shows similar results. The primary issue I think here is that no one else has 32x compute budget. So we end up with these we end up in the middle of the domain and most of the related work, which is still super, super cool. It's just a little disappointing considering the context. So I think this is a beautiful extension of the framework that was introduced in 22 and 23 for these very high quality per image generation and then extending that to videos.
[00:08:39] Isaac Robinson: It's awesome. And it's GA as of Monday, except no one can seem to get access to it because they keep shutting down the login.
[00:08:46] SAM and SAM2
[00:08:46] Isaac Robinson: The next, so next paper I wanted to talk about is SAM. So we at Roboflow allow users to label data and train models on that data. Sam, for us, has saved our users 75 years of [00:09:00] labeling time.
[00:09:00] Isaac Robinson: We are the, to the best of my knowledge, the largest SAM API that exists. We also, SAM also allows us to have our users train just pure bounding box regression models and use those to generate high quality masks which has the great side effect of requiring less training data to have a meaningful convergence.
[00:09:20] Isaac Robinson: So most people are data limited in the real world. So anything that requires less data to get to a useful thing is that super useful. Most of our users actually run their object per frame object detectors on every frame in a video, or maybe not most, but many, many. And so Sam follows into this category of taking, Sam 2 falls into this category of taking something that really really works and applying it to a video which has the wonderful benefit of being plug and play with most of our Many of our users use cases.
[00:09:53] Isaac Robinson: We're, we're still building out a sufficiently mature pipeline to take advantage of that, but it's, it's in the works. [00:10:00] So here we've got a great example. We can click on cells and then follow them. You even notice the cell goes away and comes back and we can still keep track of it which is very challenging for existing object trackers.
[00:10:14] Isaac Robinson: High level overview of how SAM2 works. We there's a simple pipeline here where we can give, provide some type of prompt and it fills out the rest of the likely masks for that object throughout the rest of the video. So here we're giving a bounding box in the first frame, a set of positive negative points, or even just a simple mask.
[00:10:36] Isaac Robinson: I'm going to assume people are somewhat familiar with SAM. So I'm going to just give a high level overview of how SAM works. You have an image encoder that runs on every frame. SAM two can be used on a single image, in which case the only difference between SAM two and SAM is that image encoder, which Sam used a standard VIT [00:11:00] Sam two replaced that with a hara hierarchical encoder, which gets approximately the same results, but leads to a six times faster inference, which is.
[00:11:11] Isaac Robinson: Excellent, especially considering how in a trend of 23 was replacing the VAT with more efficient backbones. In the case where you're doing video segmentation, the difference is that you actually create a memory bank and you cross attend the features from the image encoder based on the memory bank.
[00:11:31] Isaac Robinson: So the feature set that is created is essentially well, I'll go more into it in a couple of slides, but we take the features from the past couple frames, plus a set of object pointers and the set of prompts and use that to generate our new masks. Then we then fuse the new masks for this frame with the.
[00:11:57] Isaac Robinson: Image features and add that to the memory bank. [00:12:00] It's, well, I'll say more in a minute. The just like SAM, the SAM2 actually uses a data engine to create its data set in that people are, they assembled a huge amount of reference data, used people to label some of it and train the model used the model to label more of it and asked people to refine the predictions of the model.
[00:12:20] Isaac Robinson: And then ultimately the data set is just created from the engine Final output of the model on the reference data. It's very interesting. This paradigm is so interesting to me because it unifies a model in a dataset in a way that is very unique. It seems unlikely that another model could come in and have such a tight.
[00:12:37] Isaac Robinson: So brief overview of how the memory bank works, the paper did not have a great visual, so I'm just, I'm going to fill in a bit more. So we take the last couple of frames from our video. And we take the last couple of frames from our video attend that, along with the set of prompts that we provided, they could come from the future, [00:13:00] they could come from anywhere in the video, as well as reference object pointers, saying, by the way, here's what we've found so far attending to the last few frames has the interesting benefit of allowing it to model complex object motion without actually
[00:13:18] Isaac Robinson: By limiting the amount of frames that you attend to, you manage to keep the model running in real time. This is such an interesting topic for me because one would assume that attending to all of the frames is super essential, or having some type of summarization of all the frames is super essential for high performance.
[00:13:35] Isaac Robinson: But we see in their later ablation that that actually is not the case. So here, just to make sure that there is some benchmarking happening, we just compared to some of the stuff that's came out prior, and indeed the SAM2 strategy does improve on the state of the art. This ablation deep in their dependencies was super interesting to me.
[00:13:59] Isaac Robinson: [00:14:00] We see in section C, the number of memories. One would assume that increasing the count of memories would meaningfully increase performance. And we see that it has some impact, but not the type that you'd expect. And that it meaningfully decreases speed, which justifies, in my mind, just having this FIFO queue of memories.
[00:14:20] Isaac Robinson: Although in the future, I'm super interested to see A more dedicated summarization of all of the last video, not just a stacking of the last frames. So that another extension of beautiful per frame work into the video domain.
[00:14:42] Realtime detection: DETRs > YOLO
[00:14:42] Isaac Robinson: The next trend I'm interested in talking about is this interesting at RoboFlow, we're super interested in training real time object detectors.
[00:14:50] Isaac Robinson: Those are bread and butter. And so we're doing a lot to keep track of what is actually happening in that space. We are finally starting to see something change. So, [00:15:00] for years, YOLOs have been the dominant way of doing real time object detection, and we can see here that they've essentially stagnated.
[00:15:08] Isaac Robinson: The performance between 10 and 11 is not meaningfully different, at least, you know, in this type of high level chart. And even from the last couple series, there's not. A major change so YOLOs have hit a plateau, debtors have not. So we can look here and see the YOLO series has this plateau. And then these RT debtor, LW debtor, and Define have meaningfully changed that plateau so that in fact, the best Define models are plus 4.
[00:15:43] Isaac Robinson: 6 AP on Cocoa at the same latency. So three major steps to accomplish this. The first RT deditor, which is technically a 2023 paper preprint, but published officially in 24, so I'm going to include that. I hope that's okay. [00:16:00] That is showed that RT deditor showed that we could actually match or out speed YOLOs.
[00:16:04] Isaac Robinson: And then LWdebtor showed that pre training is hugely effective on debtors and much less so on YOLOs. And then DeFine added the types of bells and whistles that we expect from these types, this, this arena. So the major improvements that RTdebtor shows was Taking the multi scale features that debtors typically pass into their encoder and decoupling them into a much more efficient transformer encoder.
[00:16:30] Isaac Robinson: The transformer is of course, quadratic complexity. So decreasing the amount of stuff that you pass in at once is super helpful for increasing your runtime or increasing your throughput. So that change basically brought us up to yellow speed and then they do a hardcore analysis on. Benchmarking YOLOs, including the NMS step.
[00:16:54] Isaac Robinson: Once you once you include the NMS in the latency calculation, you see that in fact, these debtors [00:17:00] are outperforming, at least this time, the the, the YOLOs that existed. Then LW debtor goes in and suggests that in fact, the frame, the huge boost here is from pre training. So, this is the define line, and this is the define line without pre training.
[00:17:19] Isaac Robinson: It's within range, it's still an improvement over the YOLOs, but Really huge boost comes from the benefit of pre training. When YOLOx came out in 2021, they showed that they got much better results by having a much, much longer training time, but they found that when they did that, they actually did not benefit from pre training.
[00:17:40] Isaac Robinson: So, you see in this graph from LWdebtor, in fact, YOLOs do have a real benefit from pre training, but it goes away as we increase the training time. Then, the debtors converge much faster. LWdebtor trains for only 50 epochs, RTdebtor is 60 epochs. So, one could assume that, in fact, [00:18:00] the entire extra gain from pre training is that you're not destroying your original weights.
[00:18:06] Isaac Robinson: By relying on this long training cycle. And then LWdebtor also shows superior performance to our favorite data set, Roboflow 100 which means that they do better on the real world, not just on Cocoa. Then Define throws all the bells and whistles at it. Yellow models tend to have a lot of very specific complicated loss functions.
[00:18:26] Isaac Robinson: This Define brings that into the debtor world and shows consistent improvement on a variety of debtor based frameworks. So bring these all together and we see that suddenly we have almost 60 AP on Cocoa while running in like 10 milliseconds. Huge, huge stuff. So we're spending a lot of time trying to build models that work better with less data and debtors are clearly becoming a promising step in that direction.
[00:18:56] Isaac Robinson: The, what we're interested in seeing [00:19:00] from the debtors in this, this trend to next is. Codetter and the models that are currently sitting on the top of the leaderboard for large scale inference scale really well as you switch out the backbone. We're very interested in seeing and having people publish a paper, potentially us, on what happens if you take these real time ones and then throw a Swingy at it.
[00:19:23] Isaac Robinson: Like, do we have a Pareto curve that extends from the real time domain all the way up to the super, super slow but high performance domain? We also want to see people benchmarking in RF100 more, because that type of data is what's relevant for most users. And we want to see more pre training, because pre training works now.
[00:19:43] Isaac Robinson: It's super cool.
[00:19:48] Peter's Picks
[00:19:48] Peter Robicheaux: Alright, so, yeah, so in that theme one of the big things that we're focusing on is how do we get more out of our pre trained models. And one of the lenses to look at this is through sort of [00:20:00] this, this new requirement for like, how Fine grained visual details and your representations that are extracted from your foundation model.
[00:20:08] Peter Robicheaux: So it's sort of a hook for this Oh, yeah, this is just a list of all the the papers that I'm going to mention I just want to make sure I set an actual paper so you can find it later
[00:20:18] MMVP (Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs)
[00:20:18] Peter Robicheaux: Yeah, so sort of the big hook here is that I make the claim that LLMs can't see if you go to if you go to Claude or ChatGPT you ask it to see this Watch and tell me what time it is, it fails, right?
[00:20:34] Peter Robicheaux: And so you could say, like, maybe, maybe the Like, this is, like, a very classic test of an LLM, but you could say, Okay, maybe this, this image is, like, too zoomed out, And it just, like, it'll do better if we increase the resolution, And it has easier time finding these fine grained features, Like, where the watch hands are pointing.
[00:20:53] Peter Robicheaux: Nodice. And you can say, okay, well, maybe the model just doesn't know how to tell time from knowing the position of the hands. But if you actually prompt [00:21:00] it textually, it's very easy for it to tell the time. So this to me is proof that these LLMs literally cannot see the position of the watch hands and it can't see those details.
[00:21:08] Peter Robicheaux: So the question is sort of why? And for you anthropic heads out there, cloud fails too. So the, the, my first pick for best paper of 2024 Envision is this MMVP paper, which tries to investigate the Why do LLMs not have the ability to see fine grained details? And so, for instance, it comes up with a lot of images like this, where you ask it a question that seems very visually apparent to us, like, which way is the school bus facing?
[00:21:32] Peter Robicheaux: And it gets it wrong, and then, of course, it makes up details to support its wrong claim. And so, the process by which it finds these images is sort of contained in its hypothesis for why it can't. See these details. So it hypothesizes that models that have been initialized with, with Clip as their vision encoder, they don't have fine grained details and the, the features extracted using Clip because Clip sort of doesn't need to find these fine grained [00:22:00] details to do its job correctly, which is just to match captions and images, right?
[00:22:04] Peter Robicheaux: And sort of at a high level, even if ChatGPT wasn't initialized with Clip and wasn't trained contrastively at all. The vision encoder wasn't trained contrastively at all. Still, in order to do its job of capturing the image it could do a pretty good job without actually finding the exact position of all the objects and visual features in the image, right?
[00:22:21] Peter Robicheaux: So This paper finds a set of difficult images for these types of models. And the way it does it is it looks for embeddings that are similar in clip space, but far in DynaV2 space. So DynaV2 is a foundation model that was trained self supervised purely on image data. And it kind of uses like some complex student teacher framework, but essentially, and like, it patches out like certain areas of the image or like crops with certain areas of the image and tries to make sure that those have consistent representations, which is a way for it to learn very fine grained visual features.
[00:22:54] Peter Robicheaux: And so if you take things that are very close in clip space and very far in DynaV2 space, you get a set of images [00:23:00] that Basically, pairs of images that are hard for a chat GPT and other big language models to distinguish. So, if you then ask it questions about this image, well, as you can see from this chart, it's going to answer the same way for both images, right?
[00:23:14] Peter Robicheaux: Because to, to, from the perspective of the vision encoder, they're the same image. And so if you ask a question like, how many eyes does this animal have? It answers the same for both. And like all these other models, including Lava do the same thing, right? And so this is the benchmark that they create, which is like finding clip, like clip line pairs, which is pairs of images that are similar in clip space and creating a data set of multiple choice questions based off of those.
[00:23:39] Peter Robicheaux: And so how do these models do? Well, really bad. Lava, I think, So, so, chat2BT and Jim and I do a little bit better than random guessing, but, like, half of the performance of humans who find these problems to be very easy. Lava is, interestingly, extremely negatively correlated with this dataset. It does much, much, much, much worse [00:24:00] than random guessing, which means that this process has done a very good job of identifying hard images for, for Lava, specifically.
[00:24:07] Peter Robicheaux: And that's because Lava is basically not trained for very long and is initialized from Clip, and so You would expect it to do poorly on this dataset. So, one of the proposed solutions that this paper attempts is by basically saying, Okay, well if clip features aren't enough, What if we train the visual encoder of the language model also on dyno features?
[00:24:27] Peter Robicheaux: And so it, it proposes two different ways of doing this. One, additively which is basically interpolating between the two features, and then one is interleaving, which is just kind of like training one on the combination of both features. So there's this really interesting trend when you do the additive mixture of features.
[00:24:45] Peter Robicheaux: So zero is all clip features and one is all DynaV2 features. So. It, as you, so I think it's helpful to look at the right most chart first, which is as you increase the number of DynaV2 features, your model does worse and worse and [00:25:00] worse on the actual language modeling task. And that's because DynaV2 features were trained completely from a self supervised manner and completely in image space.
[00:25:08] Peter Robicheaux: It knows nothing about text. These features aren't really compatible with these text models. And so you can train an adapter all you want, but it seems that it's in such an alien language that it's like a very hard optimization for this. These models to solve. And so that kind of supports what's happening on the left, which is that, yeah, it gets better at answering these questions if as you include more dyna V two features up to a point, but then you, when you oversaturate, it completely loses its ability to like.
[00:25:36] Peter Robicheaux: Answer language and do language tasks. So you can also see with the interleaving, like they essentially double the number of tokens that are going into these models and just train on both, and it still doesn't really solve the MMVP task. It gets Lava 1. 5 above random guessing by a little bit, but it's still not close to ChachiPT or, you know, Any like human performance, obviously.
[00:25:59] Peter Robicheaux: [00:26:00] So clearly this proposed solution of just using DynaV2 features directly, isn't going to work. And basically what that means is that as a as a vision foundation model, DynaV2 is going to be insufficient for language tasks, right?
[00:26:14] Florence 2 (Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks)
[00:26:14] Peter Robicheaux: So my next pick for best paper of 2024 would be Florence 2, which tries to solve this problem by incorporating not only This dimension of spatial hierarchy, which is to say pixel level understanding, but also in making sure to include what they call semantic granularity, which ends up, the goal is basically to have features that are sufficient for finding objects in the image, so they're, they're, they have enough pixel information, but also can be talked about and can be reasoned about.
[00:26:44] Peter Robicheaux: And that's on the semantic granularity axis. So here's an example of basically three different paradigms of labeling that they do. So they, they create a big dataset. One is text, which is just captioning. And you would expect a model that's trained [00:27:00] only on captioning to have similar performance like chat2BT and like not have spatial hierarchy, not have features that are meaningful at the pixel level.
[00:27:08] Peter Robicheaux: And so they add another type, which is region text pairs, which is essentially either classifying a region or You're doing object detection or doing instance segmentation on that region or captioning that region. And then they have text phrased region annotations, which is essentially a triple. And basically, not only do you have a region that you've described, you also find it's like, It's placed in a descriptive paragraph about the image, which is basically trying to introduce even more like semantic understanding of these regions.
[00:27:39] Peter Robicheaux: And so like, for instance, if you're saying a woman riding on the road, right, you have to know what a woman is and what the road is and that she's on top of it. And that's, that's basically composing a bunch of objects in this visual space, but also thinking about it semantically, right? And so the way that they do this is they take basically they just dump Features from a vision encoder [00:28:00] straight into a encoder decoder transformer.
[00:28:03] Peter Robicheaux: And then they train a bunch of different tasks like object detection and so on as a language task. And I think that's one of the big things that we saw in 2024 is these, these vision language models operating in, on pixel space linguistically. So they introduced a bunch of new tokens to point to locations and
[00:28:22] Peter Robicheaux: So how does it work? How does it actually do? We can see if you look at the graph on the right, which is using the, the Dino, the the Dino framework your, your pre trained Florence 2 models transfer very, very well. They get 60%, 60 percent map on Cocoa, which is like approaching state of the art and they train
[00:28:42] Vik Korrapati: with, and they
[00:28:43] Peter Robicheaux: train with a much more more efficiently.
[00:28:47] Peter Robicheaux: So they, they converge a lot faster, which both of these things are pointing to the fact that they're actually leveraging their pre trained weights effectively. So where is it falling short? So these models, I forgot to mention, Florence is a 0. 2 [00:29:00] billion and a 0. 7 billion parameter count. So they're very, very small in terms of being a language model.
[00:29:05] Peter Robicheaux: And I think that. This framework, you can see saturation. So, what this graph is showing is that if you train a Florence 2 model purely on the image level and region level annotations and not including the pixel level annotations, like this, segmentation, it actually performs better as an object detector.
[00:29:25] Peter Robicheaux: And what that means is that it's not able to actually learn all the visual tasks that it's trying to learn because it doesn't have enough capacity.
[00:29:32] PalíGemma / PaliGemma 2
[00:29:32] Peter Robicheaux: So I'd like to see this paper explore larger model sizes, which brings us to our next big paper of 2024 or two papers. So PolyGemma came out earlier this year.
[00:29:42] Peter Robicheaux: PolyGemma 2 was released, I think like a week or two ago. Oh, I forgot to mention, you can actually train You can, like, label text datasets on RoboFlow and you can train a Florence 2 model and you can actually train a PolyGemma 2 model on RoboFlow, which we got into the platform within, like, 14 hours of release, which I was really excited about.
[00:29:59] Peter Robicheaux: So, anyway, so [00:30:00] PolyGemma 2, so PolyGemma is essentially doing the same thing, but instead of doing an encoder decoder, it just dumps everything into a decoder only transformer model. But it also introduced the concept of location tokens to point to objects in pixel space. PolyGemma 2, so PolyGemma uses Gemma as the language encoder, and it uses Gemma2B.
[00:30:17] Peter Robicheaux: PolyGemma 2 introduces using multiple different sizes of language encoders. So, the way that they sort of get around having to do encoder decoder is they use the concept of prefix loss. Which basically means that when it's generating, tokens autoregressively, it's all those tokens in the prefix, which is like the image that it's looking at and like a description of the task that it's trying to do.
[00:30:41] Peter Robicheaux: They're attending to each other fully, full attention. Which means that, you know, it can sort of. Find high level it's easier for the, the prefix to color, to color the output of the suffix and also to just find like features easily. So this is sort of [00:31:00] an example of like one of the tasks that was trained on, which is like, you describe the task in English and then you give it all these, like, You're asking for it to segment these two classes of objects, and then it finds, like, their locations using these tokens, and it finds their masks using some encoding of the masks into tokens.
[00:31:24] Peter Robicheaux: And, yeah, so, one of my critiques, I guess, of PolyGemma 1, at least, is that You find that performance saturates as a pre trained model after only 300 million examples seen. So, what this graph is representing is each blue dot is a performance on some downstream task. And you can see that after seeing 300 million examples, It sort of does equally well on all of the downtrend tasks that they tried it on, which was a lot as 1 billion examples, which to me also kind of suggests a lack of capacity for this model.
[00:31:58] Peter Robicheaux: PolyGemma2, [00:32:00] you can see the results on object detection. So these were transferred to to Coco. And you can see that this sort of also points to an increase in capacity being helpful to the model. You can see as. Both the resolution increases, and the parameter count of the language model increases, performance increases.
[00:32:16] Peter Robicheaux: So resolution makes sense, obviously, it helps to find small images, or small objects in the image. But it also makes sense for another reason, which is that it kind of gives the model a thinking register, and it gives it more tokens to, like, process when making its predictions. But yeah, you could, you could say, oh, 43.
[00:32:30] Peter Robicheaux: 6, that's not that great, like Florence 2 got 60. But this is not Training a dino or a debtor on top of this language or this image encoder. It's doing the raw language modeling task on Cocoa. So it doesn't have any of the bells and whistles. It doesn't have any of the fancy losses. It doesn't even have bipartite graph matching or anything like that.
[00:32:52] Peter Robicheaux: Okay, the big result and one of the reasons that I was really excited about this paper is that they blow everything else away [00:33:00] on MMVP. I mean, 47. 3, sure, that's nowhere near human accuracy, which, again, is 94%, but for a, you know, a 2 billion language, 2 billion parameter language model to be chat2BT, that's quite the achievement.
[00:33:12] Peter Robicheaux: And that sort of brings us to our final pick for paper of the year, which is AIMV2. So, AIMV2 sort of says, okay, Maybe this language model, like, maybe coming up with all these specific annotations to find features and with high fidelity and pixel space isn't actually necessary. And we can come up with an even simpler, more beautiful idea for combining you know, image tokens and pixel tokens in a way that's interfaceable for language tasks.
[00:33:44] Peter Robicheaux: And this is nice because it can scale, you can come up with lots more data if you don't have to come up with all these annotations, right? So the way that it works. is it does something very, very similar to PolyGemo, where you have a vision encoder that dumps image tokens into a decoder only transformer.
[00:33:59] Peter Robicheaux: But [00:34:00] the interesting thing is that it also autoregressively tries to learn the mean squared error of the image tokens. So instead of having to come up with fancy object detection or semantic, or segment, or segmentation labels, you can just try to reconstruct the image and have it learn fine grained features that way.
[00:34:16] Peter Robicheaux: And it does this in kind of, I think, a beautiful way that's kind of compatible with the PolyGemma line of thinking, which is randomly sampling a prefix line of thinking Prefix length and using only this number of image tokens as the prefix. And so doing a similar thing with the causal. So the causal with prefix is the, the attention mask on the right.
[00:34:35] Peter Robicheaux: So it's doing full block attention with some randomly sampled number of image tokens to then reconstruct the rest of the image and the downstream caption for that image. And so, This is the dataset that they train on. It's image or internet scale data, very high quality data created by the data filtering networks paper, essentially which is maybe The best clip data that exists.
[00:34:59] Peter Robicheaux: [00:35:00] And we can see that this is finally a model that doesn't saturate. It's even at the highest parameter count, it's, it appears to be, oh, at the highest parameter account, it appears to be improving in performance with more and more samples seen. And so you can sort of think that. You know, if we just keep bumping the parameter count and increasing the example scene, which is the, the, the line of thinking for language models, then it'll keep getting better.
[00:35:27] Peter Robicheaux: So how does it actually do at finding, oh, it also improves with resolution, which you would expect for a model that This is the ImageNet classification accuracy, but yeah, it does better if you increase the resolution, which means that it's actually leveraging and finding fine grained visual features.
[00:35:44] Peter Robicheaux: And so how does that actually do compared to CLIP on Cocoa? Well, you can see that if you slap a transformer detection head on it, Entry now in Cocoa, it's just 60. 2, which is also within spitting distance of Soda, which means that it does a very good job of [00:36:00] finding visual features, but you could say, okay, well, wait a second.
[00:36:03] Peter Robicheaux: Clip got to 59. 1, so. Like, how does this prove your claim at all? Because doesn't that mean like clip, which is known to be clip blind and do badly on MMVP, it's able to achieve a very high performance on fine, on this fine grained visual features task of object detection, well, they train on like, Tons of data.
[00:36:24] Peter Robicheaux: They train on like objects, 365, Cocoa, Flickr and everything else. And so I think that this benchmark doesn't do a great job of selling how good of a pre trained model MV2 is. And we would like to see the performance on fewer data as examples and not trained to convergence on object detection. So seeing it in the real world on like a dataset, like RoboFlow 100, I think would be quite interesting.
[00:36:48] Peter Robicheaux: And our, our, I guess our final, final pick for paper of 2024 would be Moondream. So introducing Vic to talk about that.
[00:36:54] swyx: But overall, that was exactly what I was looking for. Like best of 2024, an amazing job. Yeah, you can, [00:37:00] if there's any other questions while Vic gets set up, like vision stuff,
[00:37:07] swyx: yeah,
[00:37:11] swyx: Vic, go ahead. Hi,
[00:37:13] Vik Korrapati / Moondream
[00:37:13] question: well, while we're getting set up, hi, over here, thanks for the really awesome talk. One of the things that's been weird and surprising is that the foundation model companies Even these MLMs, they're just like worse than RT Tether at detection still. Like, if you wanted to pay a bunch of money to auto label your detection dataset, If you gave it to OpenAI or Cloud, that would be like a big waste.
[00:37:37] question: So I'm curious, just like, even Pali Gemma 2, like is worse. So, so I'm curious to hear your thoughts on like, how come, Nobody's cracked the code on like a generalist that really you know, beats a specialist model in computer vision like they have in in LLM land.[00:38:00]
[00:38:01] Isaac Robinson: Okay. It's a very, very interesting question. I think it depends on the specific domain. For image classification, it's basically there. In the, in AIMv2 showed, a simple attentional probe on the pre trained features gets like 90%, which is as well as anyone does. The, the, the, the bigger question, like, why isn't it transferring to object detection, especially like real time object detection.
[00:38:25] Isaac Robinson: I think, in my mind, there are two answers. One is, object detection is really, really, really the architectures are super domain specific. You know, we see these, all these super, super complicated things, and it's not super easy to, to, to build something that just transfers naturally like that, whereas image classification, you know, clip pre training transfers super, super quickly.
[00:38:48] Isaac Robinson: And the other thing is, until recently, the real time object detectors didn't even really benefit from pre training. Like, you see the YOLOs that are like, essentially saturated, showing very little [00:39:00] difference with pre training improvements, with using pre trained model at all. It's not surprising, necessarily, that People aren't looking at the effects of better and better pre training on real time detection.
[00:39:12] Isaac Robinson: Maybe that'll change in the next year. Does that answer your question?
[00:39:17] Peter Robicheaux: Can you guys hear me? Yeah, one thing I want to add is just like, or just to summarize, basically, is that like, Until 2024, you know, we haven't really seen a combination of transformer based object detectors and fancy losses, and PolyGemma suffers from the same problem, which is basically to say that these ResNet, or like the convolutional models, they have all these, like, extreme optimizations for doing object detection, but essentially, I think it's kind of been shown now that convolution models like just don't benefit from pre training and just don't like have the level of intelligence of transformer models.
[00:39:56] swyx: Awesome. Hi,
[00:39:59] Vik Korrapati: can [00:40:00] you hear me?
[00:40:01] swyx: Cool. I hear you. See you. Are you sharing your screen?
[00:40:04] Vik Korrapati: Hi. Might have forgotten to do that. Let me do
[00:40:07] swyx: that. Sorry, should have done
[00:40:08] Vik Korrapati: that.
[00:40:17] swyx: Here's your screen. Oh, classic. You might have to quit zoom and restart. What? It's fine. We have a capture of your screen.
[00:40:34] swyx: So let's get to it.
[00:40:35] Vik Korrapati: Okay, easy enough.
[00:40:49] Vik Korrapati: All right. Hi, everyone. My name is Vic. I've been working on Moondream for almost a year now. Like Shawn mentioned, I just went and looked and it turns out the first version I released December [00:41:00] 29, 2023. It's been a fascinating journey. So Moonbeam started off as a tiny vision language model. Since then, we've expanded scope a little bit to also try and build some tooling, client libraries, et cetera, to help people really deploy it.
[00:41:13] Vik Korrapati: Unlike traditional large models that are focused at assistant type use cases, we're laser focused on building capabilities that developers can, sorry, it's yeah, we're basically focused on building capabilities that developers can use to build vision applications that can run anywhere. So, in a lot of cases for vision more so than for text, you really care about being able to run on the edge, run in real time, etc.
[00:41:40] Vik Korrapati: So That's really important. We have we have different output modalities that we support. There's query where you can ask general English questions about an image and get back human like answers. There's captioning, which a lot of our users use for generating synthetic datasets to then train diffusion models and whatnot.
[00:41:57] Vik Korrapati: We've done a lot of work to minimize those sessions there. [00:42:00] So that's. Use lot. We have open vocabulary object detection built in similar to a couple of more recent models like Palagem, et cetera, where rather than having to train a dedicated model, you can just say show me soccer balls in this image or show me if there are any deer in this image, it'll detect it.
[00:42:14] Vik Korrapati: More recently, earlier this month, we released pointing capability where if all you're interested in is the center of an object you can just ask it to point out where that is. This is very useful when you're doing, you know, I automation type stuff. Let's see, LA we, we have two models out right now.
[00:42:33] Vik Korrapati: There's a general purpose to be para model, which runs fair. Like it's, it's it's fine if you're running on server. It's good for our local Amma desktop friends and it can run on flagship, flagship mobile phones, but it never. so much for joining us today, and we'll see you in the [00:43:00] next one. Less memory even with our not yet fully optimized inference client.
[00:43:06] Vik Korrapati: So the way we built our 0. 5b model was to start with the 2 billion parameter model and prune it while doing continual training to retain performance. We, our objective during the pruning was to preserve accuracy across a broad set of benchmarks. So the way we went about it was to estimate the importance of different components of the model, like attention heads, channels MLP rows and whatnot using basically a technique based on the gradient.
[00:43:37] Vik Korrapati: I'm not sure how much people want to know details. We'll be writing a paper about this, but feel free to grab me if you have more questions. Then we iteratively prune a small chunk that will minimize loss and performance retrain the model to recover performance and bring it back. The 0. 5b we released is more of a proof of concept that this is possible.
[00:43:54] Vik Korrapati: I think the thing that's really exciting about this is it makes it possible for for developers to build using the 2B param [00:44:00] model and just explore, build their application, and then once they're ready to deploy figure out what exactly they need out of the model and prune those capabilities into a smaller form factor that makes sense for their deployment target.
[00:44:12] Vik Korrapati: So yeah, very excited about that. Let me talk to you folks a little bit about another problem I've been working on recently, which is similar to the clocks example we've been talking about. We had a customer reach out who was talking about, like, who had a bunch of gauges out in the field. This is very common in manufacturing and oil and gas, where you have a bunch of analog devices that you need to monitor.
[00:44:34] Vik Korrapati: It's expensive to. And I was like, okay, let's have humans look at that and monitor stuff and make sure that the system gets shut down when the temperature goes over 80 or something. So I was like, yeah, this seems easy enough. Happy to, happy to help you distill that. Let's, let's get it going. Turns out our model couldn't do it at all.
[00:44:51] Vik Korrapati: I went and looked at other open source models to see if I could just generate a bunch of data and learn from that. Did not work either. So I was like, let's look at what the folks with [00:45:00] hundreds of billions of dollars in market cap have to offer. And yeah, that doesn't work either. My hypothesis is that like the, the way these models are trained are using a large amount of image text data scraped from the internet.
[00:45:15] Vik Korrapati: And that can be biased. In the case of gauges, most gauge images aren't gauges in the wild, they're product images. Detail images like these, where it's always set to zero. It's paired with an alt text that says something like GIVTO, pressure sensor, PSI, zero to 30 or something. And so the models are fairly good at picking up those details.
[00:45:35] Vik Korrapati: It'll tell you that it's a pressure gauge. It'll tell you what the brand is, but it doesn't really learn to pay attention to the needle over there. And so, yeah, that's a gap we need to address. So naturally my mind goes to like, let's use synthetic data to, Solve this problem. That works, but it's problematic because it turned out we needed millions of synthetic gauge images to get to reasonable performance.
[00:45:57] Vik Korrapati: And thinking about it, reading a gauge is like [00:46:00] not a one, like it's not a zero short process in our minds, right? Like if you had to tell me the reading in Celsius for this, Real world gauge. There's two dials on there. So first you have to figure out which one you have to be paying attention to, like the inner one or the outer one.
[00:46:14] Vik Korrapati: You look at the tip of the needle, you look at what labels it's between, and you count how many and do some math to figure out what that probably is. So what happens if we just add that as a Chain of thought to give the model better understanding of the different sub, to allow the model to better learn the subtasks it needs to perform to accomplish this goal.
[00:46:37] Vik Korrapati: So you can see in this example, this was actually generated by the latest version of our model. It's like, okay, Celsius is the inner scale. It's between 50 and 60. There's 10 ticks. So the second tick, it's a little debatable here, like there's a weird shadow situation going on, the dial is off, so I don't know what the ground truth is, but it works okay.
[00:46:57] Vik Korrapati: There's points on there that are, the points [00:47:00] over there are actually grounded. I don't know if this is easy to see, but when I click on those, there's a little red dot that moves around on the image. The model actually has to predict where this points are, I was already trying to do this with bounding boxes, but then Malmo came out with pointing capabilities.
[00:47:15] Vik Korrapati: And it's like pointing is a much better paradigm to to represent this. We see pretty good results. This one's actually for clock reading. I couldn't find our chart for gauge reading at the last minute. So the light. Blue chart is with our rounded chain of thought. This measures, we have, we built a clock reading benchmark about 500 images.
[00:47:37] Vik Korrapati: This measures accuracy on that. You can see it's a lot more sample efficient when you're using the chain of thought to model. Another big benefit from this approach is like, you can kind of understand how the model is. it and how it's failing. So in this example, the actual correct reading is 54 Celsius, the model output [00:48:00] 56, not too bad but you can actually go and see where it messed up. Like it got a lot of these right, except instead of saying it was on the 7th tick, it actually predicted that it was the 8th tick and that's why it went with 56.
[00:48:14] Vik Korrapati: So now that you know that this. Failing in this way, you can adjust how you're doing the chain of thought to maybe say like, actually count out each tick from 40, instead of just trying to say it's the eighth tick. Or you might say like, okay, I see that there's that middle thing, I'll count from there instead of all the way from 40.
[00:48:31] Vik Korrapati: So helps a ton. The other thing I'm excited about is a few short prompting or test time training with this. Like if a customer has a specific gauge that like we're seeing minor errors on, they can give us a couple of examples where like, if it's miss detecting the. Needle, they can go in and correct that in the chain of thought.
[00:48:49] Vik Korrapati: And hopefully that works the next time. Now, exciting approach, we only apply it to clocks and gauges. The real question is, is it going to generalize? Probably, like, there's some science [00:49:00] from text models that when you train on a broad number of tasks, it does generalize. And I'm seeing some science with our model as well.
[00:49:05] Vik Korrapati: So, in addition to the image based chain of thought stuff, I also added some spelling based chain of thought to help it understand better understand OCR, I guess. I don't understand why everyone doesn't do this, by the way. Like, it's trivial benchmark question. It's Very, very easy to nail. But I also wanted to support it for stuff like license plate, partial matching, like, hey, does any license plate in this image start with WHA or whatever?
[00:49:29] Vik Korrapati: So yeah, that sort of worked. All right, that, that ends my story about the gauges. If you think about what's going on over here it's interesting that like LLMs are showing enormous. Progress in reasoning, especially with the latest set of models that we've seen, but we're not really seeing, I have a feeling that VLMs are lagging behind, as we can see with these tasks that should be very simple for a human to do [00:50:00] that are very easy to find VLMs failing at.
[00:50:04] Vik Korrapati: My hypothesis on why this is the case is because On the internet, there's a ton of data that talks about how to reason. There's books about how to solve problems. There's books critiquing the books about how to solve problems. But humans are just so good at perception that we never really talk about it.
[00:50:20] Vik Korrapati: Like, maybe in art books where it's like, hey, to show that that mountain is further away, you need to desaturate it a bit or whatever. But the actual data on how to, like, look at images is, isn't really present. Also, the Data we have is kind of sketched. The best source of data we have is like image all text pairs on the internet and that's pretty low quality.
[00:50:40] Vik Korrapati: So yeah, I, I think our solution here is really just we need to teach them how to operate on individual tasks and figure out how to scale that out. All right. Yep. So conclusion. At Moondream we're trying to build amazing PLMs that run everywhere. Very hard problem. Much work ahead, but we're making a ton of progress and I'm really excited [00:51:00] about If anyone wants to chat about more technical details about how we're doing this or interest in collaborating, please, please hit me up.
[00:51:08] Isaac Robinson: Yeah,
[00:51:09] swyx: like, I always, when people say, when people say multi modality, like, you know, I always think about vision as the first among equals in all the modalities. So, I really appreciate having the experts in the room.

Get full access to Latent Space at www.latent.space/subscribe
- Luisteren Nogmaals beluisteren Doorgaan Wordt afgespeeld...
- Later beluisteren Later beluisteren
2024 in AI Startups [LS Live @ NeurIPS]
21 dec· Latent Space: The AI Engineer Podcast
Happy holidays! We’ll be sharing snippets from Latent Space LIVE! through the break bringing you the best of 2024 from friends of the pod!
For NeurIPS last year we did our standard conference podcast coverage interviewing selected papers (that we have now also done for ICLR and ICML), however we felt that we could be doing more to help AI Engineers 1) get more industry-relevant content, and 2) recap 2024 year in review from experts. As a result, we organized the first Latent Space LIVE!, our first in person miniconference, at NeurIPS 2024 in Vancouver.
For our opening keynote, we could think of no one better to cover 'The State of AI Startups' than our friend Sarah Guo (AI superinvestor, founder of Conviction, host of No Priors!) and Pranav Reddy (Conviction partner) to share their takes on how the AI landscape evolved in 2024 examine the evolving AI landscape and what it means for startups, enterprises, and the industry as a whole! They completely understood the assignment.
Recorded live with 200+ in-person and 2200+ online attendees at NeurIPS 2024, this keynote kicks off our mini-conference series exploring different domains of AI development in 2024. Enjoy!
Links
Slides: https://x.com/saranormous/status/1866933642401886707
Sarh Guo: https://x.com/saranormous
Pranav Reddy: https://x.com/prnvrdy
Full Video on YouTube
Want more content like this? Like and subscribe to stay updated on our latest talks, interviews, and podcasts.

Get full access to Latent Space at www.latent.space/subscribe
- Luisteren Nogmaals beluisteren Doorgaan Wordt afgespeeld...
- Later beluisteren Later beluisteren
Windsurf: The Enterprise AI IDE - with Varun and Anshul of Codeium AI
13 dec· Latent Space: The AI Engineer Podcast
Our second podcast guest ever in March 2023 was Varun Mohan, CEO of Codeium; at the time, they had around 10,000 users and how they vowed to keep their autocomplete free forever: Today, over a million developers use their products, they still have their free tier, and they recently launched Windsurf, an AI IDE.
Chapters
* 00:00:00: Introductions & Catchup
* 00:03:52: Why they created Windsurf
* 00:05:52: Limitations of VS Code
* 00:10:12: Evaluation methods for Cascade and Windsurf
* 00:16:15: Listener questions about Windsurf launch
* 00:20:30: Remote execution and security concerns
* 00:25:18: Evolution of Codeium's strategy
* 00:28:29: Cascade and its capabilities
* 00:33:12: Multi-agent systems
* 00:37:02: Areas of improvement for Windsurf
* 00:39:12: Building an enterprise-first company
* 00:42:01: Copilot for X, AI UX, and Enterprise AI blog posts

Get full access to Latent Space at www.latent.space/subscribe
- Luisteren Nogmaals beluisteren Doorgaan Wordt afgespeeld...
- Later beluisteren Later beluisteren
Generative Video WorldSim, Diffusion, Vision, Reinforcement Learning and Robotics — ICML 2024 Part 1
10 dec· Latent Space: The AI Engineer Podcast
Regular tickets are now sold out for Latent Space LIVE! at NeurIPS! We have just announced our last speaker and newest track, friend of the pod Nathan Lambert who will be recapping 2024 in Reasoning Models like o1! We opened up a handful of late bird tickets for those who are deciding now — use code DISCORDGANG if you need it. See you in Vancouver!
We’ve been sitting on our ICML recordings for a while (from today’s first-ever SOLO guest cohost, Brittany Walker), and in light of Sora Turbo’s launch (blogpost, tutorials) today, we figured it would be a good time to drop part one which had been gearing up to be a deep dive into the state of generative video worldsim, with a seamless transition to vision (the opposite modality), and finally robots (their ultimate application).
Sora, Genie, and the field of Generative Video World Simulators
Bill Peebles, author of Diffusion Transformers, gave his most recent Sora talk at ICML, which begins our episode:
* William (Bill) Peebles - SORA (slides)
Something that is often asked about Sora is how much inductive biases were introduced to achieve these results. Bill references the same principles brought by Hyung Won Chung from the o1 team - “sooner or later those biases come back to bite you”.
We also recommend these reads from throughout 2024 on Sora.
* Lilian Weng’s literature review of Video Diffusion Models
* Sora API leak
* Estimates of 100k-700k H100s needed to serve Sora (not Turbo)
* Artist guides on using Sora for professional storytelling
Google DeepMind had a remarkably strong presence at ICML on Video Generation Models, winning TWO Best Paper awards for:
* Genie: Generative Interactive Environments (covered in oral, poster, and workshop)
* VideoPoet: A Large Language Model for Zero-Shot Video Generation (see website)
We end this part by taking in Tali Dekel’s talk on The Future of Video Generation: Beyond Data and Scale.
Part 2: Generative Modeling and Diffusion
Since 2023, Sander Dieleman’s perspectives (blogpost, tweet) on diffusion as “spectral autoregression in the frequency domain” while working on Imagen and Veo have caught the public imagination, so we highlight his talk:
* Wading through the noise: an intuitive look at diffusion models
Then we go to Ben Poole for his talk on Inferring 3D Structure with 2D Priors, including his work on NeRFs and DreamFusion:
Then we investigate two flow matching papers - one from the Flow Matching co-authors - Ricky T. Q. Chen (FAIR, Meta)
And how it is implemented in Stable Diffusion 3 with Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Our last hit on Diffusion is a couple of oral presentations on speech, which we leave you to explore via our audio podcast
* NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models
* Speech Self-Supervised Learning Using Diffusion Model Synthetic Data
Part 3: Vision
The ICML Test of Time winner was DeCAF, which Trevor Darrell notably called “the OG vision foundation model”.
Lucas Beyer’s talk on “Vision in the age of LLMs — a data-centric perspective” was also well received online, and he talked about his journey from Vision Transformers to PaliGemma.
We give special honorable mention to MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark.
Part 4: Reinforcement Learning and Robotics
We segue vision into robotics with the help of Ashley Edwards, whose work on both the Gato and the Genie teams at Deepmind is summarized in Learning actions, policies, rewards, and environments from videos alone.
Brittany highlighted two poster session papers:
* Behavior Generation with Latent Actions
* We also recommend Lerrel Pinto’s On Building General-Purpose Robots
* PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs
However we must give the lion’s share of space to Chelsea Finn, now founder of Physical Intelligence, who gave FOUR talks on
* "What robots have taught me about machine learning"
* developing robot generalists
* robots that adapt autonomously
* how to give feedback to your language model
* special mention to PI colleague Sergey Levine on Robotic Foundation Models
We end the podcast with a position paper that links generative environments and RL/robotics: Automatic Environment Shaping is the Next Frontier in RL.
Timestamps
* [00:00:00] Intros
* [00:02:43] Sora - Bill Peebles
* [00:44:52] Genie: Generative Interactive Environments
* [01:00:17] Genie interview
* [01:12:33] VideoPoet: A Large Language Model for Zero-Shot Video Generation
* [01:30:51] VideoPoet interview - Dan Kondratyuk
* [01:42:00] Tali Dekel - The Future of Video Generation: Beyond Data and Scale.
* [02:27:07] Sander Dieleman - Wading through the noise: an intuitive look at diffusion models
* [03:06:20] Ben Poole - Inferring 3D Structure with 2D Priors
* [03:30:30] Ricky Chen - Flow Matching
* [04:00:03] Patrick Esser - Stable Diffusion 3
* [04:14:30] NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models
* [04:27:00] Speech Self-Supervised Learning Using Diffusion Model Synthetic Data
* [04:39:00] ICML Test of Time winner: DeCAF
* [05:03:40] Lucas Beyer: “Vision in the age of LLMs — a data-centric perspective”
* [05:42:00] Ashley Edwards: Learning actions, policies, rewards, and environments from videos alone.
* [06:03:30] Behavior Generation with Latent Actions interview
* [06:09:52] Chelsea Finn: "What robots have taught me about machine learning"
* [06:56:00] Position: Automatic Environment Shaping is the Next Frontier in RL

Get full access to Latent Space at www.latent.space/subscribe
- Luisteren Nogmaals beluisteren Doorgaan Wordt afgespeeld...
- Later beluisteren Later beluisteren
Bolt.new, Flow Engineering for Code Agents, and >$8m ARR in 2 months as a Claude Wrapper
2 dec· Latent Space: The AI Engineer Podcast
The full schedule for Latent Space LIVE! at NeurIPS has been announced, featuring Best of 2024 overview talks for the AI Startup Landscape, Computer Vision, Open Models, Transformers Killers, Synthetic Data, Agents, and Scaling, and speakers from Sarah Guo of Conviction, Roboflow, AI2/Meta, Recursal/Together, HuggingFace, OpenHands and SemiAnalysis. Join us for the IRL event/Livestream!
Alessio will also be holding a meetup at AWS Re:Invent in Las Vegas this Wednesday. See our new Events page for dates of AI Engineer Summit, Singapore, and World’s Fair in 2025. LAST CALL for questions for our big 2024 recap episode! Submit questions and messages on Speakpipe here for a chance to appear on the show!
When we first observed that GPT Wrappers are Good, Actually, we did not even have Bolt on our radar. Since we recorded our Anthropic episode discussing building Agents with the new Claude 3.5 Sonnet, Bolt.new (by Stackblitz) has easily cleared the $8m ARR bar, repeating and accelerating its initial $4m feat.
There are very many AI code generators and VS Code forks out there, but Bolt probably broke through initially because of its incredible zero shot low effort app generation:
But as we explain in the pod, Bolt also emphasized deploy (Netlify)/ backend (Supabase)/ fullstack capabilities on top of Stackblitz’s existing WebContainer full-WASM-powered-developer-environment-in-the-browser tech. Since then, the team has been shipping like mad (with weekly office hours), with bugfixing, full screen, multi-device, long context, diff based edits (using speculative decoding like we covered in Inference, Fast and Slow).
All of this has captured the imagination of low/no code builders like Greg Isenberg and many others on YouTube/TikTok/Reddit/X/Linkedin etc:
Just as with Fireworks, our relationship with Bolt/Stackblitz goes a bit deeper than normal - swyx advised the launch and got a front row seat to this epic journey, as well as demoed it with Realtime Voice at the recent OpenAI Dev Day. So we are very proud to be the first/closest to tell the full open story of Bolt/Stackblitz!
Flow Engineering + Qodo/AlphaCodium Update
In year 2 of the pod we have been on a roll getting former guests to return as guest cohosts (Harrison Chase, Aman Sanger, Jon Frankle), and it was a pleasure to catch Itamar Friedman back on the pod, giving us an update on all things Qodo and Testing Agents from our last catchup a year and a half ago:
Qodo (they renamed in September) went viral in early January this year with AlphaCodium (paper here, code here) beating DeepMind’s AlphaCode with high efficiency:
With a simple problem solving code agent:
* The first step is to have the model reason about the problem. They describe it using bullet points and focus on the goal, inputs, outputs, rules, constraints, and any other relevant details.
* Then, they make the model reason about the public tests and come up with an explanation of why the input leads to that particular output.
* The model generates two to three potential solutions in text and ranks them in terms of correctness, simplicity, and robustness.
* Then, it generates more diverse tests for the problem, covering cases not part of the original public tests.
* Iteratively, pick a solution, generate the code, and run it on a few test cases.
* If the tests fail, improve the code and repeat the process until the code passes every test.
swyx has previously written similar thoughts on types vs tests for putting bounds on program behavior, but AlphaCodium extends this to AI generated tests and code.
More recently, Itamar has also shown that AlphaCodium’s techniques also extend well to the o1 models:
Making Flow Engineering a useful technique to improve code model performance on every model. This is something we see AI Engineers uniquely well positioned to do compared to ML Engineers/Researchers.
Full Video Podcast
Like and subscribe!
Show Notes
* Itamar
* Qodo
* First episode
* Eric
* Bolt
* StackBlitz
* Thinkster
* AlphaCodium
* WebContainers
Chapters
* 00:00:00 Introductions & Updates
* 00:06:01 Generic vs. Specific AI Agents
* 00:07:40 Maintaining vs Creating with AI
* 00:17:46 Human vs Agent Computer Interfaces
* 00:20:15 Why Docker doesn't work for Bolt
* 00:24:23 Creating Testing and Code Review Loops
* 00:28:07 Bolt's Task Breakdown Flow
* 00:31:04 AI in Complex Enterprise Environments
* 00:41:43 AlphaCodium
* 00:44:39 Strategies for Breaking Down Complex Tasks
* 00:45:22 Building in Open Source
* 00:50:35 Choosing a product as a founder
* 00:59:03 Reflections on Bolt Success
* 01:06:07 Building a B2C GTM
* 01:18:11 AI Capabilities and Pricing Tiers
* 01:20:28 What makes Bolt unique
* 01:23:07 Future Growth and Product Development
* 01:29:06 Competitive Landscape in AI Engineering
* 01:30:01 Advice to Founders and Embracing AI
* 01:32:20 Having a baby and completing an Iron Man
Transcript
Alessio [00:00:00]: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner and CTO at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol.ai.
Swyx [00:00:12]: Hey, and today we're still in our sort of makeshift in-between studio, but we're very delighted to have a former returning guest host, Itamar. Welcome back.
Itamar [00:00:21]: Great to be here after a year or more. Yeah, a year and a half.
Swyx [00:00:24]: You're one of our earliest guests on Agents. Now you're CEO co-founder of Kodo. Right. Which has just been renamed. You also raised a $40 million Series A, and we can get caught up on everything, but we're also delighted to have our new guest, Eric. Welcome.
Eric [00:00:42]: Thank you. Excited to be here. Should I say Bolt or StackBlitz?
Swyx [00:00:45]: Like, is it like its own company now or?
Eric [00:00:47]: Yeah. Bolt's definitely bolt.new. That's the thing that we're probably the most known for, I imagine, at this point.
Swyx [00:00:54]: Which is ridiculous to say because you were working at StackBlitz for so long.
Eric [00:00:57]: Yeah. I mean, within a week, we were doing like double the amount of traffic. And StackBlitz had been online for seven years, and we were like, what? But anyways, yeah. So we're StackBlitz, the company behind bolt.new. If you've heard of bolt.new, that's our stuff. Yeah.
Swyx [00:01:12]: Yeah.
Itamar [00:01:13]: Excellent. I see, by the way, that the founder mode, you need to know to capture opportunities. So kudos on doing that, right? You're working on some technology, and then suddenly you can exploit that to a new world. Yeah.
Eric [00:01:24]: Totally. And I think, well, not to jump, but 100%, I mean, a couple of months ago, we had the idea for Bolt earlier this year, but we haven't really shared this too much publicly. But we actually had tried to build it with some of those state-of-the-art models back in January, February, you can kind of imagine which, and they just weren't good enough to actually do the code generation where the code was accurate and it was fast and whatever have you without a ton of like rag, but then there was like issues with that. So we put it on the shelf and then we got kind of a sneak peek of some of the new models that have come out in the past couple of months now. And so once we saw that, once we actually saw the code gen from it, we were like, oh my God, like, okay, we can build a product around this. And so that was really the impetus of us building the thing. But with that, it was StackBlitz, the core StackBlitz product the past seven years has been an IDE for developers. So the entire user experience flow we've built up just didn't make sense. And so when we kind of went out to build Bolt, we just thought, you know, if we were inventing our product today, what would the interface look like given what is now possible with the AI code gen? And so there's definitely a lot of conversations we had internally, but you know, just kind of when we logically laid it out, we were like, yeah, I think it makes sense to just greenfield a new thing and let's see what happens. If it works great, then we'll figure it out. If it doesn't work great, then it'll get deleted at some point. So that's kind of how it actually came to be.
Swyx [00:02:49]: I'll mention your background a little bit. You were also founder of Thinkster before you started StackBlitz. So both of you are second time founders. Both of you have sort of re-founded your company recently. Yours was more of a rename. I think a slightly different direction as well. And then we can talk about both. Maybe just chronologically, should we get caught up on where Kodo is first and then you know, just like what people should know since the last pod? Sure.
Itamar [00:03:12]: The last pod was two months after we launched and we basically had the vision that we talked about. The idea that software development is about specification, test and code, etc. We are more on the testing part as in essence, we think that if you solve testing, you solve software development. The beautiful chart that we'll put up on screen. And testing is a really big field, like there are many dimensions, unit testing, the level of the component, how big it is, how large it is. And then there is like different type of testing, is it regression or smoke or whatever. So back then we only had like one ID extension with unit tests as in focus. One and a half year later, first ID extension supports more type of testing as context aware. We index local, local repos, but also 10,000s of repos for Fortune 500 companies. We have another agent, another tool that is called, the pure agent is the open source and the commercial one is CodoMerge. And then we have another open source called CoverAgent, which is not yet a commercial product coming very soon. It's very impressive. It could be that already people are approving automated pull requests that they don't even aware in really big open sources. So once we have enough of these, we will also launch another agent. So for the first one and a half year, what we did is grew in our offering and mostly on the side of, does this code actually works, testing, code review, et cetera. And we believe that's the critical milestone that needs to be achieved to actually have the AI engineer for enterprise software. And then like for the first year was everything bottom up, getting to 1 million installation. 2024, that was 2023, 2024 was starting to monetize, to feel like how it is to make the first buck. So we did the teams offering, it went well with a thousand of teams, et cetera. And then we started like just a few months ago to do enterprise with everything you need, which is a lot of things that discussed in the last post that was just released by Codelm. So that's how we call it at Codelm. Just opening the brackets, our company name was Codelm AI, and we renamed to Codo and we call our models Codelm. So back to my point, so we started Enterprise Motion and already have multiple Fortune 100 companies. And then with that, we raised a series of $40 million. And what's exciting about it is that enables us to develop more agents. That's our focus. I think it's very different. We're not coming very soon with an ID or something like that.
Swyx [00:06:01]: You don't want to fork this code?
Itamar [00:06:03]: Maybe we'll fork JetBrains or something just to be different.
Swyx [00:06:08]: I noticed that, you know, I think the promise of general purpose agents has kind of died. Like everyone is doing kind of what you're doing. There's Codogen, Codomerge, and then there's a third one. What's the name of it?
Itamar [00:06:17]: Yeah. Codocover. Cover. Which is like a commercial version of a cover agent. It's coming soon.
Swyx [00:06:23]: Yeah. It's very similar with factory AI, also doing like droids. They all have special purpose doing things, but people don't really want general purpose agents. Right. The last time you were here, we talked about AutoGBT, the biggest thing of 2023. This year, not really relevant anymore. And I think it's mostly just because when you give me a general purpose agent, I don't know what to do with it.
Eric [00:06:42]: Yeah.
Itamar [00:06:43]: I totally agree with that. We're seeing it for a while and I think it will stay like that despite the computer use, et cetera, that supposedly can just replace us. You can just like prompt it to be, hey, now be a QA or be a QA person or a developer. I still think that there's a few reasons why you see like a dedicated agent. Again, I'm a bit more focused, like my head is more on complex software for big teams and enterprise, et cetera. And even think about permissions and what are the data sources and just the same way you manage permissions for users. Developers, you probably want to have dedicated guardrails and dedicated approvals for agents. I intentionally like touched a point on not many people think about. And of course, then what you can think of, like maybe there's different tools, tool use, et cetera. But just the first point by itself is a good reason why you want to have different agents.
Alessio [00:07:40]: Just to compare that with Bot.new, you're almost focused on like the application is very complex and now you need better tools to kind of manage it and build on top of it. On Bot.new, it's almost like I was using it the other day. There's basically like, hey, look, I'm just trying to get started. You know, I'm not very opinionated on like how you're going to implement this. Like this is what I want to do. And you build a beautiful app with it. What people ask as the next step, you know, going back to like the general versus like specific, have you had people say, hey, you know, this is great to start, but then I want a specific Bot.new dot whatever else to do a more vertical integration and kind of like development or what's the, what do people say?
Eric [00:08:18]: Yeah. I think, I think you kind of hit the, hit it head on, which is, you know, kind of the way that we've, we've kind of talked about internally is it's like people are using Bolt to go from like 0.0 to 1.0, like that's like kind of the biggest unlock that Bolt has versus most other things out there. I mean, I think that's kind of what's, what's very unique about Bolt. I think the, you know, the working on like existing enterprise applications is, I mean, it's crazy important because, you know, there's a, you look, when you look at the fortune 500, I mean, these code bases, some of these have been around for 20, 30 plus years. And so it's important to be going from, you know, 101.3 to 101.4, et cetera. I think for us, so what's been actually pretty interesting is we see there's kind of two different users for us that are coming in and it's very distinct. It's like people that are developers already. And then there's people that have never really written software and more if they have, it's been very, very minimal. And so in the first camp, what these developers are doing, like to go from zero to one, they're coming to Bolt and then they're ejecting the thing to get up or just downloading it and, you know, opening cursor, like whatever to, to, you know, keep iterating on the thing. And sometimes they'll bring it back to Bolt to like add in a huge piece of functionality or something. Right. But for the people that don't know how to code, they're actually just, they, they live in this thing. And that was one of the weird things when we launched is, you know, within a day of us being online, one of the most popular YouTube videos, and there's been a ton since, which was, you know, there's like, oh, Bolt is the cursor killer. And I originally saw the headlines and I was like, thanks for the views. I mean, I don't know. This doesn't make sense to me. That's not, that's not what we kind of thought.
Swyx [00:09:44]: It's how YouTubers talk to each other. Well, everything kills everything else.
Eric [00:09:47]: Totally. But what blew my mind was that there was any comparison because it's like cursor is a, is a local IDE product. But when, when we actually kind of dug into it and we, and we have people that are using our product saying this, I'm not using cursor. And I was like, what? And it turns out there are hundreds of thousands of people that we have seen that we're using cursor and we're trying to build apps with that where they're not traditional software does, but we're heavily leaning on the AI. And as you can imagine, it is very complicated, right? To do that with cursor. So when Bolt came out, they're like, wow, this thing's amazing because it kind of inverts the complexity where it's like, you know, it's not an IDE, it's, it's a, it's a chat-based sort of interface that we have. So that's kind of the split, which is rather interesting. We've had like the first startups now launch off of Bolt entirely where this, you know, tomorrow I'm doing a live stream with this guy named Paul, who he's built an entire CRM using this thing and you know, with backend, et cetera. And people have made their first money on the internet period, you know, launching this with Stripe or whatever have you. So that's, that's kind of the two main, the two main categories of folks that we see using Bolt though.
Itamar [00:10:51]: I agree that I don't understand the comparison. It doesn't make sense to me. I think like we have like two type of families of tools. One is like we re-imagine the software development. I think Bolt is there and I think like a cursor is more like a evolution of what we already have. It's like taking the IDE and it's, it's amazing and it's okay, let's, let's adapt the IDE to an era where LLMs can do a lot for us. And Bolt is more like, okay, let's rethink everything totally. And I think we see a few tools there, like maybe Vercel, Veo and maybe Repl.it in that area. And then in the area of let's expedite, let's change, let's, let's progress with what we already have. You can see Cursor and Kodo, but we're different between ourselves, Cursor and Kodo, but definitely I think that comparison doesn't make sense.
Alessio [00:11:42]: And just to set the context, this is not a Twitter demo. You've made 4 million of revenue in four weeks. So this is, this is actually working, you know, it's not a, what, what do you think that is? Like, there's been so many people demoing coding agents on Twitter and then it doesn't really work. And then you guys were just like, here you go, it's live, go use it, pay us for it. You know, is there anything in the development that was like interesting and maybe how that compares to building your own agents?
Eric [00:12:08]: We had no idea, honestly, like we, we, we've been pretty blown away and, and things have just kind of continued to grow faster since then. We're like, oh, today is week six. So I, I kind of came back to the point you just made, right, where it's, you, you kind of outlined, it's like, there's kind of this new market of like kind of rethinking the software development and then there's heavily augmenting existing developers. I think that, you know, both of which are, you know, AI code gen being extremely good, it's allowed existing developers, it's allowing existing developers to camera out software far faster than they could have ever before, right? It's like the ultimate power tool for an existing developer. But this code gen stuff is now so good. And then, and we saw this over the past, you know, from the beginning of the year when we tried to first build, it's actually lowered the barrier to people that, that aren't traditionally software engineers. But the kind of the key thing is if you kind of think about it from, imagine you've never written software before, right? My co-founder and I, he and I grew up down the street from each other in Chicago. We learned how to code when we were 13 together and we've been building stuff ever since. And this is back in like the mid 2000s or whatever, you know, there was nothing for free to learn from online on the internet and how to code. For our 13th birthdays, we asked our parents for, you know, O'Reilly books cause you couldn't get this at the library, right? And so instead of like an Xbox, we got, you know, programming books. But the hardest part for everyone learning to code is getting an environment set up locally, you know? And so when we built StackBlitz, like kind of the key thesis, like seven years ago, the insight we had was that, Hey, it seems like the browser has a lot of new APIs like WebAssembly and service workers, et cetera, where you could actually write an operating system that ran inside the browser that could boot in milliseconds. And you, you know, basically there's this missing capability of the web. Like the web should be able to build apps for the web, right? You should be able to build the web on the web. Every other platform has that, Visual Studio for Windows, Xcode for Mac. The web has no built in primitive for this. And so just like our built in kind of like nerd instinct on this was like, that seems like a huge hole and it's, you know, it will be very valuable or like, you know, very valuable problem to solve. So if you want to set up that environments, you know, this is what we spent the past seven years doing. And the reality is existing developers have running locally. They already know how to set up that environment. So the problem isn't as acute for them. When we put Bolt online, we took that technology called WebContainer and married it with these, you know, state of the art frontier models. And the people that have the most pain with getting stuff set up locally is people that don't code. I think that's been, you know, really the big explosive reason is no one else has been trying to make dev environments work inside of a browser tab, you know, for the past if since ever, other than basically our company, largely because there wasn't an immediate demand or need. So I think we kind of find ourselves at the right place at the right time. And again, for this market of people that don't know how to write software, you would kind of expect that you should be able to do this without downloading something to your computer in the same way that, hey, I don't have to download Photoshop now to make designs because there's Figma. I don't have to download Word because there's, you know, Google Docs. They're kind of looking at this as that sort of thing, right? Which was kind of the, you know, our impetus and kind of vision from the get-go. But you know, the code gen, the AI code gen stuff that's come out has just been, you know, an order of magnitude multiplier on how magic that is, right? So that's kind of my best distillation of like, what is going on here, you know?
Alessio [00:15:21]: And you can deploy too, right?
Eric [00:15:22]: Yeah.
Alessio [00:15:23]: Yeah.
Eric [00:15:24]: And so that's, what's really cool is it's, you know, we have deployment built in with Netlify and this is actually, I think, Sean, you actually built this at Netlify when you were there. Yeah. It's one of the most brilliant integrations actually, because, you know, effectively the API that Sean built, maybe you can speak to it, but like as a provider, we can just effectively give files to Netlify without the user even logging in and they have a live website. And if they want to keep, hold onto it, they can click a link and claim it to their Netlify account. But it basically is just this really magic experience because when you come to Bolt, you say, I want a website. Like my mom, 70, 71 years old, made her first website, you know, on the internet two weeks ago, right? It was about her nursing days.
Swyx [00:16:03]: Oh, that's fantastic though. It wouldn't have been made.
Eric [00:16:06]: A hundred percent. Cause even in, you know, when we've had a lot of people building personal, like deeply personal stuff, like in the first week we launched this, the sales guy from the East Coast, you know, replied to a tweet of mine and he said, thank you so much for building this to your team. His daughter has a medical condition and so for her to travel, she has to like line up donors or something, you know, so ahead of time. And so he actually used Bolt to make a website to do that, to actually go and send it to folks in the region she was going to travel to ahead of time. I was really touched by it, but I also thought like, why, you know, why didn't he use like Wix or Squarespace? Right? I mean, this is, this is a solved problem, quote unquote, right? And then when I thought, I actually use Squarespace for my, for my, uh, the wedding website for my wife and I, like back in 2021, so I'm familiar, you know, it was, it was faster. I know how to code. I was like, this is faster. Right. And I thought back and I was like, there's a whole interface you have to learn how to use. And it's actually not that simple. There's like a million things you can configure in that thing. When you come to Bolt, there's a, there's a text box. You just say, I need a, I need a wedding website. Here's the date. Here's where it is. And here's a photo of me and my wife, put it somewhere relevant. It's actually the simplest way. And that's what my, when my mom came, she said, uh, I'm Pat Simons. I was a nurse in the seventies, you know, and like, here's the things I did and a website came out. So coming back to why is this such a, I think, why are we seeing this sort of growth? It's, this is the simplest interface I think maybe ever created to actually build it, a deploy a website. And then that website, my mom made, she's like, okay, this looks great. And there's, there's one button, you just click it, deploy, and it's live and you can buy a domain name, attach it to it. And you know, it's as simple as it gets, it's getting even simpler with some of the stuff we're working on. But anyways, so that's, it's, it's, uh, it's been really interesting to see some of the usage like that.
Swyx [00:17:46]: I can offer my perspective. So I, you know, I probably should have disclosed a little bit that, uh, I'm a, uh, stack list investor.
Alessio [00:17:53]: Canceled the episode. I know, I know. Don't play it now. Pause.
Eric actually reached out to ShowMeBolt before the launch. And we, you know, we talked a lot about, like, the framing of, of what we're going to talk about how we marketed the thing, but also, like, what we're So that's what Bolt was going to need, like a whole sort of infrastructure.
swyx: Netlify, I was a maintainer but I won't take claim for the anonymous upload. That's actually the origin story of Netlify. We can have Matt Billman talk about it, but that was [00:18:00] how Netlify started. You could drag and drop your zip file or folder from your desktop onto a website, it would have a live URL with no sign in.
swyx: And so that was the origin story of Netlify. And it just persists to today. And it's just like it's really nice, interesting that both Bolt and CognitionDevIn and a bunch of other sort of agent type startups, they all use Netlify to deploy because of this one feature. They don't really care about the other features.
swyx: But, but just because it's easy for computers to use and talk to it, like if you build an interface for computers specifically, that it's easy for them to Navigate, then they will be used in agents. And I think that's a learning that a lot of developer tools companies are having. That's my bolt launch story and now if I say all that stuff.
swyx: And I just wanted to come back to, like, the Webcontainers things, right? Like, I think you put a lot of weight on the technical modes. I think you also are just like, very good at product. So you've, you've like, built a better agent than a lot of people, the rest of us, including myself, who have tried to build these things, and we didn't get as far as you did.
swyx: Don't shortchange yourself on products. But I think specifically [00:19:00] on, on infra, on like the sandboxing, like this is a thing that people really want. Alessio has Bax E2B, which we'll have on at some point, talking about like the sort of the server full side. But yours is, you know, inside of the browser, serverless.
swyx: It doesn't cost you anything to serve one person versus a million people. It doesn't, doesn't cost you anything. I think that's interesting. I think in theory, we should be able to like run tests because you can run the full backend. Like, you can run Git, you can run Node, you can run maybe Python someday.
swyx: We talked about this. But ideally, you should be able to have a fully gentic loop, running code, seeing the errors, correcting code, and just kind of self healing, right? Like, I mean, isn't that the dream?
Eric: Totally.
swyx: Yeah,
Eric: totally. At least in bold, we've got, we've got a good amount of that today. I mean, there's a lot more for us to do, but one of the nice things, because like in web container, you know, there's a lot of kind of stuff you go Google like, you know, turn docker container into wasm.
Eric: You'll find a lot of stuff out there that will do that. The problem is it's very big, it's slow, and that ruins the experience. And so what we ended up doing is just writing an operating system from [00:20:00] scratch that was just purpose built to, you know, run in a browser tab. And the reason being is, you know, Docker 2 awesome things will give you an image that's like out 60 to 100 megabits, you know, maybe more, you know, and our, our OS, you know, kind of clocks in, I think, I think we're in like a, maybe, maybe a megabyte or less or something like that.
Eric: I mean, it's, it's, you know, really, really, you know, stripped down.
swyx: This is basically the task involved is I understand that it's. Mapping every single, single Linux call to some kind of web, web assembly implementation,
Eric: but more or less, and, and then there's a lot of things actually, like when you're looking at a dev environment, there's a lot of things that you don't need that a traditional OS is gonna have, right?
Eric: Like, you know audio drivers or you like, there's just like, there's just tons of things. Oh, yeah. Right. Yeah. That goes . Yeah. You can just kind, you can, you can kind of tos them. Or alternatively, what you can do is you can actually be the nice thing. And this is, this kind of comes back to the origins of browsers, which is, you know, they're, they're at the beginning of the web and, you know, the late nineties, there was two very different kind of visions for the web where Alan Kay vehemently [00:21:00] disagree with the idea that should be document based, which is, you know, Tim Berners Lee, you know, that, and that's kind of what ended up winning, winning was this document based kind of browsing documents on the web thing.
Eric: Alan Kay, he's got this like very famous quote where he said, you know, you want web browsers to be mini operating systems. They should download little mini binaries and execute with like a little mini virtualized operating system in there. And what's kind of interesting about the history, not to geek out on this aspect, what's kind of interesting about the history is both of those folks ended up being right.
Eric: Documents were actually the pragmatic way that the web worked. Was, you know, became the most ubiquitous platform in the world to the degree now that this is why WebAssembly has been invented is that we're doing, we need to do more low level things in a browser, same thing with WebGPU, et cetera. And so all these APIs, you know, to build an operating system came to the browser.
Eric: And that was actually the realization we had in 2017 was, holy heck, like you can actually, you know, service workers, which were designed for allowing your app to work offline. That was the kind of the key one where it was like, wait a second, you can actually now run. Web servers within a [00:22:00] browser, like you can run a server that you open up.
Eric: That's wild. Like full Node. js. Full Node. js. Like that capability. Like, I can have a URL that's programmatically controlled. By a web application itself, boom. Like the web can build the web. The primitive is there. Everyone at the time, like we talked to people that like worked on, you know Chrome and V8 and they were like, uhhhh.
Eric: You know, like I don't know. But it's one of those things you just kind of have to go do it to find out. So we spent a couple of years, you know, working on it and yeah. And, and, and got to work in back in 2021 is when we kind of put the first like data of web container online. But
swyx: in partnership with Google, right?
swyx: Like Google actually had to help you get over the finish line with stuff.
Eric: A hundred percent, because well, you know, over the years of when we were doing the R and D on the thing. Kind of the biggest challenge, the two ways that you can kind of test how powerful and capable a platform are, the two types of applications are one, video games, right, because they're just very compute intensive, a lot of calculations that have to happen, right?
Eric: The second one are IDEs, because you're talking about actually virtualizing the actual [00:23:00] runtime environment you are in to actually build apps on top of it, which requires sophisticated capabilities, a lot of access to data. You know, a good amount of compute power, right, to effectively, you know, building app in app sort of thing.
Eric: So those, those are the stress tests. So if your platform is missing stuff, those are the things where you find out. Those are, those are the people building games and IDEs. They're the ones filing bugs on operating system level stuff. And for us, browser level stuff.
Eric [00:23:47]: yeah, what ended up happening is we were just hammering, you know, the Chromium bug tracker, and they're like, who are these guys? Yeah. And, and they were amazing because I mean, just making Chrome DevTools be able to debug, I mean, it's, it's not, it wasn't originally built right for debugging an operating system, right? They've been phenomenal working with us and just kind of really pushing the limits, but that it's a rising tide that's kind of lifted all boats because now there's a lot of different types of applications that you can debug with Chrome Dev Tools that are running a browser that runs more reliably because just the stress testing that, that we and, you know, games that are coming to the web are kind of pushing as well, but.
Itamar [00:24:23]: That's awesome. About the testing, I think like most, let's say coding assistant from different kinds will need this loop of testing. And even I would add code review to some, to some extent that you mentioned. How is testing different from code review? Code review could be, for example, PR review, like a code review that is done at the point of when you want to merge branches. But I would say that code review, for example, checks best practices, maintainability, and so on. It's not just like CI, but more than CI. And testing is like a more like checking functionality, et cetera. So it's different. We call, by the way, all of these together code integrity, but that's a different story. Just to go back to the, to the testing and specifically. Yeah. It's, it's, it's since the first slide. Yeah. We're consistent. So if we go back to the testing, I think like, it's not surprising that for us testing is important and for Bolt it's testing important, but I want to shed some light on a different perspective of it. Like let's think about autonomous driving. Those startups that are doing autonomous driving for highway and autonomous driving for the city. And I think like we saw the autonomous of the highway much faster and reaching to a level, I don't know, four or so much faster than those in the city. Now, in both cases, you need testing and quote unquote testing, you know, verifying validation that you're doing the right thing on the road and you're reading and et cetera. But it's probably like so different in the city that it could be like actually different technology. And I claim that we're seeing something similar here. So when you're building the next Wix, and if I was them, I was like looking at you and being a bit scared. That's what you're disrupting, what you just said. Then basically, I would say that, for example, the UX UI is freaking important. And because you're you're more aiming for the end user. In this case, maybe it's an end user that doesn't know how to develop for developers. It's also important. But let alone those that do not know to develop, they need a slick UI UX. And I think like that's one reason, for example, I think Cursor have like really good technology. I don't know the underlying what's under the hood, but at least what they're saying. But I think also their UX UI is great. It's a lot because they did their own ID. While if you're aiming for the city AI, suddenly like there's a lot of testing and code review technology that it's not necessarily like that important. For example, let's talk about integration tests. Probably like a lot of what you're building involved at the moment is isolated applications. Maybe the vision or the end game is maybe like having one solution for everything. It could be that eventually the highway companies will go into the city and the other way around. But at the beginning, there is a difference. And integration tests are a good example. I guess they're a bit less important. And when you think about enterprise software, they're really important. So to recap, like I think like the idea of looping and verifying your test and verifying your code in different ways, testing or code review, et cetera, seems to be important in the highway AI and the city AI, but in different ways and different like critical for the city, even more and more variety. Actually, I was looking to ask you like what kind of loops you guys are doing. For example, when I'm using Bolt and I'm enjoying it a lot, then I do see like sometimes you're trying to catch the errors and fix them. And also, I noticed that you're breaking down tasks into smaller ones and then et cetera, which is already a common notion for a year ago. But it seems like you're doing it really well. So if you're willing to share anything about it.
Eric [00:28:07]: Yeah, yeah. I realized I never actually hit the punchline of what I was saying before. I mentioned the point about us kind of writing an operating system from scratch because what ended up being important about that is that to your point, it's actually a very, like compared to like a, you know, if you're like running cursor on anyone's machine, you kind of don't know what you're dealing with, with the OS you're running on. There could be an error happens. It could be like a million different things, right? There could be some config. There could be, it could be God knows what, right? The thing with WebConnect is because we wrote the entire thing from scratch. It's actually a unified image basically. And we can instrument it at any level that we think is going to be useful, which is exactly what we did when we started building Bolt is we instrumented stuff at like the process level, at the runtime level, you know, et cetera, et cetera, et cetera. Stuff that would just be not impossible to do on local, but to do that in a way that works across any operating system, whatever is, I mean, would just be insanely, you know, insanely difficult to do right and reliably. And that's what you saw when you've used Bolt is that when an error actually will occur, whether it's in the build process or the actual web application itself is failing or anything kind of in between, you can actually capture those errors. And today it's a very primitive way of how we've implemented it largely because the product just didn't exist 90 days ago. So we're like, we got some work ahead of us and we got to hire some more a little bit, but basically we present and we say, Hey, this is, here's kind of the things that went wrong. There's a fix it button and then a ignore button, and then you can just hit fix it. And then we take all that telemetry through our agent, you run it through our agent and say, kind of, here's the state of the application. Here's kind of the errors that we got from Node.js or the browser or whatever, and like dah, dah, dah, dah. And it can take a crack at actually solving it. And it's actually pretty darn good at being able to do that. That's kind of been a, you know, closing the loop and having it be a reliable kind of base has seemed to be a pretty big upgrade over doing stuff locally, just because I think that's a pretty key ingredient of it. And yeah, I think breaking things down into smaller tasks, like that's, that's kind of a key part of our agent. I think like Claude did a really good job with artifacts. I think, you know, us and kind of everyone else has, has kind of taken their approach of like actually breaking out certain tasks in a certain order into, you know, kind of a concrete way. And, and so actually the core of Bolt, I know we actually made open source. So you can actually go and check out like the system prompts and et cetera, and you can run it locally and whatever have you. So anyone that's interested in this stuff, I'd highly recommend taking a look at. There's not a lot of like stuff that's like open source in this realm. It's, that was one of the fun things that we've we thought would be cool to do. And people, people seem to like it. I mean, there's a lot of forks and people adding different models and stuff. So it's been cool to see.
Swyx [00:30:41]: Yeah. I'm happy to add, I added real-time voice for my opening day demo and it was really fun to hack with. So thank you for doing that. Yeah. Thank you. I'm going to steal your code.
Eric [00:30:52]: Because I want that.
Swyx [00:30:52]: It's funny because I built on top of the fork of Bolt.new that already has the multi LLM thing. And so you just told me you're going to merge that in. So then you're going to merge two layers of forks down into this thing. So it'll be fun.
Eric [00:31:03]: Heck yeah.
Alessio [00:31:04]: Just to touch on like the environment, Itamar, you maybe go into the most complicated environments that even the people that work there don't know how to run. How much of an impact does that have on your performance? Like, you know, it's most of the work you're doing actually figuring out environment and like the libraries, because I'm sure they're using outdated version of languages, they're using outdated libraries, they're using forks that have not been on the public internet before. How much of the work that you're doing is like there versus like at the LLM level?
Itamar [00:31:32]: One of the reasons I was asking about, you know, what are the steps to break things down, because it really matters. Like, what's the tech stack? How complicated the software is? It's hard to figure it out when you're dealing with the real world, any environment of enterprise as a city, when I'm like, while maybe sometimes like, I think you do enable like in Bolt, like to install stuff, but it's quite a like controlled environment. And that's a good thing to do, because then you narrow down and it's easier to make things work. So definitely, there are two dimensions, I think, actually spaces. One is the fact just like installing our software without yet like doing anything, making it work, just installing it because we work with enterprise and Fortune 500, etc. Many of them want on prem solution.
Swyx [00:32:22]: So you have how many deployment options?
Itamar [00:32:24]: Basically, we had, we did a metric metrics, say 96 options, because, you know, they're different dimensions. Like, for example, one dimension, we connect to your code management system to your Git. So are you having like GitHub, GitLab? Subversion? Is it like on cloud or deployed on prem? Just an example. Which model agree to use its APIs or ours? Like we have our Is it TestGPT? Yeah, when we started with TestGPT, it was a huge mistake name. It was cool back then, but I don't think it's a good idea to name a model after someone else's model. Anyway, that's my opinion. So we got
Swyx [00:33:02]: I'm interested in these learnings, like things that you change your mind on.
Itamar [00:33:06]: Eventually, when you're building a company, you're building a brand and you want to create your own brand. By the way, when I thought about Bolt.new, I also thought about if it's not a problem, because when I think about Bolt, I do think about like a couple of companies that are already called this way.
Swyx [00:33:19]: Curse companies. You could call it Codium just to...
Itamar [00:33:24]: Okay, thank you. Touche. Touche.
Eric [00:33:27]: Yeah, you got to imagine the board meeting before we launched Bolt, one of our investors, you can imagine they're like, are you sure? Because from the investment side, it's kind of a famous, very notorious Bolt. And they're like, are you sure you want to go with that name? Oh, yeah. Yeah, absolutely.
Itamar [00:33:43]: At this point, we have actually four models. There is a model for autocomplete. There's a model for the chat. There is a model dedicated for more for code review. And there is a model that is for code embedding. Actually, you might notice that there isn't a good code embedding model out there. Can you name one? Like dedicated for code?
Swyx [00:34:04]: There's code indexing, and then you can do sort of like the hide for code. And then you can embed the descriptions of the code.
Itamar [00:34:12]: Yeah, but you do see a lot of type of models that are dedicated for embedding and for different spaces, different fields, etc. And I'm not aware. And I know that if you go to the bedrock, try to find like there's a few code embedding models, but none of them are specialized for code.
Swyx [00:34:31]: Is there a benchmark that you would tell us to pay attention to?
Itamar [00:34:34]: Yeah, so it's coming. Wait for that. Anyway, we have our models. And just to go back to the 96 option of deployment. So I'm closing the brackets for us. So one is like dimensional, like what Git deployment you have, like what models do you agree to use? Dotter could be like if it's air-gapped completely, or you want VPC, and then you have Azure, GCP, and AWS, which is different. Do you use Kubernetes or do not? Because we want to exploit that. There are companies that do not do that, etc. I guess you know what I mean. So that's one thing. And considering that we are dealing with one of all four enterprises, we needed to deal with that. So you asked me about how complicated it is to solve that complex code. I said, it's just a deployment part. And then now to the software, we see a lot of different challenges. For example, some companies, they did actually a good job to build a lot of microservices. Let's not get to if it's good or not, but let's first assume that it is a good thing. A lot of microservices, each one of them has their own repo. And now you have tens of thousands of repos. And you as a developer want to develop something. And I remember me coming to a corporate for the first time. I don't know where to look at, like where to find things. So just doing a good indexing for that is like a challenge. And moreover, the regular indexing, the one that you can find, we wrote a few blogs on that. By the way, we also have some open source, different than yours, but actually three and growing. Then it doesn't work. You need to let the tech leads and the companies influence your indexing. For example, Mark with different repos with different colors. This is a high quality repo. This is a lower quality repo. This is a repo that we want to deprecate. This is a repo we want to grow, etc. And let that be part of your indexing. And only then things actually work for enterprise and they don't get to a fatigue of, oh, this is awesome. Oh, but I'm starting, it's annoying me. I think Copilot is an amazing tool, but I'm quoting others, meaning GitHub Copilot, that they see not so good retention of GitHub Copilot and enterprise. Ooh, spicy. Yeah. I saw snapshots of people and we have customers that are Copilot users as well. And also I saw research, some of them is public by the way, between 38 to 50% retention for users using Copilot and enterprise. So it's not so good. By the way, I don't think it's that bad, but it's not so good. So I think that's a reason because, yeah, it helps you auto-complete, but then, and especially if you're working on your repo alone, but if it's need that context of remote repos that you're code-based, that's hard. So to make things work, there's a lot of work on that, like giving the controllability for the tech leads, for the developer platform or developer experience department in the organization to influence how things are working. A short example, because if you have like really old legacy code, probably some of it is not so good anymore. If you just fine tune on these code base, then there is a bias to repeat those mistakes or old practices, etc. So you need, for example, as I mentioned, to influence that. For example, in Coda, you can have a markdown of best practices by the tech leads and Coda will include that and relate to that and will not offer suggestions that are not according to the best practices, just as an example. So that's just a short list of things that you need to do in order to deal with, like you mentioned, the 100.1 to 100.2 version of software. I just want to say what you're doing is extremely
Eric [00:38:32]: impressive because it's very difficult. I mean, the business of Stackplus, kind of before bulk came online, we sold a version of our IDE that went on-prem. So I understand what you're saying about the difficulty of getting stuff just working on-prem. Holy heck. I mean, that is extremely hard. I guess the question I have for you is, I mean, we were just doing that with kind of Kubernetes-based stuff, but the spread of Fortune 500 companies that you're working with, how are they doing the inference for this? Are you kind of plugging into Azure's OpenAI stuff and AWS's Bedrock, you know, Cloud stuff? Or are they just like running stuff on GPUs? Like, what is that? How are these folks approaching that? Because, man, what we saw on the enterprise side, I mean, I got to imagine that that's a huge challenge. Everything you said and more, like,
Itamar [00:39:15]: for example, like someone could be, and I don't think any of these is bad. Like, they made their decision. Like, for example, some people, they're, I want only AWS and VPC on AWS, no matter what. And then they, some of them, like there is a subset, I will say, I'm willing to take models only for from Bedrock and not ours. And we have a problem because there is no good code embedding model on Bedrock. And that's part of what we're doing now with AWS to solve that. We solve it in a different way. But if you are willing to run on AWS VPC, but run your run models on GPUs or inferentia, like the new version of the more coming out, then our models can run on that. But everything you said is right. Like, we see like on-prem deployment where they have their own GPUs. We see Azure where you're using OpenAI Azure. We see cases where you're running on GCP and they want OpenAI. Like this cross, like a case, although there is Gemini or even Sonnet, I think is available on GCP, just an example. So all the options, that's part of the challenge. I admit that we thought about it, but it was even more complicated. And it took us a few months to actually, that metrics that I mentioned, to start clicking each one of the blocks there. A few months is impressive. I mean,
Eric [00:40:35]: honestly, just that's okay. Every one of these enterprises is, their networking is different. Just everything's different. Every single one is different. I see you understand. Yeah. So that just cannot be understated. That it is, that's extremely impressive. Hats off.
Itamar [00:40:50]: It could be, by the way, like, for example, oh, we're only AWS, but our GitHub enterprise is on-prem. Oh, we forgot. So we need like a private link or whatever, like every time like that. It's not, and you do need to think about it if you want to work with an enterprise. And it's important. Like I understand like their, I respect their point of view.
Swyx [00:41:10]: And this primarily impacts your architecture, your tech choices. Like you have to, you can't choose some vendors because...
Itamar [00:41:15]: Yeah, definitely. To be frank, it makes us hard for a startup because it means that we want, we want everyone to enjoy all the variety of models. By the way, it was hard for us with our technology. I want to open a bracket, like a window. I guess you're familiar with our Alpha Codium, which is an open source.
Eric [00:41:33]: We got to go over that. Yeah. So I'll do that quickly.
Itamar [00:41:36]: Yeah. A pin in that. Yeah. Actually, we didn't have it in the last episode. So, so, okay.
Swyx [00:41:41]: Okay. We'll come back to that later, but let's talk about...
Itamar [00:41:43]: Yeah. So, so just like shortly, and then we can double click on Alpha Codium. But Alpha Codium is a open source tool. You can go and try it and lets you compete on CodeForce. This is a website and a competition and actually reach a master level level, like 95% with a click of a button. You don't need to do anything. And part of what we did there is taking a problem and breaking it to different, like smaller blocks. And then the models are doing a much better job. Like we all know it by now that taking small tasks and solving them, by the way, even O1, which is supposed to be able to do system two thinking like Greg from OpenAI like hinted, is doing better on these kinds of problems. But still, it's very useful to break it down for O1, despite O1 being able to think by itself. And that's what we presented like just a month ago, OpenAI released that now they are doing 93 percentile with O1 IOI left and International Olympiad of Formation. Sorry, I forgot. Exactly. I told you I forgot. And we took their O1 preview with Alpha Codium and did better. Like it just shows like, and there is a big difference between the preview and the IOI. It shows like that these models are not still system two thinkers, and there is a big difference. So maybe they're not complete system two. Yeah, they need some guidance. I call them system 1.5. We can, we can have it. I thought about it. Like, you know, I care about this philosophy stuff. And I think like we didn't see it even close to a system two thinking. I can elaborate later. But closing the brackets, like we take Alpha Codium and as our principle of thinking, we take tasks and break them down to smaller tasks. And then we want to exploit the best model to solve them. So I want to enable anyone to enjoy O1 and SONET and Gemini 1.5, etc. But at the same time, I need to develop my own models as well, because some of the Fortune 500 want to have all air gapped or whatever. So that's a challenge. Now you need to support so many models. And to some extent, I would say that the flow engineering, the breaking down to two different blocks is a necessity for us. Why? Because when you take a big block, a big problem, you need a very different prompt for each one of the models to actually work. But when you take a big problem and break it into small tasks, we can talk how we do that, then the prompt matters less. What I want to say, like all this, like as a startup trying to do different deployment, getting all the juice that you can get from models, etc. is a big problem. And one need to think about it. And one of our mitigation is that process of taking tasks and breaking them down. That's why I'm really interested to know how you guys are doing it. And part of what we do is also open source. So you can see.
Swyx [00:44:39]: There's a lot in there. But yeah, flow over prompt. I do believe that that does make sense. I feel like there's a lot that both of you can sort of exchange notes on breaking down problems. And I just want you guys to just go for it. This is fun to watch.
Eric [00:44:55]: Yeah. I mean, what's super interesting is the context you're working in is, because for us too with Bolt, we've started thinking because our kind of existing business line was going behind the firewall, right? We were like, how do we do this? Adding the inference aspect on, we're like, okay, how does... Because I mean, there's not a lot of prior art, right? I mean, this is all new. This is all new. So I definitely am going to have a lot of questions for you.
Itamar [00:45:17]: I'm here. We're very open, by the way. We have a paper on a blog or like whatever.
Swyx [00:45:22]: The Alphacodeum, GitHub, and we'll put all this in the show notes.
Itamar [00:45:25]: Yeah. And even the new results of O1, we published it.
Eric [00:45:29]: I love that. And I also just, I think spiritually, I like your approach of being transparent. Because I think there's a lot of hype-ium around AI stuff. And a lot of it is, it's just like, you have these companies that are just kind of keep their stuff closed source and then just max hype it, but then it's kind of nothing. And I think it kind of gives a bad rep to the incredible stuff that's actually happening here. And so I think it's stuff like what you're doing where, I mean, true merit and you're cracking open actual code for others to learn from and use. That strikes me as the right approach. And it's great to hear that you're making such incredible progress.
Itamar [00:46:02]: I have something to share about the open source. Most of our tools are, we have an open source version and then a premium pro version. But it's not an easy decision to do that. I actually wanted to ask you about your strategy, but I think in your case, there is, in my opinion, relatively a good strategy where a lot of parts of open source, but then you have the deployment and the environment, which is not right if I get it correctly. And then there's a clear, almost hugging face model. Yeah, you can do that, but why should you try to deploy it yourself, deploy it with us? But in our case, and I'm not sure you're not going to hit also some competitors, and I guess you are. I wanted to ask you, for example, on some of them. In our case, one day we looked on one of our competitors that is doing code review. We're a platform. We have the code review, the testing, et cetera, spread over the ID to get. And in each agent, we have a few startups or a big incumbents that are doing only that. So we noticed one of our competitors having not only a very similar UI of our open source, but actually even our typo. And you sit there and you're kind of like, yeah, we're not that good. We don't use enough Grammarly or whatever. And we had a couple of these and we saw it there. And then it's a challenge. And I want to ask you, Bald is doing so well, and then you open source it. So I think I know what my answer was. I gave it before, but still interesting
Eric [00:47:29]: to hear what you think. GeoHot said back, I don't know who he was up to at this exact moment, but I think on comma AI, all that stuff's open source. And someone had asked him, why is this open source? And he's like, if you're not actually confident that you can go and crush it and build the best thing, then yeah, you should probably keep your stuff closed source. He said something akin to that. I'm probably kind of butchering it, but I thought it was kind of a really good point. And that's not to say that you should just open source everything, because for obvious reasons, there's kind of strategic things you have to kind of take in mind. But I actually think a pretty liberal approach, as liberal as you kind of can be, it can really make a lot of sense. Because that is so validating that one of your competitors is taking your stuff and they're like, yeah, let's just kind of tweak the styles. I mean, clearly, right? I think it's kind of healthy because it keeps, I'm sure back at HQ that day when you saw that, you're like, oh, all right, well, we have to grind even harder to make sure we stay ahead. And so I think it's actually a very useful, motivating thing for the teams. Because you might feel this period of comfort. I think a lot of companies will have this period of comfort where they're not feeling the competition and one day they get disrupted. So kind of putting stuff out there and letting people push it forces you to face reality soon, right? And actually feel that incrementally so you can kind of adjust course. And that's for us, the open source version of Bolt has had a lot of features people have been begging us for, like persisting chat messages and checkpoints and stuff. Within the first week, that stuff was landed in the open source versions. And they're like, why can't you ship this? It's in the open, so people have forked it. And we're like, we're trying to keep our servers and GPUs online. But it's been great because the folks in the community did a great job, kept us on our toes. And we've got to know most of these folks too at this point that have been building these things. And so it actually was very instructive. Like, okay, well, if we're going to go kind of land this, there's some UX patterns we can kind of look at and the code is open source to this stuff. What's great about these, what's not. So anyways, NetNet, I think it's awesome. I think from a competitive point of view for us, I think in particular, what's interesting is the core technology of WebContainer going. And I think that right now, there's really nothing that's kind of on par with that. And we also, we have a business of, because WebContainer runs in your browser, but to make it work, you have to install stuff from NPM. You have to make cores bypass requests, like connected databases, which all require server-side proxying or acceleration. And so we actually sell WebContainer as a service. One of the core reasons we open-sourced kind of the core components of Bolt when we launched was that we think that there's going to be a lot more of these AI, in-your-browser AI co-gen experiences, kind of like what Anthropic did with Artifacts and Clod. By the way, Artifacts uses WebContainers. Not yet. No, yeah. Should I strike that? I think that they've got their own thing at the moment, but there's been a lot of interest in WebContainers from folks doing things in that sort of realm and in the AI labs and startups and everything in between. So I think there'll be, I imagine, over the coming months, there'll be lots of things being announced to folks kind of adopting it. But yeah, I think effectively...
Swyx [00:50:35]: Okay, I'll say this. If you're a large model lab and you want to build sandbox environments inside of your chat app, you should call Eric.
Itamar [00:50:43]: But wait, wait, wait, wait, wait, wait. I have a question about that. I think OpenAI, they felt that people are not using their model as they would want to. So they built ChatGPT. But I would say that ChatGPT now defines OpenAI. I know they're doing a lot of business from their APIs, but still, is this how you think? Isn't Bolt.new your business now? Why don't you focus on that instead of the...
Swyx [00:51:16]: What's your advice as a founder?
Eric [00:51:18]: You're right. And so going into it, we, candidly, we were like, Bolt.new, this thing is super cool. We think people are stoked. We think people will be stoked. But we were like, maybe that's allowed. Best case scenario, after month one, we'd be mind blown if we added a couple hundred K of error or something. And we were like, but we think there's probably going to be an immediate huge business. Because there was some early poll on folks wanting to put WebContainer into their product offerings, kind of similar to what Bolt is doing or whatever. We were actually prepared for the inverse outcome here. But I mean, well, I guess we've seen poll on both. But I mean, what's happened with Bolt, and you're right, it's actually the same strategy as like OpenAI or Anthropic, where we have our ChatGPT to OpenAI's APIs is Bolt to WebContainer. And so we've kind of taken that same approach. And we're seeing, I guess, some of the similar results, except right now, the revenue side is extremely lopsided to Bolt.
Itamar [00:52:16]: I think if you ask me what's my advice, I think you have three options. One is to focus on Bolt. The other is to focus on the WebContainer. The third is to raise one billion dollars and do them both. I'm serious. I think otherwise, you need to choose. And if you raise enough money, and I think it's big bucks, because you're going to be chased by competitors. And I think it will be challenging to do both. And maybe you can. I don't know. We do see these numbers right now, raising above $100 million, even without having
Eric [00:52:49]: a product. You can see these. It's excellent advice. And I think what's been amazing, but also kind of challenging is we're trying to forecast, okay, well, where are these things going? I mean, in the initial weeks, I think us and all the investors in the company that we're sharing this with, it was like, this is cool. Okay, we added 500k. Wow, that's crazy. Wow, we're at a million now. Most things, you have this kind of the tech crunch launch of initiation and then the thing of sorrow. And if there's going to be a downtrend, it's just not coming yet. Now that we're kind of looking ahead, we're six weeks in. So now we're getting enough confidence in our convictions to go, okay, this seems to be the trend line. I'll tell you another reason why
Swyx [00:53:33]: I think, where is Jasper? They actually just announced some new numbers recently. They're still surviving. They have gone down a lot. I think that the peak that I heard was a hundred
Itamar [00:53:42]: billion ARR. And now there's like tens of these. So I think their success was phenomenal, like what I see at Bolt. And I think if you want to keep that, probably, who am I? I'm just giving my two cents. You need to focus because you are going to see weeks, I think that you're disrupting their market. And you open sourced some of it and they have containers, I believe. And you need to fight. I can tell you that when we open source, I share with you a small competitor, but I can tell you, I have a friend who has built a billion dollar company and more. When we released Alpha Codium, he sent me a private email asking, what the f**k did you just do? Why did you release that? You should have kept it. Yeah, you released that open source. I'm thinking, build some stuff and now I can do that much more easily. I can tell you my answer and I thought that maybe you'll answer as well. Although I think Bolt is already very promising. For us, Alpha Codium 1 is like GPT 1. I agree with you. Being open and open source, etc. really helps to improve the product community, etc. But at some point, OpenAI closed their GPT 3.5 or whatever. And that was part of my answer. Alpha Codium is the agent that is compatible with GPT 1 and there is a lot to do for these agents to actually get that moment that we had with GPT 3.5, etc. as agents.
Eric [00:55:11]: Yeah, I think you're dead right. And I think it just comes back to what GeoHot said. It's like, if you want to win, there's no other option than out hustling everyone else. And so I think that's kind of out hustling in the sense really meaning building the best product, building the best experiences. And so I think that's the only way kind of almost any route and open source and stuff just kind of burns the ships in a sense. And maybe that's the simplest way of saying it. You're burning the ships, but also it builds a lot of goodwill. I mean, there's tons of benefits to it. Salesforce are doing that, right?
Itamar [00:55:43]: They're now going to be agent force or whatever. So you can also...
Swyx [00:55:47]: We're going to try to get Mark on the podcast. And they're good friends with Salesforce. Any parting thoughts, any trends that you're
Itamar [00:55:55]: super excited about? If we're talking about trends, I go back to our original podcast where we talked about the idea that the software world is built from specs, tests, and code. And I think you can see that one dimension are company startups that are rethinking the entire development environment, I think like Bolt, etc. And another dimension is where is their focus? Is it on the spec, is on the test and on the code? And I think it's interesting to see that from that view. We'll see more startup and more amazing announcements of new directions, new philosophy. So I think we'll see startup focusing, let's build everything from the spec. To some extent, I would say that Bolt is, from my understanding, you can say better, somewhere in the line between the spec and the code. Because you start, like I saw your demos, you're trying to describe things, not just in one row, because you want to look like you want it. So it's on that edge between connecting between spec and code. And you see others, I think all the IDEs, most of them are the new IDEs, or the fork are there. We are more focused from the test and to the code and to the spec, etc. So these are trends, I think we will see that. And I think another dimension to consider is, is it more for the highway AI, for the developers, maybe not even a technical person, or is it for the enterprise? And that also gives you different products. If they are aiming for different ICP, different ideal client profile, they will approach this triangle of spec and test and code. And that's how I see the world. And what I'm noticing is that we're seeing more and more of those new startups, new interfaces that are not focused on code. For example, talking more about the spec, talking more about the testing. Eventually, I think that that's where the world is going to. The code is going to be there, and there will be developers, etc. But as agent improves and capabilities of the LLMs and integrations to different parts of the development environment, we're going to see more and more focusing on the spec and the test. Basically, these two might unite, the spec and the test, because you can say that tests are runnable specs, to some extent. So that's another way to look at
Swyx [00:58:23]: it. Yeah, that is literally on the slide here, runnable tests, right here. Yeah, I'm consistent.
Itamar [00:58:27]: It's all consistent. Look, I talked about system one and system two more than a year ago. And now with O1, people are talking about system one. But I think we'll talk about it again, because I think they're totally, totally wrong about O1 being a system two. It is now in the hype or whatever, talking about that. But I think the agents are the ones that will take us towards system two. And the more they are aware of their environment, and aware of that sometimes they don't know what they don't know, then we'll really get to system two. But that's
Swyx [00:59:03]: a deeper discussion. It's a deeper discussion. I love the philosophy talk that we had last time as well. All right, so we're back on to Bolt, and Itamar had to leave for another interview. But we were just talking about what happened post-launch, right? And I held this emergency council of advisors for you, because we had never seen this before. And I was like, okay, I'm going to call all the smartest people I know to join this thing.
Eric [00:59:27]: Which was extremely helpful. And I'm so appreciative. There's been a handful of me.
Swyx [00:59:31]: You made one hire out of that.
Eric [00:59:34]: Yeah, because it was like, I think I can't remember where we were at kind of ARR-wise when I had messaged you.
Swyx [00:59:40]: It was like, you messaged me at like two or three. And then by the time we got everything together, it was four. And then, yeah, now it's at-
Alessio [00:59:48]: Since Eric sat down five minutes.
Swyx [00:59:52]: But I mean, it sounds like you accelerated, because you told me it was like 100k, 200k a day. And now it's accelerated?
Eric [00:59:58]: Yeah, this past- I mean, every week has been kind of a blowout week as far as- Is it TikTok? We're digging into the degree that we can of just like where all this stuff's coming from. I mean, there's a ton of word of mouth, right? So that you can't- which you can't just like look by refer, right? So there's a ton of direct. But yeah, I mean, there's a lot of TikTok. There's a ton of YouTube. It's kind of, I think, been a sensation in the sort of like entrepreneurial, build your own SaaS, indie hacker, even developer circles. And I think, too, our team's been doing a really good job. Our folks just kind of like flipped a switch. And people were just working through the weekends or whatever to get stuff fixed. And so the product- and you'll see people say this online. Like today, there was a tweet. Someone was like, yeah, I tried this like the first week and I couldn't get whatever to work. Came back today, six weeks later, and this is ridiculous. Like this is so good, right? And so I think there's been an incredible amount of improvement to the product, to the agent, also to like the underlying models, too. Like Sonnet, they just happened to do an update with their release a couple of weeks ago. And so when we put our new agent online and the new Sonnet, we saw a huge bump in conversion just based on that. And so yeah, we've gone at that. When we were chatting, that must have been three weeks ago, maybe an average of 100K ARR per day. And this week, I will see- I've said this every week, but we'll see if it holds. The past couple of days have been like half a million of ARR per day, which is insane. I think today we've had peak traffic, just kind of set the previous- and that's kind of been every day this week. But anyways, yeah, I think things just continue to accelerate, which is kind of blowing my mind, because it's just the sheer numbers of this stuff are just mind-boggling.
Alessio [01:01:40]: I think you almost suffered from the Twitter demo issues that other people had. The first time I saw Bolt, I saw the demo and I was like, oh, that's cool. I didn't go to try it because I was like, I've seen so many of these that it's like, I don't know if it's actually going to work. And then two days ago, I signed up to use it. I was building a Luma replacement. I'm done with Luma. And I was like, man, this thing really works. And I already knew you, of course. I was like, man, this thing really works. What the f**k? I was like, it's actually, I don't know if it's like the model, if it's like how you prompt it, but it's so good at coming up with the simplest thing to implement. So the Luma example, right? So first I was like, create a RSVP page for an event and it created a wedding RSVP. I don't know if it's your fault. I don't know if you bolted it. And then I was like, well, now it needs to have a way to create more events and added that. And then I was like, now it needs a way to like have an admin page to modify event. And maybe what I would have done as a developer is like, well, I'll create a different like admin view, you know, with all the events and then I'll have like the front end thing. And instead what it did is like, it created like a admin view with toggle on top and then like just a pencil button on every page to edit them in line, you know, and that was it. And I was like, yeah, that works just as well. And like for the model, that's probably the simplest way to do it because it like limits the amount of files that are there. Can you talk just more about how much of this is like the model coming out with it, how much you're prompting it to kind of like be very like
Eric [01:03:04]: compressed and concise. A ton of it is the model, but I think what's interesting though, is you're kind of baseline model. If I just like, if it's kind of like try and put it into like a, you know, way, if you had to quantify, quantify, you know, the effect is obviously the model is like this sort of like 10X multiplier. You're how good the bottom line model is huge, huge swing. And then kind of what you can do on top of that, you can squeeze out three, four X kind of more. And so that's kind of where the realm of, you know, prompt engineering and multi-agent approaches, et cetera, kind of kick in. And so I think, I think with us, you know, our folks, like the guy on our side that, you know, led the web engineering, like that kind of our core technology for the past, you know, seven years here, you know, his name is Dominic Elm based out of Germany and he was one of the founding engineers of the company. You had previous to StackBlitz, he actually was doing machine learning and he basically had built a StackBlitz, like online ID for machine learning. So I think like, I kind of like Google Colab sort of thing, or like Hugging Face has their kind of version of this. Back in 2016, it wasn't as much of a market for this stuff, but he had been doing a lot of, you know, training, you know, ML models and that sort of thing. So I guess, you know, as we began, you know, kind of digging into AI stuff over the past year, he's been kind of leading that off. And so a lot of it, I really attribute to Dom's specific angle, cause he has deep understanding of our technology and how it works. Cause he's, you know, led the engineering on web container, but as you know, deep understanding of how these models work going and actually kind of writing out these you know, whether it's like the, the, the prompt engineering aspect of it or multi-agent or whatever, have you, you know, that's sort of like that much context. And, and the, and the other folks on the team are, are, you know, in the same, same sort of spot that have been working on this stuff. I think we'd be able to squeeze out a lot more than I've seen almost anything else out there, at least in the term of building web apps, at least. But I guess I think it's, I think it's kind of just because we we have more context on, on a fewer number of heads at the company. So we can kind of connect the dots of it faster, you
Swyx [01:05:01]: know? Yeah. That's part of the issue with the whole raise a billion dollars thing. Like you actually run very lean and that's, that's actually been to your advantage.
Eric [01:05:08]: Totally. And I think, you know, and I think we, we have to staff up because I mean, we went from, you know, call it zero customers to, you know, 20, 30,000 kind of, you know, in six weeks, we have to have certainly more customer support, customer success stuff, et cetera. But you know, also just on, on engineering we have to ramp up, but I do think that there's a, we saw this in the 2021 cycle, right? Where, you know, adding tons more people can, can, can be a thing that really hurts, you know, the company because you can, it's just harder. It's really hard to manage lots of people. Not if you're a big enough company to warrant a certain headcount, a 100%, you kind of have to do it. Right. But I think for us, it's worked just to really grow, grow the team slowly and intentionally. And so I think we're going to take the same approach here at a bit of a faster clip than we were previously. But to me, that would just be general advice to startups is like slowly intentionally as fast as you can to meet demand or whatever. Part of what I felt like you're in a unique position to
Swyx [01:06:07]: talk about, but also kind of what we went through in our, in our call was I have PMF now, what is, is kind of what I've been saying. And so like, I think the first answer is hire a data scientist because we have to sort of figure out like from our data that you're now sitting on a ton of different customers and we don't really know the different customer segments. You're starting to get an idea of churn. You're starting to get an idea of like segmentation. You already had data enrichment. One of my most interesting quotes from you from that session was that because you were selling to enterprise for so long, you had already set up all that stuff and it's just like, wasn't useful for a more sort of developer bottom up centric approach.
Eric [01:06:46]: Yeah. And particularly because for the first time in the company's history, we're selling primarily to almost non-developers. And so everything that we've ever, all the playbooks we had not relevant here basically. Right. So the, and you're one of one of our investors I talked with earlier this week, basically brought up a really great point, which is like, you are now a B2C company and how you operate needs to reflect that.
Swyx [01:07:09]: Which is, which is what, I don't know.
Eric [01:07:11]: Which is basically from an analytics perspective, like you're tracking everything. Right. And then to your point, you have, you have people kind of around the clock slicing and dicing data to understand who are these people coming in, who are the types of people you actually want to retain versus people that, you know, are just going to churn out. And that's okay. Cause they're not the actual like ICP that you're going for. Right. When you're building stuff for enterprise software, the bar is a lot lower. And then to kind of to, from the conversation before one of the biggest, and this is kind of what we found with StackBlitz, which is kind of interesting, you know, you mentioned it, it's like, it's as a startup, it's very hard to sell on-prem extremely true. But if you can do it, it's like the promised land because you know, these, these companies you know, the fortune 500s, they can write really large checks. And so when you're going and selling to them, it doesn't matter so much like on your website. Sure. You want to track the conversion to the enterprise contact form or whatever. Right. But what, what actually really matters is like the, a lot of human touch points of, Hey, we want to have a quarterly call after just getting installed this stuff. There's a whole playbook for that. And you need to hire sales engineers that can be on the ground floor and helping people install it. Then after that, you got to, okay, how do we make sure they're kind of constantly successful? Because you can't access like we can, our enterprise customer instances, we have no idea how often they're using them. Why? Because the whole point is that we can't see what they're up to for a good reason, right? Like they, they need to own their data. And so the way it's actually much, a very complicated problem of how do you have like build relationships where everyone's getting on calls, they can share kind of the telemetry that, that they can see within their instance. And you can kind of extrapolate that and make sure they're happy and successful. So that's, there's a whole art of that, of doing enterprise well, that we've gone and done and closed these folks totally unrelated to doing BC completely, completely unrelated for the most part. So anyway, so that, so that, you know, we're, as a company, we're, we're kind of reorienting, you know, our focus on, okay, going and actually really leaning in on analytics, whatever have you. And fortunately, like my co-founder and I, the art, the enterprise business of stack was, was the first time we had ever done enterprise primarily like things to the company we did before was B2C. Like we were selling people courses on how to do web development basically. Right. So a lot of the skillset that, you know, I had built up there, I able to pull that back off the shelf, dust it off, sharpen the blade. And, you know, we're doing email marketing, we're doing live streams, you know? So, so that's, it's, it's kind of cool to, you know, be shifting back to some of the, the, the, where we cut our teeth on back in the day.
Alessio [01:09:35]: How did you pick the pricing? Because I had to pay.
Swyx [01:09:38]: That's fantastic. You want to like slight, slightly like, yeah, you got a bit. It's like,
Alessio [01:09:44]: you're running out of tokens, dude. I was like, f**k, I'm running out of tokens. It's like, I don't want to run out of tokens, but there's like five different tiers. Yeah. Right. Which are kind of like token based and capacity based. Yep. How do you kind of reconcile that? And the consumer side where maybe the consumer doesn't even really need to know what a token is, right? Like on that, like your mom probably doesn't really care what an AI token is. How did you structure it to start? How did you come up with that? And then maybe ideas that you have to like improve or like modify that.
Eric [01:10:12]: Totally. Yeah. So we, so when we first launched with StackBlitz is like, we were an enterprise play, right? And so when we launched in 2017, I think we tried pricing 2018 or 2019, but like it was free for a long time. And then we had a 9𝑝𝑙𝑎𝑛𝑎𝑛𝑑𝑡ℎ𝑎𝑡𝑤𝑎𝑠𝑗𝑢𝑠𝑡𝑡ℎ𝑒𝑤𝑎𝑦𝑖𝑡𝑤𝑎𝑠.𝐼𝑡𝑤𝑎𝑠,𝑖𝑡𝑤𝑎𝑠𝑘𝑖𝑛𝑑𝑜𝑓𝑙𝑖𝑘𝑒𝑜𝑢𝑟,𝑜𝑢𝑟𝑑𝑜𝑙𝑙𝑎𝑟50ℎ𝑜𝑡𝑑𝑜𝑔𝑎𝑡𝐶𝑜𝑠𝑡𝑐𝑜.𝐼𝑡′𝑠𝑘𝑖𝑛𝑑𝑜𝑓𝑙𝑖𝑘𝑒𝑡ℎ𝑖𝑠,𝑡ℎ𝑖𝑠,𝑦𝑜𝑢𝑘𝑛𝑜𝑤,𝑗𝑢𝑠𝑡𝑙𝑜𝑤𝑝𝑟𝑖𝑐𝑒,𝑗𝑢𝑠𝑡,𝑦𝑜𝑢𝑘𝑛𝑜𝑤,𝑖𝑡,𝑖𝑡𝑤𝑎𝑠𝑛′𝑡𝑡ℎ𝑒𝑝𝑟𝑖𝑚𝑎𝑟𝑦𝑟𝑒𝑣𝑑𝑟𝑖𝑣𝑒𝑟𝑎𝑛𝑑𝑤𝑒𝑗𝑢𝑠𝑡𝑤𝑎𝑛𝑡𝑒𝑑𝑡𝑜,𝑦𝑜𝑢𝑘𝑛𝑜𝑤,𝑠𝑎𝑦,𝐻𝑒𝑦,𝑝𝑎𝑦𝑓𝑜𝑟𝑠𝑜𝑚𝑒𝑚𝑜𝑟𝑒𝑠𝑡𝑜𝑟𝑎𝑔𝑒𝑎𝑛𝑑𝑝𝑟𝑖𝑣𝑎𝑡𝑒𝑝𝑟𝑜𝑗𝑒𝑐𝑡𝑠𝑜𝑟𝑤ℎ𝑎𝑡𝑒𝑣𝑒𝑟.𝐴𝑛𝑑𝑠𝑜𝑤𝑒𝑤𝑒𝑛𝑡𝑡𝑜𝑙𝑎𝑢𝑛𝑐ℎ𝑏𝑜𝑙𝑡𝑎𝑔𝑎𝑖𝑛,𝑙𝑖𝑘𝑒𝑜𝑢𝑟𝑒𝑥𝑝𝑒𝑐𝑡𝑎𝑡𝑖𝑜𝑛𝑤𝑎𝑠,𝐻𝑒𝑦,𝑤𝑒′𝑙𝑙𝑝𝑟𝑜𝑏𝑎𝑏𝑙𝑦𝑔𝑒𝑡𝑎𝑔𝑜𝑜𝑑𝑛𝑢𝑚𝑏𝑒𝑟𝑜𝑓𝑝𝑒𝑜𝑝𝑙𝑒𝑡ℎ𝑎𝑡′𝑙𝑙𝑠𝑖𝑔𝑛𝑢𝑝𝑎𝑛𝑑𝑏𝑒𝑒𝑥𝑐𝑖𝑡𝑒𝑑𝑎𝑏𝑜𝑢𝑡𝑖𝑡.𝐴𝑛𝑑𝑦𝑜𝑢𝑘𝑛𝑜𝑤,𝑤𝑒′𝑟𝑒𝑛𝑜𝑡𝑡𝑜𝑜𝑐𝑜𝑛𝑐𝑒𝑟𝑛𝑒𝑑,𝑦𝑜𝑢𝑘𝑛𝑜𝑤,𝑤𝑒′𝑟𝑒𝑗𝑢𝑠𝑡,𝑤𝑒′𝑟𝑒𝑗𝑢𝑠𝑡𝑛𝑜𝑡,𝑤𝑒𝑤𝑒𝑟𝑒𝑢𝑛𝑝𝑟𝑒𝑝𝑎𝑟𝑒𝑑𝑓𝑜𝑟𝑡ℎ𝑒𝑡𝑠𝑢𝑛𝑎𝑚𝑖𝑡ℎ𝑎𝑡ℎ𝑖𝑡.𝐴𝑛𝑑𝑠𝑜𝑎𝑓𝑡𝑒𝑟𝑔𝑜𝑖𝑛𝑔𝑜𝑛𝑙𝑖𝑛𝑒𝑡ℎ𝑒𝑓𝑖𝑟𝑠𝑡𝑤𝑒𝑒𝑘,𝑤𝑒𝑤𝑒𝑟𝑒𝑙𝑖𝑘𝑒,𝑤𝑜𝑤,𝑡ℎ𝑖𝑠𝑖𝑠𝑐𝑜𝑜𝑙.𝑇ℎ𝑒𝑟𝑒′𝑠𝑎,𝐼𝑚𝑒𝑎𝑛,𝑖𝑡𝑗𝑢𝑠𝑡𝑘𝑒𝑝𝑡𝑔𝑟𝑜𝑤𝑖𝑛𝑔.𝐴𝑛𝑑𝑡ℎ𝑒𝑛𝑜𝑛𝑐𝑒𝑤𝑒ℎ𝑖𝑡𝑤𝑒𝑒𝑘𝑡𝑤𝑜,𝐼𝑚𝑒𝑎𝑛,𝑤𝑒𝑤𝑒𝑟𝑒𝑗𝑢𝑠𝑡𝑛𝑖𝑛𝑒𝑏𝑢𝑐𝑘𝑠𝑤𝑎𝑠,𝐼𝑚𝑒𝑎𝑛,𝑖𝑡′𝑠𝑙𝑖𝑘𝑒𝑡ℎ𝑒𝑐ℎ𝑒𝑎𝑝𝑒𝑠𝑡𝐴𝐼𝑐𝑜𝑑𝑖𝑛𝑔𝑡ℎ𝑖𝑛𝑔𝑦𝑜𝑢𝑐𝑎𝑛𝑔𝑒𝑡𝑚𝑎𝑦𝑏𝑒𝑜𝑡ℎ𝑒𝑟𝑡ℎ𝑎𝑛𝑐𝑜𝑝𝑖𝑙𝑜𝑡,𝑏𝑢𝑡𝑙𝑖𝑘𝑒𝑤𝑒𝑤𝑒𝑟𝑒𝑜𝑣𝑒𝑟𝑟𝑢𝑛𝑏𝑦𝑠𝑢𝑝𝑝𝑜𝑟𝑡𝑡𝑖𝑐𝑘𝑒𝑡𝑠.𝐴𝑛𝑑𝐼𝑗𝑢𝑠𝑡,𝑎𝑛𝑑𝑗𝑢𝑠𝑡𝑡ℎ𝑒𝑠ℎ𝑒𝑒𝑟𝑣𝑜𝑙𝑢𝑚𝑒𝑜𝑓𝑝𝑒𝑜𝑝𝑙𝑒𝑐𝑜𝑚𝑖𝑛𝑔𝑖𝑛𝑎𝑛𝑑𝑖𝑡𝑗𝑢𝑠𝑡𝑙𝑎𝑤𝑠𝑜𝑓𝑠𝑢𝑝𝑝𝑙𝑦𝑎𝑛𝑑𝑑𝑒𝑚𝑎𝑛𝑑.𝑊𝑒𝑤𝑒𝑟𝑒𝑙𝑖𝑘𝑒,𝑜𝑘𝑎𝑦,𝑡ℎ𝑖𝑠𝑖𝑠𝑛′𝑡,𝑡ℎ𝑒𝑟𝑒′𝑠𝑛𝑜𝑤𝑎𝑦𝑤𝑒𝑐𝑎𝑛𝑠𝑐𝑎𝑙𝑒𝑡𝑜𝑚𝑒𝑒𝑡𝑡ℎ𝑖𝑠.𝐴𝑙𝑠𝑜𝑡ℎ𝑒𝑝𝑒𝑜𝑝𝑙𝑒𝑐𝑜𝑚𝑖𝑛𝑔𝑖𝑛𝑎𝑟𝑒𝑏𝑢𝑟𝑛𝑖𝑛𝑔𝑡ℎ𝑟𝑜𝑢𝑔ℎ𝑡ℎ𝑒𝑖𝑟𝑡𝑜𝑘𝑒𝑛𝑠𝑎𝑛𝑑𝑡ℎ𝑒𝑟𝑒′𝑠𝑛𝑜𝑤𝑎𝑦𝑡𝑜𝑎𝑐𝑡𝑢𝑎𝑙𝑙𝑦𝑙𝑖𝑘𝑒𝑏𝑢𝑦𝑚𝑜𝑟𝑒𝑜𝑓𝑡ℎ𝑒𝑠𝑒𝑡ℎ𝑖𝑛𝑔𝑠.𝐴𝑛𝑑𝑛𝑖𝑛𝑒𝑏𝑢𝑐𝑘𝑠𝑖𝑠𝑗𝑢𝑠𝑡,𝑦𝑜𝑢𝑐𝑎𝑛′𝑡𝑔𝑒𝑡𝑡ℎ𝑎𝑡𝑚𝑢𝑐ℎ𝑖𝑛𝑓𝑒𝑟𝑒𝑛𝑐𝑒𝑜𝑢𝑡𝑜𝑓𝑡ℎ𝑎𝑡.𝐴𝑛𝑑𝑠𝑜𝑡ℎ𝑒,ℎ𝑒𝑟𝑒′𝑠𝑡ℎ𝑒𝑜𝑡ℎ𝑒𝑟𝑡ℎ𝑖𝑛𝑔𝑡ℎ𝑎𝑡′𝑠𝑖𝑛𝑡𝑒𝑟𝑒𝑠𝑡𝑖𝑛𝑔𝑎𝑏𝑜𝑢𝑡𝑏𝑜𝑙𝑡𝑐𝑜𝑚𝑝𝑎𝑟𝑒𝑑𝑡𝑜𝑙𝑖𝑘𝑒𝑠𝑜𝑚𝑒𝑡ℎ𝑖𝑛𝑔𝑙𝑖𝑘𝑒𝑐𝑜𝑝𝑖𝑙𝑜𝑡𝑜𝑟𝑤ℎ𝑎𝑡𝑒𝑣𝑒𝑟.𝐴𝑛𝑑𝑡ℎ𝑖𝑠𝑘𝑖𝑛𝑑𝑜𝑓𝑡𝑖𝑒𝑑𝑡ℎ𝑖𝑠,𝑠𝑜𝑟𝑟𝑦,𝑎𝑙𝑖𝑡𝑡𝑙𝑒𝑏𝑖𝑡𝑜𝑓𝑎𝑟𝑜𝑢𝑛𝑑𝑎𝑏𝑜𝑢𝑡𝑤𝑎𝑦𝑡𝑜𝑎𝑛𝑠𝑤𝑒𝑟𝑦𝑜𝑢𝑟𝑞𝑢𝑒𝑠𝑡𝑖𝑜𝑛.𝐵𝑢𝑡𝑏𝑎𝑠𝑖𝑐𝑎𝑙𝑙𝑦𝑤ℎ𝑎𝑡𝑤𝑒𝑒𝑛𝑑𝑒𝑑𝑢𝑝𝑎𝑡𝑡ℎ𝑎𝑡𝑚𝑜𝑚𝑒𝑛𝑡,𝑤𝑒𝑒𝑛𝑑𝑒𝑑𝑢𝑝𝑟𝑒𝑎𝑙𝑖𝑧𝑖𝑛𝑔𝑖𝑠𝑡ℎ𝑎𝑡𝑤ℎ𝑒𝑛𝑦𝑜𝑢𝑢𝑠𝑒𝑐𝑜𝑝𝑖𝑙𝑜𝑡,𝑤ℎ𝑎𝑡𝑖𝑡′𝑠𝑠𝑒𝑛𝑑𝑖𝑛𝑔𝑢𝑝,𝑖𝑡𝑑𝑜𝑒𝑠𝑛′𝑡𝑝𝑟𝑜𝑣𝑖𝑑𝑒𝑎𝑙𝑜𝑡𝑜𝑓𝑐𝑜𝑛𝑡𝑒𝑥𝑡𝑜𝑓𝑦𝑜𝑢𝑟𝑐𝑜𝑑𝑒𝑏𝑎𝑠𝑒.𝑇ℎ𝑒𝑦𝑡𝑟𝑦𝑎𝑛𝑑𝑟𝑒𝑑𝑢𝑐𝑒𝑡ℎ𝑒𝑎𝑚𝑜𝑢𝑛𝑡𝑜𝑓𝑐𝑜𝑛𝑡𝑒𝑥𝑡𝑎𝑠𝑚𝑢𝑐ℎ𝑎𝑠𝑡ℎ𝑒𝑦𝑐𝑎𝑛.𝐴𝑛𝑑𝐼𝑡ℎ𝑖𝑛𝑘,𝑦𝑜𝑢𝑘𝑛𝑜𝑤,𝑡ℎ𝑒𝑜𝑟𝑖𝑔𝑖𝑛𝑠𝑜𝑓𝑡ℎ𝑖𝑠𝑠𝑡𝑢𝑓𝑓𝑖𝑠𝑡ℎ𝑒𝑦,𝑒𝑣𝑒𝑟𝑦𝑜𝑛𝑒𝑘𝑖𝑛𝑑𝑜𝑓𝑤𝑎𝑛𝑡𝑠𝑡ℎ𝑖𝑠𝑙𝑖𝑘𝑒𝑙𝑜𝑤𝑝𝑟𝑖𝑐𝑒𝑝𝑜𝑖𝑛𝑡𝑤ℎ𝑒𝑟𝑒𝑖𝑡′𝑠𝑙𝑖𝑘𝑒𝑎𝑙𝑙𝑦𝑜𝑢𝑐𝑎𝑛𝑒𝑎𝑡.𝑆𝑜𝑖𝑡𝑗𝑢𝑠𝑡𝑘𝑖𝑛𝑑𝑜𝑓,𝑡ℎ𝑎𝑡𝑘𝑖𝑛𝑑𝑜𝑓𝑓𝑒𝑒𝑙𝑠𝑙𝑖𝑘𝑒,𝑐𝑎𝑢𝑠𝑒𝑖𝑡′𝑠𝑙𝑖𝑘𝑒,𝑖𝑡𝑎𝑙𝑚𝑜𝑠𝑡𝑙𝑖𝑘𝑒𝑁𝑒𝑡𝑓𝑙𝑖𝑥,𝑖𝑡′𝑠𝑙𝑖𝑘𝑒,𝐼′𝑙𝑙𝑝𝑎𝑦𝑎𝑡ℎ𝑖𝑛𝑔.𝐴𝑛𝑑𝑡ℎ𝑒𝑛𝐼𝑐𝑎𝑛𝑗𝑢𝑠𝑡𝑑𝑜𝑎𝑠𝑚𝑢𝑐ℎ𝑜𝑓𝑡ℎ𝑒𝑚𝑜𝑣𝑖𝑒𝑤𝑎𝑡𝑐ℎ𝑖𝑛𝑔𝑎𝑠𝐼𝑤𝑎𝑛𝑡.𝐴𝑛𝑑𝐼𝑡ℎ𝑖𝑛𝑘,𝐼𝑡ℎ𝑖𝑛𝑘𝑡ℎ𝑎𝑡,𝑡ℎ𝑎𝑡𝑘𝑖𝑛𝑑𝑜𝑓𝑚𝑒𝑛𝑡𝑎𝑙𝑖𝑡𝑦,𝑤ℎ𝑒𝑛𝑡ℎ𝑒𝑠𝑒𝑓𝑖𝑟𝑠𝑡𝐴𝐼𝑝𝑟𝑜𝑑𝑢𝑐𝑡𝑠𝑐𝑎𝑚𝑒,𝑖𝑡𝑘𝑖𝑛𝑑𝑜𝑓𝑚𝑎𝑘𝑒𝑠𝑠𝑒𝑛𝑠𝑒.𝑇ℎ𝑒𝑦′𝑟𝑒𝑙𝑖𝑘𝑒,𝑜𝑘𝑎𝑦,𝑤𝑒𝑙𝑙𝑤𝑒,𝑤𝑒𝑑𝑜𝑛′𝑡𝑤𝑎𝑛𝑡𝑡𝑜𝑚𝑒𝑡𝑒𝑟𝑖𝑡.𝐶𝑎𝑢𝑠𝑒𝑡ℎ𝑎𝑡𝑑𝑜𝑒𝑠𝑛′𝑡𝑓𝑒𝑒𝑙𝑔𝑜𝑜𝑑.𝑅𝑖𝑔ℎ𝑡.𝐵𝑢𝑡𝑡ℎ𝑒𝑝𝑟𝑜𝑏𝑙𝑒𝑚𝑖𝑠𝑡ℎ𝑎𝑡𝑡ℎ𝑒𝑛𝑡ℎ𝑒𝑦′𝑟𝑒𝑖𝑛𝑐𝑒𝑛𝑡𝑖𝑣𝑖𝑧𝑒𝑑𝑡𝑜𝑛𝑜𝑡ℎ𝑎𝑣𝑒𝑖𝑡𝑏𝑒𝑎𝑏𝑙𝑒𝑡𝑜𝑘𝑒𝑒𝑝𝑡ℎ𝑒𝑚𝑜𝑟𝑒𝑐𝑜𝑛𝑡𝑒𝑥𝑡𝑦𝑜𝑢𝑔𝑖𝑣𝑒𝑖𝑡,𝑡ℎ𝑒𝑚𝑜𝑟𝑒𝑖𝑡𝑐𝑎𝑛𝑑𝑜.𝐴𝑛𝑑𝑡ℎ𝑎𝑡′𝑠𝑡ℎ𝑒𝑚𝑎𝑔𝑖𝑐𝑜𝑓𝑤ℎ𝑎𝑡𝑤𝑒′𝑟𝑒𝑑𝑜𝑖𝑛𝑔𝑤𝑖𝑡ℎ𝑏𝑜𝑙𝑑𝑖𝑠𝑤𝑒′𝑟𝑒𝑔𝑖𝑣𝑖𝑛𝑔𝑖𝑡𝑎𝑙𝑙𝑡ℎ𝑒𝑐𝑜𝑛𝑡𝑒𝑥𝑡𝑤𝑒𝑝𝑜𝑠𝑠𝑖𝑏𝑙𝑦𝑐𝑎𝑛.𝐴𝑛𝑑𝑡ℎ𝑎𝑡′𝑠𝑤ℎ𝑦𝑦𝑜𝑢𝑐𝑎𝑛𝑔𝑜𝑡𝑜𝑖𝑡𝑎𝑛𝑑𝑠𝑎𝑦,𝑚𝑎𝑘𝑒𝑚𝑒𝑎𝑛𝑅𝑆𝑉𝑃𝑠𝑖𝑡𝑒.𝐴𝑛𝑑𝑖𝑡𝑑𝑜𝑒𝑠𝑛′𝑡𝑏𝑒𝑐𝑎𝑢𝑠𝑒𝑖𝑡ℎ𝑎𝑠𝑐𝑜𝑛𝑡𝑒𝑥𝑡,𝑡ℎ𝑒𝑒𝑛𝑡𝑖𝑟𝑒𝑠𝑡𝑎𝑡𝑒𝑜𝑓𝑡ℎ𝑒𝑎𝑝𝑝𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛,𝑦𝑜𝑢𝑘𝑛𝑜𝑤,𝑒𝑡𝑐𝑒𝑡𝑒𝑟𝑎,𝑒𝑡𝑐𝑒𝑡𝑒𝑟𝑎.𝐴𝑛𝑑𝑡ℎ𝑎𝑡′𝑠𝑤ℎ𝑎𝑡𝑚𝑎𝑘𝑒𝑠𝑖𝑡𝑠𝑜𝑎𝑐𝑐𝑢𝑟𝑎𝑡𝑒.𝑉𝑒𝑟𝑠𝑢𝑠𝑖𝑓𝑦𝑜𝑢𝑔𝑜𝑡𝑜𝑐𝑜−𝑝𝑖𝑙𝑜𝑡𝑎𝑛𝑑𝑠𝑎𝑦𝑡ℎ𝑎𝑡𝑖𝑡,𝑡ℎ𝑒𝑟𝑒′𝑙𝑙𝑏𝑒,𝑦𝑜𝑢𝑘𝑛𝑜𝑤,𝑖𝑡𝑚𝑖𝑔ℎ𝑡𝑝𝑢𝑛𝑐ℎ𝑜𝑢𝑡𝑎𝑟𝑒𝑎𝑐𝑡𝑐𝑜𝑚𝑝𝑜𝑛𝑒𝑛𝑡.𝑇ℎ𝑎𝑡′𝑠𝑡ℎ𝑒𝑏𝑢𝑡𝑡𝑜𝑛𝑡𝑜𝑐𝑟𝑒𝑎𝑡𝑒𝑡ℎ𝑒𝑡ℎ𝑖𝑛𝑔,𝑏𝑢𝑡𝑛𝑜𝑡𝑎𝑐𝑡𝑢𝑎𝑙𝑙𝑦𝑚𝑜𝑟𝑒𝑡ℎ𝑎𝑛𝑡ℎ𝑎𝑡.𝑆𝑜𝑎𝑛𝑦𝑤𝑎𝑦,𝑠𝑜,𝑢𝑚,𝑦𝑜𝑢𝑘𝑛𝑜𝑤,𝑎𝑛𝑑𝑎𝑡𝑡ℎ𝑒𝑡𝑖𝑚𝑒𝑤ℎ𝑒𝑛𝑝𝑒𝑜𝑝𝑙𝑒ℎ𝑎𝑣𝑒𝑏𝑜𝑢𝑔ℎ𝑡𝑡ℎ𝑒9planandthatwasjustthewayitwas.Itwas,itwaskindoflikeour,ourdollar50hotdogatCostco.It′skindoflikethis,this,youknow,justlowprice,just,youknow,it,itwasn′ttheprimaryrevdriverandwejustwantedto,youknow,say,Hey,payforsomemorestorageandprivateprojectsorwhatever.Andsowewenttolaunchboltagain,likeourexpectationwas,Hey,we′llprobablygetagoodnumberofpeoplethat′llsignupandbeexcitedaboutit.Andyouknow,we′renottooconcerned,youknow,we′rejust,we′rejustnot,wewereunpreparedforthetsunamithathit.Andsoaftergoingonlinethefirstweek,wewerelike,wow,thisiscool.There′sa,Imean,itjustkeptgrowing.Andthenoncewehitweektwo,Imean,wewerejustninebuckswas,Imean,it′slikethecheapestAIcodingthingyoucangetmaybeotherthancopilot,butlikewewereoverrunbysupporttickets.AndIjust,andjustthesheervolumeofpeoplecominginanditjustlawsofsupplyanddemand.Wewerelike,okay,thisisn′t,there′snowaywecanscaletomeetthis.Alsothepeoplecominginareburningthroughtheirtokensandthere′snowaytoactuallylikebuymoreofthesethings.Andninebucksisjust,youcan′tgetthatmuchinferenceoutofthat.Andsothe,here′stheotherthingthat′sinterestingaboutboltcomparedtolikesomethinglikecopilotorwhatever.Andthiskindoftiedthis,sorry,alittlebitofaroundaboutwaytoansweryourquestion.Butbasicallywhatweendedupatthatmoment,weendeduprealizingisthatwhenyouusecopilot,whatit′ssendingup,itdoesn′tprovidealotofcontextofyourcodebase.Theytryandreducetheamountofcontextasmuchastheycan.AndIthink,youknow,theoriginsofthisstuffisthey,everyonekindofwantsthislikelowpricepointwhereit′slikeallyoucaneat.Soitjustkindof,thatkindoffeelslike,causeit′slike,italmostlikeNetflix,it′slike,I′llpayathing.AndthenIcanjustdoasmuchofthemoviewatchingasIwant.AndIthink,Ithinkthat,thatkindofmentality,whenthesefirstAIproductscame,itkindofmakessense.They′relike,okay,wellwe,wedon′twanttometerit.Causethatdoesn′tfeelgood.Right.Buttheproblemisthatthenthey′reincentivizedtonothaveitbeabletokeepthemorecontextyougiveit,themoreitcando.Andthat′sthemagicofwhatwe′redoingwithboldiswe′regivingitallthecontextwepossiblycan.Andthat′swhyyoucangotoitandsay,makemeanRSVPsite.Anditdoesn′tbecauseithascontext,theentirestateoftheapplication,youknow,etcetera,etcetera.Andthat′swhatmakesitsoaccurate.Versusifyougotoco−pilotandsaythatit,there′llbe,youknow,itmightpunchoutareactcomponent.That′sthebuttontocreatethething,butnotactuallymorethanthat.Soanyway,so,um,youknow,andatthetimewhenpeoplehaveboughtthe9 plan, they were like, I want to give you more money. I want you to buy more tokens. How do I do that? And so our team scrambled that weekend, we just turned it around and just, you know, we said, okay, well, what do we think is reasonable? And we said, okay, so let's go, you immediately double the prices of the, of the base tier, because it's just not enough what people are getting on for nine bucks. So that'll be, that seems reasonable. It's kind of in line with everyone else. And then we added 50, 100 and $200 plans. Cause we're like, that should be enough. And so, yeah, so that, that's kind of the origins of it. And, and, um, it was, it was people that use it, fall in love with that and they want to use more of it. And the problem is the inference is expensive. And so we're not actually taking, you know, to date on the, on the revenue we've done, we have not really taken a margin at all on this stuff. Cause we're just trying to put all the value back into the folks that are there using the tool and just getting the maximum amount of value out of it. But it's really key to the kind of the magic of the experience. And so the other, the other thing kind of worth mentioning is there's kind of the ARR number, but then we, you can also buy additional tokens, you know, just with usage-based billing effectively. And that's accounting for an additional 20, 30% of, of revenue that's coming to the company. People are actually using this to do their jobs. Like, you think, think about a web development agency before this thing, they're going in using Figma to make a design. They have to pay the designer. They have to like punch that out into code, kind of man. And maybe like co-pilot can help a little bit with punching out this thing that they're coming to this thing. And there's just wild stories online where it's like guy bake, local bakeries, like we need a website. He's like, okay, well, I'm going to charge you a thousand bucks. They're like, okay, that sounds great. Reasonable price. 30 minutes later, he's like, here's a deploy preview of your thing. How does that look? They're like, wow, holy crap. I'm not giving you a thousand bucks. But they did, they were, they were, they were like, this usually takes months, you know? So some of the biggest power users are people that build websites for a living because this is the, the alpha on this is insane.
Alessio [01:14:26]: That's almost like the gap, right? It's like, it used to be that if I ask you before this to do a website and in 30 minutes you return to me and you give me something, I'm like, you know, you're probably just copying something else you've done before, you know, versus now it's almost like, it doesn't really matter how much time it takes you because everybody's going to be so fast with these things. It's more like the value. And that's why when you're pricing TRL, it was almost like, there's only really going to be like either 20𝑎𝑚𝑜𝑛𝑡ℎ𝑢𝑠𝑒𝑟𝑠𝑜𝑟𝑙𝑖𝑘𝑒𝑎𝑡ℎ𝑜𝑢𝑠𝑎𝑛𝑑𝑑𝑜𝑙𝑙𝑎𝑟𝑠𝑎𝑚𝑜𝑛𝑡ℎ𝑢𝑠𝑒𝑟𝑠.𝑌𝑜𝑢𝑘𝑛𝑜𝑤,𝑖𝑡′𝑠𝑎𝑙𝑚𝑜𝑠𝑡𝑙𝑖𝑘𝑒𝑤ℎ𝑜′𝑠𝑔𝑜𝑖𝑛𝑔𝑡𝑜𝑢𝑠𝑒𝑡ℎ𝑒20amonthusersorlikeathousanddollarsamonthusers.Youknow,it′salmostlikewho′sgoingtousethe50 a month because it's kind of like in between, between being infrequent user and being like a power user, you know? So yeah, it makes sense that you have like a big part of like on demand
Eric [01:15:05]: on top of that. Yeah. And on the 50, there's actually a lot of people on the one. I think it's because it's like enough to actually like for developers are using this to just kind of like punch out components or designs or whatever, kind of gets them enough for, you know, kind of in a given month or whatever. And so it's been interesting to just kind of see the, the, you know, the, the upgrades that happen, but what's been kind of cool about the product is it's, and again, I think this is kind of novel and this is, you know, us being maybe a little more transparent than we should be or something, but like, I suspect we're just, I think we're going to see a lot more of this because we're hitting an inflection point coming back to the co-pilot thing. Part of the problem before is that it didn't matter if you provided more context, the models just weren't good enough to know what to even do with it. That's not the case now. You know, just one, one, you know, story of like one of the first people, one of the power, first power users that adopted Bolt was this gal in Thailand who's a PM at a software banking company. And she had an idea for this app called viralhooks.ai, which is basically, it's a tool that if you want to make viral TikToks and stuff, it's like, what's the hook of the video to make people watch. Right. And so basically she, you know, you can go and like, see, it goes and extracts hooks from other people's videos and helps you with like, you know, AI to write your own. And she had originally put the week before Bolt launched, she put that on Upwork and you know, some, I think a developer in like Ukraine had quoted her, you know, 5,000.𝐴𝑛𝑑𝑖𝑡′𝑠𝑔𝑜𝑖𝑛𝑔𝑡𝑜𝑡𝑎𝑘𝑒𝑙𝑖𝑘𝑒𝑡ℎ𝑟𝑒𝑒𝑚𝑜𝑛𝑡ℎ𝑠𝑜𝑟𝑠𝑜𝑚𝑒𝑡ℎ𝑖𝑛𝑔𝑙𝑖𝑘𝑒𝑡ℎ𝑎𝑡.𝑅𝑒𝑎𝑠𝑜𝑛𝑎𝑏𝑙𝑒𝑡𝑖𝑚𝑒𝑓𝑟𝑎𝑚𝑒,𝑟𝑖𝑔ℎ𝑡.𝐹𝑜𝑟𝑎𝑛𝑎𝑝𝑝𝑙𝑖𝑘𝑒𝑡ℎ𝑎𝑡,𝑟𝑒𝑎𝑠𝑜𝑛𝑎𝑏𝑙𝑒𝑝𝑟𝑖𝑐𝑒.𝑇ℎ𝑒𝑤𝑒𝑒𝑘𝑎𝑓𝑡𝑒𝑟𝑡ℎ𝑎𝑡𝐵𝑜𝑙𝑡𝑐𝑎𝑚𝑒𝑜𝑢𝑡,𝑠ℎ𝑒𝑏𝑜𝑢𝑔ℎ𝑡𝑡ℎ𝑒5,000.Andit′sgoingtotakelikethreemonthsorsomethinglikethat.Reasonabletimeframe,right.Foranapplikethat,reasonableprice.TheweekafterthatBoltcameout,sheboughtthe50 plan and she had the app built within a week or two. And so you're talking about like, that's it. And it's beautiful. She did an incredible job. Right. And so the numbers are wild. 5,000,𝑡ℎ𝑟𝑒𝑒𝑚𝑜𝑛𝑡ℎ𝑠𝑡𝑜5,000,threemonthsto50 and like a week. Yeah. You got to charge more. So it's, it's kind of like, so there's, there's people like when we've had a lot of people go, this pricing is insane. And we're like, well, we're not even taking really a margin at the moment on it, you know, but also, but when you, when you compare that to the price of actually going and building the cost of building quality software today, anyone who knows the price of building quality software, the alpha is obvious, right? It's a 99% cost production and five X faster, you know, delivery time, you know? So anyway, so that's, I think we're one of the first products that have actually come out kind of proving that, you know, in, in, in a revenue way to kind of underscore the point, as you can imagine, we've had, you know, kind of venture capital firms kind of reach out and kind of, you know, curious to kind of, you know, what we're up to or whatever. And so, you know, one of the most, you know, there's kind of one of the, the most notable ones or whatever reached out. So we kind of sent them, you know, you know, kind of our numbers. Actually it was the investor update, Sean, that, that I think you, you know, the, you know, the one you saw kind of gave him a snapshot of it. And they one of their analysts accidentally replied all on what we had sent them and with, with the analysis. And so on this part there, you know, one of the things they said was we haven't seen anything that's kind of eyeopening to see people going to $200 tier on this sort of thing. Haven't seen anything else like that in the space. Cause I think this is very new because of the new model capabilities, right? Where people, you know, it makes sense. Like you're willing to pay more money for this stuff. So. This is something I've talked about before in terms of matching
Swyx [01:18:11]: the dollar amount of spend to the capabilities of the AIs. The chart that I published in the past was, you know, OpenAI has like five levels of AGI-ness and level, level one is sort of like a chatbots, level two is reasoning, level three is agents, four is organizations, five is some, something super, super human. I don't remember what the exact levels are, but each, you can sort of each match each of them with like tiers. Like 20𝑖𝑠𝑙𝑖𝑘𝑒𝑡ℎ𝑒𝑐ℎ𝑎𝑡𝐺𝐵𝑇𝑡𝑖𝑒𝑟.20islikethechatGBTtier.200 is where you're at. 2,000𝑖𝑠ℎ𝑖𝑔ℎ𝑒𝑟,2,000ishigher,20,000, $200,000, right? Like you can see levels where it makes sense. I think BrightWave is also there, by the way. Like I don't know what BrightWave charges, but it's higher, right? Than a chatGBT. And like, you have to deliver more value for that, but you, you can do it now. Yep. So then why not? Everyone should do it.
Eric [01:18:58]: I think we're going to see a lot more of this. I think we're going to see, I think, you know, for AI, Cogen specifically, this is the first moment where I think that there's been that moment where it goes from zero to one, where it's like, yep. The price point, you know, the value, the value is so, like what you can get out of these things is so much higher than it was, you know, three, six months ago that I think we're going to see, I think we're going to see a lot more of this. Like we might, you know, Bolt is, I think one of the first things, but yeah, I mean, it's just, to me, it's inevitable that we're going to see many more things kind of leveraging this, this sort of use case and the amount of efficiency you can get out of using
Alessio [01:19:38]: these systems. Right. So yeah. Yeah. Yeah. Because I mean, the Bolt arbitrage would be quote the price based on the query, you know, you're selling high value tokens. Yeah. It's like, Hey, it's like your mom is like, you wouldn't charge your mom $2,000 to tell her stories, but like, you know, this person doing an app and like a product on it. Yeah. You got to pay more, you know, but it's hard right now. I understand. It's like, it's really hard to figure out how much you can push it, how much value the person will get out
Swyx [01:20:04]: of the thing. Yeah. So I want to riff a little bit on like stuff like this, right? I think you nailed a lot with the design system. You know, one of the differences between open source Bolt and the one that you have is actually like you, you spend a lot of time on the design system. I think, right. Most things just look great when they come out, but I think there's also a whole backend portion that they need. Was that a challenge? Is there anything that you sort of like figuring out that you want to riff on? Yeah. So I think one of the main things,
Eric [01:20:28]: I think you hit the nail on the head, which is, you know, kind of going into putting Bolt online. We originally, again, we've been selling to developers and so we were kind of like, this is a tool for prototyping and they'll download their code. But we ended up finding in the early user testing was how important the deployment story was and how, and this is something you said to me specifically, you're like backend, this needs to like backend needs to be part of this, like logging in, like off just to triple confirm you're dead right. That has been the absolute number one thing that folks coming to Bolt, you know, are looking to do is build a real app with a backend, with billing. And so one of this guy, Mauricio, he's one of our power users. He's like, there's three things that like every app that I'll ever want to build in Bolt, any of these other people in this community, you know, three things, a database, auth, and payments. So those three things, right. So that's- Admin dashboard. We can do that pretty decently, pretty decently. As in every database needs a WP admin. Yes. Yes. Correct. Totally. Totally. And so, yeah, today I think like viral hooks, for example, I think she's using Firebase for auth and database and that sort of thing. You know, so I think Firebase and Superbase, those are the two things that, that just work incredibly well. And so that's actually the point where we're at now, where, you know, right now it's, you know, folks have to still, you know, kind of go to Superbase, manually spin up a thing, come back to Bolt, but the thing that, you know, it's like that sort of processing thing with Firebase, each of those products are going to have their own little quirks that you have to, there's like kind of steps, right. And so- Boltbase. Yeah. Boltbase. Yeah. I think, yeah, I think initially we're like, okay, there should just be a way to like, for Bolt to just go and spin up these things on their behalf and just, and just, you know, both of them have APIs to do so. I'll go even further, like have like pre-warm
Swyx [01:22:12]: instances that you just assign, like it's already spun up, right. So it's, so it's like kind of serverless feeling, even as like, not really, but like yeah, just pre-warm and then just kind of assign it when, whenever someone like- That's a really great point. Yeah. Just keep, keep one
Eric [01:22:26]: Firebase in the hopper, basically. One, 10, 100, I don't know. More generally, this is what I felt
Swyx [01:22:32]: that I wanted to do on our call, which is like, when you have PMF, yes, you want to invest some time in like understanding your customers and do a data analytics and like tighten, tighten things up in general, like tighten up the pricing, tighten up the cost and all that. But then like, you also have to work on like, what is next, like the next level and growth, like you can still inflect. Yeah. I don't know what that is, but you know, I wanted to, I wanted to keep pushing you and I don't know if I did, mostly because I was serving as facilitator on that call. That's what I think. Like, I think you got to still keep pushing the frontier and I don't know what it, what it is, but like, you know, I want to hear what you got thinking about.
Eric [01:23:07]: I think there's, you know, we've addressed just a lot of the low hanging P0 stuff then, and we've actually seen, we've kind of the, you know, there's, there's key moments where it's just kind of like been going like that, which has been cool. Cause it's like, okay, well we were, we're just getting started. This is just the, this is just the fixing obvious things part. Fundamentally, I think a lot, what a lot of people are coming here to do is just, how can we just make it faster to go from idea to production? And a lot of it is like, I had, when I have to go to Firebase, Superbase, spin something up, run a migrate, you know, like add a table, but it's like the agent can do that, you know, so that stuff should be baked in. Yeah. And same thing with the deployment side. It's like right now it's going to Netlify, but people have to create a Netlify account and go and do that. Right. And so I think one of the things we're going to end up doing here is just having the hosting be baked in. And so I've been talking with Matt over at Netlify about this, cause they actually have a way to kind of white label stuff. And so, cause people are, they're just going to make a website, you know? And so it's I mean, that means also you take over domain registration. Can you imagine, right? Like a couple of months from now, you come to this thing, you're like, I want to make, I want to make an RSVP site. Right. And it's like, great. Do you, you know, do you have a name for it? Or do you want to, you know, a domain? You're like, I don't know a name. It's like, well, here's like 10 options and the.coms are able to look good. Yep. That one does. Okay. We want to buy it. Okay, great. It bought the DNS is pointed at the thing. Should we start building this? Okay. Does this look good? Yep. Okay. Am I okay to push this to prod? Yep. That looks good. You know, like that's without leaving the product.
Swyx [01:24:31]: Right. So to me, like it's tomorrow was the first to actually say like you are the new Wix. I never, I personally never thought about it that way. Wix is a $10 billion company where you want to go, you know, cause you still have a choice here. From what we're hearing from the folks using
Eric [01:24:43]: the product, I think I don't even think Wix is even able to solve their need, you know? But not to say that we don't want to, you know, that, that what you're saying is now we want, but, but I mean, yeah, like I think we want to solve folks problems. And I think that there's a huge gap in the market of being able to build, you know, kind of more sophisticated, high quality software like websites in a way that for someone who's a non-engineer. And so I think there's a huge market for that. And obviously, even if you're trying to build a wedding website, yeah, this is, this is easier and faster. Right. So I love it. I, you know, again, coming to the origins of why Albert, my co-founder and I are doing this is we've always just loved building stuff on the web. It's like this, I, this is the tool from what, even when stack was just the IDE interface to the technology, it's like, this is the thing we wish we had when we were 13 years old, you know? And with Bolt, oh my God, if this is the thing I wish we had when we were 13 years old, I'm so glad that my daughter's going to have this thing, you know? So anyways, yeah, I think it makes me pretty, pretty stoked that people are going to be able to actually build amazing web applications that can do really sophisticated things, you know? So yes, I think the short answer is heck yeah. I mean, yeah, that sort of market and totally right up our alley. One other angle that I wanted to pursue was
Swyx [01:25:53]: also the other languages. You know, you're very JavaScript centric. We've talked about Python forever. Ruby maybe, is that important? You know, like the previous generation of site builders were mostly Ruby shops and some PHP. Do we want to capture that or are we just like, you know, always been on JavaScript and just let JavaScript take over the world? You know, I think, I think
Eric [01:26:14]: we're, we're, we're certainly with great interest interested in other languages and we have like minimal support of Python and some C++ stuff in web container that you can like run or whatever. I think especially with the, with the stuff we're seeing though, it's the languages is kind of ancillary to the, to the, to the thing. Well, there's the ecosystem of like,
Swyx [01:26:31]: I want to end up with a code base that I can hire humans on to do the stuff that Bolt cannot do.
Eric [01:26:36]: Yeah, true. And I think, I think in that sense, like the, the, the JavaScript Node.js ecosystem is huge and well-established. So it's like, I think it'd be certainly be able to get people to work on this stuff. And I think the only thing that would be missing is it's like, are you building web apps that where a lot of the functionality is only in libraries that are in Python or something. Right. And I think just kind of seeing the applications that are being built here at, you know, I think that'd be like data science and like ML and that sort of thing. And so that's, we're not seeing a lot of that stuff, you know? And then, but I think that's like, we're like kind of a more generic approach is like what Repl.it's doing where they're spinning up real VMs. You can kind of run anything. And I think they started off with like doing Python service. I actually haven't tried their, their, you know, their new agent stuff that's based on.
Swyx [01:27:15]: Repl.it agent. Yeah. We're close friends. Repl.it has the database, the sort of live hosting, everything integrated that you're going to want to build. And you're, I think you're on a collision course with them, to be honest.
Eric [01:27:29]: We'll see. Cause I'm curious, you're not the first person to say that. I'm curious to see how it shakes out. Cause I think the challenge is focus. You know, when you are, what's kind of the end goal that you're shooting? Yeah, Repl.it's firmly for developers.
Swyx [01:27:45]: You're positioning it for non-developers like that. That's legit.
Eric [01:27:48]: Yeah. And even getting, even if focusing on a language or an ecosystem as well, because again, the problem is that these things can just break in a million ways. And so part of the, a lot of the work in making the experience better, like how do you get, like how make it, someone get an idea into the fingertips and live on prod, right? There's so much stuff in between there. And a lot of it is just errors that happen and how do you handle those? And a lot of that comes down to having a giant database of common errors that you can maybe even fine tune stuff on at some point, right? So doing that on, on one ecosystem, you can move a lot faster than if you're trying to support a lot of different languages. However, it's a, to the point of, if you're kind of targeting developers, they may not need that level of kind of streamline, you know, thing. I think that's kind of where I see the main divergence is that we are unabashedly focused on this ecosystem of, for building web apps. Got it. Yeah. You support it forever. Yeah. And so I'm very curious to see how, just how it all shakes out. Cause it's, I think what they're doing is actually, I mean, I'm very curious to see what Microsoft does because if anyone is good at giving out VMs, tying it to a coder and putting AI in it, it's Sia. He's got a cloud. He's got VS code. They've got code spaces. They've they're in open AI. Now they've got Anthropic and Copilot. I mean, I must imagine, I must imagine that they're cooking stuff over
Swyx [01:29:06]: there, you know? We'll make sure to ask him. We have many friends from Microsoft listening to the
Alessio [01:29:11]: pod. So just to wrap, I don't know, is there anything else Bolt related? I just have one personal question before we wrap the pod. Maybe like just advice, like now that you've
Swyx [01:29:20]: been through this journey, right? Advice to your former self. Oh, okay. Yeah. At which point? Advice yourself, like thinking about, there are many founders out there with a business where they're like, they're working really hard at it. It's interesting, but it's not an AI business. Yeah. And you kind of took the plunge to invest in this and it worked out for you. Maybe a lot of people are like, okay, like, you know, this guy got lucky. Obviously there's a little bit of luck in everything, but like, how do you improve your chances? Like, would you say, go for it? Would you say everyone should go for it? How would you advise someone who was in your shoes and thinking about, you know, maybe I should have a second product. Maybe I should take this, this experiment or maybe it doesn't work out. Like what is, what's the calculus here?
Eric [01:30:01]: Yeah. We were deeply skeptical going. I remember the conversation you and I had, you know, I was like this, I think there's something here. At that point we had built some amount, but I had waited a long time to give you the call. I said, this is your moment. Well, it was. So I remember specifically at the beginning of the conversation with Sean, he and I sat down at a coffee shop and, and, and SF, and, and so I was kind of giving him the pitch of like, you know, I think we have, I think that I can't remember the exact framing. I said, but it's, it's, it was obvious that Sean had heard a lot of people say this exact thing to him over the past year or two, which is like, Hey man, we've gotten AI play. Like this is our thing plus AI equals this, this could be crazy. And Sean, I get, you gave me this like skeptical look and then, and I was like, I really think so. And kind of here's why. Right. And and I think, I think that's, it's actually, I think it's, that is internally having, being skeptical of just kind of going and jumping on hype trains is, is good. Cause it's like, I think you, you know, your focus and your time and what you're putting your weight into is the most important thing when you're a founder. I think for us, like we actually, again, like I had mentioned at the beginning of this, you know, we had tried bold and didn't see the results and that was like a two week sprint and we rolled it back. Right. This, this isn't viable at this point, but then when, you know, once we, once we saw real tangible results of, you know, some of the new stuff, right. Okay. That, that changes. Thanks. And I think a lot of it is, is two is going and finding that out for yourself and then going and talking to the smartest people, you know, with more domain knowledge on that stuff than you have and going, here's kind of what we found. Does this track? So when Sean and I met and he, and he, and you know, we keep, he and I kind of, he saw it, we talked through it and he said, this is your moment. I specifically remember that. Cause I, I walked away from that and I was like, holy s**t, this, this is it. Like this, you know, like Sean's Sean's at the intersection of web and AI and as like, it, you know, has one of the best perspectives on this stuff of, of anyone I know that put a huge wind in our sales, honestly, of just like, okay, let's, let's go and really, let's go and double down here because you know, we had conviction before, but having someone who's in the space independently kind of verify meant a lot, you know, so it makes me uncomfortable, but thank you. I get it. I mean, and I waited, I waited until I was pretty darn sure it was not going to be a waste of time to
Alessio [01:32:12]: cool. Well, that's all I have. Yeah. And then on the personal side, you had a baby in April, you ran an Ironman in October. Now it's November.
Swyx [01:32:20]: He did Ironman while launching ball. I was trying to schedule the call for him and he was like, Nope, I'm sorry. I'm swimming. I was like, Hey, I'm on the swimming session. For those who don't know, actually, I did not know. I don't even know the distance of an Ironman. 13 hours. Your time was 12, 12, 12, 12, 15, 12, 15.
Eric [01:32:41]: Give me my minutes. No, no, I, it's, it can, it can completely depends on, you know, the course and just the, the, the person or whatever, right. And, but yeah, I mean, it's,
Swyx [01:32:51]: it's 2.4 cam open water, 2.4 mile open water swim, a hundred KM, a hundred mile, a hundred KM
Eric [01:32:58]: cycle. I think it's like, I think it's 112 mile a bike and then marathon. Yeah. Full 26.2 mile marathon. Yeah. It was why. Yeah. And you weren't, you were not like a super endurance athlete before, right? Like let's like make this clear. Yeah. Kind of a wild, a wild thing. So I, you know, back when I did, we, we had our daughter in April and at that time we were, the future of the company was, you know, we're, we're figuring out what are we going to do here at that time. It was, it was pro just prior to bolt kind of getting kicked into, you know, the rebirth of it with the new models and stuff. And so I knew that it was going to be, you know, having, having a child is, you know, if you talk to anyone that's done that you're, you don't have a lot of sleep. It's it's, you know, there's a lot of, you know, to, to, to be a great parent is, is a ton of work. And then also being a startup CEO where there's a lot of uncertainty or whatever the way I've always found, like when I have to go and you kind of knock it out of the park and all aspects of my life is, is going, yeah, just to, to make it all aspects of my life. And so I was, I just won. Yeah. I woke up one day, I was like, all right, I'm going to do an Ironman this year and I burned the ships, bought the, it's cost a thousand bucks to do. These didn't know that. And, you know, just started, I'd never ran a marathon at that point. And so I think it was like 45 or 60 days after that, I ran a marathon. My brother-in-law, he's, that was even more insane two weeks before the marathon. I was like, Hey, you want to run a marathon in two weeks? He's like, sure. And, and just did it with me. He did not an endurance athlete either. Right. But anyway, so yeah, so I was training, ended up getting a coach who's usually go, you're kind of online. He's up in Marin. Great guy was on the U S Olympic team for triathlons. And when I told him, okay, I'm going to, I'm doing Ironman, California in three months, he was like, are you insane? You know, like, what are you, you know, you'd ask for my opinion, but like, I just want you to know, I don't think this is a good idea. I think, you know, like you shouldn't do this, et cetera. And I ended up doing it, you know, I ended up getting it done. And so he was like, okay, like that's pretty bad. But what makes you, what makes you ignore expert advice here? Like
Swyx [01:34:59]: most sane people would be, would be like, okay, I mean, you know what you're doing? Like,
Eric [01:35:03]: I'll maybe wait a year. I think, and this is, this is kind of the, and the being a founder, right. It's, it's all about like, if you, like I mentioned earlier, it's like when we talk to people that worked on browser engines, they're like, you can't, you can't build what you're talking about. I think the job of a founder is, is to, is to solicit that advice. And, and what my coach actually said, he was right about certain things. There are certain areas where I was under indexed on, like, I was not, you know, spending nearly enough time on my bike, for example. Like after that, I was on my bike six hours a day on the weekends. That's a lot of time to spend in the saddle. Just like, just kind of, you know, and that was like, you know, for a couple of months leading up to it, he was right on, on certain aspects of it. And, but I kind of had to look internally and go, okay, like, what is he kind of missing about who I am and like, what I kind of know I'm capable of at this point. I mean, it was a nail biter. I mean, going into the thing, you know, it's, you get in, this is the same thing with launching bolt. It's like, or, or launching anything you get launch day, race day, you kind of go in, you're like, all right, here we go. Like we're going to, we're going to find out, we're going to find out, you know, how based in reality I was about all the decisions that led to this moment. And so I was going and doing the Ironman in like six months. Most people spend, you know, the, the folks he trains, usually it's, you know, one to two years on this stuff before you do try and do a full, you know, it's like going and kind of doing in that sort of timeframe. It's, it's, it's very similar to the same sort of skill set of going and building products. You have to really kind of look at the base reality and go make your own assessment on
Alessio [01:36:24]: it. Right. So cool. Great. Sorry to wrap. Thank you so much here. Thanks for your time.

Get full access to Latent Space at www.latent.space/subscribe
- Luisteren Nogmaals beluisteren Doorgaan Wordt afgespeeld...
- Later beluisteren Later beluisteren
The new Claude 3.5 Sonnet, Computer Use, and Building SOTA Agents — with Erik Schluntz, Anthropic
28 nov· Latent Space: The AI Engineer Podcast
We have announced our first speaker, friend of the show Dylan Patel, and topic slates for Latent Space LIVE! at NeurIPS. Sign up for IRL/Livestream and to debate!
We are still taking questions for our next big recap episode! Submit questions and messages on Speakpipe here for a chance to appear on the show!
The vibe shift we observed in July - in favor of Claude 3.5 Sonnet, first introduced in June — has been remarkably long lived and persistent, surviving multiple subsequent updates of 4o, o1 and Gemini versions, for Anthropic’s Claude to end 2024 as the preferred model for AI Engineers and even being the exclusive choice for new code agents like bolt.new (our next guest on the pod!), which unlocked so much performance from Claude Sonnet that it went from $0 to $4m ARR in 4 weeks when it launched last month.
Anthropic has now raised an additional $4b from Amazon and made an incredibly well received update of Claude 3.5 Sonnet (and Haiku), making significant improvements in performance over its predecessors:
Solving SWE-Bench
As part of the October Sonnet release, Anthropic teased a blink-and-you’ll miss it result:
The updated Claude 3.5 Sonnet shows wide-ranging improvements on industry benchmarks, with particularly strong gains in agentic coding and tool use tasks. On coding, it improves performance on SWE-bench Verified from 33.4% to 49.0%, scoring higher than all publicly available models—including reasoning models like OpenAI o1-preview and specialized systems designed for agentic coding. It also improves performance on TAU-bench, an agentic tool use task, from 62.6% to 69.2% in the retail domain, and from 36.0% to 46.0% in the more challenging airline domain. The new Claude 3.5 Sonnet offers these advancements at the same price and speed as its predecessor.
This was followed up by a blogpost a week later from today’s guest, Erik Schluntz, the engineer who implemented and scored this SOTA result using a simple, non-overengineered version of the SWE-Agent framework (you can see the submissions here). We have previously covered the SWE-Bench story extensively:
* Speaking with SWEBench/SWEAgent authors at ICLR
* Speaking with Cosine Genie, the previous SOTA (43.8%) on SWEBench Verified (with brief update at DevDay 2024)
* Speaking with Shunyu Yao on SWEBench and the ReAct paradigm driving SWE-Agent
One of the notable inclusions in this blogpost are the tools that Erik decided to give Claude, e.g. the “Edit Tool”:
The tools teased in the SWEBench submission/blogpost were then polished up and released with Computer Use…
And you can also see even more computer use tools given in the new Model Context Protocol servers:
Claude Computer Use
Because it is one of the best received AI releases of the year, we recommend watching the 2 minute Computer Use intro (and related demos) in its entirety:
Eric also worked on Claude’s function calling, tool use, and computer use APIs, so we discuss that in the episode.
Erik [00:53:39]: With computer use, just give the thing a browser that's logged into what you want to integrate with, and it's going to work immediately. And I see that reduction in friction as being incredibly exciting. Imagine a customer support team where, okay, hey, you got this customer support bot, but you need to go integrate it with all these things. And you don't have any engineers on your customer support team. But if you can just give the thing a browser that's logged into your systems that you need it to have access to, now, suddenly, in one day, you could be up and rolling with a fully integrated customer service bot that could go do all the actions you care about. So I think that's the most exciting thing for me about computer use, is reducing that friction of integrations to almost zero.
As you’ll see, this is very top of mind for Erik as a former Robotics founder who’s company basically used robots to interface with human physical systems like elevators.
Full Video episode
Please like and subscribe!
Show Notes
* Eric Schluntz
* “Raising the bar on SWE-Bench Verified”
* Cobalt Robotics
* SWE-Bench
* SWE-Bench Verified
* Human Eval & other benchmarks
* Anthropic Workbench
* Aider
* Cursor
* Fireworks AI
* E2B
* Amanda Askell
* Toyota Research
* Physical Intelligence (Pi)
* Chelsea Finn
* Josh Albrecht
* Eric Jang
* 1X
* Dust
* Cosine Episode
* Bolt
* Adept Episode
* TauBench
* LMSys Episode
Timestamps
* [00:00:00] Introductions
* [00:03:39] What is SWE-Bench?
* [00:12:22] SWE-Bench vs HumanEval vs others
* [00:15:21] SWE-Agent architecture and runtime
* [00:21:18] Do you need code indexing?
* [00:24:50] Giving the agent tools
* [00:27:47] Sandboxing for coding agents
* [00:29:16] Why not write tests?
* [00:30:31] Redesigning engineering tools for LLMs
* [00:35:53] Multi-agent systems
* [00:37:52] Why XML so good?
* [00:42:57] Thoughts on agent frameworks
* [00:45:12] How many turns can an agent do?
* [00:47:12] Using multiple model types
* [00:51:40] Computer use and agent use cases
* [00:59:04] State of AI robotics
* [01:04:24] Robotics in manufacturing
* [01:05:01] Hardware challenges in robotics
* [01:09:21] Is self-driving a good business?
Transcript
Alessio [00:00:00]: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner and CTO at Decibel Partners. And today we're in the new studio with my usual co-host, Shawn from Smol AI.
Swyx [00:00:14]: Hey, and today we're very blessed to have Erik Schluntz from Anthropic with us. Welcome.
Erik [00:00:19]: Hi, thanks very much. I'm Erik Schluntz. I'm a member of technical staff at Anthropic, working on tool use, computer use, and Swebench.
Swyx [00:00:27]: Yeah. Well, how did you get into just the whole AI journey? I think you spent some time at SpaceX as well? Yeah. And robotics. Yeah. There's a lot of overlap between like the robotics people and the AI people, and maybe like there's some interlap or interest between language models for robots right now. Maybe just a little bit of background on how you got to where you are. Yeah, sure.
Erik [00:00:50]: I was at SpaceX a long time ago, but before joining Anthropic, I was the CTO and co-founder of Cobalt Robotics. We built security and inspection robots. These are sort of five foot tall robots that would patrol through an office building or a warehouse looking for anything out of the ordinary. Very friendly, no tasers or anything. We would just sort of call a remote operator if we saw anything. We have about 100 of those out in the world, and had a team of about 100. We actually got acquired about six months ago, but I had left Cobalt about a year ago now, because I was starting to get a lot more excited about AI. I had been writing a lot of my code with things like Copilot, and I was like, wow, this is actually really cool. If you had told me 10 years ago that AI would be writing a lot of my code, I would say, hey, I think that's AGI. And so I kind of realized that we had passed this level, like, wow, this is actually really useful for engineering work. That got me a lot more excited about AI and learning about large language models. So I ended up taking a sabbatical and then doing a lot of reading and research myself and decided, hey, I want to go be at the core of this and joined Anthropic.
Alessio [00:01:53]: And why Anthropic? Did you consider other labs? Did you consider maybe some of the robotics companies?
Erik [00:02:00]: So I think at the time I was a little burnt out of robotics, and so also for the rest of this, any sort of negative things I say about robotics or hardware is coming from a place of burnout, and I reserve my right to change my opinion in a few years. Yeah, I looked around, but ultimately I knew a lot of people that I really trusted and I thought were incredibly smart at Anthropic, and I think that was the big deciding factor to come there. I was like, hey, this team's amazing. They're not just brilliant, but sort of like the most nice and kind people that I know, and so I just felt like I could be a really good culture fit. And ultimately, I do care a lot about AI safety and making sure that I don't want to build something that's used for bad purposes, and I felt like the best chance of that was joining Anthropic.
Alessio [00:02:39]: And from the outside, these labs kind of look like huge organizations that have these obscure
Swyx [00:02:44]: ways to organize.
Alessio [00:02:45]: How did you get, you joined Anthropic, did you already know you were going to work on of the stuff you publish or you kind of join and then you figure out where you land? I think people are always curious to learn more.
Erik [00:02:57]: Yeah, I've been very happy that Anthropic is very bottoms up and sort of very sort of receptive to whatever your interests are. And so I joined sort of being very transparent of like, hey, I'm most excited about code generation and AI that can actually go out and sort of touch the world or sort of help people build things. And, you know, those weren't my initial initial projects. I also came in and said, hey, I want to do the most valuable possible thing for this company and help Anthropic succeed. And, you know, like, let me find the balance of those. So I was working on lots of things at the beginning, you know, function calling, tool use. And then sort of as it became more and more relevant, I was like, oh, hey, like, let's it's time to go work on encoding agents and sort of started looking at SWE-Bench as sort of a really good benchmark for that.
Swyx [00:03:39]: So let's get right into SWE-Bench. That's one of the many claims to fame. I feel like there's just been a series of releases related with Cloud 3.5 Sonnet around about two or three months ago, 3.5 Sonnet came out and it was it was a step ahead in terms of a lot of people immediately fell in love with it for coding. And then last month you released a new updated version of Cloud Sonnet. We're not going to talk about the training for that because that's still confidential. But I think Anthropic's done a really good job, like applying the model to different things. So you took the lead on SWE-Bench, but then also we're going to talk a little bit about computer use later on. So maybe just give us a context about why you looked at SWE-Bench Verified and you actually came up with a whole system for building agents that would maximally use the model well. Yeah.
Erik [00:04:28]: So I'm on a sub team called Product Research. And basically the idea of product research is to really understand what end customers care about and want in the models and then work to try to make that happen. So we're not focused on sort of these more abstract general benchmarks like math problems or MMLU, but we really care about finding the things that are really valuable and making sure the models are great at those. And so because I've been interested in coding agents, I knew that this would be a really valuable thing. And I knew there were a lot of startups and our customers trying to build coding agents with our models. And so I said, hey, this is going to be a really good benchmark to be able to measure that and do well on it. And I wasn't the first person at Anthropic to find SWE-Bench, and there are lots of people that already knew about it and had done some internal efforts on it. It fell to me to sort of both implement the benchmark, which is very tricky, and then also to sort of make sure we had an agent and basically like a reference agent, maybe I'd call it, that could do very well on it. Ultimately, we want to provide how we implemented that reference agent so that people can build their own agents on top of our system and get sort of the most out of it as possible. So with this blog post we released on SWE-Bench, we released the exact tools and the prompt that we gave the model to be able to do well.
Swyx [00:05:46]: For people who don't know, who maybe haven't dived into SWE-Bench, I think the general perception is they're like tasks that a software engineer could do. I feel like that's an inaccurate description because it is basically, one, it's a subset of like 12 repos. It's everything they could find that every issue with like a matching commit that could be tested. So that's not every commit. And then SWE-Bench verified is further manually filtered by OpenAI. Is that an accurate description and anything you'd change about that? Yes.
Erik [00:06:14]: SWE-Bench is, it certainly is a subset of all tasks. It's first of all, it's only Python repos, so already fairly limited there. And it's just 12 of these popular open source repos. And yes, it's only ones where there were tests that passed at the beginning and also new tests that were introduced that test the new feature that's added. So it is, I think, a very limited subset of real engineering tasks. But I think it's also very valuable because even though it's a subset, it is true engineering tasks. And I think a lot of other benchmarks are really kind of these much more artificial setups of even if they're related to coding, they're more like coding interview style questions or puzzles that I think are very different from day-to-day what you end up doing. I don't know how frequently you all get to use recursion in your day-to-day job, but whenever I do, it's like a treat. And I think it's almost comical, and a lot of people joke about this in the industry, is how different interview questions are.
Swyx [00:07:13]: Dynamic programming. Yeah, exactly.
Erik [00:07:15]: Like, you code. From the day-to-day job. But I think one of the most interesting things about SWE-Bench is that all these other benchmarks are usually just isolated puzzles, and you're starting from scratch. Whereas SWE-Bench, you're starting in the context of an entire repository. And so it adds this entirely new dimension to the problem of finding the relevant files. And this is a huge part of real engineering, is it's actually pretty rare that you're starting something totally greenfield. You need to go and figure out where in a codebase you're going to make a change and understand how your work is going to interact with the rest of the systems. And I think SWE-Bench does a really good job of presenting that problem.
Alessio [00:07:51]: Why do we still use human eval? It's like 92%, I think. I don't even know if you can actually get to 100% because some of the data is not actually
Swyx [00:07:59]: solvable.
Alessio [00:08:00]: Do you see benchmarks like that, they should just get sunsetted? Because when you look at the model releases, it's like, oh, it's like 92% instead of like 89%, 90% on human eval versus, you know, SWE-Bench verified is you have 49%, right? Which is like, before 45% was state of the art, but maybe like six months ago it was like 30%, something like that. So is that a benchmark that you think is going to replace human eval, or do you think they're just going to run in parallel?
Erik [00:08:27]: I think there's still need for sort of many different varied evals. Like sometimes you do really care about just sort of greenfield code generation. And so I don't think that everything needs to go to sort of an agentic setup.
Swyx [00:08:39]: It would be very expensive to implement.
Erik [00:08:41]: The other thing I was going to say is that SWE-Bench is certainly hard to implement and expensive to run because each task, you have to parse, you know, a lot of the repo to understand where to put your code. And a lot of times you take many tries of writing code, running it, editing it. It can use a lot of tokens compared to something like human eval. So I think there's definitely a space for these more traditional coding evals that are sort of easy to implement, quick to run, and do get you some signal. Maybe hopefully there's just sort of harder versions of human eval that get created.
Alessio [00:09:14]: How do we get SWE-Bench verified to 92%? Do you think that's something where it's like line of sight to it, or it's like, you know, we need a whole lot of things to go right? Yeah, yeah.
Erik [00:09:23]: And actually, maybe I'll start with SWE-Bench versus SWE-Bench verified, which is I think something I missed earlier. So SWE-Bench is, as we described, this big set of tasks that were scraped.
Swyx [00:09:33]: Like 12,000 or something?
Erik [00:09:34]: Yeah, I think it's 2,000 in the final set. But a lot of those, even though a human did them, they're actually impossible given the information that comes with the task. The most classic example of this is the test looks for a very specific error string. You know, like assert message equals error, something, something, something. And unless you know that's exactly what you're looking for, there's no way the model is going to write that exact same error message, and so the tests are going to fail. So SWE-Bench verified was actually made in partnership with OpenAI, and they hired humans to go review all these tasks and pick out a subset to try to remove any obstacle like this that would make the tasks impossible. So in theory, all of these tasks should be fully doable by the model. And they also had humans grade how difficult they thought the problems would be. Between less than 15 minutes, I think 15 minutes to an hour, an hour to four hours, and greater than four hours. So that's kind of this interesting sort of how big the problem is as well. To get to SWE-Bench verified to 90%, actually, maybe I'll also start off with some of the remaining failures that I see when running our model on SWE-Bench. I'd say the biggest cases are the model sort of operates at the wrong level of abstraction. And what I mean by that is the model puts in maybe a smaller band-aid when really the task is asking for a bigger refactor. And some of those, you know, is the model's fault, but a lot of times if you're just sort of seeing the GitHub issue, it's not exactly clear which way you should do. So even though these tasks are possible, there's still some ambiguity in how the tasks are described. That being said, I think in general, language models frequently will produce a smaller diff when possible, rather than trying to do a big refactor. I think another area, at least the agent we created, didn't have any multimodal abilities, even though our models are very good at vision. So I think that's just a missed opportunity. And if I read through some of the traces, there's some funny things where, especially the tasks on matplotlib, which is a graphing library, the test script will save an image and the model will just say, okay, it looks great, you know, without looking at it. So there's certainly extra juice to squeeze there of just making sure the model really understands all the sides of the input that it's given, including multimodal. But yeah, I think like getting to 92%. So this is something that I have not looked at, but I'm very curious about. I want someone to look at, like, what is the union of all of the different tasks that have been solved by at least one attempt at SWE-Bench Verified. There's a ton of submissions to the benchmark, and so I'd be really curious to see how many of those 500 tasks at least someone has solved. And I think, you know, there's probably a bunch that none of the attempts have ever solved. And I think it'd be interesting to look at those and say, hey, is there some problem with these? Like, are these impossible? Or are they just really hard and only a human could do them?
Swyx [00:12:22]: Yeah, like specifically, is there a category of problems that are still unreachable by any LLM agent? Yeah, yeah. And I think there definitely are.
Erik [00:12:28]: The question is, are those fairly inaccessible or are they just impossible because of the descriptions? But I think certainly some of the tasks, especially the ones that the human graders reviewed as like taking longer than four hours are extremely difficult. I think we got a few of them right, but not very many at all in the benchmark.
Swyx [00:12:49]: And did those take less than four hours?
Erik [00:12:51]: They certainly did less than, yeah, than four hours.
Swyx [00:12:54]: Is there a correlation of length of time with like human estimated time? You know what I mean? Or do we have sort of more of X paradox type situations where it's something super easy for a model, but hard for a human?
Erik [00:13:06]: I actually haven't done the stats on that, but I think that'd be really interesting to see of like how many tokens does it take and how is that correlated with difficulty? What is the likelihood of success with difficulty? I think actually a really interesting thing that I saw, one of my coworkers who was also working on this named Simon, he was focusing just specifically on the very hard problems, the ones that are said to take longer than four hours. And he ended up sort of creating a much more detailed prompt than I used. And he got a higher score on the most difficult subset of problems, but a lower score overall on the whole benchmark. And the prompt that I made, which is sort of much more simple and bare bones, got a higher score on the overall benchmark, but lower score on the really hard problems. And I think some of that is the really detailed prompt made the model sort of overcomplicate a lot of the easy problems, because honestly, a lot of the suite bench problems, they really do just ask for a bandaid where it's like, hey, this crashes if this is none, and really all you need to do is put a check if none. And so sometimes trying to make the model think really deeply, it'll think in circles and overcomplicate something, which certainly human engineers are capable of as well. But I think there's some interesting thing of the best prompt for hard problems might not be the best prompt for easy problems.
Alessio [00:14:19]: How do we fix that? Are you supposed to fix it at the model level? How do I know what prompt I'm supposed to use?
Swyx [00:14:25]: Yeah.
Erik [00:14:26]: And I'll say this was a very small effect size, and so I think this isn't worth obsessing over. I would say that as people are building systems around agents, I think the more you can separate out the different kinds of work the agent needs to do, the better you can tailor a prompt for that task. And I think that also creates a lot of like, for instance, if you were trying to make an agent that could both solve hard programming tasks, and it could just write quick test files for something that someone else had already made, the best way to do those two tasks might be very different prompts. I see a lot of people build systems where they first sort of have a classification, and then route the problem to two different prompts. And that's sort of a very effective thing, because one, it makes the two different prompts much simpler and smaller, and it means you can have someone work on one of the prompts without any risk of affecting the other tasks. So it creates like a nice separation of concerns. Yeah.
Alessio [00:15:21]: And the other model behavior thing you mentioned, they prefer to generate like shorter diffs. Why is that? Like, is there a way? I think that's maybe like the lazy model question that people have is like, why are you not just generating the whole code instead of telling me to implement it?
Swyx [00:15:36]: Are you saving tokens? Yeah, exactly. It's like conspiracy theory. Yeah. Yeah.
Erik [00:15:41]: Yeah. So there's two different things there. One is like the, I'd say maybe like doing the easier solution rather than the hard solution. And I'd say the second one, I think what you're talking about is like the lazy model is like when the model says like dot, dot, dot, code remains the same.
Swyx [00:15:52]: Code goes here. Yeah. I'm like, thanks, dude.
Erik [00:15:55]: But honestly, like that just comes as like people on the internet will do stuff like that. And like, dude, if you're talking to a friend and you ask them like to give you some example code, they would definitely do that. They're not going to reroll the whole thing. And so I think that's just a matter of like, you know, sometimes you actually do just, just want like the relevant changes. And so I think it's, this is something where a lot of times like, you know, the models aren't good at mind reading of like which one you want. So I think that like the more explicit you can be in prompting to say, Hey, you know, give me the entire thing, no, no elisions versus just give me the relevant changes. And that's something, you know, we want to make the models always better at following those kinds of instructions.
Swyx [00:16:32]: I'll drop a couple of references here. We're recording this like a day after Dario, Lex Friedman just dropped his five hour pod with Dario and Amanda and the rest of the crew. And Dario actually made this interesting observation that like, we actually don't want, we complain about models being too chatty in text and then not chatty enough in code. And so like getting that right is kind of a awkward bar because, you know, you, you don't want it to yap in its responses, but then you also want it to be complete in, in code. And then sometimes it's not complete. Sometimes you just want it to diff, which is something that Enthopic has also released with a, you know, like the, the fast edit stuff that you guys did. And then the other thing I wanted to also double back on is the prompting stuff. You said, you said it was a small effect, but it was a noticeable effect in terms of like picking a prompt. I think we'll go into suite agent in a little bit, but I kind of reject the fact that, you know, you need to choose one prompt and like have your whole performance be predicated on that one prompt. I think something that Enthopic has done really well is meta prompting, prompting for a prompt. And so why can't you just develop a meta prompt for, for all the other prompts? And you know, if it's a simple task, make a simple prompt, if it's a hard task, make a hard prompt. Obviously I'm probably hand-waving a little bit, but I will definitely ask people to try the Enthopic Workbench meta prompting system if they haven't tried it yet. I went to the Build Day recently at Enthopic HQ, and it's the closest I've felt to an AGI, like learning how to operate itself that, yeah, it's, it's, it's really magical.
Erik [00:17:57]: Yeah, no, Claude is great at writing prompts for Claude.
Swyx [00:18:00]: Right, so meta prompting. Yeah, yeah.
Erik [00:18:02]: The way I think about this is that humans, even like very smart humans still use sort of checklists and use sort of scaffolding for themselves. Surgeons will still have checklists, even though they're incredible experts. And certainly, you know, a very senior engineer needs less structure than a junior engineer, but there still is some of that structure that you want to keep. And so I always try to anthropomorphize the models and try to think about for a human sort of what is the equivalent. And that's sort of, you know, how I think about these things is how much instruction would you give a human with the same task? And do you, would you need to give them a lot of instruction or a little bit of instruction?
Alessio [00:18:36]: Let's talk about the agent architecture maybe. So first, runtime, you let it run until it thinks it's done or it reaches 200k context window.
Swyx [00:18:45]: How did you come up? What's up with that?
Erik [00:18:47]: Yeah.
Swyx [00:18:48]: Yeah.
Erik [00:18:49]: I mean, this, so I'd say that a lot of previous agent work built sort of these very hard coded and rigid workflows where the model is sort of pushed through certain flows of steps. And I think to some extent, you know, that's needed with smaller models and models that are less smart. But one of the things that we really wanted to explore was like, let's really give Claude the reins here and not force Claude to do anything, but let Claude decide, you know, how it should approach the problem, what steps it should do. And so really, you know, what we did is like the most extreme version of this is just give it some tools that it can call and it's able to keep calling the tools, keep thinking, and then yeah, keep doing that until it thinks it's done. And that's sort of the most, the most minimal agent framework that we came up with. And I think that works very well. I think especially the new Sonnet 3.5 is very, very good at self-correction, has a lot of like grit. Claude will try things that fail and then try, you know, come back and sort of try different approaches. And I think that's something that you didn't see in a lot of previous models. Some of the existing agent frameworks that I looked at, they had whole systems built to try to detect loops and see, oh, is the model doing the same thing, you know, more than three times, then we have to pull it out. And I think like the smarter the models are, the less you need that kind of extra scaffolding. So yeah, just giving the model tools and letting it keep sample and call tools until it thinks it's done was the most minimal framework that we could think of. And so that's what we did.
Alessio [00:20:18]: So you're not pruning like bad paths from the context. If it tries to do something, it fails. You just burn all these tokens.
Swyx [00:20:25]: Yes.
Erik [00:20:26]: I would say the downside of this is that this is sort of a very token expensive way to do
Swyx [00:20:29]: this. But still, it's very common to prune bad paths because models get stuck. Yeah.
Erik [00:20:35]: But I'd say that, yeah, 3.5 is not getting stuck as much as previous models. And so, yeah, we wanted to at least just try the most minimal thing. Now, I would say that, you know, this is definitely an area of future research, especially if we talk about these problems that are going to take a human more than four hours. Those might be things where we're going to need to go prune bad paths to let the model be able to accomplish this task within 200k tokens. So certainly I think there's like future research to be done in that area, but it's not necessary to do well on these benchmarks.
Swyx [00:21:06]: Another thing I always have questions about on context window things, there's a mini cottage industry of code indexers that have sprung up for large code bases, like the ones in SweetBench. You didn't need them? We didn't.
Erik [00:21:18]: And I think I'd say there's like two reasons for this. One is like SweetBench specific and the other is a more general thing. The more general thing is that I think Sonnet is very good at what we call agentic search. And what this basically means is letting the model decide how to search for something. It gets the results and then it can decide, should it keep searching or is it done? Does it have everything it needs? So if you read through a lot of the traces of the SweetBench, the model is calling tools to view directories, list out things, view files. And it will do a few of those until it feels like it's found the file where the bug is. And then it will start working on that file. And I think like, again, this is all, everything we did was about just giving Claude the full reins. So there's no hard-coded system. There's no search system that you're relying on getting the correct files into context. This just totally lets Claude do it.
Swyx [00:22:11]: Or embedding things into a vector database. Exactly. Oops. No, no.
Erik [00:22:17]: This is very, very token expensive. And so certainly, and it also takes many, many turns. And so certainly if you want to do something in a single turn, you need to do RAG and just push stuff into the first prompt.
Alessio [00:22:28]: And just to make it clear, it's using the Bash tool, basically doing LS, looking at files and then doing CAD for the following context. It can do that.
Erik [00:22:35]: But it's file editing tool also has a command in it called view that can view a directory. It's very similar to LS, but it just sort of has some nice sort of quality of life improvements. So I think it'll only do an LS sort of two directories deep so that the model doesn't get overwhelmed if it does this on a huge file. I would say actually we did more engineering of the tools than the overall prompt. But the one other thing I want to say about this agentic search is that for SWE-Bench specifically, a lot of the tasks are bug reports, which means they have a stack trace in them. And that means right in that first prompt, it tells you where to go. And so I think this is a very easy case for the model to find the right files versus if you're using this as a general coding assistant where there isn't a stack trace or you're asking it to insert a new feature, I think there it's much harder to know which files to look at. And that might be an area where you would need to do more of this exhaustive search where an agentic search would take way too long.
Swyx [00:23:33]: As someone who spent the last few years in the JS world, it'd be interesting to see SWE-Bench JS because these stack traces are useless because of so much virtualization that we do. So they're very, very disconnected with where the code problems are actually appearing.
Erik [00:23:50]: That makes me feel better about my limited front-end experience, as I've always struggled with that problem.
Swyx [00:23:55]: It's not your fault. We've gotten ourselves into a very, very complicated situation. And I'm not sure it's entirely needed. But if you talk to our friends at Vercel, they will say it is.
Erik [00:24:04]: I will say SWE-Bench just released SWE-Bench Multimodal, which I believe is either entirely JavaScript or largely JavaScript. And it's entirely things that have visual components of them.
Swyx [00:24:15]: Are you going to tackle that? We will see.
Erik [00:24:17]: I think it's on the list and there's interest, but no guarantees yet.
Swyx [00:24:20]: Just as a side note, it occurs to me that every model lab, including Enthopic, but the others as well, you should have your own SWE-Bench, whatever your bug tracker tool. This is a general methodology that you can use to track progress, I guess.
Erik [00:24:34]: Yeah, sort of running on our own internal code base.
Swyx [00:24:36]: Yeah, that's a fun idea.
Alessio [00:24:37]: Since you spend so much time on the tool design, so you have this edit tool that can make changes and whatnot. Any learnings from that that you wish the AI IDEs would take in? Is there some special way to look at files, feed them in?
Erik [00:24:50]: I would say the core of that tool is string replace. And so we did a few different experiments with different ways to specify how to edit a file. And string replace, basically, the model has to write out the existing version of the string and then a new version, and that just gets swapped in. We found that to be the most reliable way to do these edits. Other things that we tried were having the model directly write a diff, having the model fully regenerate files. That one is actually the most accurate, but it takes so many tokens, and if you're in a very big file, it's cost prohibitive. There's basically a lot of different ways to represent the same task. And they actually have pretty big differences in terms of model accuracy. I think Eider, they have a really good blog where they explore some of these different methods for editing files, and they post results about them, which I think is interesting. But I think this is a really good example of the broader idea that you need to iterate on tools rather than just a prompt. And I think a lot of people, when they make tools for an LLM, they kind of treat it like they're just writing an API for a computer, and it's sort of very minimal. It's sort of just the bare bones of what you'd need, and honestly, it's so hard for the models to use those. Again, I come back to anthropomorphizing these models. Imagine you're a developer, and you just read this for the very first time, and you're trying to use it. You can do so much better than just sort of the bare API spec of what you'd often see. Include examples in the description. Include really detailed explanations of how things work. And I think that, again, also think about what is the easiest way for the model to represent the change that it wants to make. For file editing, as an example, writing a diff is actually... Let's take the most extreme example. You want the model to literally write a patch file. I think patch files have at the very beginning numbers of how many total lines change. That means before the model has actually written the edit, it needs to decide how many numbers or how many lines are going to change.
Swyx [00:26:52]: Don't quote me on that.
Erik [00:26:54]: I think it's something like that, but I don't know if that's exactly the diff format. But you can certainly have formats that are much easier to express without messing up than others. And I like to think about how much human effort goes into designing human interfaces for things. It's incredible. This is entirely what FrontEnd is about, is creating better interfaces to kind of do the same things. And I think that same amount of attention and effort needs to go into creating agent computer interfaces.
Swyx [00:27:19]: It's a topic we've discussed, ACI or whatever that looks like. I would also shout out that I think you released some of these toolings as part of computer use as well. And people really liked it. It's all open source if people want to check it out. I'm curious if there's an environment element that complements the tools. So how do you... Do you have a sandbox? Is it just Docker? Because that can be slow or resource intensive. Do you have anything else that you would recommend?
Erik [00:27:47]: I don't think I can talk about sort of public details or about private details about how we implement our sandboxing. But obviously, we need to have sort of safe, secure, and fast sandboxes for training for the models to be able to practice writing code and working in an environment.
Swyx [00:28:03]: I'm aware of a few startups working on agent sandboxing. E2B is a close friend of ours that Alessio has led around in, but also I think there's others where they're focusing on snapshotting memory so that it can do time travel for debugging. Computer use where you can control the mouse or keyboard or something like that. Whereas here, I think that the kinds of tools that we offer are very, very limited to coding agent work cases like bash, edit, you know, stuff like that. Yeah.
Erik [00:28:30]: I think the computer use demo that we released is an extension of that. It has the same bash and edit tools, but it also has the computer tool that lets it get screenshots and move the mouse and keyboard. Yeah. So I definitely think there's sort of more general tools there. And again, the tools we released as part of SweetBench were, I'd say they're very specific for like editing files and doing bash, but at the same time, that's actually very general if you think about it. Like anything that you would do on a command line or like editing files, you can do with those tools. And so we do want those tools to feel like any sort of computer terminal work could be done with those same tools rather than making tools that were like very specific for SweetBench like run tests as its own tool, for instance. Yeah.
Swyx [00:29:15]: You had a question about tests.
Alessio [00:29:16]: Yeah, exactly. I saw there's no test writer tool. Is it because it generates the code and then you're running it against SweetBench anyway, so it doesn't really need to write the test or?
Swyx [00:29:26]: Yeah.
Erik [00:29:27]: So this is one of the interesting things about SweetBench is that the tests that the model's output is graded on are hidden from it. That's basically so that the model can't cheat by looking at the tests and writing the exact solution. And I'd say typically the model, the first thing it does is it usually writes a little script to reproduce the error. And again, most SweetBench tasks are like, hey, here's a bug that I found. I run this and I get this error. So the first thing the model does is try to reproduce that. So it's kind of been rerunning that script as a mini test. But yeah, sometimes the model will like accidentally introduce a bug that breaks some other tests and it doesn't know about that.
Alessio [00:30:05]: And should we be redesigning any tools? We kind of talked about this and like having more examples, but I'm thinking even things of like Q as a query parameter in many APIs, it's like easier for the model to like re-query than read the Q. I'm sure it learned the Q by this point, but like, is there anything you've seen like building this where it's like, hey, if I were to redesign some CLI tools, some API tool, I would like change the way structure to make it better for LLMs?
Erik [00:30:31]: I don't think I've thought enough about that off the top of my head, but certainly like just making everything more human friendly, like having like more detailed documentation and examples. I think examples are really good in things like descriptions, like so many, like just using the Linux command line, like how many times I do like dash dash help or look at the man page or something. It's like, just give me one example of like how I actually use this. Like I don't want to go read through a hundred flags. Just give me the most common example. But again, so you know, things that would be useful for a human, I think are also very useful for a model.
Swyx [00:31:03]: Yeah. I mean, there's one thing that you cannot give to code agents that is useful for human is this access to the internet. I wonder how to design that in, because one of the issues that I also had with just the idea of a suite bench is that you can't do follow up questions. You can't like look around for similar implementations. These are all things that I do when I try to fix code and we don't do that. It's not, it wouldn't be fair, like it'd be too easy to cheat, but then also it's kind of not being fair to these agents because they're not operating in a real world situation. Like if I had a real world agent, of course I'm giving it access to the internet because I'm not trying to pass a benchmark. I don't have a question in there more, more just like, I feel like the most obvious tool access to the internet is not being used.
Erik [00:31:47]: I think that that's really important for humans, but honestly the models have so much general knowledge from pre-training that it's, it's like less important for them. I feel like versioning, you know, if you're working on a newer thing that was like, they came after the knowledge cutoff, then yes, I think that's very important. I think actually this, this is like a broader problem that there is a divergence between Sweebench and like what customers will actually care about who are working on a coding agent for real use. And I think one of those there is like internet access and being able to like, how do you pull in outside information? I think another one is like, if you have a real coding agent, you don't want to have it start on a task and like spin its wheels for hours because you gave it a bad prompt. You want it to come back immediately and ask follow up questions and like really make sure it has a very detailed understanding of what to do, then go off for a few hours and do work. So I think that like real tasks are going to be much more interactive with the agent rather than this kind of like one shot system. And right now there's no benchmark that, that measures that. And maybe I think it'd be interesting to have some benchmark that is more interactive. I don't know if you're familiar with TauBench, but it's a, it's a customer service benchmark where there's basically one LLM that's playing the user or the customer that's getting support and another LLM that's playing the support agent and they interact and try to resolve the issue.
Swyx [00:33:08]: Yeah. We talked to the LMSIS guys. Awesome. And they also did MTBench for people listening along. So maybe we need MTSWE-Bench. Sure. Yeah.
Erik [00:33:16]: So maybe, you know, you could have something where like before the SWE-Bench task starts, you have like a few back and forths with kind of like the, the author who can answer follow up questions about what they want the task to do. And of course you'd need to do that where it doesn't cheat and like just get the exact, the exact thing out of the human or out of the sort of user. But I think that would be a really interesting thing to see. If you look at sort of existing agent work, like a Repl.it's coding agent, I think one of the really great UX things they do is like first having the agent create a plan and then having the human approve that plan or give feedback. I think for agents in general, like having a planning step at the beginning, one, just having that plan will improve performance on the downstream task just because it's kind of like a bigger chain of thought, but also it's just such a better UX. It's way easier for a human to iterate on a plan with a model rather than iterating on the full task that sort of has a much slower time through each loop. If the human has approved this implementation plan, I think it makes the end result a lot more sort of auditable and trustable. So I think there's a lot of things sort of outside of SweetBench that will be very important for real agent usage in the world. Yeah.
Swyx [00:34:27]: I will say also, there's a couple of comments on names that you dropped. Copilot also does the plan stage before it writes code. I feel like those approaches have generally been less Twitter successful because it's not prompt to code, it's prompt plan code. You know, so there's a little bit of friction in there, but it's not much. Like it's, it actually, it's, it, you get a lot for what it's worth. I also like the way that Devin does it, where you can sort of edit the plan as it goes along. And then the other thing with Repl.it, we had a, we hosted a sort of dev day pregame with Repl.it and they also commented about multi-agents. So like having two agents kind of bounce off of each other. I think it's a similar approach to what you're talking about with kind of the few shot example, just as in the prompts of clarifying what the agent wants. But typically I think this would be implemented as a tool calling another agent, like a sub-agent I don't know if you explored that, do you like that idea?
Erik [00:35:20]: I haven't explored this enough, but I've definitely heard of people having good success with this. Of almost like basically having a few different sort of personas of agents, even if they're all the same LLM. I think this is one thing with multi-agent that a lot of people will kind of get confused by is they think it has to be different models behind each thing. But really it's sort of usually the same, the same model with different prompts. And yet having one, having them have different personas to kind of bring different sort of thoughts and priorities to the table. I've seen that work very well and sort of create a much more thorough and thought out
Swyx [00:35:53]: response.
Erik [00:35:53]: I think the downside is just that it adds a lot of complexity and it adds a lot of extra tokens. So I think it depends what you care about. If you want a plan that's very thorough and detailed, I think it's great. If you want a really quick, just like write this function, you know, you probably don't want to do that and have like a bunch of different calls before it does this.
Alessio [00:36:11]: And just talking about the prompt, why are XML tags so good in Cloud? I think initially people were like, oh, maybe you're just getting lucky with XML. But I saw obviously you use them in your own agent prompts, so they must work. And why is it so model specific to your family?
Erik [00:36:26]: Yeah, I think that there's, again, I'm not sure how much I can say, but I think there's historical reasons that internally we've preferred XML. I think also the one broader thing I'll say is that if you look at certain kinds of outputs, there is overhead to outputting in JSON. If you're trying to output code in JSON, there's a lot of extra escaping that needs to be done, and that actually hurts model performance across the board. Versus if you're in just a single XML tag, there's none of that sort of escaping that
Swyx [00:36:58]: needs to happen.
Erik [00:36:58]: That being said, I haven't tried having it write HTML and XML, which maybe then you start running into weird escaping things there. I'm not sure. But yeah, I'd say that's some historical reasons, and there's less overhead of escaping.
Swyx [00:37:12]: I use XML in other models as well, and it's just a really nice way to make sure that the thing that ends is tied to the thing that starts. That's the only way to do code fences where you're pretty sure example one start, example one end, that is one cohesive unit.
Alessio [00:37:30]: Because the braces are nondescriptive. Yeah, exactly.
Swyx [00:37:33]: That would be my simple reason. XML is good for everyone, not just Cloud. Cloud was just the first one to popularize it, I think.
Erik [00:37:39]: I do definitely prefer to read XML than read JSON.
Alessio [00:37:43]: Any other details that are maybe underappreciated? I know, for example, you had the absolute paths versus relative. Any other fun nuggets?
Erik [00:37:52]: I think that's a good sort of anecdote to mention about iterating on tools. Like I said, spend time prompt engineering your tools, and don't just write the prompt, but write the tool, and then actually give it to the model and read a bunch of transcripts about how the model tries to use the tool. I think by doing that, you will find areas where the model misunderstands a tool or makes mistakes, and then basically change the tool to make it foolproof. There's this Japanese term, pokayoke, about making tools mistake-proof. You know, the classic idea is you can have a plug that can fit either way, and that's dangerous, or you can make it asymmetric so that it can't fit this way, it has to go like this, and that's a better tool because you can't use it the wrong way. So for this example of absolute paths, one of the things that we saw while testing these tools is, oh, if the model has done CD and moved to a different directory, it would often get confused when trying to use the tool because it's now in a different directory, and so the paths aren't lining up. So we said, oh, well, let's just force the tool to always require an absolute path, and then that's easy for the model to understand. It knows sort of where it is. It knows where the files are. And then once we have it always giving absolute paths, it never messes up even, like, no matter where it is because it just, if you're using an absolute path, it doesn't matter where
Swyx [00:39:13]: you are.
Erik [00:39:13]: So iterations like that, you know, let us make the tool foolproof for the model. I'd say there's other categories of things where we see, oh, if the model, you know, opens vim, like, you know, it's never going to return. And so the tool is stuck.
Swyx [00:39:28]: Did it get stuck? Yeah. Get out of vim. What?
Erik [00:39:31]: Well, because the tool is, like, it just text in, text out. It's not interactive. So it's not like the model doesn't know how to get out of vim. It's that the way that the tool is, like, hooked up to the computer is not interactive. Yes, I mean, there is the meme of no one knows how to get out of vim. You know, basically, we just added instructions in the tool of, like, hey, don't launch commands that don't return.
Swyx [00:39:54]: Yeah, like, don't launch vim.
Erik [00:39:55]: Don't launch whatever. If you do need to do something, you know, put an ampersand after it to launch it in the background. And so, like, just, you know, putting kind of instructions like that just right in the description for the tool really helps the model. And I think, like, that's an underutilized space of prompt engineering, where, like, people might try to do that in the overall prompt, but just put that in the tool itself so the model knows that it's, like, for this tool, this is what's relevant.
Swyx [00:40:20]: You said you worked on the function calling and tool use before you actually started this vBench work, right? Was there any surprises? Because you basically went from creator of that API to user of that API. Any surprises or changes you would make now that you have extensively dog-fooded in a state-of-the-art agent?
Erik [00:40:39]: I want us to make, like, maybe, like, a little bit less verbose SDK. I think some way, like, right now, it just takes, I think we sort of force people to do the best practices of writing out sort of these full JSON schemas, but it would be really nice if you could just pass in a Python function as a tool. I think that could be something nice.
Swyx [00:40:58]: I think that there's a lot of, like, Python- There's helper libraries. ... structure, you know. I don't know if there's anyone else that is specializing for Anthropic. Maybe Jeremy Howard's and Simon Willis's stuff. They all have Cloud-specific stuff that they are working on. Cloudette. Cloudette, exactly. I also wanted to spend a little bit of time with SuiteAgent. It seems like a very general framework. Like, is there a reason you picked it apart from it's the same authors as vBench, or?
Erik [00:41:21]: The main thing we wanted to go with was the same authors as vBench, so it just felt sort of like the safest, most neutral option. And it was, you know, very high quality. It was very easy to modify, to work with. I would say it also actually, their underlying framework is sort of this, it's like, you
Swyx [00:41:39]: know, think, act, observe.
Erik [00:41:40]: That they kind of go through this loop, which is like a little bit more hard-coded than what we wanted to do, but it's still very close. That's still very general. So it felt like a good match as sort of the starting point for our agent. And we had already sort of worked with and talked with the SWE-Bench people directly, so it felt nice to just have, you know, we already know the authors. This will be easy to work with.
Swyx [00:42:00]: I'll share a little bit of like, this all seems disconnected, but once you figure out the people and where they go to school, it all makes sense. So it's all Princeton. Yeah, the SWE-Bench and SuiteAgent.
Erik [00:42:11]: It's a group out of Princeton.
Swyx [00:42:12]: Yeah, and we had Shun Yu on the pod, and he came up with the React paradigm, and that's think, act, observe. That's all React. So they're all friends. Yep, yeah, exactly.
Erik [00:42:22]: And you know, if you actually read our traces of our submission, you can actually see like think, act, observe in our logs. And we just didn't even change the printing code. So it's like doing still function calls under the hood, and the model can do sort of multiple function calls in a row without thinking in between if it wants to. But yeah, so a lot of similarities and a lot of things we inherited from SuiteAgent just as a starting point for the framework.
Alessio [00:42:47]: Any thoughts about other agent frameworks? I think there's, you know, the whole gamut from very simple to like very complex.
Swyx [00:42:53]: Autogen, CooEI, LandGraph. Yeah, yeah.
Erik [00:42:56]: I think I haven't explored a lot of them in detail. I would say with agent frameworks in general, they can certainly save you some like boilerplate. But I think there's actually this like downside of making agents too easy, where you end up very quickly like building a much more complex system than you need. And suddenly, you know, instead of having one prompt, you have five agents that are talking to each other and doing a dialogue. And it's like, because the framework made that 10 lines to do, you end up building something that's way too complex. So I think I would actually caution people to like try to start without these frameworks if you can, because you'll be closer to the raw prompts and be able to sort of directly understand what's going on. I think a lot of times these frameworks also, by trying to make everything feel really magical, you end up sort of really hiding what the actual prompt and output of the model is, and that can make it much harder to debug. So certainly these things have a place, and I think they do really help at getting rid of boilerplate, but they come with this cost of obfuscating what's really happening and making it too easy to very quickly add a lot of complexity. So yeah, I would recommend people to like try it from scratch, and it's like not that bad.
Alessio [00:44:08]: Would you rather have like a framework of tools? Do you almost see like, hey, it's maybe easier to get tools that are already well curated, like the ones that you build, if I had an easy way to get the best tool from you, and
Swyx [00:44:21]: like you maintain the definition?
Alessio [00:44:22]: Or yeah, any thoughts on how you want to formalize tool sharing?
Erik [00:44:26]: Yeah, I think that's something that we're certainly interested in exploring, and I think there is space for sort of these general tools that will be very broadly applicable. But at the same time, most people that are building on these, they do have much more specific things that they're trying to do. You know, I think that might be useful for hobbyists and demos, but the ultimate end applications are going to be bespoke. And so we just want to make sure that the model's great at any tool that it uses. But certainly something we're exploring.
Alessio [00:44:52]: So everything bespoke, no frameworks, no anything.
Swyx [00:44:55]: Just for now, for now.
Erik [00:44:56]: Yeah, I would say that like the best thing I've seen is people building up from like, build some good util functions, and then you can use those as building blocks. Yeah, yeah.
Alessio [00:45:05]: I have a utils folder, or like all these scripts. My framework is like def, call, and tropic. And then I just put all the defaults.
Swyx [00:45:12]: Yeah, exactly. There's a startup hidden in every utils folder, you know? No, totally not. Like, if you use it enough, like it's a startup, you know? At some point. I'm kind of curious, is there a maximum length of turns that it took? Like, what was the longest run? I actually don't.
Erik [00:45:27]: I mean, it had basically infinite turns until it ran into a 200k context. I should have looked this up. I don't know. And so for some of those failed cases where it eventually ran out of context, I mean, it was over 100 turns. I'm trying to remember like the longest successful run, but I think it was definitely over 100 turns that some of the times.
Swyx [00:45:48]: Which is not that much. It's a coffee break. Yeah.
Erik [00:45:52]: But certainly, you know, these things can be a lot of turns. And I think that's because some of these things are really hard, where it's going to take, you know, many tries to do it. And if you think about like, think about a task that takes a human four hours to do. Think about how many different files you read, and like times you edit a file in four hours. That's a lot more than 100.
Alessio [00:46:10]: How many times you open Twitter because you get distracted. But if you had a lot more compute, what's kind of like the return on the extra compute now? So like, you know, if you had thousands of turns or like whatever, like how much better would it get?
Erik [00:46:23]: Yeah, this I don't know. And I think this is, I think sort of one of the open areas of research in general with agents is memory and sort of how do you have something that can do work beyond its context length where you're just purely appending. So you mentioned earlier things like pruning bad paths. I think there's a lot of interesting work around there. Can you just roll back but summarize, hey, don't go down this path? There be dragons. Yeah, I think that's very interesting that you could have something that that uses way more tokens without ever using at a time more than 200k. So I think that's very interesting. I think the biggest thing is like, can you make the model sort of losslessly summarize what it's learned from trying different approaches and bring things back? I think that's sort of the big challenge.
Swyx [00:47:11]: What about different models?
Alessio [00:47:12]: So you have Haiku, which is like, you know, cheaper. So you're like, well, what if I have a Haiku to do a lot of these smaller things and then put it back up?
Erik [00:47:20]: I think Cursor might have said that they actually have a separate model for file editing.
Swyx [00:47:25]: I'm trying to remember.
Erik [00:47:25]: I think they were on maybe the Lex Fridman podcast where they said they have a bigger model, like write what the code should be and then a different model, like apply it. So I think there's a lot of interesting room for stuff like that. Yeah, fast supply.
Swyx [00:47:37]: We actually did a pod with Fireworks that they worked with on. It's speculative decoding.
Erik [00:47:41]: But I think there's also really interesting things about like, you know, paring down input tokens as well, especially sometimes the models trying to read like a 10,000 line file. That's a lot of tokens. And most of it is actually not going to be relevant. I think it'd be really interesting to like delegate that to Haiku. Haiku read this file and just pull out the most relevant functions. And then, you know, Sonnet reads just those and you save 90% on tokens. I think there's a lot of really interesting room for things like that. And again, we were just trying to do sort of the simplest, most minimal thing and show that it works. I'm really hoping that people, sort of the agent community builds things like that on top of our models. That's, again, why we released these tools. We're not going to go and do lots more submissions to SWE-Bench and try to prompt engineer this and build a bigger system. We want people to like the ecosystem to do that on top of our models. But yeah, so I think that's a really interesting one.
Swyx [00:48:32]: It turns out, I think you did do 3.5 Haiku with your tools and it scored a 40.6. Yes.
Erik [00:48:38]: So it did very well. It itself is actually very smart, which is great. But we haven't done any experiments with this combination of the two models. But yeah, I think that's one of the exciting things is that how well Haiku 3.5 did on SWE-Bench shows that sort of even our smallest, fastest model is very good at sort of thinking agentically and working on hard problems. Like it's not just sort of for writing simple text anymore.
Alessio [00:49:02]: And I know you're not going to talk about it, but like Sonnet is not even supposed to be the best model, you know? Like Opus, it's kind of like we left it at three back in the corner intro. At some point, I'm sure the new Opus will come out. And if you had Opus Plus on it, that sounds very, very good.
Swyx [00:49:19]: There's a run with SuiteAgent plus Opus, but that's the official SWE-Bench guys doing it.
Erik [00:49:24]: That was the older, you know, 3.0.
Swyx [00:49:25]: You didn't do yours. Yeah. Okay. Did you want to? I mean, you could just change the model name.
Erik [00:49:31]: I think we didn't submit it, but I think we included it in our model card.
Swyx [00:49:35]: Okay.
Erik [00:49:35]: We included the score as a comparison. Yeah.
Swyx [00:49:38]: Yeah.
Erik [00:49:38]: And Sonnet and Haiku, actually, I think the new ones, they both outperformed the original Opus. Yeah. I did see that.
Swyx [00:49:44]: Yeah. It's a little bit hard to find. Yeah.
Erik [00:49:47]: It's not an exciting score, so we didn't feel like they need to submit it to the benchmark.
Swyx [00:49:52]: We can cut over to computer use if we're okay with moving on to topics on this, if anything else. I think we're good.
Erik [00:49:58]: I'm trying to think if there's anything else SWE-Bench related.
Swyx [00:50:02]: It doesn't have to be also just specifically SWE-Bench, but just your thoughts on building agents, because you are one of the few people that have reached this leaderboard on building a coding agent. This is the state of the art. It's surprisingly not that hard to reach with some good principles. Right. There's obviously a ton of low-hanging fruit that we covered. Your thoughts on if you were to build a coding agent startup, what next?
Erik [00:50:24]: I think the really interesting question for me, for all the startups out there, is this kind of divergence between the benchmarks and what real customers will want. So I'm curious, maybe the next time you have a coding agent startup on the podcast, you should ask them that. What are the differences that they're starting to make? Tomorrow.
Swyx [00:50:40]: Oh, perfect, perfect. Yeah.
Erik [00:50:41]: I'm actually very curious what they will see, because I also have seen, I feel like it's slowed down a little bit if I don't see the startups submitting to SWE-Bench that much anymore.
Swyx [00:50:52]: Because of the traces, the trace. So we had Cosign on, they had a 50-something on full, on SWE-Bench full, which is the hardest one, and they were rejected because they didn't want to submit their traces. Yep. IP, you know? Yeah, that makes sense, that makes sense. Actually, tomorrow we're talking to Bolt, which is a cloud customer. You guys actually published a case study with them. I assume you weren't involved with that, but they were very happy with Cloud. Cool. One of the biggest launches of the year. Yeah, totally. We actually happened to be sitting in Adept's former office. My take on this is Anthropic shipped Adept as a feature. It's still a beta feature, but yes. What was it like when you tried it for the first time? Was it obvious that Cloud had reached that stage where you could do computer use? It was somewhat of a surprise to me.
Erik [00:51:40]: I had been on vacation, and I came back, and everyone's like, computer use works. So it was this very exciting moment. After the first go to Google, I think I tried to have it play Minecraft or something, and it actually installed and opened Minecraft.
Swyx [00:51:54]: I was like, wow, this is pretty cool.
Erik [00:51:55]: So I was like, wow, yeah, this thing can actually use a computer. And certainly, it is still beta. There's certain things that it's not very good at yet. But I'm really excited, I think, most broadly, not just for new things that weren't possible before, but as a much lower friction way to implement tool use. One anecdote from my days at Cobalt Robotics, we wanted our robots to be able to ride elevators, to go between floors and fully cover a building. The first way that we did this was doing API integrations with the elevator companies. Some of them actually had APIs. We could send a request, and it would move the elevator. Each new company we did took six months to do,
Swyx [00:52:37]: because they were very slow.
Erik [00:52:39]: They didn't really care.
Swyx [00:52:40]: Or an elevator, not an API.
Erik [00:52:42]: Even installing, once we had it with the company, they would have to literally go install an API box on the elevator that we wanted to use, and that would sometimes take six months.
Swyx [00:52:51]: So very slow.
Erik [00:52:52]: And eventually, we're like, okay, this is slowing down all of our customer deployments. And I was like, what if we just add an arm to the robot? And I added this little arm that could literally go and press the elevator buttons, and we use computer vision to do this. And we could deploy that in a single day, and have the robot being able to use the elevators. At the same time, it was slower than the API. It wasn't quite as reliable. Sometimes it would miss, and it would have to try to press it again.
Swyx [00:53:20]: But it would get there.
Erik [00:53:20]: But it was slower and a little bit less reliable. And I kind of see this as an analogy to computer use, of anything you can do with computer use today, you could probably write tool use and integrate it with APIs.
Swyx [00:53:33]: It's up to the language model.
Erik [00:53:34]: But that's going to take a bunch of software engineering to write those integrations.
Swyx [00:53:38]: You have to do all this stuff.
Erik [00:53:39]: With computer use, just give the thing a browser that's logged into what you want to integrate with, and it's going to work immediately. And I see that reduction in friction as being incredibly exciting. Imagine a customer support team where, okay, hey, you got this customer support bot, but you need to go integrate it with all these things. And you don't have any engineers on your customer support team. But if you can just give the thing a browser that's logged into your systems that you need it to have access to, now, suddenly, in one day, you could be up and rolling with a fully integrated customer service bot that could go do all the actions you care about. So I think that's the most exciting thing for me about computer use, is reducing that friction of integrations to almost zero.
Alessio [00:54:20]: Or farming on World of Warcraft.
Swyx [00:54:23]: Yes, or that.
Erik [00:54:23]: Just go computer use.
Alessio [00:54:25]: Very high-value use cases.
Swyx [00:54:27]: I always say about this, this is the oldest question in robotics or self-driving, which is, do you drive by vision or do you have special tools? And vision is the universal tool to claim all tools. There's trade-offs, but there's situations in which that will come. But this week's podcast, the one that we just put out, had Stan Polu from Dust saying that he doesn't see a future where it's the significant workhorse. I think there could be a separation between maybe the high-volume use cases. You want APIs. And then the long tail, you want computer use. I totally agree. Right?
Erik [00:55:00]: Or you'll start, you'll prototype something with computer use. And then, hey, this is working. Customers have adopted this feature. OK, let's go turn it into an API. And it'll be faster and use less tokens.
Swyx [00:55:11]: I'd be interested to see a computer use agent replace itself by figuring out the API and then just dropping out of the equation altogether.
Erik [00:55:20]: Yeah, that's really fun, actually.
Swyx [00:55:22]: If I was running an RPA company, you would have the RPA scripting. RPA, for people listening, is robotic process automation, where you would script things that always show up in sequence. So you don't have an LLM in the loop. And so basically what you need to do is train an LLM to code that script. And then you can naturally hand off from computer use to non-computer use.
Erik [00:55:43]: Or have some way to turn Claude's actions of computer use into a saved script that you can then run repeatedly.
Swyx [00:55:49]: Yeah, it'd be interesting to record that.
Alessio [00:55:50]: Why did you decide to not ship any sandbox harness for computer use? It's kind of like, hey, peace.
Swyx [00:55:58]: Run at your own risk. It's Docker, right?
Erik [00:55:59]: No, no, we launched it with, I think, a VM or Docker, a Docker as system.
Alessio [00:56:03]: But it's not for your actual computer, right? The Docker instance runs in the Docker. It's not for...
Swyx [00:56:10]: Yeah, it runs its own browser.
Erik [00:56:13]: I mean, the main reason for that, one, is sort of security. We don't want... The model can do anything. So we wanted to give it a sandbox, not have people do their own computer. At least sort of for our default experience. We really care about providing a nice sort of... Making the default safe, I think, is the best way for us to do it. And I mean, very quickly, people made modifications to let you run it on your own desktop. And that's fine.
Swyx [00:56:37]: Someone else can do that.
Erik [00:56:37]: But we don't want that to be the official, anthropic thing to run. I would say also, from a product perspective, right now, because this is sort of still in beta, I think a lot of the most useful use cases are... Like, a sandbox is actually what you want. You want something where, hey, it can't mess up anything in here. It only has what I gave it. Also, if it's using your computer, you know, you can't use your computer at the same time. I think you actually want it to have its own screen. It's like you and a person pair programming, but only on one laptop versus you have two laptops.
Swyx [00:57:07]: Everyone should totally have a side laptop where the computer uses... Cloud is just doing its thing. Yeah, yeah.
Erik [00:57:11]: I think it's such a better experience. Unless there's something very explicit you want it to do for you on your own computer.
Swyx [00:57:17]: It becomes like you're sort of shelling into a remote machine and, you know, maybe checking in on it every now and then. Like, I have fond memories of... Half our audience is going to be too young to remember this, but Citrix desktop experience, like, you were sort of remote into a machine that someone else was operating. And for a long time, that would be how you did, like, enterprise computing. Yeah, yeah. It's coming back. Any other implications of computer use? You know, is it a fun demo or is it, like, the future of Anthropic? I'm very excited about it.
Erik [00:57:50]: I think that, like, there's a lot of sort of very repetitive work that, like, computer use will be great for. I think I've seen some examples of people build, like, coding agents that then also, like, test the front end that they made. So I think it's very cool to, like, use computer use to be able to close the loop on a lot of things that right now just a terminal-based agent can't do. So I think that's very exciting.
Swyx [00:58:11]: It's kind of like end-to-end testing. Exactly. Yeah, yeah.
Erik [00:58:14]: The end sort of front-end and web testing is something I'm very excited about.
Swyx [00:58:18]: Yeah, I've seen Amanda also talking... This would be Amanda Askell, the head of Cloud Character. She goes on a lunch break and it generates, you know, research ideas for her. Giving it a name like computer use is very practical. It's like you're supposed to do things, but maybe sometimes it's not about doing things, it's about thinking. And thinking... In the process of thinking, you're using the computer. In some way that's, you know, solving SweetBench, like, you should be allowed to use the internet or you should be allowed to use a computer to solve it and use your vision and use whatever. Like, we're just sort of shackling it with all these restrictions just because we want to play nice for a benchmark. But really, you know, a full AI will be able to do all these things. To think. Yeah, we'll definitely be able to. To reason. To Google and search for things.
Erik [00:58:58]: Yeah, yeah. Pull down inspiration.
Alessio [00:59:00]: Can we just do a... before we wrap, a robotics corner?
Swyx [00:59:03]: Oh, yeah, yeah.
Alessio [00:59:04]: People are always curious, especially with somebody that is not trying to hype their own company. What's the state of AI robotics? Under-hyped, over-hyped?
Erik [00:59:12]: Yeah, and I'll say, like, these are my opinions, not Anthropic's. And again, coming from a place of a burned-out robotics founder, so take everything with a grain of salt. I would say on the positives, like, there is really sort of incredible progress that's happened in the last five years that I think will be a big unlock for robotics. The first is just general purpose language models. I mean, there was an old saying in robotics that if to fully describe your task is harder than to just do the task, you can never automate it. Because, like, it's going to take more effort to even tell the robot how to do this thing than to me just do it itself. LLM solved that. I no longer need to go exhaustively program in every little thing I could do. The thing just has common sense. And it's going to know, how do I make a Reuben sandwich? I'm not going to have to go program that in. Whereas before, like, the idea of even, like, a cooking thing, it's like, oh god, like, we're gonna have the team of engineers that are hard coding recipes for the long tail of anything. It would be a disaster. So I think that's one thing, is that bringing common sense really is, like, solves this huge problem of describing tasks. The second big innovation has been diffusion models for path planning. A lot of this work came out of Toyota Research. There's a lot of startups now that are working on this, like Physical Intelligence Pi, Chelsea Finn's startup out of Stanford. And the basic idea here is using a little bit of the, I'd say maybe more inspiration from diffusion rather than diffusion models themselves. But they're a way to basically learn an end-to-end sort of motion control. Whereas previously, all of robotics motion control was sort of very hard-coded. You either, you know, you're programming in explicit motions, or you're programming in an explicit goal and using an optimization library to find the shortest path to it. This is now something where you just give it a bunch of demonstrations. And again, just like using learning, it's basically like learning from these examples. What does it mean to go pick up a cup? And doing these in a way just like diffusion models, where they are somewhat conditioned by text, you can have the same model learn many different tasks. And then the hope is that these start to generalize. That if you've trained it on picking up coffee cups and picking up books, then when I say pick up the backpack, it knows how to do that too. Even though you've never trained it on that. That's kind of the holy grail here, is that you train it on 500 different tasks, and then that's enough to really get it to generalize to do anything you would need. I think that's like still a big TBD. And these people are working, have like measured some degree of generalization. But at the end of the day, it's also like LLMs. Like, you know, do you really care about the thing, being able to do something that no one has ever shown in training data? People for like a home robot, there's going to be like a hundred things that people really wanted to do. And you can just make sure it has good training for those things. What you do care about then is like generalization within a task of, oh, I've never seen this particular coffee mug before. Can I still pick it up? And those, the models do seem very good at. So these kind of are the two big things that are going for robotics right now, is LLMs for common sense and diffusion-inspired path planning algorithms. I think this is very promising, but I think there's a lot of hype. And I think where we are right now is where self-driving cars were 10 years ago. I think we have very cool demos that work. I mean, 10 years ago, you had videos of people driving a car on the highway, driving a car, you know, on a street with a safety driver. But it's really taken a long time to go from there to, I took a Waymo here today. And even Waymo is only in SF and a few other cities. And I think it takes a long time for these things to actually get everywhere and to get all the edge cases covered. I think that for robotics, the limiting factor is going to be reliability, that these models are really good at doing these demos of doing laundry or doing dishes. If they only work 99% of the time, that sounds good, but that's actually really annoying. Humans are really good at these tasks. Imagine if one out of every 100 dishes, it washed, it breaks. You would not want that robot in your house, or you certainly wouldn't want that in your factory if one of every 100 boxes that it moves, it drops and breaks things inside it. So I think for these things to really be useful, they're going to have to hit a very, very high level of reliability, just like self-driving cars. And I don't know how hard it's going to be for these models to move from the 95% reliability to 99.9. I think that's going to be the big thing. And I think also, I'm a little skeptical of how good the unit economics of these things will be. These robots are going to be very expensive to build. And if you're just trying to replace labor, like a one-for-one purchase, it kind of sets an upper cap about how much you can charge. And so it seems like it's not that great a business. I'm also worried about that for the self-driving car industry.
Alessio [01:04:05]: Do you see most of the applications actually taking some of the older, especially manufacturing machinery, which needs to be very precise? Even if it's off by just a few millimeters, it cannot screw up the whole thing and be able to adjust at the edge? Or do you think the net new use cases may be more interesting?
Erik [01:04:24]: I think it'd be very hard to replace a lot of those traditional manufacturing robots because everything relies on that precision. If you have a model that can, again, only get there 99% of the time, you don't want 1% of your cars to have the weld in the wrong spot. That's going to be a disaster. And a lot of manufacturing is all about getting rid of as much variance and uncertainty as
Swyx [01:04:47]: possible.
Erik [01:04:47]: Yeah.
Swyx [01:04:48]: And what about the hardware?
Alessio [01:04:49]: A lot of my friends that work in robotics, one of their big issues is sometimes you just have a servo that fails, and it takes a bunch of time to fix that.
Swyx [01:04:57]: Is that holding back things?
Alessio [01:04:58]: Or is the software still, anyway, not that ready?
Swyx [01:05:01]: I think both.
Erik [01:05:01]: I think there's been a lot more progress in the software in the last few years. And I think a lot of the humanoid robot companies now are really trying to build amazing hardware. Hardware is just so hard. It's something where you build your first robot, and it works. You're like, great. Then you build 10 of them. Five of them work. Three of them work half the time. Two of them don't work. And you built them all the same, and you don't know why. And it's just like the real world has this level of detail and differences that software
Swyx [01:05:28]: doesn't have.
Erik [01:05:29]: Imagine if every for loop you wrote, some of them just didn't work. Some of them were slower than others. How do you deal with that? Imagine if every binary that you shipped to a customer, each of those four loops was a
Swyx [01:05:41]: little different.
Erik [01:05:41]: It becomes just so hard to scale and maintain quality of these things. And I think that's what makes hardware really hard. It's not building one of something, but repeatedly building something and making it work reliably. Where again, you'll buy a batch of 100 motors, and each of those motors will behave a little bit differently to the same input command.
Swyx [01:06:01]: This is your lived experience at Cobalt.
Erik [01:06:03]: And robotics is all about how do you build something that's robust despite these differences.
Swyx [01:06:08]: We can't get the tolerance of motors down to-
Erik [01:06:10]: It's just everything.
Swyx [01:06:13]: It's actually everything.
Alessio [01:06:14]: Yeah.
Erik [01:06:15]: No, I mean, one of my horror stories was that at Cobalt, this was many years ago, we had a thermal camera on the robot that had a USB connection to the computer inside, which is, first of all, is a big mistake. You're not supposed to use a USB. It is not a reliable protocol. It's designed that if there's mistakes, the user can just unplug it and plug it back in. I see. And so typically things that are USB, they're not designed to the same level of very high reliability you need. Again, because they assume someone will just unplug it and replug it back in. You just say someone sometime.
Swyx [01:06:46]: I heard this too, and I didn't listen to it.
Erik [01:06:47]: I really wish I had before. Anyway, at a certain point, a bunch of these thermal cameras started failing, and we couldn't figure out why. And I asked everyone on the team, like, hey, what's changed? Did the software change around this? Did the hardware design change around this? And I was investigating all this stuff, looking at kernel logs of what's happening with this
Swyx [01:07:07]: thing.
Erik [01:07:07]: And finally, the procurement person was like, oh, yeah, well, I found this new vendor for USB cables last summer.
Swyx [01:07:14]: And I'm like, what?
Erik [01:07:15]: You switched which vendor were buying USB cables? I'm like, yeah, it's the same exact cable. It's just a dollar cheaper. And it turns out this was the problem. This new cable had slightly worse resistance or slightly worse EMI interference. And it worked most of the time. But 1% of the time, these cameras would fail, and we'd need to reboot a big part of the system. And it was all just because the same exact spec, these two different USB cables, slightly different. And so these are the kind of things you deal with with hardware.
Swyx [01:07:45]: For listeners, we had an episode with Josh Albrecht in BU where he talked about buying tens of thousands of GPUs. And just some of them will just not do math. Yeah, that's the same thing. You run some tests to find the bad batch, and then you return it to sender because they just, GPUs won't do math, right? Yeah, yeah, this is the thing.
Erik [01:08:05]: The real world has this level of detail. Eric Jang, he did AI at Google.
Swyx [01:08:11]: Yeah, 1X. Yeah, and then joined 1X.
Erik [01:08:13]: I see him post on Twitter occasionally of complaints about hardware and supply chain. And we know each other, and we joke occasionally. I went from robotics into AI, and he went from AI into robotics.
Swyx [01:08:26]: I mean, look, very, very promising. The time of the real world is unlimited, right? But just also a lot harder. And yeah, I do think something I also tell people about for why working software agents is they're infinitely clonable. Yeah, they always work the same way. Mostly, unless you're using Python. And yeah, I mean, this is the whole thesis. I'm also interested, you dropped a little bit of alpha there. I don't want to make sure we don't lose it. Like, you're kind of skeptical about self-driving as a business. So I want to double click on this a little bit, because I mean, I think that shouldn't be taken away. We do have some public Waymo numbers. Read from Waymo is pretty public with their stats. They're exceeding 100 Waymo trips a week. If you assume a 25𝑟𝑖𝑑𝑒𝑎𝑣𝑒𝑟𝑎𝑔𝑒,𝑡ℎ𝑎𝑡′𝑠25rideaverage,that′s130 million revenue run rate. At some point, they will recoup their investment, right? Like, what are we talking about here? Way to skepticism.
Erik [01:09:21]: I think, and again, I'm not an expert. I don't know their financials. I would say the thing I'm worried about is compared to an Uber, I don't know how much an Uber driver takes home a year, but call that the revenue that a Waymo is going to be making in that same year. Those cars are expensive. It's not about if you can hit profitability, it's about your cash conversion cycles. Is building one Waymo, how cheap can you make that compared to how much you're earning as the equivalent of what an Uber driver would take home? Because remember, an Uber driver, you're not getting that whole revenue. You think about, for the Uber driver, the cost of the car, the depreciation of the car. I'm not convinced how much profit Waymo can actually make per car.
Swyx [01:10:02]: That's, I think, my skepticism.
Alessio [01:10:02]: Well, they need to pre-assess the run Waymo because the Class C is like $110 grand, something
Swyx [01:10:09]: like that, plus the LiDAR. That's many years of, yeah, yeah, yeah. Exactly, exactly. Anything else?
Alessio [01:10:14]: Parting thoughts? Call to action? Rants?
Swyx [01:10:18]: The floor is yours.
Erik [01:10:19]: I'm very excited to see a lot more LLM agents out there in the world doing things. And I think they'll be, the biggest limiting thing will start to become, do people trust the output of these agents? And how do you trust the output of an agent that did five hours of work for you and is coming back with something? And if you can't find some way to trust that agent's work, it kind of wasn't valuable at all. So I think that's going to be a really important thing, is not just doing the work, but doing the work in a trustable, auditable way where you can also explain to the human, hey, here's exactly how this works and why and how I came to it. I think that's going to be really important.
Swyx [01:10:54]: Thank you so much. Yeah, thanks. This was great.

Get full access to Latent Space at www.latent.space/subscribe
- Luisteren Nogmaals beluisteren Doorgaan Wordt afgespeeld...
- Later beluisteren Later beluisteren
Why Compound AI + Open Source will beat Closed AI
25 nov· Latent Space: The AI Engineer Podcast
We have a full slate of upcoming events: AI Engineer London, AWS Re:Invent in Las Vegas, and now Latent Space LIVE! at NeurIPS in Vancouver and online. Sign up to join and speak!
We are still taking questions for our next big recap episode! Submit questions and messages on Speakpipe here for a chance to appear on the show!
We try to stay close to the inference providers as part of our coverage, as our podcasts with Together AI and Replicate will attest:
However one of the most notable pull quotes from our very well received Braintrust episode was his opinion that open source model adoption has NOT gone very well and is actually declining in relative market share terms (it is of course increasing in absolute terms):
Today’s guest, Lin Qiao, would wholly disagree. Her team of Pytorch/GPU experts are wholly dedicated toward helping you serve and finetune the full stack of open source models from Meta and others, across all modalities (Text, Audio, Image, Embedding, Vision-understanding), helping customers like Cursor and Hubspot scale up open source model inference both rapidly and affordably.
Fireworks has emerged after its successive funding rounds with top tier VCs as one of the leaders of the Compound AI movement, a term first coined by the Databricks/Mosaic gang at Berkeley AI and adapted as “Composite AI” by Gartner:
Replicating o1
We are the first podcast to discuss Fireworks’ f1, their proprietary replication of OpenAI’s o1. This has become a surprisingly hot area of competition in the past week as both Nous Forge and Deepseek r1 have launched competitive models.
Full Video Podcast
Like and subscribe!
Timestamps
* 00:00:00 Introductions
* 00:02:08 Pre-history of Fireworks and PyTorch at Meta
* 00:09:49 Product Strategy: From Framework to Model Library
* 00:13:01 Compound AI Concept and Industry Dynamics
* 00:20:07 Fireworks' Distributed Inference Engine
* 00:22:58 OSS Model Support and Competitive Strategy
* 00:29:46 Declarative System Approach in AI
* 00:31:00 Can OSS replicate o1?
* 00:36:51 Fireworks f1
* 00:41:03 Collaboration with Cursor and Speculative Decoding
* 00:46:44 Fireworks quantization (and drama around it)
* 00:49:38 Pricing Strategy
* 00:51:51 Underrated Features of Fireworks Platform
* 00:55:17 Hiring
Transcript
Alessio [00:00:00]: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner at CTO at Danceable Partners, and I'm joined by my co-host, Swyx founder, Osmalayar.
Swyx [00:00:11]: Hey, and today we're in a very special studio inside the Fireworks office with Lin Qiang, CEO of Fireworks. Welcome. Yeah.
Lin [00:00:20]: Oh, you should welcome us.
Swyx [00:00:21]: Yeah, welcome. Yeah, thanks for having us. It's unusual to be in the home of a startup, but it's also, I think our relationship is a bit unusual compared to all our normal guests. Definitely.
Lin [00:00:34]: Yeah. I'm super excited to talk about very interesting topics in that space with both of you.
Swyx [00:00:41]: You just celebrated your two-year anniversary yesterday.
Lin [00:00:43]: Yeah, it's quite a crazy journey. We circle around and share all the crazy stories across these two years, and it has been super fun. All the way from we experienced Silicon Valley bank run to we delete some data that shouldn't be deleted operationally. We went through a massive scale where we actually are busy getting capacity to, yeah, we learned to kind of work with it as a team with a lot of brilliant people across different places to join a company. It has really been a fun journey.
Alessio [00:01:24]: When you started, did you think the technical stuff will be harder or the bank run and then the people side? I think there's a lot of amazing researchers that want to do companies and it's like the hardest thing is going to be building the product and then you have all these different other things. So, were you surprised by what has been your experience the most?
Lin [00:01:42]: Yeah, to be honest with you, my focus has always been on the product side and then after the product goes to market. And I didn't realize the rest has been so complicated, operating a company and so on. But because I don't think about it, I just kind of manage it. So it's done. I think I just somehow don't think about it too much and solve whatever problem coming our way and it worked.
Swyx [00:02:08]: So let's, I guess, let's start at the pre-history, the initial history of Fireworks. You ran the PyTorch team at Meta for a number of years and we previously had Sumit Chintal on and I think we were just all very interested in the history of GenEI. Maybe not that many people know how deeply involved Faire and Meta were prior to the current GenEI revolution.
Lin [00:02:35]: My background is deep in distributed system, database management system. And I joined Meta from the data side and I saw this tremendous amount of data growth, which cost a lot of money and we're analyzing what's going on. And it's clear that AI is driving all this data generation. So it's a very interesting time because when I joined Meta, Meta is going through ramping down mobile-first, finishing the mobile-first transition and then starting AI-first. And there's a fundamental reason about that sequence because mobile-first gave a full range of user engagement that has never existed before. And all this user engagement generated a lot of data and this data power AI. So then the whole entire industry is also going through, falling through this same transition. When I see, oh, okay, this AI is powering all this data generation and look at where's our AI stack. There's no software, there's no hardware, there's no people, there's no team. I want to dive up there and help this movement. So when I started, it's very interesting industry landscape. There are a lot of AI frameworks. It's a kind of proliferation of AI frameworks happening in the industry. But all the AI frameworks focus on production and they use a very certain way of defining the graph of neural network and then use that to drive the model iteration and productionization. And PyTorch is completely different. So they could also assume that he was the user of his product. And he basically says, researchers face so much pain using existing AI frameworks, this is really hard to use and I'm going to do something different for myself. And that's the origin story of PyTorch. PyTorch actually started as the framework for researchers. They don't care about production at all. And as they grow in terms of adoption, so the interesting part of AI is research is the top of our normal production. There are so many researchers across academic, across industry, they innovate and they put their results out there in open source and that power the downstream productionization. So it's brilliant for MATA to establish PyTorch as a strategy to drive massive adoption in open source because MATA internally is a PyTorch shop. So it creates a flying wheel effect. So that's kind of a strategy behind PyTorch. But when I took on PyTorch, it's kind of at Caspo, MATA established PyTorch as the framework for both research and production. So no one has done that before. And we have to kind of rethink how to architect PyTorch so we can really sustain production workload, the stability, reliability, low latency, all this production concern was never a concern before. Now it's a concern. And we actually have to adjust its design and make it work for both sides. And that took us five years because MATA has so many AI use cases, all the way from ranking recommendation as powering the business top line or as ranking newsfeed, video ranking to site integrity detect bad content automatically using AI to all kinds of effects, translation, image classification, object detection, all this. And also across AI running on the server side, on mobile phones, on AI VR devices, the wide spectrum. So by the time we actually basically managed to support AI across ubiquitous everywhere across MATA. But interestingly, through open source engagement, we work with a lot of companies. It is clear to us like this industry is starting to take on AI first transition. And of course, MATA's hyperscale always go ahead of industry. And it feels like when we start this AI journey at MATA, there's no software, no hardware, no team. For many companies we engage with through PyTorch, we feel the pain. That's the genesis why we feel like, hey, if we create fireworks and support industry going through this transition, it will be a huge amount of impact. Of course, the problem that the industry is facing will not be the same as MATA. MATA is so big, right? So it's kind of skewed towards extreme scale and extreme optimization in the industry will be different. But we feel like we have the technical chop and we've seen a lot. We'll look to kind of drive that. So yeah, so that's how we started.
Swyx [00:06:58]: When you and I chatted about the origins of fireworks, it was originally envisioned more as a PyTorch platform, and then later became much more focused on generative AI. Is that fair to say? What was the customer discovery here?
Lin [00:07:13]: Right. So I would say our initial blueprint is we should build a PyTorch cloud because a PyTorch library and there's no SaaS platform to enable AI workloads.
Swyx [00:07:26]: Even in 2022, it's interesting.
Lin [00:07:28]: I would not say absolutely no, but cloud providers have some of those, but it's not first class citizen, right? At 2022, there's still like TensorFlow is massively in production. And this is all pre-gen AI, and PyTorch is kind of getting more and more adoption. But there's no PyTorch-first SaaS platform existing. At the same time, we are also a very pragmatic set of people. We really want to make sure from the get-go, we get really, really close to customers. We understand their use case, we understand their pain points, we understand the value we deliver to them. So we want to take a different approach instead of building a horizontal PyTorch cloud. We want to build a verticalized platform first. And then we talk with many customers. And interestingly, we started the company in September 2022, and in October, November, the OpenAI announced ChatGPT. And then boom, when we talked with many customers, they were like, can you help us work on the JNS aspect? So of course, there are some open source models. It's not as good at that time, but people are already putting a lot of attention there. Then we decided that if we're going to pick a vertical, we're going to pick JNI. The other reason is all JNI models are PyTorch models. So that's another reason. We believe that because of the nature of JNI, it's going to generate a lot of human consumable content. It will drive a lot of consumer, customer-developer-facing application and product innovation. Guaranteed. We're just at the beginning of this. Our prediction is for those kind of applications, the inference is much more important than training because inference scale is proportional to the up-limit award population. And training scale is proportional to the number of researchers. Of course, each training round could be very expensive. Although PyTorch supports both inference and training, we decided to laser focus on inference. So yeah, so that's how we got started. And we launched our public platform August last year. When we launched, it was a single product. It's a distributed inference engine with a simple API, open AI compatible API with many models. We started with LM and then we added a lot of models. Fast forward to now, we are a full platform with multiple product lines. So we love to kind of dive deep into what we offer. But that's a very fun journey in the past two years.
Alessio [00:09:49]: What was the transition from you start to focus on PyTorch and people want to understand the framework, get it live. And now say maybe most people that use you don't even really know much about PyTorch at all. You know, they're just trying to consume a model. From a product perspective, like what were some of the decisions early on? Like right in October, November, you were just like, hey, most people just care about the model, not about the framework. We're going to make it super easy or was it more a gradual transition to the model library
Swyx [00:10:16]: you have today?
Lin [00:10:17]: Yeah. So our product decision is all based on who is our ICP. And one thing I want to acknowledge here is the generic technology is disruptive. It's very different from AI before GNI. So it's a clear leap forward. Because before GNI, the companies that want to invest in AI, they have to train from scratch. There's no other way. There's no foundation model. It doesn't exist. So that means then to start a team, first hire a team who is capable of crunch data. There's a lot of data to crunch, right? Because training from scratch, you have to prepare a lot of data. And then they need to have GPUs to train, and then you start to manage GPUs. So then it becomes a very complex project. It takes a long time and not many companies can afford it, actually. And the GNI is a very different game right now, because it is a foundation model. So you don't have to train anymore. That makes AI much more accessible as a technology. As an app developer or product manager, even, not a developer, they can interact with GNI models directly. So our goal is to make AI accessible to all app developers and product engineers. That's our goal. So then getting them into the building model doesn't make any sense anymore with this new technology. And then building easy, accessible APIs is the most important. Early on, when we got started, we decided we're going to be open AI compatible. It's just kind of very easy for developers to adopt this new technology, and we will manage the underlying complexity of serving all these models.
Swyx [00:11:56]: Yeah, open AI has become the standard. Even as we're recording today, Gemini announced that they have open AI compatible APIs. Interesting. So we just need to drop it all in line, and then we have everyone popping in line.
Lin [00:12:09]: That's interesting, because we are working very closely with Meta as one of the partners. Meta, of course, is kind of very generous to donate many very, very strong open source models, expecting more to come. But also they have announced LamaStack, which is basically standardized, the upper level stack built on top of Lama models. So they don't just want to give out models and you figure out what the upper stack is. They instead want to build a community around the stack and build a new standard. I think there's an interesting dynamics in play in the industry right now, when it's more standardized across open AI, because they are kind of creating the top of the funnel, or standardized across Lama, because this is the most used open source model. So I think it's a lot of fun working at this time.
Swyx [00:13:01]: I've been a little bit more doubtful on LamaStack, I think you've been more positive. Basically it's just like the meta version of whatever Hugging Face offers, you know, or TensorRT, or BLM, or whatever the open source opportunity is. But to me, it's not clear that just because Meta open sources Lama, that the rest of LamaStack will be adopted. And it's not clear why I should adopt it. So I don't know if you agree.
Lin [00:13:27]: It's very early right now. That's why I kind of work very closely with them and give them feedback. The feedback to the meta team is very important. So then they can use that to continue to improve the model and also improve the higher level I think the success of LamaStack heavily depends on the community adoption. And there's no way around it. And I know the meta team would like to kind of work with a broader set of community. But it's very early.
Swyx [00:13:52]: One thing that after your Series B, so you raced for Benchmark, and then Sequoia. I remember being close to you for at least your Series B announcements, you started betting heavily on this term of Compound AI. It's not a term that we've covered very much in the podcast, but I think it's definitely getting a lot of adoption from Databricks and Berkeley people and all that. What's your take on Compound AI? Why is it resonating with people?
Lin [00:14:16]: Right. So let me give a little bit of context why we even consider that space.
Swyx [00:14:22]: Because like pre-Series B, there was no message, and now it's like on your landing page.
Lin [00:14:27]: So it's kind of very organic evolution from when we first launched our public platform, we are a single product. We are a distributed inference engine, where we do a lot of innovation, customized KUDA kernels, raw kernel kernels, running on different kinds of hardware, and build distributed disaggregated execution, inference execution, build all kinds of caching. So that is one. So that's kind of one product line, is the fast, most cost-efficient inference platform. Because we wrote PyTorch code, we know we basically have a special PyTorch build for that, together with a custom kernel we wrote. And then we worked with many more customers, we realized, oh, the distributed inference engine, our design is one size fits all. We want to have this inference endpoint, then everyone come in, and no matter what kind of form and shape or workload they have, it will just work for them. So that's great. But the reality is, we realized all customers have different kinds of use cases. The use cases come in all different forms and shapes. And the end result is the data distribution in their inference workload doesn't align with the data distribution in the training data for the model. It's a given, actually. If you think about it, because researchers have to guesstimate what is important, what's not important in preparing data for training. So because of that misalignment, then we leave a lot of quality, latency, cost improvement on the table. So then we're saying, OK, we want to heavily invest in a customization engine. And we actually announced it called FHIR Optimizer. So FHIR Optimizer basically helps users navigate a three-dimensional optimization space across quality, latency, and cost. So it's a three-dimensional curve. And even for one company, for different use cases, they want to land in different spots. So we automate that process for our customers. It's very simple. You have your inference workload. You inject into the optimizer along with the objective function. And then we spit out inference deployment config and the model setup. So it's your customized setup. So that is a completely different product. So that product thinking is one size fits all. And now on top of that, we provide a huge variety of state-of-the-art models, hundreds of them, varying from text to large state-of-the-art English models. That's where we started. And as we talk with many customers, we realize, oh, audio and text are very, very close. Many of our customers start to build assistants, all kinds of assistants using text. And they immediately want to add audio, audio in, audio out. So we support transcription, translation, speech synthesis, text, audio alignment, all different kinds of audio features. It's a big announcement. You should have heard by the time this is out. And the other areas of vision and text are very close with each other. Because a lot of information doesn't live in plain text. A lot of information lives in multimedia format, images, PDFs, screenshots, and many other different formats. So oftentimes to solve a problem, we need to put the vision model first to extract information and then use language model to process and then send out results. So vision is important. We also support vision model, various different kinds of vision models specialized in processing different kinds of source and extraction. And we're also going to have another announcement of a new API endpoint we'll support for people to upload various different kinds of multimedia content and then get the extract very accurate information out and feed that into LM. And of course, we support embedding because embedding is very important for semantic search, for RAG, and all this. And in addition to that, we also support text-to-image, image generation models, text-to-image, image-to-image, and we're adding text-to-video as well in our portfolio. So it's a very comprehensive set of model catalog that built on top of File Optimizer and Distributed Inference Engine. But then we talk with more customers, they solve business use case, and then we realize one model is not sufficient to solve their problem. And it's very clear because one is the model hallucinates. Many customers, when they onboard this JNI journey, they thought this is magical. JNI is going to solve all my problems magically. But then they realize, oh, this model hallucinates. It hallucinates because it's not deterministic, it's probabilistic. So it's designed to always give you an answer, but based on probabilities, so it hallucinates. And that's actually sometimes a feature for creative writing, for example. Sometimes it's a bug because, hey, you don't want to give misinformation. And different models also have different specialties. To solve a problem, you want to ask different special models to kind of decompose your task into multiple small tasks, narrow tasks, and then have an expert model solve that task really well. And of course, the model doesn't have all the information. It has limited knowledge because the training data is finite, not infinite. So the model oftentimes doesn't have real-time information. It doesn't know any proprietary information within the enterprise. It's clear that in order to really build a compiling application on top of JNI, we need a compound AI system. Compound AI system basically is going to have multiple models across modalities, along with APIs, whether it's public APIs, internal proprietary APIs, storage systems, database systems, knowledge to work together to deliver the best answer.
Swyx [00:20:07]: Are you going to offer a vector database?
Lin [00:20:09]: We actually heavily partner with several big vector database providers. Which is your favorite? They are all great in different ways. But it's public information, like MongoDB is our investor. And we have been working closely with them for a while.
Alessio [00:20:26]: When you say distributed inference engine, what do you mean exactly? Because when I hear your explanation, it's almost like you're centralizing a lot of the decisions through the Fireworks platform on the quality and whatnot. What do you mean distributed? It's like you have GPUs in a lot of different clusters, so you're sharding the inference across the same model.
Lin [00:20:45]: So first of all, we run across multiple GPUs. But the way we distribute across multiple GPUs is unique. We don't distribute the whole model monolithically across multiple GPUs. We chop them into pieces and scale them completely differently based on what's the bottleneck. We also are distributed across regions. We have been running in North America, EMEA, and Asia. We have regional affinity to applications because latency is extremely important. We are also doing global load balancing because a lot of applications there, they quickly scale to global population. And then at that scale, different content wakes up at a different time. And you want to kind of load balancing across. So all the way, and we also have, we manage various different kinds of hardware skew from different hardware vendors. And different hardware design is best for different types of workload, whether it's long context, short context, long generation. So all these different types of workload is best fitted for different kinds of hardware skew. And then we can even distribute across different hardware for a workload. So the distribution actually is all around in the full stack.
Swyx [00:22:02]: At some point, we'll show on the YouTube, the image that Ray, I think, has been working on with all the different modalities that you offer. To me, it's basically you offer the open source version of everything that OpenAI typically offers. I don't think there is. Actually, if you do text to video, you will be a superset of what OpenAI offers because they don't have Sora. Is that Mochi, by the way? Mochi. Mochi, right?
Lin [00:22:27]: Mochi. And there are a few others. I will say, the interesting thing is, I think we're betting on the open source community is going to proliferate. This is literally what we're seeing. And there's amazing video generation companies. There is amazing audio companies. Like cross-border, the innovation is off the chart, and we are building on top of that. I think that's the advantage we have compared with a closed source company.
Swyx [00:22:58]: I think I want to restate the value proposition of Fireworks for people who are comparing you versus a raw GPU provider like a RunPod or Lambda or anything like those, which is like you create the developer experience layer and you also make it easily scalable or serverless or as an endpoint. And then, I think for some models, you have custom kernels, but not all models.
Lin [00:23:25]: Almost for all models. For all large language models, all your models, and the VRMs. Almost for all models we serve.
Swyx [00:23:35]: And so that is called Fire Attention. I don't remember the speed numbers, but apparently much better than VLM, especially on a concurrency basis.
Lin [00:23:44]: So Fire Attention is specific mostly for language models, but for other modalities, we'll also have a customized kernel.
Swyx [00:23:51]: And I think the typical challenge for people is understanding that has value, and then there are other people who are also offering open-source models. Your mode is your ability to offer a good experience for all these customers. But if your existence is entirely reliant on people releasing nice open-source models, other people can also do the same thing.
Lin [00:24:14]: So I would say we build on top of open-source model foundation. So that's the kind of foundation we build on top of. But we look at the value prop from the lens of application developers and product engineers. So they want to create new UX. So what's happening in the industry right now is people are thinking about a completely new way of designing products. And I'm talking to so many founders, it's just mind-blowing. They help me understand existing way of doing PowerPoint, existing way of coding, existing way of managing customer service. It's actually putting a box in our head. For example, PowerPoint. So PowerPoint generation is we always need to think about how to fit into my storytelling into this format of slide one after another. And I'm going to juggle through design together with what story to tell. But the most important thing is what's our storytelling lines, right? And why don't we create a space that is not limited to any format? And those kind of new product UX design combined with automated content generation through Gen AI is the new thing that many founders are doing. What are the challenges they're facing? Let's go from there. One is, again, because a lot of products built on top of Gen AI, they are consumer-personal developer facing, and they require interactive experience. It's just a kind of product experience we all get used to. And our desire is to actually get faster and faster interaction. Otherwise, nobody wants to spend time, right? And then that requires low latency. And the other thing is the nature of consumer-personal developer facing is your audience is very big. You want to scale up to product market fit quickly. But if you lose money at a small scale, you're going to bankrupt quickly. So it's actually a big contrast. I actually have product market fit, but when I scale, I scale out of my business. So that's kind of a very funny way to think about it. So then having low latency and low cost is essential for those new applications and products to survive and really become a generation company. So that's the design point for our distributed inference engine and the file optimizer. File optimizer, you can think about that as a feedback loop. The more you feed your inference workload to our inference engine, the more we help you improve quality, lower latency further, lower your cost. It basically becomes better. And we automate that because we don't want you as an app developer or product engineer to think about how to figure out all these low-level details. It's impossible because you're not trained to do that at all. You should kind of keep your focus on the product innovation. And then the compound AI, we actually feel a lot of pain as the app developers, engineers, there are so many models. Every week, there's at least a new model coming out.
Swyx [00:27:09]: Tencent had a giant model this week. Yeah, yeah.
Lin [00:27:13]: I saw that. I saw that.
Swyx [00:27:15]: It's like $500 billion.
Lin [00:27:18]: So they're like, should I keep chasing this or should I forget about it? And which model should I pick to solve what kind of sub-problem? How do I even decompose my problem into those smaller problems and fit the model into it? I have no idea. And then there are two ways to think about this design. I think I talked about that in the past. One is imperative, as in you figure out how to do it. You give developer tools to dictate how to do it. Or you build a declarative system where a developer tells what they want to do, not how. So these are completely two different designs. So the analogy I want to draw is, in the data world, the database management system is a declarative system because people use database, use SQL. SQL is a way you say, what do you want to extract out of a database? What kind of result do you want? But you don't figure out which node is going to, how many nodes you're going to run on top of, how you redefine your disk, which index you use, which project. You don't need to worry about any of those. And database management system will figure out, generate a new best plan, and execute on that. So database is declarative. And it makes it super easy. You just learn SQL, which is learn a semantic meaning of SQL, and you can use it. Imperative side is there are a lot of ETL pipelines. And people design this DAG system with triggers, with actions, and you dictate exactly what to do. And if it fails, then how to recover. So that's an imperative system. We have seen a range of systems in the ecosystem go different ways. I think there's value of both. There's value of both. I don't think one is going to subsume the other. But we are leaning more into the philosophy of the declarative system. Because from the lens of app developer and product engineer, that would be easiest for them to integrate.
Swyx [00:29:07]: I understand that's also why PyTorch won as well, right? This is one of the reasons. Ease of use.
Lin [00:29:14]: Focus on ease of use, and then let the system take on the hard challenges and complexities. So we follow, we extend that thinking into current system design. So another announcement is we will also announce our next declarative system is going to appear as a model that has extremely high quality. And this model is inspired by Owen's announcement for OpenAI. You should see that by the time we announce this or soon.
Alessio [00:29:46]: Trained by you.
Lin [00:29:47]: Yes.
Alessio [00:29:48]: Is this the first model that you trained? It's not the first.
Lin [00:29:52]: We actually have trained a model called FireFunction. It's a function calling model. It's our first step into compound AI system. Because function calling model can dispatch a request into multiple APIs. We have pre-baked set of APIs the model learned. You can also add additional APIs through the configuration to let model dispatch accordingly. So we have a very high quality function calling model that's already released. We have actually three versions. The latest version is very high quality. But now we take a further step that you don't even need to use function calling model. You use our new model we're going to release. It will solve a lot of problems approaching very high OpenAI quality. So I'm very excited about that.
Swyx [00:30:41]: Do you have any benchmarks yet?
Lin [00:30:43]: We have a benchmark. We're going to release it hopefully next week. We just put our model to LMSYS and people are guessing. Is this the next Gemini model or a MADIS model? People are guessing. That's very interesting. We're watching the Reddit discussion right now.
Swyx [00:31:00]: I have to ask more questions about this. When OpenAI released o1, a lot of people asked about whether or not it's a single model or whether it's a chain of models. Noam and basically everyone on the Strawberry team was very insistent that what they did for reinforcement learning, chain of thought, cannot be replicated by a whole bunch of open source model calls. Do you think that that is wrong? Have you done the same amount of work on RL as they have or was it a different direction?
Lin [00:31:29]: I think they take a very specific approach where the caliber of team is very high. So I do think they are the domain expert in doing the things they are doing. I don't think there's only one way to achieve the same goal. We're on the same direction in the sense that the quality scaling law is shifting from training to inference. For that, I fully agree with them. But we're taking a completely different approach to the problem. All of that is because, of course, we didn't train the model from scratch. All of that is because we built on the show of giants. The current model available we have access to is getting better and better. The future trend is the gap between the open source model and the co-source model. It's just going to shrink to the point there's not much difference. And then we're on the same level field. That's why I think our early investment in inference and all the work we do around balancing across quality, latency, and cost pay off because we have accumulated a lot of experience and that empowers us to release this new model that is approaching open-ended quality.
Alessio [00:32:39]: I guess the question is, what do you think the gap to catch up will be? Because I think everybody agrees with open source models eventually will catch up. And I think with 4, then with Lama 3.2, 3.1, 4.5b, we close the gap. And then 0.1 just reopened the gap so much and it's unclear. Obviously, you're saying your model will have...
Swyx [00:32:57]: We're closing that gap.
Alessio [00:32:58]: But you think in the future, it's going to be months?
Lin [00:33:02]: So here's the thing that's happened. There's public benchmark. It is what it is. But in reality, open source models in certain dimensions are already on par or beat closed source models. So for example, in the coding space, open source models are really, really good. And in function calling, file function is also really, really good. So it's all a matter of whether you build one model to solve all the problems and you want to be the best of solving all the problems, or in the open source domain, it's going to specialize. All these different model builders specialize in certain narrow area. And it's logical that they can be really, really good in that very narrow area. And that's our prediction is with specialization, there will be a lot of expert models really, really good and even better than one-size-fits-all closed source models.
Swyx [00:33:55]: I think this is the core debate that I am still not 100% either way on in terms of compound AI versus normal AI. Because you're basically fighting the bitter lesson.
Lin [00:34:09]: Look at the human society, right? We specialize. And you feel really good about someone specializing doing something really well, right? And that's how our way evolved from ancient times. We're all journalists. We do everything. Now we heavily specialize in different domains. So my prediction is in the AI model space, it will happen also. Except for the bitter lesson.
Swyx [00:34:30]: You get short-term gains by having specialists, domain specialists, and then someone just needs to train like a 10x bigger model on 10x more inference, 10x more data, 10x more model perhaps, whatever the current scaling law is. And then it supersedes all the individual models because of some generalized intelligence slash world knowledge. I think that is the core insight of the GPTs, the GPT-123 networks. Right.
Lin [00:34:56]: But the training scaling law is because you have an increasing amount of data to train from. And you can do a lot of compute. So I think on the data side, we're approaching the limit. And the only data to increase that is synthetic generated data. And then there's like what is the secret sauce there, right? Because if you have a very good large model, you can generate very good synthetic data and then continue to improve quality. So that's why I think in OpenAI, they are shifting from the training scaling law into
Swyx [00:35:25]: inference scaling law.
Lin [00:35:25]: And it's the test time and all this. So I definitely believe that's the future direction. And that's where we are really good at, doing inference.
Swyx [00:35:34]: A couple of questions on that. Are you planning to share your reasoning choices?
Lin [00:35:39]: That's a very good question. We are still debating.
Swyx [00:35:43]: Yeah.
Lin [00:35:45]: We're still debating.
Swyx [00:35:46]: I would say, for example, it's interesting that, for example, SweetBench. If you want to be considered for ranking, you have to submit your reasoning choices. And that has actually disqualified some of our past guests. Cosign was doing well on SweetBench, but they didn't want to leak those results. So that's why you don't see O1 preview on SweetBench, because they don't submit their reasoning choices. And obviously, it's IP. But also, if you're going to be more open, then that's one way to be more open. So your model is not going to be open source, right? It's going to be an endpoint that you provide. Okay, cool. And then pricing, also the same as OpenAI, just kind of based on...
Lin [00:36:25]: Yeah, this is... I don't have, actually, information. Everything is going so fast, we haven't even thought about that yet. Yeah, I should be more prepared.
Swyx [00:36:33]: I mean, this is live. You know, it's nice to just talk about it as it goes live. Any other things that you want feedback on or you're thinking through? It's kind of nice to just talk about something when it's not decided yet. About this new model. It's going to be exciting. It's going to generate a lot of buzz. Right.
Lin [00:36:51]: I'm very excited to see how people are going to use this model. So there's already a Reddit discussion about it. And people are asking very deep, mathematical questions. And since the model got it right, surprising. And internally, we're also asking the model to generate what is AGI. And it generates a very complicated DAG thinking process. So we're having a lot of fun testing this internally. But I'm more curious, how will people use it? What kind of application they're going to try and test on it? And that's where we really like to hear feedback from the community. And also feedback to us. What works out well? What doesn't work out well? What works out well, but surprising them? And what kind of thing they think we should improve on? And those kind of feedback will be tremendously helpful.
Swyx [00:37:44]: Yeah. So I've been a production user of Preview and Mini since launch. I would say they're very, very obvious jobs in quality. So much so that they made clods on it. And they made the previous state-of-the-art look bad. It's really that stark, that difference. The number one thing, just feedback or feature requests, is people want control on the budget. Because right now, in 0.1, it kind of decides its own thinking budget. But sometimes you know how hard the problem is. And you want to actually tell the model, spend two minutes on this. Or spend some dollar amount. Maybe it's time you miss dollars. I don't know what the budget is. That makes a lot of sense.
Lin [00:38:27]: So we actually thought about that requirement. And it should be, at some point, we need to support that. Not initially. But that makes a lot of sense.
Swyx [00:38:38]: Okay. So that was a fascinating overview of just the things that you're working on. First of all, I realized that... I don't know if I've ever given you this feedback. But I think you guys are one of the reasons I agreed to advise you. Because I think when you first met me, I was kind of dubious. I was like... Who are you? There's Replicate. There's Together. There's Laptop. There's a whole bunch of other players. You're in very, very competitive fields. Like, why will you win? And the reason I actually changed my mind was I saw you guys shipping. I think your surface area is very big. The team is not that big. No. We're only 40 people. Yeah. And now here you are trying to compete with OpenAI and everyone else. What is the secret?
Lin [00:39:21]: I think the team. The team is the secret.
Swyx [00:39:23]: Oh boy. So there's no thing I can just copy. You just... No.
Lin [00:39:30]: I think we all come from a very aligned culture. Because most of our team came from meta.
Swyx [00:39:38]: Yeah.
Lin [00:39:38]: And many startups. So we really believe in results. One is result. And second is customer. We're very customer obsessed. And we don't want to drive adoption for the sake of adoption. We really want to make sure we understand we are delivering a lot of business values to the customer. And we really value their feedback. So we would wake up midnight and deploy some model for them. Shuffle some capacity for them. And yeah, over the weekend, no brainer.
Swyx [00:40:15]: So yeah.
Lin [00:40:15]: So that's just how we work as a team. And the caliber of the team is really, really high as well. So as plug-in, we're hiring. We're expanding very, very fast. So if we are passionate about working on the most cutting-edge technology in the general space, come talk with us. Yeah.
Swyx [00:40:38]: Let's talk a little bit about that customer journey. I think one of your more famous customers is Cursor. We were the first podcast to have Cursor on. And then obviously since then, they have blown up. Cause and effect are not related. But you guys especially worked on a fast supply model where you were one of the first people to work on speculative decoding in a production setting. Maybe just talk about what was the behind the scenes of working with Cursor?
Lin [00:41:03]: I will say Cursor is a very, very unique team. I think the unique part is the team has very high technical caliber. There's no question about it. But they have decided, although many companies building coding co-pilot, they will say, I'm going to build a whole entire stack because I can. And they are unique in the sense they seek partnership. Not because they cannot. They're fully capable, but they know where to focus. That to me is amazing. And of course, they want to find a bypass partner. So we spent some time working together. They are pushing us very aggressively because for them to deliver high caliber product experience, they need the latency. They need the interactive, but also high quality at the same time. So actually, we expanded our product feature quite a lot as we support Cursor. And they are growing so fast. And we massively scaled quickly across multiple regions. And we developed a pretty high intense inference stack, almost like similar to what we do for Meta. I think that's a very, very interesting engagement. And through that, there's a lot of trust being built. They realize, hey, this is a team they can really partner with. And they can go big with. That comes back to, hey, we're really customer obsessed. And all the engineers working with them, there's just enormous amount of time syncing together with them and discussing. And we're not big on meetings, but we are like stack channel always on. Yeah, so you almost feel like working as one team. So I think that's really highlighted.
Swyx [00:42:38]: Yeah. For those who don't know, so basically Cursor is a VS Code fork. But most of the time, people will be using closed models. Like I actually use a lot of SONET. So you're not involved there, right? It's not like you host SONET or you have any partnership with it. You're involved where Cursor is small, or like their house brand models are concerned, right?
Lin [00:42:58]: I don't know what I can say, but the things they haven't said.
Swyx [00:43:04]: Very obviously, the drop down is 4.0, but in Cursor, right? So I assume that the Cursor side is the Fireworks side. And then the other side, they're calling out the other. Just kind of curious. And then, do you see any more opportunity on the... You know, I think you made a big splash with 1,000 tokens per second. That was because of speculative decoding. Is there more to push there?
Lin [00:43:25]: We push a lot. Actually, when I mentioned Fire Optimizer, right? So as in, we have a unique automation stack that is one size fits one. We actually deployed to Cursor earlier on. Basically optimized for their specific workload. And that's a lot of juice to extract out of there. And we see success in that product. It actually can be widely adopted. So that's why we started a separate product line called Fire Optimizer. So speculative decoding is just one approach. And speculative decoding here is not static. We actually wrote a blog post about it. There's so many different ways to do speculative decoding. You can pair a small model with a large model in the same model family. Or you can have equal pads and so on. There are different trade-offs which approach you take. It really depends on your workload. And then with your workload, we can align the Eagle heads or Medusa heads or a small big model pair much better to extract the best latency reduction. So all of that is part of the Fire Optimizer offering.
Alessio [00:44:23]: I know you mentioned some of the other inference providers. I think the other question that people always have is around benchmarks. So you get different performance on different platforms. How should people think about... People are like, hey, Lama 3.2 is X on MMLU. But maybe using speculative decoding, you go down a different path. Maybe some providers run a quantized model. How should people think about how much they should care about how you're actually running the model? What's the delta between all the magic that you do and what a raw model...
Lin [00:44:57]: Okay, so there are two big development cycles. One is experimentation, where they need fast iteration. They don't want to think about quality, and they just want to experiment with product experience and so on. So that's one. And then it looks good, and they want to post-product market with scaling. And the quality is really important. And latency and all the other things are becoming important. During the experimentation phase, it's just pick a good model. Don't worry about anything else. Make sure you even generate the right solution to your product. And that's the focus. And then post-product market fit, then that's kind of the three-dimensional optimization curve start to kick in across quality, latency, cost, where you should land. And to me, it's purely a product decision. To many products, if you choose a lower quality, but better speed and lower cost, but it doesn't make a difference to the product experience, then you should do it. So that's why I think inference is part of the validation. The validation doesn't stop at offline eval. The validation will go through A-B testing, through inference. And that's where we offer various different configurations for you to test which is the best setting. So this is the traditional product evaluation. So product evaluation should also include your new model versions and different model setup into the consideration.
Swyx [00:46:22]: I want to specifically talk about what happens a few months ago with some of your major competitors. I mean, all of this is public. What is your take on what happens? And maybe you want to set the record straight on how Fireworks does quantization because I think a lot of people may have outdated perceptions or they didn't read the clarification post on your approach to quantization.
Lin [00:46:44]: First of all, it's always a surprise to us that without any notice, we got called out.
Swyx [00:46:51]: Specifically by name, which is normally not what...
Lin [00:46:54]: Yeah, in a public post. And have certain interpretation of our quality. So I was really surprised. And it's not a good way to compete, right? We want to compete fairly. And oftentimes when one vendor gives out results, the interpretation of another vendor is always extremely biased. So we actually refrain ourselves to do any of those. And we happily partner with third parties to do the most fair evaluation. So we're very surprised. And we don't think that's a good way to figure out the competition landscape. So then we react. I think when it comes to quantization, the interpretation, we wrote actually a very thorough blog post. Because again, no one says it's all. We have various different quantization schemes. We can quantize very different parts of the model from ways to activation to cross-TPU communication. They can use different quantization schemes or consistent across the board. And again, it's a trade-off. It's a trade-off across this three-dimensional quality, latency, and cost. And for our customer, we actually let them find the best optimized point. And we have a very thorough evaluation process to pick that point. But for self-serve, there's only one point to pick. There's no customization available. So of course, it depends on what we talk with many customers. We have to pick one point. And I think the end result, like AA published, later on AA published a quality measure. And we actually looked really good. So that's why what I mean is, I will leave the evaluation of quality or performance to third party and work with them to find the most fair benchmark. And I think that's a good approach, a methodology. But I'm not a part of an approach of calling out specific names
Swyx [00:48:55]: and critique other competitors in a very biased way. Databases happens as well. I think you're the more politically correct one. And then Dima is the more... Something like this. It's you on Twitter.
Lin [00:49:11]: It's like the Russian... We partner. We play different roles.
Swyx [00:49:20]: Another one that I wanted to... I'm just the last one on the competition side. There's a perception of price wars in hosting open source models. And we talked about the competitiveness in the market. Do you aim to make margin on open source models? Oh, absolutely, yes.
Lin [00:49:38]: So, but I think it really... When we think about pricing, it's really need to coordinate with the value we're delivering. If the value is limited, or there are a lot of people delivering the same value, there's no differentiation. There's only one way to go. It's going down. So through competition. If I take a big step back, there is pricing from... We're more compared with close model providers, APIs, right? The close model provider, their cost structure is even more interesting because we don't bear any training costs. And we focus on inference optimization, and that's kind of where we continue to add a lot of product value. So that's how we think about product. But for the close source API provider, model provider, they bear a lot of training costs. And they need to amortize the training costs into the inference. So that created very interesting dynamics of, yeah, if we match pricing there, and I think how they are going to make money is very, very interesting.
Swyx [00:50:37]: So for listeners, opening eyes 2024, $4 billion in revenue, $3 billion in compute training, $2 billion in compute inference, $1 billion in research compute amortization, and $700 million in salaries. So that is like...
Swyx [00:50:59]: I mean, a lot of R&D.
Lin [00:51:01]: Yeah, so I think matter is basically like, make it zero. So that's a very, very interesting dynamics we're operating within. But coming back to inference, so we are, again, as I mentioned, our product is, we are a platform. We're not just a single model as a service provider as many other inference providers, like they're providing a single model. We have our optimizer to highly customize towards your inference workload. We have a compound AI system where significantly simplify your interaction to high quality and low latency, low cost. So those are all very different from other providers.
Alessio [00:51:38]: What do people not know about the work that you do? I guess like people are like, okay, Fireworks, you run model very quickly. You have the function model. Is there any kind of like underrated part of Fireworks that more people should try?
Lin [00:51:51]: Yeah, actually, one user post on x.com, he mentioned, oh, actually, Fireworks can allow me to upload the LoRa adapter to the service model at the same cost and use it at same cost. Nobody has provided that. That's because we have a very special, like we rolled out multi-LoRa last year, actually. And we actually have this function for a long time. And many people has been using it, but it's not well known that, oh, if you find your model, you don't need to use on demand. If you find your model is LoRa, you can upload your LoRa adapter and we deploy it as if it's a new model. And then you use, you get your endpoint and you can use that directly, but at the same cost as the base model. So I'm happy that user is marketing it for us. He discovered that feature, but we have that for last year. So I think to feedback to me is, we have a lot of very, very good features, as Sean just mentioned. I'm the advisor to the company,
Swyx [00:52:57]: and I didn't know that you had speculative decoding released.
Lin [00:53:02]: We have prompt catching way back last year also. We have many, yeah. So I think that is one of the underrated feature. And if they're developers, you are using our self-serve platform, please try it out.
Swyx [00:53:16]: The LoRa thing is interesting because I think you also, the reason people add additional costs to it, it's not because they feel like charging people. Normally in normal LoRa serving setups, there is a cost to dedicating, loading those weights and dedicating a machine to that inference. How come you can't avoid it?
Lin [00:53:36]: Yeah, so this is kind of our technique called multi-LoRa. So we basically have many LoRa adapters share the same base model. And basically we significantly reduce the memory footprint of serving. And the one base model can sustain a hundred to a thousand LoRa adapters. And then basically all these different LoRa adapters can share the same, like direct the same traffic to the same base model where base model is dominating the cost. So that's how we advertise that way. And that's how we can manage the tokens per dollar, million token pricing, the same as base model.
Swyx [00:54:13]: Awesome. Is there anything that you think you want to request from the community or you're looking for model-wise or tooling-wise that you think like someone should be working on in this?
Lin [00:54:23]: Yeah, so we really want to get a lot of feedback from the application developers who are starting to build on JNN or on the already adopted or starting about thinking about new use cases and so on to try out Fireworks first. And let us know what works out really well for you and what is your wishlist and what sucks, right? So what is not working out for you and we would like to continue to improve. And for our new product launches, typically we want to launch to a small group of people. Usually we launch on our Discord first to have a set of people use that first. So please join our Discord channel. We have a lot of communication going on there. Again, you can also give us feedback. We'll have a starting office hour for you to directly talk with our DevRel and engineers to exchange more long notes.
Alessio [00:55:17]: And you're hiring across the board?
Lin [00:55:18]: We're hiring across the board. We're hiring front-end engineers, infrastructure cloud, infrastructure engineers, back-end system optimization engineers, applied researchers, like researchers who have done post-training, who have done a lot of fine-tuning and so on.
Swyx [00:55:34]: That's it. Thank you. Thanks for having us.

Get full access to Latent Space at www.latent.space/subscribe
- Luisteren Nogmaals beluisteren Doorgaan Wordt afgespeeld...
- Later beluisteren Later beluisteren
Agents @ Work: Lindy.ai
15 nov· Latent Space: The AI Engineer Podcast
Alessio will be at AWS re:Invent next week and hosting a casual coffee meetup on Wednesday, RSVP here! And subscribe to our calendar for our Singapore, NeurIPS, and all upcoming meetups!
We are still taking questions for our next big recap episode! Submit questions and messages on Speakpipe here for a chance to appear on the show!
If you've been following the AI agents space, you have heard of Lindy AI; while founder Flo Crivello is hesitant to call it "blowing up," when folks like Andrew Wilkinson start obsessing over your product, you're definitely onto something.
In our latest episode, Flo walked us through Lindy's evolution from late 2022 to now, revealing some design choices about agent platform design that go against conventional wisdom in the space.
The Great Reset: From Text Fields to Rails
Remember late 2022? Everyone was "LLM-pilled," believing that if you just gave a language model enough context and tools, it could do anything. Lindy 1.0 followed this pattern:
* Big prompt field ✅
* Bunch of tools ✅
* Prayer to the LLM gods ✅
Fast forward to today, and Lindy 2.0 looks radically different. As Flo put it (~17:00 in the episode): "The more you can put your agent on rails, one, the more reliable it's going to be, obviously, but two, it's also going to be easier to use for the user."
Instead of a giant, intimidating text field, users now build workflows visually:
* Trigger (e.g., "Zendesk ticket received")
* Required actions (e.g., "Check knowledge base")
* Response generation
This isn't just a UI change - it's a fundamental rethinking of how to make AI agents reliable. As Swyx noted during our discussion: "Put Shoggoth in a box and make it a very small, minimal viable box. Everything else should be traditional if-this-then-that software."
The Surprising Truth About Model Limitations
Here's something that might shock folks building in the space: with Claude 3.5 Sonnet, the model is no longer the bottleneck. Flo's exact words (~31:00): "It is actually shocking the extent to which the model is no longer the limit. It was the limit a year ago. It was too expensive. The context window was too small."
Some context: Lindy started when context windows were 4K tokens. Today, their system prompt alone is larger than that. But what's really interesting is what this means for platform builders:
* Raw capabilities aren't the constraint anymore
* Integration quality matters more than model performance
* User experience and workflow design are the new bottlenecks
The Search Engine Parallel: Why Horizontal Platforms Might Win
One of the spiciest takes from our conversation was Flo's thesis on horizontal vs. vertical agent platforms. He draws a fascinating parallel to search engines (~56:00):
"I find it surprising the extent to which a horizontal search engine has won... You go through Google to search Reddit. You go through Google to search Wikipedia... search in each vertical has more in common with search than it does with each vertical."
His argument: agent platforms might follow the same pattern because:
* Agents across verticals share more commonalities than differences
* There's value in having agents that can work together under one roof
* The R&D cost of getting agents right is better amortized across use cases
This might explain why we're seeing early vertical AI companies starting to expand horizontally. The core agent capabilities - reliability, context management, tool integration - are universal needs.
What This Means for Builders
If you're building in the AI agents space, here are the key takeaways:
* Constrain First: Rather than maximizing capabilities, focus on reliable execution within narrow bounds
* Integration Quality Matters: With model capabilities plateauing, your competitive advantage lies in how well you integrate with existing tools
* Memory Management is Key: Flo revealed they actively prune agent memories - even with larger context windows, not all memories are useful
* Design for Discovery: Lindy's visual workflow builder shows how important interface design is for adoption
The Meta Layer
There's a broader lesson here about AI product development. Just as Lindy evolved from "give the LLM everything" to "constrain intelligently," we might see similar evolution across the AI tooling space. The winners might not be those with the most powerful models, but those who best understand how to package AI capabilities in ways that solve real problems reliably.
Full Video Podcast
Flo’s talk at AI Engineer Summit
Chapters
* 00:00:00 Introductions
* 00:04:05 AI engineering and deterministic software
* 00:08:36 Lindys demo
* 00:13:21 Memory management in AI agents
* 00:18:48 Hierarchy and collaboration between Lindys
* 00:21:19 Vertical vs. horizontal AI tools
* 00:24:03 Community and user engagement strategies
* 00:26:16 Rickrolling incident with Lindy
* 00:28:12 Evals and quality control in AI systems
* 00:31:52 Model capabilities and their impact on Lindy
* 00:39:27 Competition and market positioning
* 00:42:40 Relationship between Factorio and business strategy
* 00:44:05 Remote work vs. in-person collaboration
* 00:49:03 Europe vs US Tech
* 00:58:59 Testing the Overton window and free speech
* 01:04:20 Balancing AI safety concerns with business innovation
Show Notes
* Lindy.ai
* Rick Rolling
* Flo on X
* TeamFlow
* Andrew Wilkinson
* Dust
* Poolside.ai
* SB1047
* Gathertown
* Sid Sijbrandij
* Matt Mullenweg
* Factorio
* Seeing Like a State
Transcript
Alessio [00:00:00]: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner and CTO at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol.ai.
Swyx [00:00:12]: Hey, and today we're joined in the studio by Florent Crivello. Welcome.
Flo [00:00:15]: Hey, yeah, thanks for having me.
Swyx [00:00:17]: Also known as Altimore. I always wanted to ask, what is Altimore?
Flo [00:00:21]: It was the name of my character when I was playing Dungeons & Dragons. Always. I was like 11 years old.
Swyx [00:00:26]: What was your classes?
Flo [00:00:27]: I was an elf. I was a magician elf.
Swyx [00:00:30]: Well, you're still spinning magic. Right now, you're a solo founder and CEO of Lindy.ai. What is Lindy?
Flo [00:00:36]: Yeah, we are a no-code platform letting you build your own AI agents easily. So you can think of we are to LangChain as Airtable is to MySQL. Like you can just pin up AI agents super easily by clicking around and no code required. You don't have to be an engineer and you can automate business workflows that you simply could not automate before in a few minutes.
Swyx [00:00:55]: You've been in our orbit a few times. I think you spoke at our Latent Space anniversary. You spoke at my summit, the first summit, which was a really good keynote. And most recently, like we actually already scheduled this podcast before this happened. But Andrew Wilkinson was like, I'm obsessed by Lindy. He's just created a whole bunch of agents. So basically, why are you blowing up?
Flo [00:01:16]: Well, thank you. I think we are having a little bit of a moment. I think it's a bit premature to say we're blowing up. But why are things going well? We revamped the product majorly. We called it Lindy 2.0. I would say we started working on that six months ago. We've actually not really announced it yet. It's just, I guess, I guess that's what we're doing now. And so we've basically been cooking for the last six months, like really rebuilding the product from scratch. I think I'll list you, actually, the last time you tried the product, it was still Lindy 1.0. Oh, yeah. If you log in now, the platform looks very different. There's like a ton more features. And I think one realization that we made, and I think a lot of folks in the agent space made the same realization, is that there is such a thing as too much of a good thing. I think many people, when they started working on agents, they were very LLM peeled and chat GPT peeled, right? They got ahead of themselves in a way, and us included, and they thought that agents were actually, and LLMs were actually more advanced than they actually were. And so the first version of Lindy was like just a giant prompt and a bunch of tools. And then the realization we had was like, hey, actually, the more you can put your agent on Rails, one, the more reliable it's going to be, obviously, but two, it's also going to be easier to use for the user, because you can really, as a user, you get, instead of just getting this big, giant, intimidating text field, and you type words in there, and you have no idea if you're typing the right word or not, here you can really click and select step by step, and tell your agent what to do, and really give as narrow or as wide a guardrail as you want for your agent. We started working on that. We called it Lindy on Rails about six months ago, and we started putting it into the hands of users over the last, I would say, two months or so, and I think things really started going pretty well at that point. The agent is way more reliable, way easier to set up, and we're already seeing a ton of new use cases pop up.
Swyx [00:03:00]: Yeah, just a quick follow-up on that. You launched the first Lindy in November last year, and you were already talking about having a DSL, right? I remember having this discussion with you, and you were like, it's just much more reliable. Is this still the DSL under the hood? Is this a UI-level change, or is it a bigger rewrite?
Flo [00:03:17]: No, it is a much bigger rewrite. I'll give you a concrete example. Suppose you want to have an agent that observes your Zendesk tickets, and it's like, hey, every time you receive a Zendesk ticket, I want you to check my knowledge base, so it's like a RAG module and whatnot, and then answer the ticket. The way it used to work with Lindy before was, you would type the prompt asking it to do that. You check my knowledge base, and so on and so forth. The problem with doing that is that it can always go wrong. You're praying the LLM gods that they will actually invoke your knowledge base, but I don't want to ask it. I want it to always, 100% of the time, consult the knowledge base after it receives a Zendesk ticket. And so with Lindy, you can actually have the trigger, which is Zendesk ticket received, have the knowledge base consult, which is always there, and then have the agent. So you can really set up your agent any way you want like that.
Swyx [00:04:05]: This is something I think about for AI engineering as well, which is the big labs want you to hand over everything in the prompts, and only code of English, and then the smaller brains, the GPU pours, always want to write more code to make things more deterministic and reliable and controllable. One way I put it is put Shoggoth in a box and make it a very small, the minimal viable box. Everything else should be traditional, if this, then that software.
Flo [00:04:29]: I love that characterization, put the Shoggoth in the box. Yeah, we talk about using as much AI as necessary and as little as possible.
Alessio [00:04:37]: And what was the choosing between kind of like this drag and drop, low code, whatever, super code-driven, maybe like the Lang chains, auto-GPT of the world, and maybe the flip side of it, which you don't really do, it's like just text to agent, it's like build the workflow for me. Like what have you learned actually putting this in front of users and figuring out how much do they actually want to add it versus like how much, you know, kind of like Ruby on Rails instead of Lindy on Rails, it's kind of like, you know, defaults over configuration.
Flo [00:05:06]: I actually used to dislike when people said, oh, text is not a great interface. I was like, ah, this is such a mid-take, I think text is awesome. And I've actually come around, I actually sort of agree now that text is really not great. I think for people like you and me, because we sort of have a mental model, okay, when I type a prompt into this text box, this is what it's going to do, it's going to map it to this kind of data structure under the hood and so forth. I guess it's a little bit blackmailing towards humans. You jump on these calls with humans and you're like, here's a text box, this is going to set up an agent for you, do it. And then they type words like, I want you to help me put order in my inbox. Oh, actually, this is a good one. This is actually a good one. What's a bad one? I would say 60 or 70% of the prompts that people type don't mean anything. Me as a human, as AGI, I don't understand what they mean. I don't know what they mean. It is actually, I think whenever you can have a GUI, it is better than to have just a pure text interface.
Alessio [00:05:58]: And then how do you decide how much to expose? So even with the tools, you have Slack, you have Google Calendar, you have Gmail. Should people by default just turn over access to everything and then you help them figure out what to use? I think that's the question. When I tried to set up Slack, it was like, hey, give me access to all channels and everything, which for the average person probably makes sense because you don't want to re-prompt them every time you add new channels. But at the same time, for maybe the more sophisticated enterprise use cases, people are like, hey, I want to really limit what you have access to. How do you kind of thread that balance?
Flo [00:06:35]: The general philosophy is we ask for the least amount of permissions needed at any given moment. I don't think Slack, I could be mistaken, but I don't think Slack lets you request permissions for just one channel. But for example, for Google, obviously there are hundreds of scopes that you could require for Google. There's a lot of scopes. And sometimes it's actually painful to set up your Lindy because you're going to have to ask Google and add scopes five or six times. We've had sessions like this. But that's what we do because, for example, the Lindy email drafter, she's going to ask you for your authorization once for, I need to be able to read your email so I can draft a reply, and then another time for I need to be able to write a draft for them. We just try to do it very incrementally like that.
Alessio [00:07:15]: Do you think OAuth is just overall going to change? I think maybe before it was like, hey, we need to set up OAuth that humans only want to kind of do once. So we try to jam-pack things all at once versus what if you could on-demand get different permissions every time from different parts? Do you ever think about designing things knowing that maybe AI will use it instead of humans will use it? Yeah, for sure.
Flo [00:07:37]: One pattern we've started to see is people provisioning accounts for their AI agents. And so, in particular, Google Workspace accounts. So, for example, Lindy can be used as a scheduling assistant. So you can just CC her to your emails when you're trying to find time with someone. And just like a human assistant, she's going to go back and forth and offer other abilities and so forth. Very often, people don't want the other party to know that it's an AI. So it's actually funny. They introduce delays. They ask the agent to wait before replying, so it's not too obvious that it's an AI. And they provision an account on Google Suite, which costs them like $10 a month or something like that. So we're seeing that pattern more and more. I think that does the job for now. I'm not optimistic on us actually patching OAuth. Because I agree with you, ultimately, we would want to patch OAuth because the new account thing is kind of a clutch. It's really a hack. You would want to patch OAuth to have more granular access control and really be able to put your sugar in the box. I'm not optimistic on us doing that before AGI, I think. That's a very close timeline.
Swyx [00:08:36]: I'm mindful of talking about a thing without showing it. And we already have the setup to show it. Why don't we jump into a screen share? For listeners, you can jump on the YouTube and like and subscribe. But also, let's have a look at how you show off Lindy. Yeah, absolutely.
Flo [00:08:51]: I'll give an example of a very simple Lindy and then I'll graduate to a much more complicated one. A super simple Lindy that I have is, I unfortunately bought some investment properties in the south of France. It was a really, really bad idea. And I put them on a Holydew, which is like the French Airbnb, if you will. And so I received these emails from time to time telling me like, oh, hey, you made 200 bucks. Someone booked your place. When I receive these emails, I want to log this reservation in a spreadsheet. Doing this without an AI agent or without AI in general is a pain in the butt because you must write an HTML parser for this email. And so it's just hard. You may not be able to do it and it's going to break the moment the email changes. By contrast, the way it works with Lindy, it's really simple. It's two steps. It's like, okay, I receive an email. If it is a reservation confirmation, I have this filter here. Then I append a row to this spreadsheet. And so this is where you can see the AI part where the way this action is configured here, you see these purple fields on the right. Each of these fields is a prompt. And so I can say, okay, you extract from the email the day the reservation begins on. You extract the amount of the reservation. You extract the number of travelers of the reservation. And now you can see when I look at the task history of this Lindy, it's really simple. It's like, okay, you do this and boom, appending this row to this spreadsheet. And this is the information extracted. So effectively, this node here, this append row node is a mini agent. It can see everything that just happened. It has context over the task and it's appending the row. And then it's going to send a reply to the thread. That's a very simple example of an agent.
Swyx [00:10:34]: A quick follow-up question on this one while we're still on this page. Is that one call? Is that a structured output call? Yeah. Okay, nice. Yeah.
Flo [00:10:41]: And you can see here for every node, you can configure which model you want to power the node. Here I use cloud. For this, I use GPT-4 Turbo. Much more complex example, my meeting recorder. It looks very complex because I've added to it over time, but at a high level, it's really simple. It's like when a meeting begins, you record the meeting. And after the meeting, you send me a summary and you send me coaching notes. So I receive, like my Lindy is constantly coaching me. And so you can see here in the prompt of the coaching notes, I've told it, hey, you know, was I unnecessarily confrontational at any point? I'm French, so I have to watch out for that. Or not confrontational enough. Should I have double-clicked on any issue, right? So I can really give it exactly the kind of coaching that I'm expecting. And then the interesting thing here is, like, you can see the agent here, after it sent me these coaching notes, moves on. And it does a bunch of other stuff. So it goes on Slack. It disseminates the notes on Slack. It does a bunch of other stuff. But it's actually able to backtrack and resume the automation at the coaching notes email if I responded to that email. So I'll give a super concrete example. This is an actual coaching feedback that I received from Lindy. She was like, hey, this was a sales call I had with a customer. And she was like, I found your explanation of Lindy too technical. And I was able to follow up and just ask a follow-up question in the thread here. And I was like, why did you find too technical about my explanation? And Lindy restored the context. And so she basically picked up the automation back up here in the tree. And she has all of the context of everything that happened, including the meeting in which I was. So she was like, oh, you used the words deterministic and context window and agent state. And that concept exists at every level for every channel and every action that Lindy takes. So another example here is, I mentioned she also disseminates the notes on Slack. So this was a meeting where I was not, right? So this was a teammate. He's an indie meeting recorder, posts the meeting notes in this customer discovery channel on Slack. So you can see, okay, this is the onboarding call we had. This was the use case. Look at the questions. How do I make Lindy slower? How do I add delays to make Lindy slower? And I was able, in the Slack thread, to ask follow-up questions like, oh, what did we answer to these questions? And it's really handy because I know I can have this sort of interactive Q&A with these meetings. It means that very often now, I don't go to meetings anymore. I just send my Lindy. And instead of going to like a 60-minute meeting, I have like a five-minute chat with my Lindy afterwards. And she just replied. She was like, well, this is what we replied to this customer. And I can just be like, okay, good job, Jack. Like, no notes about your answers. So that's the kind of use cases people have with Lindy. It's a lot of like, there's a lot of sales automations, customer support automations, and a lot of this, which is basically personal assistance automations, like meeting scheduling and so forth.
Alessio [00:13:21]: Yeah, and I think the question that people might have is memory. So as you get coaching, how does it track whether or not you're improving? You know, if these are like mistakes you made in the past, like, how do you think about that?
Flo [00:13:31]: Yeah, we have a memory module. So I'll show you my meeting scheduler, Lindy, which has a lot of memories because by now I've used her for so long. And so every time I talk to her, she saves a memory. If I tell her, you screwed up, please don't do this. So you can see here, oh, it's got a double memory here. This is the meeting link I have, or this is the address of the office. If I tell someone to meet me at home, this is the address of my place. This is the code. I guess we'll have to edit that out. This is not the code of my place. No dogs. Yeah, so Lindy can just manage her own memory and decide when she's remembering things between executions. Okay.
Swyx [00:14:11]: I mean, I'm just going to take the opportunity to ask you, since you are the creator of this thing, how come there's so few memories, right? Like, if you've been using this for two years, there should be thousands of thousands of things. That is a good question.
Flo [00:14:22]: Agents still get confused if they have too many memories, to my point earlier about that. So I just am out of a call with a member of the Lama team at Meta, and we were chatting about Lindy, and we were going into the system prompt that we sent to Lindy, and all of that stuff. And he was amazed, and he was like, it's a miracle that it's working, guys. He was like, this kind of system prompt, this does not exist, either pre-training or post-training. These models were never trained to do this kind of stuff. It's a miracle that they can be agents at all. And so what I do, I actually prune the memories. You know, it's actually something I've gotten into the habit of doing from back when we had GPT 3.5, being Lindy agents. I suspect it's probably not as necessary in the Cloud 3.5 Sunette days, but I prune the memories. Yeah, okay.
Swyx [00:15:05]: The reason is because I have another assistant that also is recording and trying to come up with facts about me. It comes up with a lot of trivial, useless facts that I... So I spend most of my time pruning. Actually, it's not super useful. I'd much rather have high-quality facts that it accepts. Or maybe I was even thinking, were you ever tempted to add a wake word to only memorize this when I say memorize this? And otherwise, don't even bother.
Flo [00:15:30]: I have a Lindy that does this. So this is my inbox processor, Lindy. It's kind of beefy because there's a lot of different emails. But somewhere in here,
Swyx [00:15:38]: there is a rule where I'm like,
Flo [00:15:39]: aha, I can email my inbox processor, Lindy. It's really handy. So she has her own email address. And so when I process my email inbox, I sometimes forward an email to her. And it's a newsletter, or it's like a cold outreach from a recruiter that I don't care about, or anything like that. And I can give her a rule. And I can be like, hey, this email I want you to archive, moving forward. Or I want you to alert me on Slack when I have this kind of email. It's really important. And so you can see here, the prompt is, if I give you a rule about a kind of email, like archive emails from X, save it as a new memory. And I give it to the memory saving skill. And yeah.
Swyx [00:16:13]: One thing that just occurred to me, so I'm a big fan of virtual mailboxes. I recommend that everybody have a virtual mailbox. You could set up a physical mail receive thing for Lindy. And so then Lindy can process your physical mail.
Flo [00:16:26]: That's actually a good idea. I actually already have something like that. I use like health class mail. Yeah. So yeah, most likely, I can process my physical mail. Yeah.
Swyx [00:16:35]: And then the other product's idea I have, looking at this thing, is people want to brag about the complexity of their Lindys. So this would be like a 65 point Lindy, right?
Flo [00:16:43]: What's a 65 point?
Swyx [00:16:44]: Complexity counting. Like how many nodes, how many things, how many conditions, right? Yeah.
Flo [00:16:49]: This is not the most complex one. I have another one. This designer recruiter here is kind of beefy as well. Right, right, right. So I'm just saying,
Swyx [00:16:56]: let people brag. Let people be super users. Oh, right.
Flo [00:16:59]: Give them a score. Give them a score.
Swyx [00:17:01]: Then they'll just be like, okay, how high can you make this score?
Flo [00:17:04]: Yeah, that's a good point. And I think that's, again, the beauty of this on-rails phenomenon. It's like, think of the equivalent, the prompt equivalent of this Lindy here, for example, that we're looking at. It'd be monstrous. And the odds that it gets it right are so low. But here, because we're really holding the agent's hand step by step by step, it's actually super reliable. Yeah.
Swyx [00:17:22]: And is it all structured output-based? Yeah. As far as possible? Basically. Like, there's no non-structured output?
Flo [00:17:27]: There is. So, for example, here, this AI agent step, right, or this send message step, sometimes it gets to... That's just plain text.
Swyx [00:17:35]: That's right.
Flo [00:17:36]: Yeah. So I'll give you an example. Maybe it's TMI. I'm having blood pressure issues these days. And so this Lindy here, I give it my blood pressure readings, and it updates a log that I have of my blood pressure that it sends to my doctor.
Swyx [00:17:49]: Oh, so every Lindy comes with a to-do list?
Flo [00:17:52]: Yeah. Every Lindy has its own task history. Huh. Yeah. And so you can see here, this is my main Lindy, my personal assistant, and I've told it, where is this? There is a point where I'm like, if I am giving you a health-related fact, right here, I'm giving you health information, so then you update this log that I have in this Google Doc, and then you send me a message. And you can see, I've actually not configured this send message node. I haven't told it what to send me a message for. Right? And you can see, it's actually lecturing me. It's like, I'm giving it my blood pressure ratings. It's like, hey, it's a bit high. Here are some lifestyle changes you may want to consider.
Alessio [00:18:27]: I think maybe this is the most confusing or new thing for people. So even I use Lindy and I didn't even know you could have multiple workflows in one Lindy. I think the mental model is kind of like the Zapier workflows. It starts and it ends. It doesn't choose between. How do you think about what's a Lindy versus what's a sub-function of a Lindy? Like, what's the hierarchy?
Flo [00:18:48]: Yeah. Frankly, I think the line is a little arbitrary. It's kind of like when you code, like when do you start to create a new class versus when do you overload your current class. I think of it in terms of like jobs to be done and I think of it in terms of who is the Lindy serving. This Lindy is serving me personally. It's really my day-to-day Lindy. I give it a bunch of stuff, like very easy tasks. And so this is just the Lindy I go to. Sometimes when a task is really more specialized, so for example, I have this like summarizer Lindy or this designer recruiter Lindy. These tasks are really beefy. I wouldn't want to add this to my main Lindy, so I just created a separate Lindy for it. Or when it's a Lindy that serves another constituency, like our customer support Lindy, I don't want to add that to my personal assistant Lindy. These are two very different Lindys.
Alessio [00:19:31]: And you can call a Lindy from within another Lindy. That's right. You can kind of chain them together.
Flo [00:19:36]: Lindys can work together, absolutely.
Swyx [00:19:38]: A couple more things for the video portion. I noticed you have a podcast follower. We have to ask about that. What is that?
Flo [00:19:46]: So this one wakes me up every... So wakes herself up every week. And she sends me... So she woke up yesterday, actually. And she searches for Lenny's podcast. And she looks for like the latest episode on YouTube. And once she finds it, she transcribes the video and then she sends me the summary by email. I don't listen to podcasts as much anymore. I just like read these summaries. Yeah.
Alessio [00:20:09]: We should make a latent space Lindy. Marketplace.
Swyx [00:20:12]: Yeah. And then you have a whole bunch of connectors. I saw the list briefly. Any interesting one? Complicated one that you're proud of? Anything that you want to just share? Connector stories.
Flo [00:20:23]: So many of our workflows are about meeting scheduling. So we had to build some very open unity tools around meeting scheduling. So for example, one that is surprisingly hard is this find available times action. You would not believe... This is like a thousand lines of code or something. It's just a very beefy action. And you can pass it a bunch of parameters about how long is the meeting? When does it start? When does it end? What are the meetings? The weekdays in which I meet? How many time slots do you return? What's the buffer between my meetings? It's just a very, very, very complex action. I really like our GitHub action. So we have a Lindy PR reviewer. And it's really handy because anytime any bug happens... So the Lindy reads our guidelines on Google Docs. By now, the guidelines are like 40 pages long or something. And so every time any new kind of bug happens, we just go to the guideline and we add the lines. Like, hey, this has happened before. Please watch out for this category of bugs. And it's saving us so much time every day.
Alessio [00:21:19]: There's companies doing PR reviews. Where does a Lindy start? When does a company start? Or maybe how do you think about the complexity of these tasks when it's going to be worth having kind of like a vertical standalone company versus just like, hey, a Lindy is going to do a good job 99% of the time?
Flo [00:21:34]: That's a good question. We think about this one all the time. I can't say that we've really come up with a very crisp articulation of when do you want to use a vertical tool versus when do you want to use a horizontal tool. I think of it as very similar to the internet. I find it surprising the extent to which a horizontal search engine has won. But I think that Google, right? But I think the even more surprising fact is that the horizontal search engine has won in almost every vertical, right? You go through Google to search Reddit. You go through Google to search Wikipedia. I think maybe the biggest exception is e-commerce. Like you go to Amazon to search e-commerce, but otherwise you go through Google. And I think that the reason for that is because search in each vertical has more in common with search than it does with each vertical. And search is so expensive to get right. Like Google is a big company that it makes a lot of sense to aggregate all of these different use cases and to spread your R&D budget across all of these different use cases. I have a thesis, which is, it's a really cool thesis for Lindy, is that the same thing is true for agents. I think that by and large, in a lot of verticals, agents in each vertical have more in common with agents than they do with each vertical. I also think there are benefits in having a single agent platform because that way your agents can work together. They're all like under one roof. That way you only learn one platform and so you can create agents for everything that you want. And you don't have to like pay for like a bunch of different platforms and so forth. So I think ultimately, it is actually going to shake out in a way that is similar to search in that search is everywhere on the internet. Every website has a search box, right? So there's going to be a lot of vertical agents for everything. I think AI is going to completely penetrate every category of software. But then I also think there are going to be a few very, very, very big horizontal agents that serve a lot of functions for people.
Swyx [00:23:14]: That is actually one of the questions that we had about the agent stuff. So I guess we can transition away from the screen and I'll just ask the follow-up, which is, that is a hot topic. You're basically saying that the current VC obsession of the day, which is vertical AI enabled SaaS, is mostly not going to work out. And then there are going to be some super giant horizontal SaaS.
Flo [00:23:34]: Oh, no, I'm not saying it's either or. Like SaaS today, vertical SaaS is huge and there's also a lot of horizontal platforms. If you look at like Airtable or Notion, basically the entire no-code space is very horizontal. I mean, Loom and Zoom and Slack, there's a lot of very horizontal tools out there. Okay.
Swyx [00:23:49]: I was just trying to get a reaction out of you for hot takes. Trying to get a hot take.
Flo [00:23:54]: No, I also think it is natural for the vertical solutions to emerge first because it's just easier to build. It's just much, much, much harder to build something horizontal. Cool.
Swyx [00:24:03]: Some more Lindy-specific questions. So we covered most of the top use cases and you have an academy. That was nice to see. I also see some other people doing it for you for free. So like Ben Spites is doing it and then there's some other guy who's also doing like lessons. Yeah. Which is kind of nice, right? Yeah, absolutely. You don't have to do any of that.
Flo [00:24:20]: Oh, we've been seeing it more and more on like LinkedIn and Twitter, like people posting their Lindys and so forth.
Swyx [00:24:24]: I think that's the flywheel that you built the platform where creators see value in allying themselves to you. And so then, you know, your incentive is to make them successful so that they can make other people successful and then it just drives more and more engagement. Like it's earned media. Like you don't have to do anything.
Flo [00:24:39]: Yeah, yeah. I mean, community is everything.
Swyx [00:24:41]: Are you doing anything special there? Any big wins?
Flo [00:24:44]: We have a Slack community that's pretty active. I can't say we've invested much more than that so far.
Swyx [00:24:49]: I would say from having, so I have some involvement in the no-code community. I would say that Webflow going very hard after no-code as a category got them a lot more allies than just the people using Webflow. So it helps you to grow the community beyond just Lindy. And I don't know what this is called. Maybe it's just no-code again. Maybe you want to call it something different. But there's definitely an appetite for this and you are one of a broad category, right? Like just before you, we had Dust and, you know, they're also kind of going after a similar market. Zapier obviously is not going to try to also compete with you. Yeah. There's no question there. It's just like a reaction about community. Like I think a lot about community. Lanespace is growing the community of AI engineers. And I think you have a slightly different audience of, I don't know what.
Flo [00:25:33]: Yeah. I think the no-code tinkerers is the community. Yeah. It is going to be the same sort of community as what Webflow, Zapier, Airtable, Notion to some extent.
Swyx [00:25:43]: Yeah. The framing can be different if you were, so I think tinkerers has this connotation of not serious or like small. And if you framed it to like no-code EA, we're exclusively only for CEOs with a certain budget, then you just have, you tap into a different budget.
Flo [00:25:58]: That's true. The problem with EA is like, the CEO has no willingness to actually tinker and play with the platform.
Swyx [00:26:05]: Maybe Andrew's doing that. Like a lot of your biggest advocates are CEOs, right?
Flo [00:26:09]: A solopreneur, you know, small business owners, I think Andrew is an exception. Yeah. Yeah, yeah, he is.
Swyx [00:26:14]: He's an exception in many ways. Yep.
Alessio [00:26:16]: Just before we wrap on the use cases, is Rick rolling your customers? Like a officially supported use case or maybe tell that story?
Flo [00:26:24]: It's one of the main jobs to be done, really. Yeah, we woke up recently, so we have a Lindy obviously doing our customer support and we do check after the Lindy. And so we caught this email exchange where someone was asking Lindy for video tutorials. And at the time, actually, we did not have video tutorials. We do now on the Lindy Academy. And Lindy responded to the email. It's like, oh, absolutely, here's a link. And we were like, what? Like, what kind of link did you send? And so we clicked on the link and it was a recall. We actually reacted fast enough that the customer had not yet opened the email. And so we reacted immediately. Like, oh, hey, actually, sorry, this is the right link. And so the customer never reacted to the first link. And so, yeah, I tweeted about that. It went surprisingly viral. And I checked afterwards in the logs. We did like a database query and we found, I think, like three or four other instances of it having happened before.
Swyx [00:27:12]: That's surprisingly low.
Flo [00:27:13]: It is low. And we fixed it across the board by just adding a line to the system prompt that's like, hey, don't recall people, please don't recall.
Swyx [00:27:21]: Yeah, yeah, yeah. I mean, so, you know, you can explain it retroactively, right? Like, that YouTube slug has been pasted in so many different corpuses that obviously it learned to hallucinate that.
Alessio [00:27:31]: And it pretended to be so many things. That's the thing.
Swyx [00:27:34]: I wouldn't be surprised if that takes one token. Like, there's this one slug in the tokenizer and it's just one token.
Flo [00:27:41]: That's the idea of a YouTube video.
Swyx [00:27:43]: Because it's used so much, right? And you have to basically get it exactly correct. It's probably not. That's a long speech.
Flo [00:27:52]: It would have been so good.
Alessio [00:27:55]: So this is just a jump maybe into evals from here. How could you possibly come up for an eval that says, make sure my AI does not recall my customer? I feel like when people are writing evals, that's not something that they come up with. So how do you think about evals when it's such like an open-ended problem space?
Flo [00:28:12]: Yeah, it is tough. We built quite a bit of infrastructure for us to create evals in one click from any conversation history. So we can point to a conversation and we can be like, in one click we can turn it into effectively a unit test. It's like, this is a good conversation. This is how you're supposed to handle things like this. Or if it's a negative example, then we modify a little bit the conversation after generating the eval. So it's very easy for us to spin up this kind of eval.
Alessio [00:28:36]: Do you use an off-the-shelf tool which is like Brain Trust on the podcast? Or did you just build your own?
Flo [00:28:41]: We unfortunately built our own. We're most likely going to switch to Brain Trust. Well, when we built it, there was nothing. Like there was no eval tool, frankly. I mean, we started this project at the end of 2022. It was like, it was very, very, very early. I wouldn't recommend it to build your own eval tool. There's better solutions out there and our eval tool breaks all the time and it's a nightmare to maintain. And that's not something we want to be spending our time on.
Swyx [00:29:04]: I was going to ask that basically because I think my first conversations with you about Lindy was that you had a strong opinion that everyone should build their own tools. And you were very proud of your evals. You're kind of showing off to me like how many evals you were running, right?
Flo [00:29:16]: Yeah, I think that was before all of these tools came around. I think the ecosystem has matured a fair bit.
Swyx [00:29:21]: What is one thing that Brain Trust has nailed that you always struggled to do?
Flo [00:29:25]: We're not using them yet, so I couldn't tell. But from what I've gathered from the conversations I've had, like they're doing what we do with our eval tool, but better.
Swyx [00:29:33]: And like they do it, but also like 60 other companies do it, right? So I don't know how to shop apart from brand. Word of mouth.
Flo [00:29:41]: Same here.
Swyx [00:29:42]: Yeah, like evals or Lindys, there's two kinds of evals, right? Like in some way, you don't have to eval your system as much because you've constrained the language model so much. And you can rely on open AI to guarantee that the structured outputs are going to be good, right? We had Michelle sit where you sit and she explained exactly how they do constraint grammar sampling and all that good stuff. So actually, I think it's more important for your customers to eval their Lindys than you evaling your Lindy platform because you just built the platform. You don't actually need to eval that much.
Flo [00:30:14]: Yeah. In an ideal world, our customers don't need to care about this. And I think the bar is not like, look, it needs to be at 100%. I think the bar is it needs to be better than a human. And for most use cases we serve today, it is better than a human, especially if you put it on Rails.
Swyx [00:30:30]: Is there a limiting factor of Lindy at the business? Like, is it adding new connectors? Is it adding new node types? Like how do you prioritize what is the most impactful to your company?
Flo [00:30:41]: Yeah. The raw capabilities for sure are a big limit. It is actually shocking the extent to which the model is no longer the limit. It was the limit a year ago. It was too expensive. The context window was too small. It's kind of insane that we started building this when the context windows were like 4,000 tokens. Like today, our system prompt is more than 4,000 tokens. So yeah, the model is actually very much not a limit anymore. It almost gives me pause because I'm like, I want the model to be a limit. And so no, the integrations are ones, the core capabilities are ones. So for example, we are investing in a system that's basically, I call it like the, it's a J hack. Give me these names, like the poor man's RLHF. So you can turn on a toggle on any step of your Lindy workflow to be like, ask me for confirmation before you actually execute this step. So it's like, hey, I receive an email, you send a reply, ask me for confirmation before actually sending it. And so today you see the email that's about to get sent and you can either approve, deny, or change it and then approve. And we are making it so that when you make a change, we are then saving this change that you're making or embedding it in the vector database. And then we are retrieving these examples for future tasks and injecting them into the context window. So that's the kind of capability that makes a huge difference for users. That's the bottleneck today. It's really like good old engineering and product work.
Swyx [00:31:52]: I assume you're hiring. We'll do a call for hiring at the end.
Alessio [00:31:54]: Any other comments on the model side? When did you start feeling like the model was not a bottleneck anymore? Was it 4.0? Was it 3.5? 3.5.
Flo [00:32:04]: 3.5 Sonnet, definitely. I think 4.0 is overhyped, frankly. We don't use 4.0. I don't think it's good for agentic behavior. Yeah, 3.5 Sonnet is when I started feeling that. And then with prompt caching with 3.5 Sonnet, like that fills the cost, cut the cost again. Just cut it in half. Yeah.
Swyx [00:32:21]: Your prompts are... Some of the problems with agentic uses is that your prompts are kind of dynamic, right? Like from caching to work, you need the front prefix portion to be stable.
Flo [00:32:32]: Yes, but we have this append-only ledger paradigm. So every node keeps appending to that ledger and every filled node inherits all the context built up by all the previous nodes. And so we can just decide, like, hey, every X thousand nodes, we trigger prompt caching again.
Swyx [00:32:47]: Oh, so you do it like programmatically, not all the time.
Flo [00:32:50]: No, sorry. Anthropic manages that for us. But basically, it's like, because we keep appending to the prompt, the prompt caching works pretty well.
Alessio [00:32:57]: We have this small podcaster tool that I built for the podcast and I rewrote all of our prompts because I noticed, you know, I was inputting stuff early on. I wonder how much more money OpenAN and Anthropic are making just because people don't rewrite their prompts to be like static at the top and like dynamic at the bottom.
Flo [00:33:13]: I think that's the remarkable thing about what we're having right now. It's insane that these companies are routinely cutting their costs by two, four, five. Like, they basically just apply constraints. They want people to take advantage of these innovations. Very good.
Swyx [00:33:25]: Do you have any other competitive commentary? Commentary? Dust, WordWare, Gumloop, Zapier? If not, we can move on.
Flo [00:33:31]: No comment.
Alessio [00:33:32]: I think the market is,
Flo [00:33:33]: look, I mean, AGI is coming. All right, that's what I'm talking about.
Swyx [00:33:38]: I think you're helping. Like, you're paving the road to AGI.
Flo [00:33:41]: I'm playing my small role. I'm adding my small brick to this giant, giant, giant castle. Yeah, look, when it's here, we are going to, this entire category of software is going to create, it's going to sound like an exaggeration, but it is a fact it is going to create trillions of dollars of value in a few years, right? It's going to, for the first time, we're actually having software directly replace human labor. I see it every day in sales calls. It's like, Lindy is today replacing, like, we talk to even small teams. It's like, oh, like, stop, this is a 12-people team here. I guess we'll set up this Lindy for one or two days, and then we'll have to decide what to do with this 12-people team. And so, yeah. To me, there's this immense uncapped market opportunity. It's just such a huge ocean, and there's like three sharks in the ocean. I'm focused on the ocean more than on the sharks.
Swyx [00:34:25]: So we're moving on to hot topics, like, kind of broadening out from Lindy, but obviously informed by Lindy. What are the high-order bits of good agent design?
Flo [00:34:31]: The model, the model, the model, the model. I think people fail to truly, and me included, they fail to truly internalize the bitter lesson. So for the listeners out there who don't know about it, it's basically like, you just scale the model. Like, GPUs go brr, it's all that matters. I think it also holds for the cognitive architecture. I used to be very cognitive architecture-filled, and I was like, ah, and I was like a critic, and I was like a generator, and all this, and then it's just like, GPUs go brr, like, just like let the model do its job. I think we're seeing it a little bit right now with O1. I'm seeing some tweets that say that the new 3.5 SONNET is as good as O1, but with none of all the crazy...
Swyx [00:35:09]: It beats O1 on some measures. On some reasoning tasks. On AIME, it's still a lot lower. Like, it's like 14 on AIME versus O1, it's like 83.
Flo [00:35:17]: Got it. Right. But even O1 is still the model. Yeah.
Swyx [00:35:22]: Like, there's no cognitive architecture on top of it.
Flo [00:35:23]: You can just wait for O1 to get better.
Alessio [00:35:25]: And so, as a founder, how do you think about that, right? Because now, knowing this, wouldn't you just wait to start Lindy? You know, you start Lindy, it's like 4K context, the models are not that good. It's like, but you're still kind of like going along and building and just like waiting for the models to get better. How do you today decide, again, what to build next, knowing that, hey, the models are going to get better, so maybe we just shouldn't focus on improving our prompt design and all that stuff and just build the connectors instead or whatever? Yeah.
Flo [00:35:51]: I mean, that's exactly what we do. Like, all day, we always ask ourselves, oh, when we have a feature idea or a feature request, we ask ourselves, like, is this the kind of thing that just gets better while we sleep because models get better? I'm reminded, again, when we started this in 2022, we spent a lot of time because we had to around context pruning because 4,000 tokens is really nothing. You really can't do anything with 4,000 tokens. All that work was throwaway work. Like, now it's like it was for nothing, right? Now we just assume that infinite context windows are going to be here in a year or something, a year and a half, and infinitely cheap as well, and dynamic compute is going to be here. Like, we just assume all of these things are going to happen, and so we really focus, our job to be done in the industry is to provide the input and output to the model. I really compare it all the time to the PC and the CPU, right? Apple is busy all day. They're not like a CPU wrapper. They have a lot to build, but they don't, well, now actually they do build the CPU as well, but leaving that aside, they're busy building a laptop. It's just a lot of work to build these things. It's interesting because, like,
Swyx [00:36:45]: for example, another person that we're close to, Mihaly from Repl.it, he often says that the biggest jump for him was having a multi-agent approach, like the critique thing that you just said that you don't need, and I wonder when, in what situations you do need that and what situations you don't. Obviously, the simple answer is for coding, it helps, and you're not coding, except for, are you still generating code? In Indy? Yeah.
Flo [00:37:09]: No, we do. Oh, right. No, no, no, the cognitive architecture changed. We don't, yeah.
Swyx [00:37:13]: Yeah, okay. For you, you're one shot, and you chain tools together, and that's it. And if the user really wants
Flo [00:37:18]: to have this kind of critique thing, you can also edit the prompt, you're welcome to. I have some of my Lindys, I've told them, like, hey, be careful, think step by step about what you're about to do, but that gives you a little bump for some use cases, but, yeah.
Alessio [00:37:30]: What about unexpected model releases? So, Anthropic released computer use today. Yeah. I don't know if many people were expecting computer use to come out today. Do these things make you rethink how to design, like, your roadmap and things like that, or are you just like, hey, look, whatever, that's just, like, a small thing in their, like, AGI pursuit, that, like, maybe they're not even going to support, and, like, it's still better for us to build our own integrations into systems and things like that. Because maybe people will say, hey, look, why am I building all these API integrations
Flo [00:38:02]: when I can just do computer use and never go to the product? Yeah. No, I mean, we did take into account computer use. We were talking about this a year ago or something, like, we've been talking about it as part of our roadmap. It's been clear to us that it was coming, My philosophy about it is anything that can be done with an API must be done by an API or should be done by an API for a very long time. I think it is dangerous to be overly cavalier about improvements of model capabilities. I'm reminded of iOS versus Android. Android was built on the JVM. There was a garbage collector, and I can only assume that the conversation that went down in the engineering meeting room was, oh, who cares about the garbage collector? Anyway, Moore's law is here, and so that's all going to go to zero eventually. Sure, but in the meantime, you are operating on a 400 MHz CPU. It was like the first CPU on the iPhone 1, and it's really slow, and the garbage collector is introducing a tremendous overhead on top of that, especially a memory overhead. For the longest time, and it's really only been recently that Android caught up to iOS in terms of how smooth the interactions were, but for the longest time, Android phones were significantly slower
Swyx [00:39:07]: and laggier
Flo [00:39:08]: and just not feeling as good as iOS devices. Look, when you're talking about modules and magnitude of differences in terms of performance and reliability, which is what we are talking about when we're talking about API use versus computer use, then you can't ignore that, right? And so I think we're going to be in an API use world for a while.
Swyx [00:39:27]: O1 doesn't have API use today. It will have it at some point, and it's on the roadmap. There is a future in which OpenAI goes much harder after your business, your market, than it is today. Like, ChatGPT, it's its own business. All they need to do is add tools to the ChatGPT, and now they're suddenly competing with you. And by the way, they have a GPT store where a bunch of people have already configured their tools to fit with them. Is that a concern?
Flo [00:39:56]: I think even the GPT store, in a way, like the way they architect it, for example, their plug-in systems are actually grateful because we can also use the plug-ins. It's very open. Now, again, I think it's going to be such a huge market. I think there's going to be a lot of different jobs to be done. I know they have a huge enterprise offering and stuff, but today, ChatGPT is a consumer app. And so, the sort of flow detail I showed you, this sort of workflow, this sort of use cases that we're going after, which is like, we're doing a lot of lead generation and lead outreach and all of that stuff. That's not something like meeting recording, like Lindy Today right now joins your Zoom meetings and takes notes, all of that stuff.
Swyx [00:40:34]: I don't see that so far
Flo [00:40:35]: on the OpenAI roadmap.
Swyx [00:40:36]: Yeah, but they do have an enterprise team that we talk to You're hiring GMs?
Flo [00:40:42]: We did.
Swyx [00:40:43]: It's a fascinating way to build a business, right? Like, what should you, as CEO, be in charge of? And what should you basically hire
Flo [00:40:52]: a mini CEO to do? Yeah, that's a good question. I think that's also something we're figuring out. The GM thing was inspired from my days at Uber, where we hired one GM per city or per major geo area. We had like all GMs, regional GMs and so forth. And yeah, Lindy is so horizontal that we thought it made sense to hire GMs to own each vertical and the go-to market of the vertical and the customization of the Lindy templates for these verticals and so forth. What should I own as a CEO? I mean, the canonical reply here is always going to be, you know, you own the fundraising, you own the culture, you own the... What's the rest of the canonical reply? The culture, the fundraising.
Swyx [00:41:29]: I don't know,
Flo [00:41:30]: products. Even that, eventually, you do have to hand out. Yes, the vision, the culture, and the foundation. Well, you've done your job as a CEO. In practice, obviously, yeah, I mean, all day, I do a lot of product work still and I want to keep doing product work for as long as possible.
Swyx [00:41:48]: Obviously, like you're recording and managing the team. Yeah.
Flo [00:41:52]: That one feels like the most automatable part of the job, the recruiting stuff.
Swyx [00:41:56]: Well, yeah. You saw my
Flo [00:41:59]: design your recruiter here. Relationship between Factorio and building Lindy. We actually very often talk about how the business of the future is like a game of Factorio. Yeah. So, in the instance, it's like Slack and you've got like 5,000 Lindys in the sidebar and your job is to somehow manage your 5,000 Lindys. And it's going to be very similar to company building because you're going to look for like the highest leverage way to understand what's going on in your AI company and understand what levels do you have to make impact in that company. So, I think it's going to be very similar to like a human company except it's going to go infinitely faster. Today, in a human company, you could have a meeting with your team and you're like, oh, I'm going to build a facility and, you know, now it's like, okay,
Swyx [00:42:40]: boom, I'm going to spin up 50 designers. Yeah. Like, actually, it's more important that you can clone an existing designer that you know works because the hiring process, you cannot clone someone because every new person you bring in is going to have their own tweaks
Flo [00:42:54]: and you don't want that. Yeah.
Swyx [00:42:56]: That's true. You want an army of mindless drones
Flo [00:42:59]: that all work the same way.
Swyx [00:43:00]: The reason I bring this, bring Factorio up as well is one, Factorio Space just came out. Apparently, a whole bunch of people stopped working. I tried out Factorio. I never really got that much into it. But the other thing was, you had a tweet recently about how the sort of intentional top-down design was not as effective as just build. Yeah. Just ship.
Flo [00:43:21]: I think people read a little bit too much into that tweet. It went weirdly viral. I was like, I did not intend it as a giant statement online.
Swyx [00:43:28]: I mean, you notice you have a pattern with this, right? Like, you've done this for eight years now.
Flo [00:43:33]: You should know. I legit was just hearing an interesting story about the Factorio game I had. And everybody was like, oh my God, so deep. I guess this explains everything about life and companies. There is something to be said, certainly, about focusing on the constraint. And I think it is Patrick Collison who said, people underestimate the extent to which moonshots are just one pragmatic step taken after the other. And I think as long as you have some inductive bias about, like, some loose idea about where you want to go, I think it makes sense to follow a sort of greedy search along that path. I think planning and organizing is important. And having older is important.
Swyx [00:44:05]: I'm wrestling with that. There's two ways I encountered it recently. One with Lindy. When I tried out one of your automation templates and one of them was quite big and I just didn't understand it, right? So, like, it was not as useful to me as a small one that I can just plug in and see all of. And then the other one was me using Cursor. I was very excited about O1 and I just up front
Flo [00:44:27]: stuffed everything
Swyx [00:44:28]: I wanted to do into my prompt and expected O1 to do everything. And it got itself into a huge jumbled mess and it was stuck. It was really... There was no amount... I wasted, like, two hours on just, like, trying to get out of that hole. So I threw away the code base, started small, switched to Clouds on it and build up something working and just add it over time and it just worked. And to me, that was the factorial sentiment, right? Maybe I'm one of those fanboys that's just, like, obsessing over the depth of something that you just randomly tweeted out. But I think it's true for company building, for Lindy building, for coding.
Flo [00:45:02]: I don't know. I think it's fair and I think, like, you and I talked about there's the Tuft & Metal principle and there's this other... Yes, I love that. There's the... I forgot the name of this other blog post but it's basically about this book Seeing Like a State that talks about the need for legibility and people who optimize the system for its legibility and anytime you make a system... So legible is basically more understandable. Anytime you make a system more understandable from the top down, it performs less well from the bottom up. And it's fine but you should at least make this trade-off with your eyes wide open. You should know, I am sacrificing performance for understandability, for legibility. And in this case, for you, it makes sense. It's like you are actually optimizing for legibility. You do want to understand your code base but in some other cases it may not make sense. Sometimes it's better to leave the system alone and let it be its glorious, chaotic, organic self and just trust that it's going to perform well even though you don't understand it completely.
Swyx [00:45:55]: It does remind me of a common managerial issue or dilemma which you experienced in the small scale of Lindy where, you know, do you want to organize your company by functional sections or by products or, you know, whatever the opposite of functional is. And you tried it one way and it was more legible to you as CEO but actually it stopped working at the small level. Yeah.
Flo [00:46:17]: I mean, one very small example, again, at a small scale is we used to have everything on Notion. And for me, as founder, it was awesome because everything was there. The roadmap was there. The tasks were there. The postmortems were there. And so, the postmortem was linked
Swyx [00:46:31]: to its task.
Flo [00:46:32]: It was optimized for you. Exactly. And so, I had this, like, one pane of glass and everything was on Notion. And then the team, one day,
Swyx [00:46:39]: came to me with pitchforks
Flo [00:46:40]: and they really wanted to implement Linear. And I had to bite my fist so hard. I was like, fine, do it. Implement Linear. Because I was like, at the end of the day, the team needs to be able to self-organize and pick their own tools.
Alessio [00:46:51]: Yeah. But it did make the company slightly less legible for me. Another big change you had was going away from remote work, every other month. The discussion comes up again. What was that discussion like? How did your feelings change? Was there kind of like a threshold of employees and team size where you felt like, okay, maybe that worked. Now it doesn't work anymore. And how are you thinking about the future
Flo [00:47:12]: as you scale the team? Yeah. So, for context, I used to have a business called TeamFlow. The business was about building a virtual office for remote teams. And so, being remote was not merely something we did. It was, I was banging the remote drum super hard and helping companies to go remote. And so, frankly, in a way, it's a bit embarrassing for me to do a 180 like that. But I guess, when the facts changed, I changed my mind. What happened? Well, I think at first, like everyone else, we went remote by necessity. It was like COVID and you've got to go remote. And on paper, the gains of remote are enormous. In particular, from a founder's standpoint, being able to hire from anywhere is huge. Saving on rent is huge. Saving on commute is huge for everyone and so forth. But then, look, we're all here. It's like, it is really making it much harder to work together. And I spent three years of my youth trying to build a solution for this. And my conclusion is, at least we couldn't figure it out and no one else could. Zoom didn't figure it out. We had like a bunch of competitors. Like, Gathertown was one of the bigger ones. We had dozens and dozens of competitors. No one figured it out. I don't know that software can actually solve this problem. The reality of it is, everyone just wants to get off the darn Zoom call. And it's not a good feeling to be in your home office if you're even going to have a home office all day. It's harder to build culture. It's harder to get in sync. I think software is peculiar because it's like an iceberg. It's like the vast majority of it is submerged underwater. And so, the quality of the software that you ship is a function of the alignment of your mental models about what is below that waterline. Can you actually get in sync about what it is exactly fundamentally that we're building? What is the soul of our product? And it is so much harder to get in sync about that when you're remote. And then you waste time in a thousand ways because people are offline and you can't get a hold of them or you can't share your screen. It's just like you feel like you're walking in molasses all day. And eventually, I was like, okay, this is it. We're not going to do this anymore.
Swyx [00:49:03]: Yeah. I think that is the current builder San Francisco consensus here. Yeah. But I still have a big... One of my big heroes as a CEO is Sid Subban from GitLab.
Flo [00:49:14]: Mm-hmm.
Swyx [00:49:15]: Matt Mullenweg
Flo [00:49:16]: used to be a hero.
Swyx [00:49:17]: But these people run thousand-person remote businesses. The main idea is that at some company size, your company is remote anyway. Yeah. Because if you go from one building to two buildings, congrats, you're now remote from the other building. If you want to go from one city office to two city offices, they're remote from each other.
Flo [00:49:35]: But the teams are co-located. Every time anyone talks about remote success stories, they always talk about this real force. Yeah. It's always GitLab and WordPress and Zapier. Zapier. It used to be Envision. And I will point out that in every one of these examples, you have a co-located counterfactual that is sometimes orders of magnitude bigger. Look, I like Matt Mullenweg a lot, but WordPress is a commercial failure. They run 60% of the internet and they're like a fraction of the size of even Substack. Right?
Swyx [00:50:05]: They're trying to get more money.
Flo [00:50:07]: Yeah, that's my point, right? Look, GitLab is much smaller than GitHub. Envision, you know, is no more. And Figma, like, completely took off. And Figma was like very in-person. So, I think if you're optimizing for productivity, if you really know, hey, this is a support ticket, right, and I want to have my support ticket for a buck 50 per support ticket and next year I want it for a buck 20, then sure, send your support ticket team to offshore, like the Philippines or whatever, and just optimize for cost. If you're optimizing for cost, absolutely be remote. If you're optimizing for creativity, which I think that software and product building is a creative endeavor, if you're optimizing for creativity, it's kind of like you have to be in person and hear the music to do that.
Swyx [00:50:52]: Yeah. Maybe the line is that all jobs that can be remote should be AI or Lindy's and all jobs that are not remote are in person. Like, there's a very,
Flo [00:51:04]: very clear separation of jobs. Sure. Well, I think over the long term,
Swyx [00:51:09]: every job is going to be AI anyway. It would be curious to break down what you think is creativity in coding and in product defining and how to express that for sure. You're definitely what I call a temperature zero use case of LLMs. You want it to be reliable, predictable, small. And then there's other use cases of LLMs that are more for creativity and engines. Right? I haven't checked, but I'm pretty sure no one uses Lindy for brainstorming. Actually,
Flo [00:51:36]: probably they do. I use Lindy for brainstorming
Swyx [00:51:38]: a lot, actually. Yeah, yeah. But you want to have something that's anti-fragile to hallucination. Hallucinations are good.
Flo [00:51:45]: By creativity, I mean, is it about direction or magnitude? If it is about direction, like decide what to do, then it's a creative endeavor. If it is about magnitude and just do it as fast as possible, as cheap as possible, then it's magnitude. And so sometimes, you know, software companies are not necessarily creative. Sometimes you know what you're doing. And I'll say that it's going to come across the wrong way, but linear. I look up to a huge amount, like such amazing product builders, but they know what they're building. They're building a I don't mean to throw shade at them. Like, good for them.
Swyx [00:52:20]: I think they're aware that they're not like They recently got s**t for saying that they have work-life balance on their job description.
Flo [00:52:26]: They're like, what do you mean by this? We're building a new kind of product that no one's ever built before. And so we're just scratching our heads all day trying to get in sync about like, what exactly is it
Swyx [00:52:37]: that we're building? What does it consist of? Inherently creative struggle. Yeah. Dare we ask about San Francisco? And there's a whole bunch of tough stuff in here. Probably the biggest one I would just congratulate you on is becoming American, right? Very French, but your heart was sort of in the U.S. You eventually found your way here. What are your takes for founders? A few years ago, you wrote this post on Go West, young man. And now you've basically completed that journey, right? You're now here and up to the point where you're kind of mystified by how Europe has been so decel.
Flo [00:53:11]: In a way, though, I feel vindicated because I was making the prediction that Europe was over 14 years ago or something like that. I think it's been a walking corpse for a long time. I think it is only now becoming obvious that it is paying the consequences of its policies from 10, 20, 30 years ago. I think at this point, I wish I could rewrite the Go West, young man article but really even more extreme. I think at this point, if you are in tech, especially in AI, but if you're in tech and you're not in San Francisco, you either lack judgment or you lack ambition. It's funny, I recently told that to someone and they were like, oh, not everyone wants to be like a unicorn founder. And I was like, like I said, judgment or ambition. It's fine to not have ambition. It's fine to want to prioritize other things than your company in life or your career in life. That's perfectly okay. But know that that's the trade-off you're making. If you prioritize your career, you've got to be here.
Alessio [00:54:03]: As a fellow European escapist, I grew up in Rome.
Flo [00:54:05]: Yeah, how do you feel?
Swyx [00:54:06]: We never talk about your feelings about Europe.
Alessio [00:54:08]: Yeah, I've been in the U.S. now six years. Well, I started my first company in Europe 10 years ago, something like that. Yeah, you can tell nobody really wants to do much. And then you're like, okay. It's funny, I was looking back through some old tweets and I was sending all these tweets to Marc Andreessen like 15 years ago like trying to like learn more about why are you guys putting money in these things that most people here would say you're like crazy to like even back. And eventually, you know, I started doing venture six, five years ago. And I think just like so many people in Europe reach out and ask, hey, can you like talk to our team and they just cannot comprehend like the risk appetite that people have here. It's just like so foreign to people, at least in Italy and like in some parts of Europe. I'm sure there's some great founders in Europe, but like the average European founders, like why would I leave my job at the post office to go work on the startup that could change everything and become very successful but might go out of business instead in the U.S. You have like, you know, we host a hackathon and it's like 400 people and it's like, where can I go work that it's like no job security, you know? It's just like completely different and there's no incentives from the government to change that. There's no way you can like change such a deep-rooted culture of like, you know, going and wine and April spritz
Flo [00:55:27]: and all of that
Alessio [00:55:28]: early in the afternoon.
Flo [00:55:29]: So, I don't really know how it's going to change.
Alessio [00:55:32]: It's quality of life. Yeah, totally. That's why I left. The quality is so high that I left. But again, I think it's better to move here and just, if you want to do this job and do this, you should be here. If you don't want to, that's fine.
Flo [00:55:47]: But like,
Alessio [00:55:48]: don't copium. Don't be like, oh no, you can also be successful doing this and knees or like whatever. No, probably not, you know? So,
Flo [00:55:59]: yeah,
Alessio [00:56:00]: I've already done my N400
Flo [00:56:01]: so I should get my U.S. citizenship interview soon. Yeah. And I think to be fair, I think what's happening right now to Europe and they've said no to capitalism. They've decided to say no to capitalism a long time ago. They've like completely over-regulated. Taxation is much too high and so forth. But I also think some of this is a little bit of a self-fulfilling prophecy or it's a self-perpetuating phenomenon because, look, to your point, like once there is a network effect that's just so incredibly powerful, they can't be broken, really. And we tried with San Francisco. I tried with San Francisco. Like during COVID,
Swyx [00:56:35]: there was a movement of people moving to Miami.
Flo [00:56:38]: How did that pan out? You can't break the network effect,
Swyx [00:56:41]: you know? It's so annoying because first principles wise, tech should not be here. Like tech should be in Miami because it's just a better city.
Flo [00:56:48]: San Francisco does not want tech to be here.
Swyx [00:56:50]: San Francisco hates tech.
Flo [00:56:51]: 100%.
Swyx [00:56:52]: This is the thing I actually wrote down.
Alessio [00:56:54]: San Francisco hates tech. It is true. I think the people that are in San Francisco that were here before, tech hated it and then there's kind of like this passed down thing. But I would say people in Miami would hate it too if there were too much of it. You know? The Mickey Beach crowd would also not gel.
Swyx [00:57:08]: They're just rich enough and chill enough to not care.
Flo [00:57:10]: Yeah, I think so too.
Swyx [00:57:11]: They're like, oh, crypto kids.
Flo [00:57:13]: Okay, cool. Yeah. Miami celebrates success which is one thing
Swyx [00:57:17]: I loved about it.
Flo [00:57:18]: A little bit too much.
Swyx [00:57:19]: Maybe the last thing I'll mention, I just wanted a little bit of EUAC talk. I think that's good. I'll maybe carve out that I think the UK has done really well. That's an argument for the UK not being part of Europe is that, you know, the AI institutions there at least have done very well. Right?
Flo [00:57:34]: Sure. I think a lot of Britain is in the gutter. Yeah, exactly.
Swyx [00:57:38]: They've been stagnating at best. And then France has a few wins.
Flo [00:57:41]: Who?
Swyx [00:57:42]: Mistral.
Flo [00:57:43]: Who uses Mistral?
Swyx [00:57:44]: Hugging face.
Flo [00:57:45]: A few wins.
Swyx [00:57:46]: I'm just saying. They disappointed their first AI minister. You know the meme with the guy
Flo [00:57:51]: who's celebrating with his trophy and then he's like, no, that's France. Right? To me, that's France. It's like, aha, look, we've got Mistral! It's like champagne! It's like maybe 1% of market share. And by the way, and it's not a critic of them, it's a critic of France and of Europe. And by the way, I think I've heard that the Mistral guys were moving to the US. They're opening an office here. They're opening an office here. But, I mean,
Swyx [00:58:15]: they're very French, right?
Flo [00:58:16]: Right.
Swyx [00:58:17]: You can't really avoid it. There's one interesting counter move which is Jason Warner and ISOCAT moving to Paris for poolside. I don't know. It remains to be seen how that move is going. Maybe the last thing I'll say, you know, that's the Europe talk. We try not to do politics so much, but you're here. One thing that you do a lot is you test your overturned windows. Right? Like far more than any founder I know. You know it's not your job. Someone, for sure, you're just indulging. But also, I think you consciously test. And I just want to see what drives you there and why do you keep doing it? Because you treat very spicy stuff, especially for like the San Francisco sort of liberal dynasty.
Flo [00:58:59]: I don't know because I assume you're referring to I posted something about pronouns and how nonsense...
Swyx [00:59:05]: Just in general. I don't want you to focus on any particular thing unless you want to.
Flo [00:59:09]: You know, well, that tweet in particular, when I was tweeting it, I was like, oh, this is kind of spicy. Should I do this? And then I just did it. And I received zero pushback.
Swyx [00:59:20]: And the tweet was actually
Flo [00:59:21]: pretty successful and I received a lot of people reaching out like, oh my God, so true. I think it's coming from a few different places. One, life is more fun this way. Like I don't feel like if everyone always self-censors, you never know what everyone, what anyone thinks. And so it's becoming like a self-perpetuating thing. It's like a public lies, private truth sort of phenomenon. Or like, you know, there's this phenomenon called the preference cascade. It's like, there's this joke. It's like, oh, there's only one communist left in USSR. The problem is no one knows which one it is. So everyone pretends to be communist because everyone else pretends to be communist. And so I think there's a role to be played when you have a boss who's going to fire me. It's like, look, if I don't speak up and if founders don't speak up, I'm like, why? What are you afraid of? Right? Like, there's really not that much downside. And I think there's
Swyx [01:00:14]: something to be said about standing up for what you think is right and being real and owning your opinions. I think there's a correlation there between having that level of independence for your political beliefs and free speech or whatever and the way
Flo [01:00:27]: that you think about business too. But I think there's such a powerful insight at its core, which is groupthink is real and pervasive and really problematic. Like, your brain constantly shuts down because you're not even thinking in your other way or you're not thinking. You just look around you and you decide to adopt the same beliefs as people around you. And everyone thinks
Swyx [01:00:48]: they're immune
Flo [01:00:49]: and everyone else
Swyx [01:00:50]: is doing it
Flo [01:00:51]: except themselves. I'm a special snowflake. I have free will. That's right. And so I actually make it a point to look for, and then I think about it and I'm like, do I believe this thing? And very often the answer is yes. And then I just say it. And so I think the AI safety is an example of that. Like, at some point, Marc Andreessen blocked me on Twitter and it hurt, frankly. I really look up to Marc Andreessen
Swyx [01:01:13]: and I knew he would block me. It means you're successful on Twitter.
Flo [01:01:17]: It's just the right message. Marc Andreessen was really my booster initially on Twitter. He really made my account. And I was like, look, I'm really concerned about AI safety. It is an unpopular view
Swyx [01:01:27]: among my peers. I remember, you were one of the few that actually came out in support of the bill.
Flo [01:01:32]: I came out in support of SB1047 a year and a half ago. I put like some tweet storms about how I was really concerned. And yeah, I was blocked by a bunch of AI safety people and I don't like it, but you know, it's funny, maybe it's my French education. But look, in France, World War II is very present in people's minds and the phenomenon of people collaborating with the Nazis and there's always this sort of debate that people have like at dinner and it's like, ah, would you really have resisted during World War II? And everybody is always saying, oh yeah, we totally have resisted. It's like, yeah, but no. The reality of it is 95% of the country did not resist and most of it actually collaborated actively with the Nazis. And so 95% of y'all are wrong. You would actually have collaborated, right? I've always told myself I will stand up for what I think is right because some people got attacked and the way I was brought up is if someone gets attacked before you, you get involved. It doesn't matter, you get involved and you help the person, right? And so, look, I'm not pretending we're nowhere near a World War II phenomenon but I'm like, exactly because we are nowhere near
Alessio [01:02:45]: this kind of phenomenon. The stakes are so low and if you're not going to stand up
Flo [01:02:49]: for what you think is right when the stakes are so low,
Swyx [01:02:52]: are you going to stand up when it matters? There's an inconsistency in your statements because you simultaneously believe that AGI is very soon and you also say stakes are low. You can't believe both are real.
Flo [01:03:03]: Well, why does AGI make the stakes of speaking up higher?
Swyx [01:03:06]: Sorry, the stakes of safety.
Flo [01:03:08]: Oh yeah, no, the stakes of AI
Swyx [01:03:11]: are like physical safety?
Flo [01:03:12]: No, AI safety. Oh no, the stakes of AI safety couldn't be higher.
Swyx [01:03:17]: I meant the stakes
Flo [01:03:18]: of speaking up about
Alessio [01:03:19]: pronouns or whatever. How do you figure out who's real and who isn't? Because there was a manifesto for responsible AI that hundreds of VCs and people signed and I don't think anybody actually thinks about it anymore.
Flo [01:03:30]: Was that the pause letter?
Swyx [01:03:31]: The six-month pause?
Flo [01:03:32]: No,
Alessio [01:03:33]: there was something else that I think general catalyst and some fun sign. And then there's maybe the anthropic case which is like, hey, we're leaving open AI because you guys don't take security seriously and then it's like, hey, what if we gave AI access to a whole computer
Flo [01:03:49]: to just go do things?
Alessio [01:03:50]: How do you reconcile like, okay, I mean, you could say the same thing about Lindy. It's like, if you're worried about AI safety, why are you building AI? Right? That's kind of like the extreme thinking. How do you internally decide between participation and talking about it and saying, hey, I think this is important but I'm still going to build towards that and building actually makes it safer because I'm involved versus just being like anti. I think this is unsafe but then not do anything about it and just kind of remove yourself
Flo [01:04:20]: from the whole thing. What I think about our own involvement here is I'm acutely concerned about the risks at the model layer and I'm simultaneously very excited about the upside. Like, for the record, my PDoom, insofar as I can quantify it, which I cannot, but if I had to, like my vibe is like 10% or something like that and so there's like a 90% chance that we live in like a pure utopia. Right? And that's awesome. Right? So like, let's go after utopia. Right? Let's talk about the 10% chance that we live in a utopia where there's no disease and it's like a post-scarcity world. I think that utopia is going to happen through, like again, I'm bringing my little contribution to the movement. I think it would be silly to say no to the upside because you're concerned about the downside. At the same time, we want to be concerned about the downside. I know that it's very self-serving to say, oh, you know, like the downside doesn't exist at my layer, it exists at like the model layer. But truly, look at Lindy, look at the Apple building. I struggle to see exactly how it would like get up if I'm concerned about the model layer.
Swyx [01:05:21]: Okay. Well, this kind of discussion can go on for hours. It is still daylight, so not the best time for it. But I really appreciate you spending the time. Any other last calls to actions or thoughts that you feel like you want to get off your chest?
Flo [01:05:33]: AGI is coming.
Flo [01:05:37]: Are you hiring
Alessio [01:05:38]: for any roles? We are.
Flo [01:05:40]: Oh yeah, I guess that should be the...
Swyx [01:05:43]: Don't bother.
Flo [01:05:44]: No, can you stop saying AGI is coming and just talk about it? We are also hiring yeah, we are hiring designers and engineers right now. Yeah. So hit me up at flo.lindy.ai
Alessio [01:05:55]: And then go talk to my Lindy. You're not actually going to read it.
Flo [01:05:58]: Actually, I have wondered
Swyx [01:05:59]: how many times when I talk to you, I'm talking to a bot. Part of that is I don't have to know, right?
Flo [01:06:05]: That's right. Well, it's actually doubly confusing because we also have a teammate
Swyx [01:06:09]: whose name is Lindy. Yes, I was wondering when I met her, I was like, wait, did you hire her first?
Flo [01:06:14]: Marketing is fun. No, she was an inspiration after we named the company both after her. Oh, okay.
Swyx [01:06:19]: Interesting. Yeah, wonderful. I'll comment on the design piece just because I think that there are a lot of AI companies that very much focus on the functionality and the models and the capabilities and the benchmark. But I think that increasingly I'm seeing people differentiate with design and people want to use beautiful products and people who can figure that out and integrate the AI into their human lives. You know, design at the limit. One, at the lowest level is to make this look pretty, make this look like Stripe or Linear's homepage. That's design. But at the highest level of design it is make this integrate seamlessly into my life. Intuitive, beautiful, inspirational maybe even. And I think that companies that, you know, this is kind of like a blog post I've been thinking about, companies that emphasize design actually are going to win more than companies that don't. Yeah,
Flo [01:07:06]: I love Steve Jobs' quote and I'm going to butcher it. It's something like, design is the expression of the soul of a man-made product through successive layers of design. Jesus. Right? He was good. He was cooking. He was cooking on that one. He was cooking. It starts with the soul of the product which is why I was saying it is so important to reach alignment about that soul of the product, right? It's like an onion, like you peel the onion in those layers, right? And you design an entire journey just like the user experiencing your product chronologically all the way from the beginning of like the awareness stage I think it is also the job of the designer to design that part of the experience. It's like, okay, design is immensely important. Okay.
Alessio [01:07:46]: Lovely. Yeah.
Flo [01:07:48]: Thanks for coming on, Flo. Yeah, absolutely. Thanks for having me.

Get full access to Latent Space at www.latent.space/subscribe
- Luisteren Nogmaals beluisteren Doorgaan Wordt afgespeeld...
- Later beluisteren Later beluisteren
Agents @ Work: Dust.tt
11 nov· Latent Space: The AI Engineer Podcast
We are recording our next big recap episode and taking questions!
Submit questions and messages on Speakpipe here for a chance to appear on the show!
Also subscribe to our calendar for our Singapore, NeurIPS, and all upcoming meetups!
In our first ever episode with Logan Kilpatrick we called out the two hottest LLM frameworks at the time: LangChain and Dust. We’ve had Harrison from LangChain on twice (as a guest and as a co-host), and we’ve now finally come full circle as Stanislas from Dust joined us in the studio.
After stints at Oracle and Stripe, Stan had joined OpenAI to work on mathematical reasoning capabilities. He describes his time at OpenAI as "the PhD I always wanted to do" while acknowledging the challenges of research work: "You're digging into a field all day long for weeks and weeks, and you find something, you get super excited for 12 seconds. And at the 13 seconds, you're like, 'oh, yeah, that was obvious.' And you go back to digging."
This experience, combined with early access to GPT-4's capabilities, shaped his decision to start Dust: "If we believe in AGI and if we believe the timelines might not be too long, it's actually the last train leaving the station to start a company. After that, it's going to be computers all the way down."
The History of Dust
Dust's journey can be broken down into three phases:
* Developer Framework (2022): Initially positioned as a competitor to LangChain, Dust started as a developer tooling platform. While both were open source, their approaches differed – LangChain focused on broad community adoption and integration as a pure developer experience, while Dust emphasized UI-driven development and better observability that wasn’t just `print` statements.
* Browser Extension (Early 2023): The company pivoted to building XP1, a browser extension that could interact with web content. This experiment helped validate user interaction patterns with AI, even while using less capable models than GPT-4.
* Enterprise Platform (Current): Today, Dust has evolved into an infrastructure platform for deploying AI agents within companies, with impressive metrics like 88% daily active users in some deployments.
The Case for Being Horizontal
The big discussion for early stage companies today is whether or not to be horizontal or vertical. Since models are so good at general tasks, a lot of companies are building vertical products that take care of a workflow end-to-end in order to offer more value and becoming more of “Services as Software”. Dust on the other hand is a platform for the users to build their own experiences, which has had a few advantages:
* Maximum Penetration: Dust reports 60-70% weekly active users across entire companies, demonstrating the potential reach of horizontal solutions rather than selling into a single team.
* Emergent Use Cases: By allowing non-technical users to create agents, Dust enables use cases to emerge organically from actual business needs rather than prescribed solutions.
* Infrastructure Value: The platform approach creates lasting value through maintained integrations and connections, similar to how Stripe's value lies in maintaining payment infrastructure. Rather than relying on third-party integration providers, Dust maintains its own connections to ensure proper handling of different data types and structures.
The Vertical Challenge
However, this approach comes with trade-offs:
* Harder Go-to-Market: As Stan talked about: "We spike at penetration... but it makes our go-to-market much harder. Vertical solutions have a go-to-market that is much easier because they're like, 'oh, I'm going to solve the lawyer stuff.'"
* Complex Infrastructure: Building a horizontal platform requires maintaining numerous integrations and handling diverse data types appropriately – from structured Salesforce data to unstructured Notion pages. As you scale integrations, the cost of maintaining them also scales.
* Product Surface Complexity: Creating an interface that's both powerful and accessible to non-technical users requires careful design decisions, down to avoiding technical terms like "system prompt" in favor of "instructions."
The Future of AI Platforms
Stan initially predicted we'd see the first billion-dollar single-person company in 2023 (a prediction later echoed by Sam Altman), but he's now more focused on a different milestone: billion-dollar companies with engineering teams of just 20 people, enabled by AI assistance.
This vision aligns with Dust's horizontal platform approach – building the infrastructure that allows small teams to achieve outsized impact through AI augmentation. Rather than replacing entire job functions (the vertical approach), they're betting on augmenting existing workflows across organizations.
Full YouTube Episode
Chapters
* 00:00:00 Introductions
* 00:04:33 Joining OpenAI from Paris
* 00:09:54 Research evolution and compute allocation at OpenAI
* 00:13:12 Working with Ilya Sutskever and OpenAI's vision
* 00:15:51 Leaving OpenAI to start Dust
* 00:18:15 Early focus on browser extension and WebGPT-like functionality
* 00:20:20 Dust as the infrastructure for agents
* 00:24:03 Challenges of building with early AI models
* 00:28:17 LLMs and Workflow Automation
* 00:35:28 Building dependency graphs of agents
* 00:37:34 Simulating API endpoints
* 00:40:41 State of AI models
* 00:43:19 Running evals
* 00:46:36 Challenges in building AI agents infra
* 00:49:21 Buy vs. build decisions for infrastructure components
* 00:51:02 Future of SaaS and AI's Impact on Software
* 00:53:07 The single employee $1B company race
* 00:56:32 Horizontal vs. vertical approaches to AI agents
Transcript
Alessio [00:00:00]: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol.ai.
Swyx [00:00:11]: Hey, and today we're in a studio with Stanislas, welcome.
Stan [00:00:14]: Thank you very much for having me.
Swyx [00:00:16]: Visiting from Paris.
Stan [00:00:17]: Paris.
Swyx [00:00:18]: And you have had a very distinguished career. It's very hard to summarize, but you went to college in both Ecopolytechnique and Stanford, and then you worked in a number of places, Oracle, Totems, Stripe, and then OpenAI pre-ChatGPT. We'll talk, we'll spend a little bit of time about that. About two years ago, you left OpenAI to start Dust. I think you were one of the first OpenAI alum founders.
Stan [00:00:40]: Yeah, I think it was about at the same time as the Adept guys, so that first wave.
Swyx [00:00:46]: Yeah, and people really loved our David episode. We love a few sort of OpenAI stories, you know, for back in the day, like we're talking about pre-recording. Probably the statute of limitations on some of those stories has expired, so you can talk a little bit more freely without them coming after you. But maybe we'll just talk about, like, what was your journey into AI? You know, you were at Stripe for almost five years, there are a lot of Stripe alums going into OpenAI. I think the Stripe culture has come into OpenAI quite a bit.
Stan [00:01:11]: Yeah, so I think the buses of Stripe people really started flowing in, I guess, after ChatGPT. But, yeah, my journey into AI is a... I mean, Greg Brockman. Yeah, yeah. From Greg, of course. And Daniela, actually, back in the days, Daniela Amodei.
Swyx [00:01:27]: Yes, she was COO, I mean, she is COO, yeah. She had a pretty high job at OpenAI at the time, yeah, for sure.
Stan [00:01:34]: My journey started as anybody else, you're fascinated with computer science and you want to make them think, it's awesome, but it doesn't work. I mean, it was a long time ago, it was like maybe 16, so it was 25 years ago. Then the first big exposure to AI would be at Stanford, and I'm going to, like, disclose a whole lamb, because at the time it was a class taught by Andrew Ng, and there was no deep learning. It was half features for vision and a star algorithm. So it was fun. But it was the early days of deep learning. At the time, I think a few years after, it was the first project at Google. But you know, that cat face or the human face trained from many images. I went to, hesitated doing a PhD, more in systems, eventually decided to go into getting a job. Went at Oracle, started a company, did a gazillion mistakes, got acquired by Stripe, worked with Greg Buckman there. And at the end of Stripe, I started interesting myself in AI again, felt like it was the time, you had the Atari games, you had the self-driving craziness at the time. And I started exploring projects, it felt like the Atari games were incredible, but there were still games. And I was looking into exploring projects that would have an impact on the world. And so I decided to explore three things, self-driving cars, cybersecurity and AI, and math and AI. It's like I sing it by a decreasing order of impact on the world, I guess.
Swyx [00:03:01]: Discovering new math would be very foundational.
Stan [00:03:03]: It is extremely foundational, but it's not as direct as driving people around.
Swyx [00:03:07]: Sorry, you're doing this at Stripe, you're like thinking about your next move.
Stan [00:03:09]: No, it was at Stripe, kind of a bit of time where I started exploring. I did a bunch of work with friends on trying to get RC cars to drive autonomously. Almost started a company in France or Europe about self-driving trucks. We decided to not go for it because it was probably very operational. And I think the idea of the company, of the team wasn't there. And also I realized that if I wake up a day and because of a bug I wrote, I killed a family, it would be a bad experience. And so I just decided like, no, that's just too crazy. And then I explored cybersecurity with a friend. We're trying to apply transformers to cut fuzzing. So cut fuzzing, you have kind of an algorithm that goes really fast and tries to mutate the inputs of a library to find bugs. And we tried to apply a transformer to that and do reinforcement learning with the signal of how much you propagate within the binary. Didn't work at all because the transformers are so slow compared to evolutionary algorithms that it kind of didn't work. Then I started interested in math and AI and started working on SAT solving with AI. And at the same time, OpenAI was kind of starting the reasoning team that were tackling that project as well. I was in touch with Greg and eventually got in touch with Ilya and finally found my way to OpenAI. I don't know how much you want to dig into that. The way to find your way to OpenAI when you're in Paris was kind of an interesting adventure as well.
Swyx [00:04:33]: Please. And I want to note, this was a two-month journey. You did all this in two months.
Stan [00:04:38]: The search.
Swyx [00:04:40]: Your search for your next thing, because you left in July 2019 and then you joined OpenAI in September.
Stan [00:04:45]: I'm going to be ashamed to say that.
Swyx [00:04:47]: You were searching before. I was searching before.
Stan [00:04:49]: I mean, it's normal. No, the truth is that I moved back to Paris through Stripe and I just felt the hardship of being remote from your team nine hours away. And so it kind of freed a bit of time for me to start the exploration before. Sorry, Patrick. Sorry, John.
Swyx [00:05:05]: Hopefully they're listening. So you joined OpenAI from Paris and from like, obviously you had worked with Greg, but not
Stan [00:05:13]: anyone else. No. Yeah. So I had worked with Greg, but not Ilya, but I had started chatting with Ilya and Ilya was kind of excited because he knew that I was a good engineer through Greg, I presume, but I was not a trained researcher, didn't do a PhD, never did research. And I started chatting and he was excited all the way to the point where he was like, hey, come pass interviews, it's going to be fun. I think he didn't care where I was, he just wanted to try working together. So I go to SF, go through the interview process, get an offer. And so I get Bob McGrew on the phone for the first time, he's like, hey, Stan, it's awesome. You've got an offer. When are you coming to SF? I'm like, hey, it's awesome. I'm not coming to the SF. I'm based in Paris and we just moved. He was like, hey, it's awesome. Well, you don't have an offer anymore. Oh, my God. No, it wasn't as hard as that. But that's basically the idea. And it took me like maybe a couple more time to keep chatting and they eventually decided to try a contractor set up. And that's how I kind of started working at OpenAI, officially as a contractor, but in practice really felt like being an employee.
Swyx [00:06:14]: What did you work on?
Stan [00:06:15]: So it was solely focused on math and AI. And in particular in the application, so the study of the larger grid models, mathematical reasoning capabilities, and in particular in the context of formal mathematics. The motivation was simple, transformers are very creative, but yet they do mistakes. Formal math systems are of the ability to verify a proof and the tactics they can use to solve problems are very mechanical, so you miss the creativity. And so the idea was to try to explore both together. You would get the creativity of the LLMs and the kind of verification capabilities of the formal system. A formal system, just to give a little bit of context, is a system in which a proof is a program and the formal system is a type system, a type system that is so evolved that you can verify the program. If the type checks, it means that the program is correct.
Swyx [00:07:06]: Is the verification much faster than actually executing the program?
Stan [00:07:12]: Verification is instantaneous, basically. So the truth is that what you code in involves tactics that may involve computation to search for solutions. So it's not instantaneous. You do have to do the computation to expand the tactics into the actual proof. The verification of the proof at the very low level is instantaneous.
Swyx [00:07:32]: How quickly do you run into like, you know, halting problem PNP type things, like impossibilities where you're just like that?
Stan [00:07:39]: I mean, you don't run into it at the time. It was really trying to solve very easy problems. So I think the... Can you give an example of easy? Yeah, so that's the mass benchmark that everybody knows today. The Dan Hendricks one. The Dan Hendricks one, yeah. And I think it was the low end part of the mass benchmark at the time, because that mass benchmark includes AMC problems, AMC 8, AMC 10, 12. So these are the easy ones. Then AIME problems, somewhat harder, and some IMO problems, like Crazy Arm.
Swyx [00:08:07]: For our listeners, we covered this in our Benchmarks 101 episode. AMC is literally the grade of like high school, grade 8, grade 10, grade 12. So you can solve this. Just briefly to mention this, because I don't think we'll touch on this again. There's a bit of work with like Lean, and then with, you know, more recently with DeepMind doing like scoring like silver on the IMO. Any commentary on like how math has evolved from your early work to today?
Stan [00:08:34]: I mean, that result is mind blowing. I mean, from my perspective, spent three years on that. At the same time, Guillaume Lampe in Paris, we were both in Paris, actually. He was at FAIR, was working on some problems. We were pushing the boundaries, and the goal was the IMO. And we cracked a few problems here and there. But the idea of getting a medal at an IMO was like just remote. So this is an impressive result. And we can, I think the DeepMind team just did a good job of scaling. I think there's nothing too magical in their approach, even if it hasn't been published. There's a Dan Silver talk from seven days ago where it goes a little bit into more details. It feels like there's nothing magical there. It's really applying reinforcement learning and scaling up the amount of data that can generate through autoformalization. So we can dig into what autoformalization means if you want.
Alessio [00:09:26]: Let's talk about the tail end, maybe, of the OpenAI. So you joined, and you're like, I'm going to work on math and do all of these things. I saw on one of your blog posts, you mentioned you fine-tuned over 10,000 models at OpenAI using 10 million A100 hours. How did the research evolve from the GPD 2, and then getting closer to DaVinci 003? And then you left just before ChatGPD was released, but tell people a bit more about the research path that took you there.
Stan [00:09:54]: I can give you my perspective of it. I think at OpenAI, there's always been a large chunk of the compute that was reserved to train the GPTs, which makes sense. So it was pre-entropic splits. Most of the compute was going to a product called Nest, which was basically GPT-3. And then you had a bunch of, let's say, remote, not core research teams that were trying to explore maybe more specific problems or maybe the algorithm part of it. The interesting part, I don't know if it was where your question was going, is that in those labs, you're managing researchers. So by definition, you shouldn't be managing them. But in that space, there's a managing tool that is great, which is compute allocation. Basically by managing the compute allocation, you can message the team of where you think the priority should go. And so it was really a question of, you were free as a researcher to work on whatever you wanted. But if it was not aligned with OpenAI mission, and that's fair, you wouldn't get the compute allocation. As it happens, solving math was very much aligned with the direction of OpenAI. And so I was lucky to generally get the compute I needed to make good progress.
Swyx [00:11:06]: What do you need to show as incremental results to get funded for further results?
Stan [00:11:12]: It's an imperfect process because there's a bit of a... If you're working on math and AI, obviously there's kind of a prior that it's going to be aligned with the company. So it's much easier than to go into something much more risky, much riskier, I guess. You have to show incremental progress, I guess. It's like you ask for a certain amount of compute and you deliver a few weeks after and you demonstrate that you have a progress. Progress might be a positive result. Progress might be a strong negative result. And a strong negative result is actually often much harder to get or much more interesting than a positive result. And then it generally goes into, as any organization, you would have people finding your project or any other project cool and fancy. And so you would have that kind of phase of growing up compute allocation for it all the way to a point. And then maybe you reach an apex and then maybe you go back mostly to zero and restart the process because you're going in a different direction or something else. That's how I felt. Explore, exploit. Yeah, exactly. Exactly. Exactly. It's a reinforcement learning approach.
Swyx [00:12:14]: Classic PhD student search process.
Alessio [00:12:17]: And you were reporting to Ilya, like the results you were kind of bringing back to him or like what's the structure? It's almost like when you're doing such cutting edge research, you need to report to somebody who is actually really smart to understand that the direction is right.
Stan [00:12:29]: So we had a reasoning team, which was working on reasoning, obviously, and so math in general. And that team had a manager, but Ilya was extremely involved in the team as an advisor, I guess. Since he brought me in OpenAI, I was lucky to mostly during the first years to have kind of a direct access to him. He would really coach me as a trainee researcher, I guess, with good engineering skills. And Ilya, I think at OpenAI, he was the one showing the North Star, right? He was his job and I think he really enjoyed it and he did it super well, was going through the teams and saying, this is where we should be going and trying to, you know, flock the different teams together towards an objective.
Swyx [00:13:12]: I would say like the public perception of him is that he was the strongest believer in scaling. Oh, yeah. Obviously, he has always pursued the compression thesis. You have worked with him personally, what does the public not know about how he works?
Stan [00:13:26]: I think he's really focused on building the vision and communicating the vision within the company, which was extremely useful. I was personally surprised that he spent so much time, you know, working on communicating that vision and getting the teams to work together versus...
Swyx [00:13:40]: To be specific, vision is AGI? Oh, yeah.
Stan [00:13:42]: Vision is like, yeah, it's the belief in compression and scanning computes. I remember when I started working on the Reasoning team, the excitement was really about scaling the compute around Reasoning and that was really the belief we wanted to ingrain in the team. And that's what has been useful to the team and with the DeepMind results shows that it was the right approach with the success of GPT-4 and stuff shows that it was the right approach.
Swyx [00:14:06]: Was it according to the neural scaling laws, the Kaplan paper that was published?
Stan [00:14:12]: I think it was before that, because those ones came with GPT-3, basically at the time of GPT-3 being released or being ready internally. But before that, there really was a strong belief in scale. I think it was just the belief that the transformer was a generic enough architecture that you could learn anything. And that was just a question of scaling.
Alessio [00:14:33]: Any other fun stories you want to tell? Sam Altman, Greg, you know, anything.
Stan [00:14:37]: Weirdly, I didn't work that much with Greg when I was at OpenAI. He had always been mostly focused on training the GPTs and rightfully so. One thing about Sam Altman, he really impressed me because when I joined, he had joined not that long ago and it felt like he was kind of a very high level CEO. And I was mind blown by how deep he was able to go into the subjects within a year or something, all the way to a situation where when I was having lunch by year two, I was at OpenAI with him. He would just quite know deeply what I was doing. With no ML background. Yeah, with no ML background, but I didn't have any either, so I guess that explains why. But I think it's a question about, you don't necessarily need to understand the very technicalities of how things are done, but you need to understand what's the goal and what's being done and what are the recent results and all of that in you. And we could have kind of a very productive discussion. And that really impressed me, given the size at the time of OpenAI, which was not negligible.
Swyx [00:15:44]: Yeah. I mean, you've been a, you were a founder before, you're a founder now, and you've seen Sam as a founder. How has he affected you as a founder?
Stan [00:15:51]: I think having that capability of changing the scale of your attention in the company, because most of the time you operate at a very high level, but being able to go deep down and being in the known of what's happening on the ground is something that I feel is really enlightening. That's not a place in which I ever was as a founder, because first company, we went all the way to 10 people. Current company, there's 25 of us. So the high level, the sky and the ground are pretty much at the same place. No, you're being too humble.
Swyx [00:16:21]: I mean, Stripe was also like a huge rocket ship.
Stan [00:16:23]: Stripe, I was a founder. So I was, like at OpenAI, I was really happy being on the ground, pushing the machine, making it work. Yeah.
Swyx [00:16:31]: Last OpenAI question. The Anthropic split you mentioned, you were around for that. Very dramatic. David also left around that time, you left. This year, we've also had a similar management shakeup, let's just call it. Can you compare what it was like going through that split during that time? And then like, does that have any similarities now? Like, are we going to see a new Anthropic emerge from these folks that just left?
Stan [00:16:54]: That I really, really don't know. At the time, the split was pretty surprising because they had been trying GPT-3, it was a success. And to be completely transparent, I wasn't in the weeds of the splits. What I understood of it is that there was a disagreement of the commercialization of that technology. I think the focal point of that disagreement was the fact that we started working on the API and wanted to make those models available through an API. Is that really the core disagreement? I don't know.
Swyx [00:17:25]: Was it safety?
Stan [00:17:26]: Was it commercialization?
Swyx [00:17:27]: Or did they just want to start a company?
Stan [00:17:28]: Exactly. Exactly. That I don't know. But I think what I was surprised of is how quickly OpenAI recovered at the time. And I think it's just because we were mostly a research org and the mission was so clear that some divergence in some teams, some people leave, the mission is still there. We have the compute. We have a site. So it just keeps going.
Swyx [00:17:50]: Very deep bench. Like just a lot of talent. Yeah.
Alessio [00:17:53]: So that was the OpenAI part of the history. Exactly. So then you leave OpenAI in September 2022. And I would say in Silicon Valley, the two hottest companies at the time were you and Lanktrain. What was that start like and why did you decide to start with a more developer focused kind of like an AI engineer tool rather than going back into some more research and something else?
Stan [00:18:15]: Yeah. First, I'm not a trained researcher. So going through OpenAI was really kind of the PhD I always wanted to do. But research is hard. You're digging into a field all day long for weeks and weeks and weeks, and you find something, you get super excited for 12 seconds. And at the 13 seconds, you're like, oh, yeah, that was obvious. And you go back to digging. I'm not a trained, like formally trained researcher, and it wasn't kind of a necessarily an ambition of me of creating, of having a research career. And I felt the hardness of it. I enjoyed a lot of like that a ton. But at the time, I decided that I wanted to go back to something more productive. And the other fun motivation was like, I mean, if we believe in AGI and if we believe the timelines might not be too long, it's actually the last train leaving the station to start a company. After that, it's going to be computers all the way down. And so that was kind of the true motivation for like trying to go there. So that's kind of the core motivation at the beginning of personally. And the motivation for starting a company was pretty simple. I had seen GPT-4 internally at the time, it was September 2022. So it was pre-GPT, but GPT-4 was ready since, I mean, I'd been ready for a few months internally. I was like, okay, that's obvious, the capabilities are there to create an insane amount of value to the world. And yet the deployment is not there yet. The revenue of OpenAI at the time were ridiculously small compared to what it is today. So the thesis was, there's probably a lot to be done at the product level to unlock the usage.
Alessio [00:19:49]: Yeah. Let's talk a bit more about the form factor, maybe. I think one of the first successes you had was kind of like the WebGPT-like thing, like using the models to traverse the web and like summarize things. And the browser was really the interface. Why did you start with the browser? Like what was it important? And then you built XP1, which was kind of like the browser extension.
Stan [00:20:09]: So the starting point at the time was, if you wanted to talk about LLMs, it was still a rather small community, a community of mostly researchers and to some extent, very early adopters, very early engineers. It was almost inconceivable to just build a product and go sell it to the enterprise, though at the time there was a few companies doing that. The one on marketing, I don't remember its name, Jasper. But so the natural first intention, the first, first, first intention was to go to the developers and try to create tooling for them to create product on top of those models. And so that's what Dust was originally. It was quite different than Lanchain, and Lanchain just beat the s**t out of us, which is great. It's a choice.
Swyx [00:20:53]: You were cloud, in closed source. They were open source.
Stan [00:20:56]: Yeah. So technically we were open source and we still are open source, but I think that doesn't really matter. I had the strong belief from my research time that you cannot create an LLM-based workflow on just one example. Basically, if you just have one example, you overfit. So as you develop your interaction, your orchestration around the LLM, you need a dozen examples. Obviously, if you're running a dozen examples on a multi-step workflow, you start paralyzing stuff. And if you do that in the console, you just have like a messy stream of tokens going out and it's very hard to observe what's going there. And so the idea was to go with an UI so that you could kind of introspect easily the output of each interaction with the model and dig into there through an UI, which is-
Swyx [00:21:42]: Was that open source? I actually didn't come across it.
Stan [00:21:44]: Oh yeah, it wasn't. I mean, Dust is entirely open source even today. We're not going for an open source-
Swyx [00:21:48]: If it matters, I didn't know that.
Stan [00:21:49]: No, no, no, no, no. The reason why is because we're not open source because we're not doing an open source strategy. It's not an open source go-to-market at all. We're open source because we can and it's fun.
Swyx [00:21:59]: Open source is marketing. You have all the downsides of open source, which is like people can clone you.
Stan [00:22:03]: But I think that downside is a big fallacy. Okay. Yes, anybody can clone Dust today, but the value of Dust is not the current state. The value of Dust is the number of eyeballs and hands of developers that are creating to it in the future. And so yes, anybody can clone it today, but that wouldn't change anything. There is some value in being open source. In a discussion with the security team, you can be extremely transparent and just show the code. When you have discussion with users and there's a bug or a feature missing, you can just point to the issue, show the pull request, show the, show the, exactly, oh, PR welcome. That doesn't happen that much, but you can show the progress if the person that you're chatting with is a little bit technical, they really enjoy seeing the pull request advancing and seeing all the way to deploy. And then the downsides are mostly around security. You never want to do security by obfuscation. But the truth is that your vector of attack is facilitated by you being open source. But at the same time, it's a good thing because if you're doing anything like a bug bountying or stuff like that, you just give much more tools to the bug bountiers so that their output is much better. So there's many, many, many trade-offs. I don't believe in the value of the code base per se. I think it's really the people that are on the code base that have the value and go to market and the product and all of those things that are around the code base. Obviously, that's not true for every code base. If you're working on a very secret kernel to accelerate the inference of LLMs, I would buy that you don't want to be open source. But for product stuff, I really think there's very little risk. Yeah.
Alessio [00:23:39]: I signed up for XP1, I was looking, January 2023. I think at the time you were on DaVinci 003. Given that you had seen GPD 4, how did you feel having to push a product out that was using this model that was so inferior? And you're like, please, just use it today. I promise it's going to get better. Just overall, as a founder, how do you build something that maybe doesn't quite work with the model today, but you're just expecting the new model to be better?
Stan [00:24:03]: Yeah, so actually, XP1 was even on a smaller one that was the post-GDPT release, small version, so it was... Ada, Babbage... No, no, no, not that far away. But it was the small version of GDPT, basically. I don't remember its name. Yes, you have a frustration there. But at the same time, I think XP1 was designed, was an experiment, but was designed as a way to be useful at the current capability of the model. If you just want to extract data from a LinkedIn page, that model was just fine. If you want to summarize an article on a newspaper, that model was just fine. And so it was really a question of trying to find a product that works with the current capability, knowing that you will always have tailwinds as models get better and faster and cheaper. So that was kind of a... There's a bit of a frustration because you know what's out there and you know that you don't have access to it yet. It's also interesting to try to find a product that works with the current capability.
Alessio [00:24:55]: And we highlighted XP1 in our anatomy of autonomy post in April of last year, which was, you know, where are all the agents, right? So now we spent 30 minutes getting to what you're building now. So you basically had a developer framework, then you had a browser extension, then you had all these things, and then you kind of got to where Dust is today. So maybe just give people an overview of what Dust is today and the courtesies behind it. Yeah, of course.
Stan [00:25:20]: So Dust, we really want to build the infrastructure so that companies can deploy agents within their teams. We are horizontal by nature because we strongly believe in the emergence of use cases from the people having access to creating an agent that don't need to be developers. They have to be thinkers. They have to be curious. But anybody can create an agent that will solve an operational thing that they're doing in their day-to-day job. And to make those agents useful, there's two focus, which is interesting. The first one is an infrastructure focus. You have to build the pipes so that the agent has access to the data. You have to build the pipes such that the agents can take action, can access the web, et cetera. So that's really an infrastructure play. Maintaining connections to Notion, Slack, GitHub, all of them is a lot of work. It is boring work, boring infrastructure work, but that's something that we know is extremely valuable in the same way that Stripe is extremely valuable because it maintains the pipes. And we have that dual focus because we're also building the product for people to use it. And there it's fascinating because everything started from the conversational interface, obviously, which is a great starting point. But we're only scratching the surface, right? I think we are at the pong level of LLM productization. And we haven't invented the C3. We haven't invented Counter-Strike. We haven't invented Cyberpunk 2077. So this is really our mission is to really create the product that lets people equip themselves to just get away all the work that can be automated or assisted by LLMs.
Alessio [00:26:57]: And can you just comment on different takes that people had? So maybe the most open is like auto-GPT. It's just kind of like just trying to do anything. It's like it's all magic. There's no way for you to do anything. Then you had the ADAPT, you know, we had David on the podcast. They're very like super hands-on with each individual customer to build super tailored. How do you decide where to draw the line between this is magic? This is exposed to you, especially in a market where most people don't know how to build with AI at all. So if you expect them to do the thing, they're probably not going to do it. Yeah, exactly.
Stan [00:27:29]: So the auto-GPT approach obviously is extremely exciting, but we know that the agentic capability of models are not quite there yet. It just gets lost. So we're starting, we're starting where it works. Same with the XP one. And where it works is pretty simple. It's like simple workflows that involve a couple tools where you don't even need to have the model decide which tools it's used in the sense of you just want people to put it in the instructions. It's like take that page, do that search, pick up that document, do the work that I want in the format I want, and give me the results. There's no smartness there, right? In terms of orchestrating the tools, it's mostly using English for people to program a workflow where you don't have the constraint of having compatible API between the two.
Swyx [00:28:17]: That kind of personal automation, would you say it's kind of like an LLM Zapier type of
Stan [00:28:22]: thing?
Swyx [00:28:22]: Like if this, then that, and then, you know, do this, then this. You're programming with English?
Stan [00:28:28]: So you're programming with English. So you're just saying, oh, do this and then that. You can even create some form of APIs. You say, when I give you the command X, do this. When I give you the command Y, do this. And you describe the workflow. But you don't have to create boxes and create the workflow explicitly. It just needs to describe what are the tasks supposed to be and make the tool available to the agent. The tool can be a semantic search. The tool can be querying into a structured database. The tool can be searching on the web. And obviously, the interesting tools that we're only starting to scratch are actually creating external actions like reimbursing something on Stripe, sending an email, clicking on a button in the admin or something like that.
Swyx [00:29:11]: Do you maintain all these integrations?
Stan [00:29:13]: Today, we maintain most of the integrations. We do always have an escape hatch for people to kind of custom integrate. But the reality is that the reality of the market today is that people just want it to work, right? And so it's mostly us maintaining the integration. As an example, a very good source of information that is tricky to productize is Salesforce. Because Salesforce is basically a database and a UI. And they do the f**k they want with it. And so every company has different models and stuff like that. So right now, we don't support it natively. And the type of support or real native support will be slightly more complex than just osing into it, like is the case with Slack as an example. Because it's probably going to be, oh, you want to connect your Salesforce to us? Give us the SQL. That's the Salesforce QL language. Give us the queries you want us to run on it and inject in the context of dust. So that's interesting how not only integrations are cool, and some of them require a bit of work on the user. And for some of them that are really valuable to our users, but we don't support yet, they can just build them internally and push the data to us.
Swyx [00:30:18]: I think I understand the Salesforce thing. But let me just clarify, are you using browser automation because there's no API for something?
Stan [00:30:24]: No, no, no, no. In that case, so we do have browser automation for all the use cases and apply the public web. But for most of the integration with the internal system of the company, it really runs through API.
Swyx [00:30:35]: Haven't you felt the pull to RPA, browser automation, that kind of stuff?
Stan [00:30:39]: I mean, what I've been saying for a long time, maybe I'm wrong, is that if the future is that you're going to stand in front of a computer and looking at an agent clicking on stuff, then I'll hit my computer. And my computer is a big Lenovo. It's black. Doesn't sound good at all compared to a Mac. And if the APIs are there, we should use them. There is going to be a long tail of stuff that don't have APIs, but as the world is moving forward, that's disappearing. So the core API value in the past has really been, oh, this old 90s product doesn't have an API. So I need to use the UI to automate. I think for most of the ICP companies, the companies that ICP for us, the scale ups that are between 500 and 5,000 people, tech companies, most of the SaaS they use have APIs. Now there's an interesting question for the open web, because there are stuff that you want to do that involve websites that don't necessarily have APIs. And the current state of web integration from, which is us and OpenAI and Anthropic, I don't even know if they have web navigation, but I don't think so. The current state of affair is really, really broken because you have what? You have basically search and headless browsing. But headless browsing, I think everybody's doing basically body.innertext and fill that into the model, right?
Swyx [00:31:56]: MARK MIRCHANDANI There's parsers into Markdown and stuff.
Stan [00:31:58]: FRANCESC CAMPOY I'm super excited by the companies that are exploring the capability of rendering a web page into a way that is compatible for a model, being able to maintain the selector. So that's basically the place where to click in the page through that process, expose the actions to the model, have the model select an action in a way that is compatible with model, which is not a big page of a full DOM that is very noisy, and then being able to decompress that back to the original page and take the action. And that's something that is really exciting and that will kind of change the level of things that agents can do on the web. That I feel exciting, but I also feel that the bulk of the useful stuff that you can do within the company can be done through API. The data can be retrieved by API. The actions can be taken through API.
Swyx [00:32:44]: For listeners, I'll note that you're basically completely disagreeing with David Wan. FRANCESC CAMPOY Exactly, exactly. I've seen it since it's summer. ADEPT is where it is, and Dust is where it is. So Dust is still standing.
Alessio [00:32:55]: Can we just quickly comment on function calling? You mentioned you don't need the models to be that smart to actually pick the tools. Have you seen the models not be good enough? Or is it just like, you just don't want to put the complexity in there? Like, is there any room for improvement left in function calling? Or do you feel you usually consistently get always the right response, the right parameters
Stan [00:33:13]: and all of that?
Alessio [00:33:13]: FRANCESC CAMPOY So that's a tricky product question.
Stan [00:33:15]: Because if the instructions are good and precise, then you don't have any issue, because it's scripted for you. And the model will just look at the scripts and just follow and say, oh, he's probably talking about that action, and I'm going to use it. And the parameters are kind of abused from the state of the conversation. I'll just go with it. If you provide a very high level, kind of an auto-GPT-esque level in the instructions and provide 16 different tools to your model, yes, we're seeing the models in that state making mistakes. And there is obviously some progress can be made on the capabilities. But the interesting part is that there is already so much work that can assist, augment, accelerate by just going with pretty simply scripted for actions agents. What I'm excited about by pushing our users to create rather simple agents is that once you have those working really well, you can create meta agents that use the agents as actions. And all of a sudden, you can kind of have a hierarchy of responsibility that will probably get you almost to the point of the auto-GPT value. It requires the construction of intermediary artifacts, but you're probably going to be able to achieve something great. I'll give you some example. We have our incidents are shared in Slack in a specific channel, or shipped are shared in Slack. We have a weekly meeting where we have a table about incidents and shipped stuff. We're not writing that weekly meeting table anymore. We have an assistant that just go find the right data on Slack and create the table for us. And that assistant works perfectly. It's trivially simple, right? Take one week of data from that channel and just create the table. And then we have in that weekly meeting, obviously some graphs and reporting about our financials and our progress and our ARR. And we've created assistants to generate those graphs directly. And those assistants works great. By creating those assistants that cover those small parts of that weekly meeting, slowly we're getting to in a world where we'll have a weekly meeting assistance. We'll just call it. You don't need to prompt it. You don't need to say anything. It's going to run those different assistants and get that notion page just ready. And by doing that, if you get there, and that's an objective for us to us using Dust, get there, you're saving an hour of company time every time you run it. Yeah.
Alessio [00:35:28]: That's my pet topic of NPM for agents. How do you build dependency graphs of agents? And how do you share them? Because why do I have to rebuild some of the smaller levels of what you built already?
Swyx [00:35:40]: I have a quick follow-up question on agents managing other agents. It's a topic of a lot of research, both from Microsoft and even in startups. What you've discovered best practice for, let's say like a manager agent controlling a bunch of small agents. It's two-way communication. I don't know if there should be a protocol format.
Stan [00:35:59]: To be completely honest, the state we are at right now is creating the simple agents. So we haven't even explored yet the meta agents. We know it's there. We know it's going to be valuable. We know it's going to be awesome. But we're starting there because it's the simplest place to start. And it's also what the market understands. If you go to a company, random SaaS B2B company, not necessarily specialized in AI, and you take an operational team and you tell them, build some tooling for yourself, they'll understand the small agents. If you tell them, build AutoGP, they'll be like, Auto what?
Swyx [00:36:31]: And I noticed that in your language, you're very much focused on non-technical users. You don't really mention API here. You mention instruction instead of system prompt, right? That's very conscious.
Stan [00:36:41]: Yeah, it's very conscious. It's a mark of our designer, Ed, who kind of pushed us to create a friendly product. I was knee-deep into AI when I started, obviously. And my co-founder, Gabriel, was a Stripe as well. We started a company together that got acquired by Stripe 15 years ago. It was at Alain, a healthcare company in Paris. After that, it was a little bit less so knee-deep in AI, but really focused on product. And I didn't realize how important it is to make that technology not scary to end users. It didn't feel scary to me, but it was really seen by Ed, our designer, that it was feeling scary to the users. And so we were very proactive and very deliberate about creating a brand that feels not too scary and creating a wording and a language, as you say, that really tried to communicate the fact that it's going to be fine. It's going to be easy. You're going to make it.
Alessio [00:37:34]: And another big point that David had about ADAPT is we need to build an environment for the agents to act. And then if you have the environment, you can simulate what they do. How's that different when you're interacting with APIs and you're kind of touching systems that you cannot really simulate? If you call it the Salesforce API, you're just calling it.
Stan [00:37:52]: So I think that goes back to the DNA of the companies that are very different. ADAPT, I think, was a product company with a very strong research DNA, and they were still doing research. One of their goals was building a model. And that's why they raised a large amount of money, et cetera. We are 100% deliberately a product company. We don't do research. We don't train models. We don't even run GPUs. We're using the models that exist, and we try to push the product boundary as far as possible with the existing models. So that creates an issue. Indeed, so to answer your question, when you're interacting in the real world, well, you cannot simulate, so you cannot improve the models. Even improving your instructions is complicated for a builder. The hope is that you can use models to evaluate the conversations so that you can get at least feedback and you could get contradictive information about the performance of the assistance. But if you take actual trace of interaction of humans with those agents, it is even for us humans extremely hard to decide whether it was a productive interaction or a really bad interaction. You don't know why the person left. You don't know if they left happy or not. So being extremely, extremely, extremely pragmatic here, it becomes a product issue. We have to build a product that identifies the end users to provide feedback so that as a first step, the person that is building the agent can iterate on it. As a second step, maybe later when we start training model and post-training, et cetera, we can optimize around that for each of those companies. Yeah.
Alessio [00:39:17]: Do you see in the future products offering kind of like a simulation environment, the same way all SaaS now kind of offers APIs to build programmatically? Like in cybersecurity, there are a lot of companies working on building simulative environments so that then you can use agents like Red Team, but I haven't really seen that.
Stan [00:39:34]: Yeah, no, me neither. That's a super interesting question. I think it's really going to depend on how much, because you need to simulate to generate data, you need to train data to train models. And the question at the end is, are we going to be training models or are we just going to be using frontier models as they are? On that question, I don't have a strong opinion. It might be the case that we'll be training models because in all of those AI first products, the model is so close to the product surface that as you get big and you want to really own your product, you're going to have to own the model as well. Owning the model doesn't mean doing the pre-training, that would be crazy. But at least having an internal post-training realignment loop, it makes a lot of sense. And so if we see many companies going towards that all the time, then there might be incentives for the SaaS's of the world to provide assistance in getting there. But at the same time, there's a tension because those SaaS, they don't want to be interacted by agents, they want the human to click on the button. Yeah, they got to sell seats. Exactly.
Swyx [00:40:41]: Just a quick question on models. I'm sure you've used many, probably not just OpenAI. Would you characterize some models as better than others? Do you use any open source models? What have been the trends in models over the last two years?
Stan [00:40:53]: We've seen over the past two years kind of a bit of a race in between models. And at times, it's the OpenAI model that is the best. At times, it's the Anthropic models that is the best. Our take on that is that we are agnostic and we let our users pick their model. Oh, they choose? Yeah, so when you create an assistant or an agent, you can just say, oh, I'm going to run it on GP4, GP4 Turbo, or...
Swyx [00:41:16]: Don't you think for the non-technical user, that is actually an abstraction that you should take away from them?
Stan [00:41:20]: We have a sane default. So we move the default to the latest model that is cool. And we have a sane default, and it's actually not very visible. In our flow to create an agent, you would have to go in advance and go pick your model. So this is something that the technical person will care about. But that's something that obviously is a bit too complicated for the...
Swyx [00:41:40]: And do you care most about function calling or instruction following or something else?
Stan [00:41:44]: I think we care most for function calling because you want to... There's nothing worse than a function call, including incorrect parameters or being a bit off because it just drives the whole interaction off.
Swyx [00:41:56]: Yeah, so got the Berkeley function calling.
Stan [00:42:00]: These days, it's funny how the comparison between GP4O and GP4 Turbo is still up in the air on function calling. I personally don't have proof, but I know many people, and I'm probably part of them, to think that GP4 Turbo is still better than GP4O on function calling. Wow. We'll see what comes out of the O1 class if it ever gets function calling. And Cloud 3.5 Summit is great as well. They kind of innovated in an interesting way, which was never quite publicized. But it's that they have that kind of chain of thought step whenever you use a Cloud model or Summit model with function calling. That chain of thought step doesn't exist when you just interact with it just for answering questions. But when you use function calling, you get that step, and it really helps getting better function calling.
Swyx [00:42:43]: Yeah, we actually just recorded a podcast with the Berkeley team that runs that leaderboard this week. So they just released V3.
Stan [00:42:49]: Yeah.
Swyx [00:42:49]: It was V1 like two months ago, and then they V2, V3. Turbo is on top.
Stan [00:42:53]: Turbo is on top. Turbo is over 4.0.
Swyx [00:42:54]: And then the third place is XLAM from Salesforce, which is a large action model they've been trying to popularize.
Stan [00:43:01]: Yep.
Swyx [00:43:01]: O1 Mini is actually on here, I think. O1 Mini is number 11.
Stan [00:43:05]: But arguably, O1 Mini has been in a line for that. Yeah.
Alessio [00:43:09]: Do you use leaderboards? Do you have your own evals? I mean, this is kind of intuitive, right? Like using the older model is better. I think most people just upgrade. Yeah. What's the eval process like?
Stan [00:43:19]: It's funny because I've been doing research for three years, and we have bigger stuff to cook. When you're deploying in a company, one thing where we really spike is that when we manage to activate the company, we have a crazy penetration. The highest penetration we have is 88% daily active users within the entire employee of the company. The kind of average penetration and activation we have in our current enterprise customers is something like more like 60% to 70% weekly active. So we basically have the entire company interacting with us. And when you're there, there is so many stuff that matters most than getting evals, getting the best model. Because there is so many places where you can create products or do stuff that will give you the 80% with the work you do. Whereas deciding if it's GPT-4 or GPT-4 Turbo or et cetera, you know, it'll just give you the 5% improvement. But the reality is that you want to focus on the places where you can really change the direction or change the interaction more drastically. But that's something that we'll have to do eventually because we still want to be serious people.
Swyx [00:44:24]: It's funny because in some ways, the model labs are competing for you, right? You don't have to do any effort. You just switch model and then it'll grow. What are you really limited by? Is it additional sources?
Stan [00:44:36]: It's not models, right?
Swyx [00:44:37]: You're not really limited by quality of model.
Stan [00:44:40]: Right now, we are limited by the infrastructure part, which is the ability to connect easily for users to all the data they need to do the job they want to do.
Swyx [00:44:51]: Because you maintain all your own stuff.
Stan [00:44:53]: You know, there are companies out there
Swyx [00:44:54]: that are starting to provide integrations as a service, right? I used to work in an integrations company. Yeah, I know.
Stan [00:44:59]: It's just that there is some intricacies about how you chunk stuff and how you process information from one platform to the other. If you look at the end of the spectrum, you could think of, you could say, oh, I'm going to support AirByte and AirByte has- I used to work at AirByte.
Swyx [00:45:12]: Oh, really?
Stan [00:45:13]: That makes sense.
Swyx [00:45:14]: They're the French founders as well.
Stan [00:45:15]: I know Jean very well. I'm seeing him today. And the reality is that if you look at Notion, AirByte does the job of taking Notion and putting it in a structured way. But that's the way it is not really usable to actually make it available to models in a useful way. Because you get all the blocks, details, et cetera, which is useful for many use cases.
Swyx [00:45:35]: It's also for data scientists and not for AI.
Stan [00:45:38]: The reality of Notion is that sometimes you have a- so when you have a page, there's a lot of structure in it and you want to capture the structure and chunk the information in a way that respects that structure. In Notion, you have databases. Sometimes those databases are real tabular data. Sometimes those databases are full of text. You want to get the distinction and understand that this database should be considered like text information, whereas this other one is actually quantitative information. And to really get a very high quality interaction with that piece of information, I haven't found a solution that will work without us owning the connection end-to-end.
Swyx [00:46:15]: That's why I don't invest in, there's Composio, there's All Hands from Graham Newbig. There's all these other companies that are like, we will do the integrations for you. You just, we have the open source community. We'll do off the shelf. But then you are so specific in your needs that you want to own it.
Swyx [00:46:28]: Yeah, exactly.
Stan [00:46:29]: You can talk to Michel about that.
Swyx [00:46:30]: You know, he wants to put the AI in there, but you know. Yeah, I will. I will.
Stan [00:46:35]: Cool. What are we missing?
Alessio [00:46:36]: You know, what are like the things that are like sneakily hard that you're tackling that maybe people don't even realize they're like really hard?
Stan [00:46:43]: The real parts as we kind of touch base throughout the conversation is really building the infra that works for those agents because it's a tenuous walk. It's an evergreen piece of work because you always have an extra integration that will be useful to a non-negligible set of your users. I'm super excited about is that there's so many interactions that shouldn't be conversational interactions and that could be very useful. Basically, know that we have the firehose of information of those companies and there's not going to be that many companies that capture the firehose of information. When you have the firehose of information, you can do a ton of stuff with models that are just not accelerating people, but giving them superhuman capability, even with the current model capability because you can just sift through much more information. An example is documentation repair. If I have the firehose of Slack messages and new Notion pages, if somebody says, I own that page, I want to be updated when there is a piece of information that should update that page, this is not possible. You get an email saying, oh, look at that Slack message. It says the opposite of what you have in that paragraph. Maybe you want to update or just ping that person. I think there is a lot to be explored on the product layer in terms of what it means to interact productively with those models. And that's a problem that's extremely hard and extremely exciting.
Swyx [00:48:00]: One thing you keep mentioning about infra work, obviously, Dust is building that infra and serving that in a very consumer-friendly way. You always talk about infra being additional sources, additional connectors. That is very important. But I'm also interested in the vertical infra. There is an orchestrator underlying all these things where you're doing asynchronous work. For example, the simplest one is a cron job. You just schedule things. But also, for if this and that, you have to wait for something to be executed and proceed to the next task. I used to work on an orchestrator as well, Temporal.
Stan [00:48:31]: We used Temporal. Oh, you used Temporal? Yeah. Oh, how was the experience?
Swyx [00:48:34]: I need the NPS.
Stan [00:48:36]: We're doing a self-discovery call now.
Swyx [00:48:39]: But you can also complain to me because I don't work there anymore.
Stan [00:48:42]: No, we love Temporal. There's some edges that are a bit rough, surprisingly rough. And you would say, why is it so complicated?
Swyx [00:48:49]: It's always versioning.
Stan [00:48:50]: Yeah, stuff like that. But we really love it. And we use it for exactly what you said, like managing the entire set of stuff that needs to happen so that in semi-real time, we get all the updates from Slack or Notion or GitHub into the system. And whenever we see that piece of information goes through, maybe trigger workflows to run agents because they need to provide alerts to users and stuff like that. And Temporal is great. Love it.
Swyx [00:49:17]: You haven't evaluated others. You don't want to build your own. You're happy with...
Stan [00:49:21]: Oh, no, we're not in the business of replacing Temporal. And Temporal is so... I mean, it is or any other competitive product. They're very general. If it's there, there's an interesting theory about buy versus build. I think in that case, when you're a high-growth company, your buy-build trade-off is very much on the side of buy. Because if you have the capability, you're just going to be saving time, you can focus on your core competency, etc. And it's funny because we're seeing, we're starting to see the post-high-growth company, post-SKF company, going back on that trade-off, interestingly. So that's the cloud news about removing Zendesk and Salesforce. Do you believe that, by the way?
Alessio [00:49:56]: Yeah, I did a podcast with them.
Stan [00:49:58]: Oh, yeah?
Alessio [00:49:58]: It's true.
Swyx [00:49:59]: No, no, I know.
Stan [00:50:00]: Of course they say it's true,
Swyx [00:50:00]: but also how well is it going to go?
Stan [00:50:02]: So I'm not talking about deflecting the customer traffic. I'm talking about building AI on top of Salesforce and Zendesk, basically, if I understand correctly. And all of a sudden, your product surface becomes much smaller because you're interacting with an AI system that will take some actions. And so all of a sudden, you don't need the product layer anymore. And you realize that, oh, those things are just databases that I pay a hundred times the price, right? Because you're a post-SKF company and you have tech capabilities, you are incentivized to reduce your costs and you have the capability to do so. And then it makes sense to just scratch the SaaS away. So it's interesting that we might see kind of a bad time for SaaS in post-hyper-growth tech companies. So it's still a big market, but it's not that big because if you're not a tech company, you don't have the capabilities to reduce that cost. If you're a high-growth company, always going to be buying because you go faster with that. But that's an interesting new space, new category of companies that might remove some SaaS. Yeah, Alessio's firm
Swyx [00:51:02]: has an interesting thesis on the future of SaaS in AI.
Alessio [00:51:05]: Service as a software, we call it. It's basically like, well, the most extreme is like, why is there any software at all? You know, ideally, it's all a labor interface where you're asking somebody to do something for you, whether that's a person, an AI agent or whatnot.
Stan [00:51:17]: Yeah, yeah, that's interesting. I have to ask.
Swyx [00:51:19]: Are you paying for Temporal Cloud or are you self-hosting?
Stan [00:51:22]: Oh, no, no, we're paying, we're paying. Oh, okay, interesting.
Swyx [00:51:24]: We're paying way too much.
Stan [00:51:26]: It's crazy expensive, but it makes us-
Swyx [00:51:28]: That's why as a shareholder, I like to hear that. It makes us go faster,
Stan [00:51:31]: so we're happy to pay.
Swyx [00:51:33]: Other things in the infrastack, I just want a list for other founders to think about. Ops, API gateway, evals, you know, anything interesting there that you build or buy?
Stan [00:51:41]: I mean, there's always an interesting question. We've been building a lot around the interface between models and because Dust, the original version, was an orchestration platform and we basically provide a unified interface to every model providers.
Swyx [00:51:56]: That's what I call gateway.
Stan [00:51:57]: That we add because Dust was that and so we continued building upon and we own it. But that's an interesting question was in you, you want to build that or buy it?
Swyx [00:52:06]: Yeah, I always say light LLM is the current open source consensus.
Stan [00:52:09]: Exactly, yeah. There's an interesting question there.
Swyx [00:52:12]: Ops, Datadog, just tracking.
Stan [00:52:14]: Oh yeah, so Datadog is an obvious... What are the mistakes that I regret? I started as pure JavaScript, not TypeScript, and I think you want to, if you're wondering, oh, I want to go fast, I'll do a little bit of JavaScript. No, don't, just start with TypeScript. I see, okay.
Swyx [00:52:30]: So interesting, you are a research engineer that came out of OpenAI that bet on TypeScript.
Stan [00:52:36]: Well, the reality is that if you're building a product, you're going to be doing a lot of JavaScript, right? And Next, we're using Next as an example. It's a great platform. And our internal service is actually not built in Python either, it's built in Rust.
Swyx [00:52:50]: That's another fascinating choice. The Next.js story is interesting because Next.js is obviously the king of the world in JavaScript land, but recently ChachiBT just rewrote from Next.js to Remix. We are going to be having them on to talk about the big rewrite. That is like the biggest news in front-end world in a while.
Stan [00:53:06]: All right, just to wrap,
Alessio [00:53:07]: in 2023, you predicted the first billion dollar company with just one person running it, and you said that's basically like a sign of AGI, once we get there. And you said it had already been started. Any 2024 updates on the take?
Stan [00:53:20]: That quote was probably independently invented it, but Sam Altman stole it from me eventually. But anyway, it's a good quote. So I hypothesized it was maybe already being started, but if it's a uniperson company, it would probably grow really fast, and so we should probably see it already. I guess we're going to have to wait for it a little bit. And I think it's because the dust of the world don't exist. And so you don't have that thing that lets you run those, just do anything with models. But one thing that is exciting is maybe that we're going to be able to scale a team much further than before. All generations of company might be the first billion dollar companies with engineering teams of 20 people. That would be so exciting as well. That would be so great. You know, you don't have the management hurdle, you're just 20 focused people with a lot of assistance from machines to achieve your job. That would be great. And that I believe in a bit more. Yeah.
Alessio [00:54:14]: I've written a post called Maximum Enterprise Utilization, kind of like you have MFU for GPUs, but it's basically like so many people are focused on, oh, it's going to like displace jobs and whatnot. But I'm like, there's so much work that people don't do because they don't have the people. And maybe the question is that you just don't scale to that size, you know, to begin with. And maybe everybody will use Dust and Dust is only going to be 20 people and then people using Dust will be two people.
Swyx [00:54:39]: So my hot take is, I actually know what vertical they'll be in. They'll be content creators and podcasters.
Alessio [00:54:44]: There's already two of us, so we're a max capacity.
Swyx [00:54:47]: Most people would regard Jimmy Donaldson, like Mr. Beast as a billionaire, but his team is, he's got about like 200 people. So he's not a single person company. The closer one actually is Joe Rogan, where he basically just has like a guy. Hey, Jamie, put it on the screen. But Joe, I don't think, he sold his future for 250 million to Spotify. So he's not going to hit that billionaire status. The non-consensus one, it will be the Hawkswagirl.
Swyx [00:55:12]: Anyway, but like you want creators who are empowered by a bunch of agents, Dust agents to do all this stuff because then ultimately it's just the brand, the curation. What is the role of the human then? What is that one person supposed to do if you have all these agents?
Stan [00:55:28]: That's a good question. I mean, I think it was, I think it was Pinterest or Dropbox founder at the time was when you're CEO, you mostly have an editorial position. You're here to say yes and no to the things you are supposed to do.
Swyx [00:55:42]: Okay, so I make a daily AI newsletter where I just, it's 99% AI generated, but I serve the role as the editor. Like I write commentary. I choose between four options.
Stan [00:55:53]: You decide what goes in and goes out. And ultimately, as you said, you build up your brand through those many decisions.
Swyx [00:56:00]: You should pursue creators.
Stan [00:56:03]: And you've made a, I think you've made a, you've have an upcoming podcast with Notebook NLM, which has been doing a crazy stuff. That is exciting.
Swyx [00:56:09]: They were just in here yesterday. I'll tell you one agent that we need. If you want to pursue the creator market, the one agent that we haven't paid for is our video editor agent. So if you want, you need to, you know, wrap FFmpeg in a GPT.
Alessio [00:56:24]: Awesome. This was great. Anything we missed? Any final kind of like call to action hiring? It's like, obviously people should buy the product.
Stan [00:56:32]: And no, I think we didn't dive into the vertical versus horizontal approach to AI agents. We mentioned a few things. We spike at penetration and that's just awesome because we carry the tool that the entire company has and use. So we create a ton of value, but it makes our go-to-market much harder. Vertical solutions have a go-to-market that is much easier because they're like, oh, I'm going to solve the lawyer stuff. But the potential within the company after that is limited. So there's really a nice tension there. We are true believers of the horizontal approach and we'll see how that plays out. But I think it's an interesting thing to think about when as a founder or as a technical person working with agents, what do you want to solve? Do you want to solve something general or do you want to solve something specific? And it has a lot of impact on eventually what type of company you're going to build.
Swyx [00:57:21]: Yeah, I'll provide you my response on that. So I've gone the other way. I've gone products over platform. And it's basically your sense on the products drives your platform development. In other words, if you're trying to be as many things to as many people as possible, we're just trying to be one thing. We build our brand in one specific niche. And in future, if we want to choose to spin off platforms for other things, we can because we have that brand. So for example, Perplexity, we went for products in search, right? But then we also have Perplexity Labs that like here's the info that we use for search and whatever.
Stan [00:57:51]: The counter argument to that is that you always have lateral movement within companies, but if you're Zendesk, you're not going to be Zendesk- Serving web services.
Swyx [00:58:03]: There are a few, you know, there's success stories on both sides, but there's Amazon and Amazon web services, right? And sorry by platform,
Stan [00:58:08]: I don't really mean the platform as the platform platform. I mean like the product that is useful to everybody within the company. And I'll take on that is that there is so many operations within the company. Some of them have been extremely rationalized by the markets, like salespeople, like support has been extremely rationalized. And so you can probably create very powerful vertical product around that. But there is so many operations that make up a company that are specific to the company that you need a product to help people get assisted on those operations. And that's kind of the bet we have. Excellent.
Alessio [00:58:40]: Awesome, man. Thanks again for the time. Thank you very much for having me.
Stan [00:58:42]: It was so much fun. Yeah, great discussion.
Swyx [00:58:44]: Thank you.
Stan [00:58:46]: Thank you.

Get full access to Latent Space at www.latent.space/subscribe
- Luisteren Nogmaals beluisteren Doorgaan Wordt afgespeeld...
- Later beluisteren Later beluisteren
In the Arena: How LMSys changed LLM Benchmarking Forever
1 nov· Latent Space: The AI Engineer Podcast
Apologies for lower audio quality; we lost recordings and had to use backup tracks.
Our guests today are Anastasios Angelopoulos and Wei-Lin Chiang, leads of Chatbot Arena, fka LMSYS, the crowdsourced AI evaluation platform developed by the LMSys student club at Berkeley, which became the de facto standard for comparing language models. Arena Elo is often more cited than MMLU scores to many folks, and they have attracted >1,000,000 people to cast votes since its launch, leading top model trainers to cite them over their own formal academic benchmarks:
The Limits of Static Benchmarks
We’ve done two benchmarks episodes: Benchmarks 101 and Benchmarks 201. One issue we’ve always brought up with static benchmarks is that 1) many are getting saturated, with models scoring almost perfectly on them 2) they often don’t reflect production use cases, making it hard for developers and users to use them as guidance.
The fundamental challenge in AI evaluation isn't technical - it's philosophical. How do you measure something that increasingly resembles human intelligence? Rather than trying to define intelligence upfront, Arena let users interact naturally with models and collect comparative feedback. It's messy and subjective, but that's precisely the point - it captures the full spectrum of what people actually care about when using AI.
The Pareto Frontier of Cost vs Intelligence
Because the Elo scores are remarkably stable over time, we can put all the chat models on a map against their respective cost to gain a view of at least 3 orders of magnitude of model sizes/costs and observe the remarkable shift in intelligence per dollar over the past year:
This frontier stood remarkably firm through the recent releases of o1-preview and price cuts of Gemini 1.5:
The Statistics of Subjectivity
In our Benchmarks 201 episode, Clémentine Fourrier from HuggingFace thought this design choice was one of shortcomings of arenas: they aren’t reproducible. You don’t know who ranked what and what exactly the outcome was at the time of ranking. That same person might rank the same pair of outputs differently on a different day, or might ask harder questions to better models compared to smaller ones, making it imbalanced.
Another argument that people have brought up is confirmation bias. We know humans prefer longer responses and are swayed by formatting - Rob Mulla from Dreadnode had found some interesting data on this in May:
The approach LMArena is taking is to use logistic regression to decompose human preferences into constituent factors. As Anastasios explains: "We can say what components of style contribute to human preference and how they contribute." By adding these style components as parameters, they can mathematically "suck out" their influence and isolate the core model capabilities.
This extends beyond just style - they can control for any measurable factor: "What if I want to look at the cost adjusted performance? Parameter count? We can ex post facto measure that."
This is one of the most interesting things about Arena: You have a data generation engine which you can clean and turn into leaderboards later. If you wanted to create a leaderboard for poetry writing, you could get existing data from Arena, normalize it by identifying these style components. Whether or not it’s possible to really understand WHAT bias the voters have, that’s a different question.
Private Evals
One of the most delicate challenges LMSYS faces is maintaining trust while collaborating with AI labs. The concern is that labs could game the system by testing multiple variants privately and only releasing the best performer. This was brought up when 4o-mini released and it ranked as the second best model on the leaderboard:
But this fear misunderstands how Arena works. Unlike static benchmarks where selection bias is a major issue, Arena's live nature means any initial bias gets washed out by ongoing evaluation. As Anastasios explains: "In the long run, there's way more fresh data than there is data that was used to compare these five models."
The other big question is WHAT model is actually being tested; as people often talk about on X / Discord, the same endpoint will randomly feel “nerfed” like it happened for “Claude European summer” and corresponding conspiracy theories:
It’s hard to keep track of these performance changes in Arena as these changes (if real…?) are not observable.
The Future of Evaluation
The team's latest work on RouteLLM points to an interesting future where evaluation becomes more granular and task-specific. But they maintain that even simple routing strategies can be powerful - like directing complex queries to larger models while handling simple tasks with smaller ones.
Arena is now going to expand beyond text into multimodal evaluation and specialized domains like code execution and red teaming. But their core insight remains: the best way to evaluate intelligence isn't to simplify it into metrics, but to embrace its complexity and find rigorous ways to analyze it. To go after this vision, they are spinning out Arena from LMSys, which will stay as an academia-driven group at Berkeley.
Full Video Podcast
Chapters
* 00:00:00 - Introductions
* 00:01:16 - Origin and development of Chatbot Arena
* 00:05:41 - Static benchmarks vs. Arenas
* 00:09:03 - Community building
* 00:13:32 - Biases in human preference evaluation
* 00:18:27 - Style Control and Model Categories
* 00:26:06 - Impact of o1
* 00:29:15 - Collaborating with AI labs
* 00:34:51 - RouteLLM and router models
* 00:38:09 - Future of LMSys / Arena
Show Notes
* Anastasios Angelopoulos
* Anastasios' NeurIPS Paper Conformal Risk Control
* Wei-Lin Chiang
* Chatbot Arena
* LMSys
* MTBench
* ShareGPT dataset
* Stanford's Alpaca project
* LLMRouter
* E2B
* Dreadnode
Transcript
Alessio [00:00:00]: Hey everyone, welcome to the Latent Space podcast. This is Alessio, Partner and CTO in Residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol.ai.
Swyx [00:00:14]: Hey, and today we're very happy and excited to welcome Anastasios and Wei Lin from LMSys. Welcome guys.
Wei Lin [00:00:21]: Hey, how's it going? Nice to see you.
Anastasios [00:00:23]: Thanks for having us.
Swyx [00:00:24]: Anastasios, I actually saw you, I think at last year's NeurIPS. You were presenting a paper, which I don't really super understand, but it was some theory paper about how your method was very dominating over other sort of search methods. I don't remember what it was, but I remember that you were a very confident speaker.
Anastasios [00:00:40]: Oh, I totally remember you. Didn't ever connect that, but yes, that's definitely true. Yeah. Nice to see you again.
Swyx [00:00:46]: Yeah. I was frantically looking for the name of your paper and I couldn't find it. Basically I had to cut it because I didn't understand it.
Anastasios [00:00:51]: Is this conformal PID control or was this the online control?
Wei Lin [00:00:55]: Blast from the past, man.
Swyx [00:00:57]: Blast from the past. It's always interesting how NeurIPS and all these academic conferences are sort of six months behind what people are actually doing, but conformal risk control, I would recommend people check it out. I have the recording. I just never published it just because I was like, I don't understand this enough to explain it.
Anastasios [00:01:14]: People won't be interested.
Wei Lin [00:01:15]: It's all good.
Swyx [00:01:16]: But ELO scores, ELO scores are very easy to understand. You guys are responsible for the biggest revolution in language model benchmarking in the last few years. Maybe you guys want to introduce yourselves and maybe tell a little bit of the brief history of LMSys
Wei Lin [00:01:32]: Hey, I'm Wei Lin. I'm a fifth year PhD student at UC Berkeley, working on Chatbot Arena these days, doing crowdsourcing AI benchmarking.
Anastasios [00:01:43]: I'm Anastasios. I'm a sixth year PhD student here at Berkeley. I did most of my PhD on like theoretical statistics and sort of foundations of model evaluation and testing. And now I'm working 150% on this Chatbot Arena stuff. It's great.
Alessio [00:02:00]: And what was the origin of it? How did you come up with the idea? How did you get people to buy in? And then maybe what were one or two of the pivotal moments early on that kind of made it the standard for these things?
Wei Lin [00:02:12]: Yeah, yeah. Chatbot Arena project was started last year in April, May, around that. Before that, we were basically experimenting in a lab how to fine tune a chatbot open source based on the Llama 1 model that I released. At that time, Lama 1 was like a base model and people didn't really know how to fine tune it. So we were doing some explorations. We were inspired by Stanford's Alpaca project. So we basically, yeah, grow a data set from the internet, which is called ShareGPT data set, which is like a dialogue data set between user and chat GPT conversation. It turns out to be like pretty high quality data, dialogue data. So we fine tune on it and then we train it and release the model called V2. And people were very excited about it because it kind of like demonstrate open way model can reach this conversation capability similar to chat GPT. And then we basically release the model with and also build a demo website for the model. People were very excited about it. But during the development, the biggest challenge to us at the time was like, how do we even evaluate it? How do we even argue this model we trained is better than others? And then what's the gap between this open source model that other proprietary offering? At that time, it was like GPT-4 was just announced and it's like Cloud One. What's the difference between them? And then after that, like every week, there's a new model being fine tuned, released. So even until still now, right? And then we have that demo website for V2 now. And then we thought like, okay, maybe we can add a few more of the model as well, like API model as well. And then we quickly realized that people need a tool to compare between different models. So we have like a side by side UI implemented on the website to that people choose, you know, compare. And we quickly realized that maybe we can do something like, like a battle on top of ECLMs, like just anonymize it, anonymize the identity, and that people vote which one is better. So the community decides which one is better, not us, not us arguing, you know, our model is better or what. And that turns out to be like, people are very excited about this idea. And then we tweet, we launch, and that's, yeah, that's April, May. And then it was like first two, three weeks, like just a few hundred thousand views tweet on our launch tweets. And then we have regularly double update weekly, beginning at a time, adding new model GPT-4 as well. So it was like, that was the, you know, the initial.
Anastasios [00:04:58]: Another pivotal moment, just to jump in, would be private models, like the GPT, I'm a little,
Wei Lin [00:05:04]: I'm a little chatty. That was this year. That was this year.
Anastasios [00:05:07]: Huge.
Wei Lin [00:05:08]: That was also huge.
Alessio [00:05:09]: In the beginning, I saw the initial release was May 3rd of the beta board. On April 6, we did a benchmarks 101 episode for a podcast, just kind of talking about, you know, how so much of the data is like in the pre-training corpus and blah, blah, blah. And like the benchmarks are really not what we need to evaluate whether or not a model is good. Why did you not make a benchmark? Maybe at the time, you know, it was just like, Hey, let's just put together a whole bunch of data again, run a, make a score that seems much easier than coming out with a whole website where like users need to vote. Any thoughts behind that?
Wei Lin [00:05:41]: I think it's more like fundamentally, we don't know how to automate this kind of benchmarks when it's more like, you know, conversational, multi-turn, and more open-ended task that may not come with a ground truth. So let's say if you ask a model to help you write an email for you for whatever purpose, there's no ground truth. How do you score them? Or write a story or a creative story or many other things like how we use ChatterBee these days. It's more open-ended. You know, we need human in the loop to give us feedback, which one is better. And I think nuance here is like, sometimes it's also hard for human to give the absolute rating. So that's why we have this kind of pairwise comparison, easier for people to choose which one is better. So from that, we use these pairwise comparison, those to calculate the leaderboard. Yeah. You can add more about this methodology.
Anastasios [00:06:40]: Yeah. I think the point is that, and you guys probably also talked about this at some point, but static benchmarks are intrinsically, to some extent, unable to measure generative model performance. And the reason is because you cannot pre-annotate all the outputs of a generative model. You change the model, it's like the distribution of your data is changing. New labels to deal with that. New labels are great automated labeling, right? Which is why people are pursuing both. And yeah, static benchmarks, they allow you to zoom in to particular types of information like factuality, historical facts. We can build the best benchmark of historical facts, and we will then know that the model is great at historical facts. But ultimately, that's not the only axis, right? And we can build 50 of them, and we can evaluate 50 axes. But it's just so, the problem of generative model evaluation is just so expansive, and it's so subjective, that it's just maybe non-intrinsically impossible, but at least we don't see a way. We didn't see a way of encoding that into a fixed benchmark.
Wei Lin [00:07:47]: But on the other hand, I think there's a challenge where this kind of online dynamic benchmark is more expensive than static benchmark, offline benchmark, where people still need it. Like when they build models, they need static benchmark to track where they are.
Anastasios [00:08:03]: It's not like our benchmark is uniformly better than all other benchmarks, right? It just measures a different kind of performance that has proved to be useful.
Swyx [00:08:14]: You guys also published MTBench as well, which is a static version, let's say, of Chatbot Arena, right? That people can actually use in their development of models.
Wei Lin [00:08:25]: Right. I think one of the reasons we still do this static benchmark, we still wanted to explore, experiment whether we can automate this, because people, eventually, model developers need it to fast iterate their model. So that's why we explored LM as a judge, and ArenaHard, trying to filter, select high-quality data we collected from Chatbot Arena, the high-quality subset, and use that as a question and then automate the judge pipeline, so that people can quickly get high-quality signal, benchmark signals, using this online benchmark.
Swyx [00:09:03]: As a community builder, I'm curious about just the initial early days. Obviously when you offer effectively free A-B testing inference for people, people will come and use your arena. What do you think were the key unlocks for you? Was it funding for this arena? Was it marketing? When people came in, do you see a noticeable skew in the data? Which obviously now you have enough data sets, you can separate things out, like coding and hard prompts, but in the early days, it was just all sorts of things.
Anastasios [00:09:31]: Yeah, maybe one thing to establish at first is that our philosophy has always been to maximize organic use. I think that really does speak to your point, which is, yeah, why do people come? They came to use free LLM inference, right? And also, a lot of users just come to the website to use direct chat, because you can chat with the model for free. And then you could think about it like, hey, let's just be kind of like more on the selfish or conservative or protectionist side and say, no, we're only giving credits for people that battle or so on and so forth. Strategy wouldn't work, right? Because what we're trying to build is like a big funnel, a big funnel that can direct people. And some people are passionate and interested and they battle. And yes, the distribution of the people that do that is different. It's like, as you're pointing out, it's like, that's not as they're enthusiastic.
Wei Lin [00:10:24]: They're early adopters of this technology.
Anastasios [00:10:27]: Or they like games, you know, people like this. And we've run a couple of surveys that indicate this as well, of our user base.
Wei Lin [00:10:36]: We do see a lot of developers come to the site asking polling questions, 20-30%. Yeah, 20-30%.
Anastasios [00:10:42]: It's obviously not reflective of the general population, but it's reflective of some corner of the world of people that really care. And to some extent, maybe that's all right, because those are like the power users. And you know, we're not trying to claim that we represent the world, right? We represent the people that come and vote.
Swyx [00:11:02]: Did you have to do anything marketing-wise? Was anything effective? Did you struggle at all? Was it success from day one?
Wei Lin [00:11:09]: At some point, almost done. Okay. Because as you can imagine, this leaderboard depends on community engagement participation. If no one comes to vote tomorrow, then no leaderboard.
Anastasios [00:11:23]: So we had some period of time when the number of users was just, after the initial launch, it went lower. Yeah. And, you know, at some point, it did not look promising. Actually, I joined the project a couple months in to do the statistical aspects, right? As you can imagine, that's how it kind of hooked into my previous work. At that time, it wasn't like, you know, it definitely wasn't clear that this was like going to be the eval or something. It was just like, oh, this is a cool project. Like Wayland seems awesome, you know, and that's it.
Wei Lin [00:11:56]: Definitely. There's in the beginning, because people don't know us, people don't know what this is for. So we had a hard time. But I think we were lucky enough that we have some initial momentum. And as well as the competition between model providers just becoming, you know, became very intense. Intense. And then that makes the eval onto us, right? Because always number one is number one.
Anastasios [00:12:23]: There's also an element of trust. Our main priority in everything we do is trust. We want to make sure we're doing everything like all the I's are dotted and the T's are crossed and nobody gets unfair treatment and people can see from our profiles and from our previous work and from whatever, you know, we're trustworthy people. We're not like trying to make a buck and we're not trying to become famous off of this or that. It's just, we're trying to provide a great public leaderboard community venture project.
Wei Lin [00:12:51]: Yeah.
Swyx [00:12:52]: Yes. I mean, you are kind of famous now, you know, that's fine. Just to dive in more into biases and, you know, some of this is like statistical control. The classic one for human preference evaluation is humans demonstrably prefer longer contexts or longer outputs, which is actually something that we don't necessarily want. You guys, I think maybe two months ago put out some length control studies. Apart from that, there are just other documented biases. Like, I'd just be interested in your review of what you've learned about biases and maybe a little bit about how you've controlled for them.
Anastasios [00:13:32]: At a very high level, yeah. Humans are biased. Totally agree. Like in various ways. It's not clear whether that's good or bad, you know, we try not to make value judgments about these things. We just try to describe them as they are. And our approach is always as follows. We collect organic data and then we take that data and we mine it to get whatever insights we can get. And, you know, we have many millions of data points that we can now use to extract insights from. Now, one of those insights is to ask the question, what is the effect of style, right? You have a bunch of data, you have votes, people are voting either which way. We have all the conversations. We can say what components of style contribute to human preference and how do they contribute? Now, that's an important question. Why is that an important question? It's important because some people want to see which model would be better if the lengths of the responses were the same, were to be the same, right? People want to see the causal effect of the model's identity controlled for length or controlled for markdown, number of headers, bulleted lists, is the text bold? Some people don't, they just don't care about that. The idea is not to impose the judgment that this is not important, but rather to say ex post facto, can we analyze our data in a way that decouples all the different factors that go into human preference? Now, the way we do this is via statistical regression. That is to say the arena score that we show on our leaderboard is a particular type of linear model, right? It's a linear model that takes, it's a logistic regression that takes model identities and fits them against human preference, right? So it regresses human preference against model identity. What you get at the end of that logistic regression is a parameter vector of coefficients. And when the coefficient is large, it tells you that GPT 4.0 or whatever, very large coefficient, that means it's strong. And that's exactly what we report in the table. It's just the predictive effect of the model identity on the vote. The other thing that you can do is you can take that vector, let's say we have M models, that is an M dimensional vector of coefficients. What you can do is you say, hey, I also want to understand what the effect of length is. So I'll add another entry to that vector, which is trying to predict the vote, right? That tells me the difference in length between two model responses. So we have that for all of our data. We can compute it ex post facto. We added it into the regression and we look at that predictive effect. And then the idea, and this is formally true under certain conditions, not always verifiable ones, but the idea is that adding that extra coefficient to this vector will kind of suck out the predictive power of length and put it into that M plus first coefficient and quote, unquote, de-bias the rest so that the effect of length is not included. And that's what we do in style control. Now we don't just do it for M plus one. We have, you know, five, six different style components that have to do with markdown headers and bulleted lists and so on that we add here. Now, where is this going? You guys see the idea. It's a general methodology. If you have something that's sort of like a nuisance parameter, something that exists and provides predictive value, but you really don't want to estimate that. You want to remove its effect. In causal inference, these things are called like confounders often. What you can do is you can model the effect. You can put them into your model and try to adjust for them. So another one of those things might be cost. You know, what if I want to look at the cost adjusted performance of my model, which models are punching above their weight, parameter count, which models are punching above their weight in terms of parameter count, we can ex post facto measure that. We can do it without introducing anything that compromises the organic nature of the
Wei Lin [00:17:17]: data that we collect.
Anastasios [00:17:18]: Hopefully that answers the question.
Wei Lin [00:17:20]: It does.
Swyx [00:17:21]: So I guess with a background in econometrics, this is super familiar.
Anastasios [00:17:25]: You're probably better at this than me for sure.
Swyx [00:17:27]: Well, I mean, so I used to be, you know, a quantitative trader and so, you know, controlling for multiple effects on stock price is effectively the job. So it's interesting. Obviously the problem is proving causation, which is hard, but you don't have to do that.
Anastasios [00:17:45]: Yes. Yes, that's right. And causal inference is a hard problem and it goes beyond statistics, right? It's like you have to build the right causal model and so on and so forth. But we think that this is a good first step and we're sort of looking forward to learning from more people. You know, there's some good people at Berkeley that work on causal inference for the learning from them on like, what are the really most contemporary techniques that we can use in order to estimate true causal effects if possible.
Swyx [00:18:10]: Maybe we could take a step through the other categories. So style control is a category. It is not a default. I have thought that when you wrote that blog post, actually, I thought it would be the new default because it seems like the most obvious thing to control for. But you also have other categories, you have coding, you have hard prompts. We consider that.
Anastasios [00:18:27]: We're still actively considering it. It's just, you know, once you make that step, once you take that step, you're introducing your opinion and I'm not, you know, why should our opinion be the one? That's kind of a community choice. We could put it to a vote.
Wei Lin [00:18:39]: We could pass.
Anastasios [00:18:40]: Yeah, maybe do a poll. Maybe do a poll.
Swyx [00:18:42]: I don't know. No opinion is an opinion.
Wei Lin [00:18:44]: You know what I mean?
Swyx [00:18:45]: Yeah.
Wei Lin [00:18:46]: There's no neutral choice here.
Swyx [00:18:47]: Yeah. You have all these others. You have instruction following too. What are your favorite categories that you like to talk about? Maybe you tell a little bit of the stories, tell a little bit of like the hard choices that you had to make.
Wei Lin [00:18:57]: Yeah. Yeah. Yeah. I think the, uh, initially the reason why we want to add these new categories is essentially to answer some of the questions from our community, which is we won't have a single leaderboard for everything. So these models behave very differently in different domains. Let's say this model is trend for coding, this model trend for more technical questions and so on. On the other hand, to answer people's question about like, okay, what if all these low quality, you know, because we crowdsource data from the internet, there will be noise. So how do we de-noise? How do we filter out these low quality data effectively? So that was like, you know, some questions we want to answer. So basically we spent a few months, like really diving into these questions to understand how do we filter all these data because these are like medias of data points. And then if you want to re-label yourself, it's possible, but we need to kind of like to automate this kind of data classification pipeline for us to effectively categorize them to different categories, say coding, math, structure, and also harder problems. So that was like, the hope is when we slice the data into these meaningful categories to give people more like better signals, more direct signals, and that's also to clarify what we are actually measuring for, because I think that's the core part of the benchmark. That was the initial motivation. Does that make sense?
Anastasios [00:20:27]: Yeah. Also, I'll just say, this does like get back to the point that the philosophy is to like mine organic, to take organic data and then mine it x plus factor.
Alessio [00:20:35]: Is the data cage-free too, or just organic?
Anastasios [00:20:39]: It's cage-free.
Wei Lin [00:20:40]: No GMO. Yeah. And all of these efforts are like open source, like we open source all of the data cleaning pipeline, filtering pipeline. Yeah.
Swyx [00:20:50]: I love the notebooks you guys publish. Actually really good just for learning statistics.
Wei Lin [00:20:54]: Yeah. I'll share this insights with everyone.
Alessio [00:20:59]: I agree on the initial premise of, Hey, writing an email, writing a story, there's like no ground truth. But I think as you move into like coding and like red teaming, some of these things, there's like kind of like skill levels. So I'm curious how you think about the distribution of skill of the users. Like maybe the top 1% of red teamers is just not participating in the arena. So how do you guys think about adjusting for it? And like feels like this where there's kind of like big differences between the average and the top. Yeah.
Anastasios [00:21:29]: Red teaming, of course, red teaming is quite challenging. So, okay. Moving back. There's definitely like some tasks that are not as subjective that like pairwise human preference feedback is not the only signal that you would want to measure. And to some extent, maybe it's useful, but it may be more useful if you give people better tools. For example, it'd be great if we could execute code with an arena, be fantastic.
Wei Lin [00:21:52]: We want to do it.
Anastasios [00:21:53]: There's also this idea of constructing a user leaderboard. What does that mean? That means some users are better than others. And how do we measure that? How do we quantify that? Hard in chatbot arena, but where it is easier is in red teaming, because in red teaming, there's an explicit game. You're trying to break the model, you either win or you lose. So what you can do is you can say, Hey, what's really happening here is that the models and humans are playing a game against one another. And then you can use the same sort of Bradley Terry methodology with some, some extensions that we came up with in one of you can read one of our recent blog posts for, for the sort of theoretical extensions. You can attribute like strength back to individual players and jointly attribute strength to like the models that are in this jailbreaking game, along with the target tasks, like what types of jailbreaks you want.
Wei Lin [00:22:44]: So yeah.
Anastasios [00:22:45]: And I think that this is, this is a hugely important and interesting avenue that we want to continue researching. We have some initial ideas, but you know, all thoughts are welcome.
Wei Lin [00:22:54]: Yeah.
Alessio [00:22:55]: So first of all, on the code execution, the E2B guys, I'm sure they'll be happy to help
Wei Lin [00:22:59]: you.
Alessio [00:23:00]: I'll please set that up. They're big fans. We're investors in a company called Dreadnought, which we do a lot in AI red teaming. I think to me, the most interesting thing has been, how do you do sure? Like the model jailbreak is one side. We also had Nicola Scarlini from DeepMind on the podcast, and he was talking about, for example, like, you know, context stealing and like a weight stealing. So there's kind of like a lot more that goes around it. I'm curious just how you think about the model and then maybe like the broader system, even with Red Team Arena, you're just focused on like jailbreaking of the model, right? You're not doing kind of like any testing on the more system level thing of the model where like, maybe you can get the training data back, you're going to exfiltrate some of the layers and the weights and things like that.
Wei Lin [00:23:43]: So right now, as you can see, the Red Team Arena is at a very early stage and we are still exploring what could be the potential new games we can introduce to the platform. So the idea is still the same, right? And we build a community driven project platform for people. They can have fun with this website, for sure. That's one thing, and then help everyone to test these models. So one of the aspects you mentioned is stealing secrets, stealing training sets. That could be one, you know, it could be designed as a game. Say, can you still use their credential, you know, we hide, maybe we can hide the credential into system prompts and so on. So there are like a few potential ideas we want to explore for sure. Do you want to add more?
Anastasios [00:24:28]: I think that this is great. This idea is a great one. There's a lot of great ideas in the Red Teaming space. You know, I'm not personally like a Red Teamer. I don't like go around and Red Team models, but there are people that do that and they're awesome. They're super skilled. When I think about the Red Team arena, I think those are really the people that we're building it for. Like, we want to make them excited and happy, build tools that they like. And just like chatbot arena, we'll trust that this will end up being useful for the world. And all these people are, you know, I won't say all these people in this community are actually good hearted, right? They're not doing it because they want to like see the world burn. They're doing it because they like, think it's fun and cool. And yeah. Okay. Maybe they want to see, maybe they want a little bit.
Wei Lin [00:25:13]: I don't know. Majority.
Anastasios [00:25:15]: Yeah.
Wei Lin [00:25:16]: You know what I'm saying.
Anastasios [00:25:17]: So, you know, trying to figure out how to serve them best, I think, I don't know where that fits. I just, I'm not expressing. And give them credits, right?
Wei Lin [00:25:24]: And give them credit.
Anastasios [00:25:25]: Yeah. Yeah. So I'm not trying to express any particular value judgment here as to whether that's the right next step. It's just, that's sort of the way that I think we would think about it.
Swyx [00:25:35]: Yeah. We also talked to Sander Schulhoff of the HackerPrompt competition, and he's pretty interested in Red Teaming at scale. Let's just call it that. You guys maybe want to talk with him.
Wei Lin [00:25:45]: Oh, nice.
Swyx [00:25:46]: We wanted to cover a little, a few topical things and then go into the other stuff that your group is doing. You know, you're not just running Chatbot Arena. We can also talk about the new website and your future plans, but I just wanted to briefly focus on O1. It is the hottest, latest model. Obviously, you guys already have it on the leaderboard. What is the impact of O1 on your evals?
Wei Lin [00:26:06]: Made our interface slower.
Anastasios [00:26:07]: It made it slower.
Swyx [00:26:08]: Yeah.
Wei Lin [00:26:10]: Because it needs like 30, 60 seconds, sometimes even more to, the latency is like higher. So that's one. Sure. But I think we observe very interesting things from this model as well. Like we observe like significant improvement in certain categories, like more technical or math. Yeah.
Anastasios [00:26:32]: I think actually like one takeaway that was encouraging is that I think a lot of people before the O1 release were thinking, oh, like this benchmark is saturated. And why were they thinking that? They were thinking that because there was a bunch of models that were kind of at the same level. They were just kind of like incrementally competing and it sort of wasn't immediately obvious that any of them were any better. Nobody, including any individual person, it's hard to tell. But what O1 did is it was, it's clearly a better model for certain tasks. I mean, I used it for like proving some theorems and you know, there's some theorems that like only I know because I still do a little bit of theory. Right. So it's like, I can go in there and ask like, oh, how would you prove this exact thing? Which I can tell you has never been in the public domain. It'll do it. It's like, what?
Wei Lin [00:27:19]: Okay.
Anastasios [00:27:20]: So there's this model and it crushed the benchmark. You know, it's just like really like a big gap. And what that's telling us is that it's not saturated yet. It's still measuring some signal. That was encouraging. The point, the takeaway is that the benchmark is comparative. There's no absolute number. There's no maximum ELO. It's just like, if you're better than the rest, then you win. I think that was actually quite helpful to us.
Swyx [00:27:46]: I think people were criticizing, I saw some of the academics criticizing it as not apples to apples. Right. Like, because it can take more time to reason, it's basically doing some search, doing some chain of thought that if you actually let the other models do that same thing, they might do better.
Wei Lin [00:28:03]: Absolutely.
Anastasios [00:28:04]: To be clear, none of the leaderboard currently is apples to apples because you have like Gemini Flash, you have, you know, all sorts of tiny models like Lama 8B, like 8B and 405B are not apples to apples.
Wei Lin [00:28:19]: Totally agree. They have different latencies.
Anastasios [00:28:21]: Different latencies.
Wei Lin [00:28:22]: Control for latency. Yeah.
Anastasios [00:28:24]: Latency control. That's another thing. We can do style control, but latency control. You know, things like this are important if you want to understand the trade-offs involved in using AI.
Swyx [00:28:34]: O1 is a developing story. We still haven't seen the full model yet, but it's definitely a very exciting new paradigm. I think one community controversy I just wanted to give you guys space to address is the collaboration between you and the large model labs. People have been suspicious, let's just say, about how they choose to A-B test on you. I'll state the argument and let you respond, which is basically they run like five anonymous models and basically argmax their Elo on LMSYS or chatbot arena, and they release the best one. Right? What has been your end of the controversy? How have you decided to clarify your policy going forward?
Wei Lin [00:29:15]: On a high level, I think our goal here is to build a fast eval for everyone, and including everyone in the community can see the data board and understand, compare the models. More importantly, I think we want to build the best eval also for model builders, like all these frontier labs building models. They're also internally facing a challenge, which is how do they eval the model? That's the reason why we want to partner with all the frontier lab people, and then to help them testing. That's one of the... We want to solve this technical challenge, which is eval. Yeah.
Anastasios [00:29:54]: I mean, ideally, it benefits everyone, right?
Wei Lin [00:29:56]: Yeah.
Anastasios [00:29:57]: And people also are interested in seeing the leading edge of the models. People in the community seem to like that. Oh, there's a new model up. Is this strawberry? People are excited. People are interested. Yeah. And then there's this question that you bring up of, is it actually causing harm?
Wei Lin [00:30:15]: Right?
Anastasios [00:30:16]: Is it causing harm to the benchmark that we are allowing this private testing to happen? Maybe stepping back, why do you have that instinct? The reason why you and others in the community have that instinct is because when you look at something like a benchmark, like an image net, a static benchmark, what happens is that if I give you a million different models that are all slightly different, and I pick the best one, there's something called selection bias that plays in, which is that the performance of the winning model is overstated. This is also sometimes called the winner's curse. And that's because statistical fluctuations in the evaluation, they're driving which model gets selected as the top. So this selection bias can be a problem. Now there's a couple of things that make this benchmark slightly different. So first of all, the selection bias that you include when you're only testing five models is normally empirically small.
Wei Lin [00:31:12]: And that's why we have these confidence intervals constructed.
Anastasios [00:31:16]: That's right. Yeah. Our confidence intervals are actually not multiplicity adjusted. One thing that we could do immediately tomorrow in order to address this concern is if a model provider is testing five models and they want to release one, and we're constructing the models at level one minus alpha, we can just construct the intervals instead at level one minus alpha divided by five. That's called Bonferroni correction. What that'll tell you is that the final performance of the model, the interval that gets constructed, is actually formally correct. We don't do that right now, partially because we know from simulations that the amount of selection bias you incur with these five things is just not huge. It's not huge in comparison to the variability that you get from just regular human voters. So that's one thing. But then the second thing is the benchmark is live, right? So what ends up happening is it'll be a small magnitude, but even if you suffer from the winner's curse after testing these five models, what'll happen is that over time, because we're getting new data, it'll get adjusted down. So if there's any bias that gets introduced at that stage, in the long run, it actually doesn't matter. Because asymptotically, basically in the long run, there's way more fresh data than there is data that was used to compare these five models against these private models.
Swyx [00:32:35]: The announcement effect is only just the first phase and it has a long tail.
Anastasios [00:32:39]: Yeah, that's right. And it sort of like automatically corrects itself for this selection adjustment.
Swyx [00:32:45]: Every month, I do a little chart of LMSys Elo versus cost, just to track the price per dollar, the amount of like, how much money do I have to pay for one incremental point in ELO? And so I actually observe an interesting stability in most of the Elo numbers, except for some of them. For example, GPT-4-O August has fallen from 12.90𝑡𝑜12.90to12.60 over the past few months. And it's surprising.
Wei Lin [00:33:11]: You're saying like a new version of GPT-4-O versus the version in May?
Swyx [00:33:17]: There was May. May is $12.85. I could have made some data entry error, but it'd be interesting to track these things over time. Anyway, I observed like numbers go up, numbers go down. It's remarkably stable. Gotcha.
Anastasios [00:33:28]: So there are two different track points and the Elo has fallen.
Wei Lin [00:33:31]: Yes.
Swyx [00:33:32]: And sometimes ELOs rise as well. I think a core rose from 1,200𝑡𝑜1,200to1,230. And that's one of the things, by the way, the community is always suspicious about, like, hey, did this same endpoint get dumber after release? Right? It's such a meme.
Anastasios [00:33:45]: That's funny. But those are different endpoints, right?
Wei Lin [00:33:47]: Yeah, those are different API endpoints, I think. For GPT-4-O, August and May. But if it's for like, you know, endpoint versions we fixed, usually we observe small variation after release.
Anastasios [00:34:04]: I mean, you can quantify the variations that you would expect in an ELO. That's a close form number that you can calculate. So if the variations are larger than we would expect, then that indicates that we should
Wei Lin [00:34:17]: look into that. For sure.
Anastasios [00:34:19]: That's important for us to know. So maybe you should send us a reply. Yeah, please.
Wei Lin [00:34:22]: I'll send you some data. Yeah.
Alessio [00:34:24]: And I know we only got a few minutes before we wrap, but there are two things I would definitely love to talk about. One is route LLM. So talking about models, maybe getting dumber over time, blah, blah, blah. Are routers actually helpful in your experience? And Sean pointed out that MOEs are technically routers too. So how do you kind of think about the router being part of the model versus routing different models? And yeah, overall learnings from building it?
Wei Lin [00:34:51]: Yeah. So route LLM is a project we released a few months ago, I think. And our goal was to basically understand, can we use the preference data we collect to route model based on the question, conditional on the questions, because we will make assumption that some model are good at math, some model are good at coding, things like that. So we found it somewhat useful. For sure, this is like ongoing effort. Our first phase with this project is pretty much like open source, the framework that we develop. So for anyone interested in this problem, they can use the framework, and then they can train their own router model, and then to do evaluation to benchmark. So that's our goal, the reason why we released this framework. And I think there are a couple of future stuff we are thinking. One is, can we just scale this, do even more data, even more preference data, and then train a reward model, train like a router model, better router model. Another thing is, release a benchmark, because right now, currently, there seems to be, one of the end point when we developed this project was like, there's just no good benchmark for a router. So that will be another thing we think could be a useful contribution to community. And there's still, for sure, methodology, new methodology we can use.
Swyx [00:36:18]: I think my fundamental philosophical doubt is, does the router model have to be at least as smart as the smartest model? What's the minimum required intelligence of a router model, right? Like, if it's too dumb, it's not going to route properly.
Anastasios [00:36:32]: Well, I think that you can build a very, very simple router that is very effective. So let me give you an example. You can build a great router with one parameter, and the parameter is just like, I'm going to check if my question is hard. And if it's hard, then I'm going to go to the big model. If it's easy, I'm going to go to the little model. You know, there's various ways of measuring hard that are like, pretty trivial, right? Like, does it have code? Does it have math? Is it long? That's already a great first step, right? Because ultimately, at the end of the day, you're competing with a weak baseline, which is any individual model. And you're trying to ask the question, how do I improve cost? And that's like a one-dimensional trade-off. It's like performance cost, and it's great. Now, you can also get into the extension, which is what models are good at what particular
Wei Lin [00:37:23]: types of queries.
Anastasios [00:37:25]: And then, you know, I think your concern starts taking into effect is, can we actually do that? Can we estimate which models are good in which parts of the space in a way that doesn't introduce more variability and more variation and error into our final pipeline than just using the best of them? That's kind of how I see it.
Swyx [00:37:44]: Your approach is really interesting compared to the commercial approaches where you use information from the chat arena to inform your model, which is, I mean, smart, and it's the foundation of everything you do. Yep.
Alessio [00:37:56]: As we wrap, can we just talk about LMSYS and what that's going to be going forward? Like, LMRENA, I'm becoming something. I saw you announced yesterday you're graduating. I think maybe that was confusing since you're PhD students, but this is a different type
Wei Lin [00:38:09]: of graduation.
Anastasios [00:38:10]: Just for context, LMSYS started as like a student club.
Wei Lin [00:38:15]: Student driven. Yeah.
Anastasios [00:38:16]: Student driven, like research projects, you know, many different research projects are part of LMSYS. Sort of chatbot arena has, of course, like kind of become its own thing. And Lianmin and Ying, who are, you know, created LMSYS, have kind of like moved on to working on SGLANG. And now they're doing other projects that are sort of originated from LMSYS. And for that reason, we thought it made sense to kind of decouple the two. Just so, A, the LMSYS thing, it's not like when someone says LMSYS, they think of chatbot arena. That's not fair, so to speak.
Wei Lin [00:38:52]: And we want to support new projects.
Anastasios [00:38:54]: And we want to support new projects and so on and so forth. But of course, these are all like, you know, our friends.
Wei Lin [00:38:59]: So that's why we call it graduation. I agree.
Alessio [00:39:03]: That's like one thing that people wear. Maybe a little confused by where LMSYS kind of starts and ends and where arena starts
Wei Lin [00:39:10]: and ends.
Alessio [00:39:10]: So I think you reach escape velocity now that you're kind of like your own thing.
Swyx [00:39:15]: So I have one parting question. Like, what do you want more of? Like, what do you want people to approach you with?
Anastasios [00:39:21]: Oh, my God, we need so much help. One thing would be like, we're obviously expanding into like other kinds of arenas, right? We definitely need like active help on red teaming. We definitely need active help on our different modalities, different modalities.
Wei Lin [00:39:35]: So pilot, yeah, coding, coding.
Anastasios [00:39:38]: You know, if somebody could like help us implement this, like REPL in REPL in chatbot arena,
Wei Lin [00:39:44]: massive, that would be a massive delta.
Anastasios [00:39:45]: And I know that there's people out there who are passionate and capable of doing it. It's just, we don't have enough hands on deck. We're just like an academic research lab, right? We're not equipped to support this kind of project. So, yeah, we need help with that. We also need just like general back-end dev. And new ideas, new conceptual ideas. I mean, honestly, the work that we do spans everything from like foundational statistics, like new proofs to full stack dev. And like anybody who's like, wants to contribute something to that pipeline is, should definitely reach out.
Wei Lin [00:40:22]: We need it. And it's an open source project anyways. Anyone can make a PR.
Anastasios [00:40:26]: And we're happy to, you know, whoever wants to contribute, we'll give them credit, you know? We're not trying to keep all the credit for ourselves. We want it to be a community project.
Wei Lin [00:40:33]: That's great.
Alessio [00:40:34]: And fits this pair of everything you've been doing over there. So, awesome, guys. Well, thank you so much for taking the time. And we'll put all the links in the show notes so that people can find you and reach out if they need it. Thank you so much.
Anastasios [00:40:46]: It's very nice to talk to you. And thank you for the wonderful questions.
Wei Lin [00:40:49]: Thank you so much.

Get full access to Latent Space at www.latent.space/subscribe
- Luisteren Nogmaals beluisteren Doorgaan Wordt afgespeeld...
- Later beluisteren Later beluisteren
How NotebookLM Was Made
25 okt· Latent Space: The AI Engineer Podcast
If you’ve listened to the podcast for a while, you might have heard our ElevenLabs-powered AI co-host Charlie a few times. Text-to-speech has made amazing progress in the last 18 months, with OpenAI’s Advanced Voice Mode (aka “Her”) as a sneak peek of the future of AI interactions (see our “Building AGI in Real Time” recap). Yet, we had yet to see a real killer app for AI voice (not counting music).
Today’s guests, Raiza Martin and Usama Bin Shafqat, are the lead PM and AI engineer behind the NotebookLM feature flag that gave us the first viral AI voice experience, the “Deep Dive” podcast:
The idea behind the “Audio Overviews” feature is simple: take a bunch of documents, websites, YouTube videos, etc, and generate a podcast out of them. This was one of the first demos that people built with voice models + RAG + GPT models, but it was always a glorified speech-to-text. Raiza and Usama took a very different approach:
* Make it conversational: when you listen to a NotebookLM audio there are a ton of micro-interjections (Steven Johnson calls them disfluencies) like “Oh really?” or “Totally”, as well as pauses and “uh…”, like you would expect in a real conversation. These are not generated by the LLM in the transcript, but they are built into the the audio model. See ~28:00 in the pod for more details.
* Listeners love tension: if two people are always in agreement on everything, it’s not super interesting. They tuned the model to generate flowing conversations that mirror the tone and rhythm of human speech. They did not confirm this, but many suspect the 2 year old SoundStorm paper is related to this model.
* Generating new insights: because the hosts’ goal is not to summarize, but to entertain, it comes up with funny metaphors and comparisons that actually help expand on the content rather than just paraphrasing like most models do. We have had listeners make podcasts out of our podcasts, like this one.
This is different than your average SOTA-chasing, MMLU-driven model buildooor. Putting product and AI engineering in the same room, having them build evals together, and understanding what the goal is lets you get these unique results.
The 5 rules for AI PMs
We always focus on AI Engineers, but this episode had a ton of AI PM nuggets as well, which we wanted to collect as NotebookLM is one of the most successful products in the AI space:
1. Less is more: the first version of the product had 0 customization options. All you could do is give it source documents, and then press a button to generate. Most users don’t know what “temperature” or “top-k” are, so you’re often taking the magic away by adding more options in the UI. Since recording they added a few, like a system prompt, but those were features that users were “hacking in”, as Simon Willison highlighted in his blog post.
2. Use Real-Time Feedback: they built a community of 65,000 users on Discord that is constantly reporting issues and giving feedback; sometimes they noticed server downtime even before the Google internal monitoring did. Getting real time pings > aggregating user data when doing initial iterations.
3. Embrace Non-Determinism: AI outputs variability is a feature, not a bug. Rather than limiting the outputs from the get-go, build toggles that you can turn on/off with feature flags as the feedback starts to roll in.
4. Curate with Taste: if you try your product and it sucks, you don’t need more data to confirm it. Just scrap that and iterate again. This is even easier for a product like this; if you start listening to one of the podcasts and turn it off after 10 seconds, it’s never a good sign.
5. Stay Hands-On: It’s hard to build taste if you don’t experiment. Trying out all your competitors products as well as unrelated tools really helps you understand what users are seeing in market, and how to improve on it.
Chapters
00:00 Introductions01:39 From Project Tailwind to NotebookLM09:25 Learning from 65,000 Discord members12:15 How NotebookLM works18:00 Working with Steven Johnson23:00 How to prioritize features25:13 Structuring the data pipelines29:50 How to eval34:34 Steering the podcast outputs37:51 Defining speakers personalities39:04 How do you make audio engaging?45:47 Humor is AGI51:38 Designing for non-determinism53:35 API when?55:05 Multilingual support and dialect considerations57:50 Managing system prompts and feature requests01:00:58 Future of NotebookLM01:04:59 Podcasts for your codebase01:07:16 Plans for real-time chat01:08:27 Wrap up
Show Notes
* Notebook LM
* AI Test Kitchen
* Nicholas Carlini
* Steven Johnson
* Wealth of Nations
* Histories of Mysteries by Andrej Karpathy
* chicken.pdf Threads
* Area 120
* Raiza Martin
* Usama Bin Shafqat
Transcript
NotebookLM [00:00:00]: Hey everyone, we're here today as guests on Latent Space. It's great to be here, I'm a long time listener and fan, they've had some great guests on this show before. Yeah, what an honor to have us, the hosts of another podcast, join as guests. I mean a huge thank you to Swyx and Alessio for the invite, thanks for having us on the show. Yeah really, it seems like they brought us here to talk a little bit about our show, our podcast. Yeah, I mean we've had lots of listeners ourselves, listeners at Deep Dive. Oh yeah, we've made a ton of audio overviews since we launched and we're learning a lot. There's probably a lot we can share around what we're building next, huh? Yeah, we'll share a little bit at least. The short version is we'll keep learning and getting better for you. We're glad you're along for the ride. So yeah, keep listening. Keep listening and stay curious. We promise to keep diving deep and bringing you even better options in the future. Stay curious.
Alessio [00:00:52]: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner and CTO at Residence at Decibel Partners. And I'm joined by my co-host, Swyx, founder of Smol.ai.
Swyx [00:01:01]: Hey, and today we're back in the studio with our special guest, Raiza Martin. And Raiza, I forgot to get your last name, Shafqat.
Raiza [00:01:10]: Yes.
Swyx [00:01:10]: Okay, welcome.
Raiza [00:01:12]: Hello, thank you for having us.
Swyx [00:01:14]: So AI podcasters meet human podcasters, always fun. Congrats on the success of Notebook LM. I mean, how does it feel?
Raiza [00:01:22]: It's been a lot of fun. A lot of it, honestly, was unexpected. But my favorite part is really listening to the audio overviews that people have been making.
Swyx [00:01:29]: Maybe we should do a little bit of intros and tell the story. You know, what is your path into the sort of Google AI org? Or maybe, actually, I don't even know what org you guys are in.
Raiza [00:01:39]: I can start. My name is Raisa. I lead the Notebook LM team inside of Google Labs. So specifically, that's the org that we're in. It's called Google Labs. It's only about two years old. And our whole mandate is really to build AI products. That's it. We work super closely with DeepMind. Our entire thing is just, like, try a bunch of things and see what's landing with users. And the background that I have is, really, I worked in payments before this, and I worked in ads right before, and then startups. I tell people, like, at every time that I changed orgs, I actually almost quit Google. Like, specifically, like, in between ads and payments, I was like, all right, I can't do this. Like, this is, like, super hard. I was like, it's not for me. I'm, like, a very zero-to-one person. But then I was like, okay, I'll try. I'll interview with other teams. And when I interviewed in payments, I was like, oh, these people are really cool. I don't know if I'm, like, a super good fit with this space, but I'll try it because the people are cool. And then I really enjoyed that, and then I worked on, like, zero-to-one features inside of payments, and I had a lot of fun. But then the time came again where I was like, oh, I don't know. It's like, it's time to leave. It's time to start my own thing. But then I interviewed inside of Google Labs, and I was like, oh, darn. Like, there's definitely, like—
Alessio [00:02:48]: They got you again.
Raiza [00:02:49]: They got me again. And so now I've been here for two years, and I'm happy that I stayed because especially with, you know, the recent success of Notebook LM, I'm like, dang, we did it. I actually got to do it. So that was really cool.
Usama [00:03:02]: Kind of similar, honestly. I was at a big team at Google. We do sort of the data center supply chain planning stuff. Google has, like, the largest sort of footprint. Obviously, there's a lot of management stuff to do there. But then there was this thing called Area 120 at Google, which does not exist anymore. But I sort of wanted to do, like, more zero-to-one building and landed a role there. We were trying to build, like, a creator commerce platform called Kaya. It launched briefly a couple years ago. But then Area 120 sort of transitioned and morphed into Labs. And, like, over the last few years, like, the focus just got a lot clearer. Like, we were trying to build new AI products and do it in the wild and sort of co-create and all of that. So, you know, we've just been trying a bunch of different things. And this one really landed, which has felt pretty phenomenal. Really, really landed.
Swyx [00:03:53]: Let's talk about the brief history of Notebook LM. You had a tweet, which is very helpful for doing research. May 2023, during Google I.O., you announced Project Tailwind.
Raiza [00:04:03]: Yeah.
Swyx [00:04:03]: So today is October 2024. So you joined October 2022?
Raiza [00:04:09]: Actually, I used to lead AI Test Kitchen. And this was actually, I think, not I.O. 2023. I.O. 2022 is when we launched AI Test Kitchen, or announced it. And I don't know if you remember it.
Swyx [00:04:23]: That's how you, like, had the basic prototype for Gemini.
Raiza [00:04:26]: Yes, yes, exactly. Lambda.
Swyx [00:04:28]: Gave beta access to people.
Raiza [00:04:29]: Yeah, yeah, yeah. And I remember, I was like, wow, this is crazy. We're going to launch an LLM into the wild. And that was the first project that I was working on at Google. But at the same time, my manager at the time, Josh, he was like, hey, I want you to really think about, like, what real products would we build that are not just demos of the technology? That was in October of 2022. I was sitting next to an engineer that was working on a project called Talk to Small Corpus. His name was Adam. And the idea of Talk to Small Corpus is basically using LLM to talk to your data. And at the time, I was like, wait, there's some, like, really practical things that you can build here. And just a little bit of background, like, I was an adult learner. Like, I went to college while I was working a full-time job. And the first thing I thought was, like, this would have really helped me with my studying, right? Like, if I could just, like, talk to a textbook, especially, like, when I was tired after work, that would have been huge. We took a lot of, like, the Talk to Small Corpus prototypes, and I showed it to a lot of, like, college students, particularly, like, adult learners. They were like, yes, like, I get it, right? Like, I didn't even have to explain it to them. And we just continued to iterate the prototype from there to the point where we actually got a slot as part of the I.O. demo in 23.
Swyx [00:05:42]: And Corpus, was it a textbook? Oh, my gosh.
Raiza [00:05:45]: Yeah. It's funny. Actually, when he explained the project to me, he was like, talk to Small Corpus. It was like, talk to a small corpse?
Swyx [00:05:51]: Yeah, nobody says Corpus.
Raiza [00:06:00]: It was like, a small corpse? This is not AI. Yeah, yeah. And it really was just, like, a way for us to describe the amount of data that we thought, like, it could be good for.
Swyx [00:06:02]: Yeah, but even then, you're still, like, doing rag stuff. Because, you know, the context length back then was probably, like, 2K, 4K.
Raiza [00:06:08]: Yeah, it was basically rag.
Raiza [00:06:09]: That was essentially what it was.
Raiza [00:06:10]: And I remember, I was like, we were building the prototypes. And at the same time, I think, like, the rest of the world was. Right? We were seeing all of these, like, chat with PDF stuff come up. And I was like, come on, we gotta go. Like, we have to, like, push this out into the world. I think if there was anything, I wish we would have launched sooner because I wanted to learn faster. But I think, like, we netted out pretty well.
Alessio [00:06:30]: Was the initial product just text-to-speech? Or were you also doing kind of, like, synthesizing of the content, refining it? Or were you just helping people read through it?
Raiza [00:06:40]: Before we did the I.O. announcement in 23, we'd already done a lot of studies. And one of the first things that I realized was the first thing anybody ever typed was, summarize the thing. Right?
Raiza [00:06:53]: Summarize the document.
Raiza [00:06:54]: And it was, like, half like a test and half just like, oh, I know the content. I want to see how well it does this. So it was part of the first thing that we launched. It was called Project Tailwind back then. It was just Q&A, so you could chat with the doc just through text, and it would automatically generate a summary as well. I'm not sure if we had it back then.
Raiza [00:07:12]: I think we did.
Raiza [00:07:12]: It would also generate the key topics in your document, and it could support up to, like, 10 documents. So it wasn't just, like, a single doc.
Alessio [00:07:20]: And then the I.O. demo went well, I guess. And then what was the discussion from there to where we are today? Is there any, maybe, intermediate step of the product that people missed between this was launch or?
Raiza [00:07:33]: It was interesting because every step of the way, I think we hit, like, some pretty critical milestones. So I think from the initial demo, I think there was so much excitement of, like, wow, what is this thing that Google is launching? And so we capitalized on that. We built the wait list. That's actually when we also launched the Discord server, which has been huge for us because for us in particular, one of the things that I really wanted to do was to be able to launch features and get feedback ASAP. Like, the moment somebody tries it, like, I want to hear what they think right now, and I want to ask follow-up questions. And the Discord has just been so great for that. But then we basically took the feedback from I.O., we continued to refine the product.
Raiza [00:08:12]: So we added more features.
Raiza [00:08:13]: We added sort of, like, the ability to save notes, write notes. We generate follow-up questions. So there's a bunch of stuff in the product that shows, like, a lot of that research. But it was really the rolling out of things. Like, we removed the wait list, so rolled out to all of the United States. We rolled out to over 200 countries and territories. We started supporting more languages, both in the UI and, like, the actual source stuff. We experienced, like, in terms of milestones, there was, like, an explosion of, like, users in Japan. This was super interesting in terms of just, like, unexpected. Like, people would write to us and they would be like, this is amazing. I have to read all of these rules in English, but I can chat in Japanese. It's like, oh, wow. That's true, right? Like, with LLMs, you kind of get this natural, it translates the content for you. And you can ask in your sort of preferred mode. And I think that's not just, like, a language thing, too. I think there's, like, I do this test with Wealth of Nations all the time because it's, like, a pretty complicated text to read. The Evan Smith classic.
Swyx [00:09:11]: It's, like, 400 pages or something.
Raiza [00:09:12]: Yeah. But I like this test because I'm, like, asking, like, Normie, you know, plain speak. And then it summarizes really well for me. It sort of adapts to my tone.
Swyx [00:09:22]: Very capitalist.
Raiza [00:09:25]: Very on brand.
Swyx [00:09:25]: I just checked in on a Notebook LM Discord. 65,000 people. Yeah.
Raiza [00:09:29]: Crazy.
Swyx [00:09:29]: Just, like, for one project within Google. It's not, like, it's not labs. It's just Notebook LM.
Raiza [00:09:35]: Just Notebook LM.
Swyx [00:09:36]: What do you learn from the community?
Raiza [00:09:39]: I think that the Discord is really great for hearing about a couple of things.
Raiza [00:09:43]: One, when things are going wrong. I think, honestly, like, our fastest way that we've been able to find out if, like, the servers are down or there's just an influx of people being, like, it says
Raiza [00:09:53]: system unable to answer.
Raiza [00:09:54]: Anybody else getting this?
Raiza [00:09:56]: And I'm, like, all right, let's go.
Raiza [00:09:58]: And it actually catches it a lot faster than, like, our own monitoring does.
Raiza [00:10:01]: It's, like, that's been really cool. So, thank you.
Swyx [00:10:03]: Canceled eat a dog.
Raiza [00:10:05]: So, thank you to everybody. Please keep reporting it. I think the second thing is really the use cases.
Raiza [00:10:10]: I think when we put it out there, I was, like, hey, I have a hunch of how people will use it, but, like, to actually hear about, you know, not just the context of, like, the use of Notebook LM, but, like, what is this person's life like? Why do they care about using this tool?
Raiza [00:10:23]: Especially people who actually have trouble using it, but they keep pushing.
Raiza [00:10:27]: Like, that's just so critical to understand what was so motivating, right?
Raiza [00:10:31]: Like, what was your problem that was, like, so worth solving? So, that's, like, a second thing.
Raiza [00:10:34]: The third thing is also just hearing sort of, like, when we have wins and when we don't have wins because there's actually a lot of functionality where I'm, like, hmm, I
Raiza [00:10:42]: don't know if that landed super well or if that was actually super critical.
Raiza [00:10:45]: As part of having this sort of small project, right, I want to be able to unlaunch things, too. So, it's not just about just, like, rolling things out and testing it and being, like, wow, now we have, like, 99 features. Like, hopefully we get to a place where it's, like, there's just a really strong core feature set and the things that aren't as great, we can just unlaunch.
Swyx [00:11:02]: What have you unlaunched? I have to ask.
Raiza [00:11:04]: I'm in the process of unlaunching some stuff, but, for example, we had this idea that you could highlight the text in your source passage and then you could transform it. And nobody was really using it and it was, like, a very complicated piece of our architecture and it's very hard to continue supporting it in the context of new features. So, we were, like, okay, let's do a 50-50 sunset of this thing and see if anybody complains.
Raiza [00:11:28]: And so far, nobody has.
Swyx [00:11:29]: Is there, like, a feature flagging paradigm inside of your architecture that lets you feature flag these things easily?
Raiza [00:11:36]: Yes, and actually...
Raiza [00:11:37]: What is it called?
Swyx [00:11:38]: Like, I love feature flagging.
Raiza [00:11:40]: You mean, like, in terms of just, like, being able to expose things to users?
Swyx [00:11:42]: Yeah, as a PM. Like, this is your number one tool, right?
Raiza [00:11:44]: Yeah, yeah.
Swyx [00:11:45]: Let's try this out. All right, if it works, roll it out. If it doesn't, roll it back, you know?
Raiza [00:11:49]: Yeah, I mean, we just run Mendel experiments for the most part. And, actually, I don't know if you saw it, but on Twitter, somebody was able to get around our flags and they enabled all the experiments.
Raiza [00:11:58]: They were, like, check out what the Notebook LM team is cooking.
Raiza [00:12:02]: I was, like, oh!
Raiza [00:12:03]: And I was at lunch with the rest of the team and I was, like, I was eating. I was, like, guys, guys, Magic Draft League!
Raiza [00:12:10]: They were, like, oh, no!
Raiza [00:12:12]: I was, like, okay, just finish eating and then let's go figure out what to do.
Raiza [00:12:15]: Yeah.
Alessio [00:12:15]: I think a post-mortem would be fun, but I don't think we need to do it on the podcast now. Can we just talk about what's behind the magic? So, I think everybody has questions, hypotheses about what models power it. I know you might not be able to share everything, but can you just get people very basic? How do you take the data and put it in the model? What text model you use? What's the text-to-speech kind of, like, jump between the two? Sure.
Raiza [00:12:42]: Yeah.
Raiza [00:12:42]: I was going to say, SRaiza, he manually does all the podcasts.
Raiza [00:12:46]: Oh, thank you.
Usama [00:12:46]: Really fast. You're very fast, yeah.
Raiza [00:12:48]: Both of the voices at once.
Usama [00:12:51]: Voice actor.
Raiza [00:12:52]: Good, good.
Usama [00:12:52]: Yeah, so, for a bit of background, we were building this thing sort of outside Notebook LM to begin with. Like, just the idea is, like, content transformation, right? Like, we can do different modalities. Like, everyone knows that. Everyone's been poking at it. But, like, how do you make it really useful? And, like, one of the ways we thought was, like, okay, like, you maybe, like, you know, people learn better when they're hearing things. But TTS exists, and you can, like, narrate whatever's on screen. But you want to absorb it the same way. So, like, that's where we sort of started out into the realm of, like, maybe we try, like, you know, two people are having a conversation kind of format. We didn't actually start out thinking this would live in Notebook, right? Like, Notebook was sort of, we built this demo out independently, tried out, like, a few different sort of sources. The main idea was, like, go from some sort of sources and transform it into a listenable, engaging audio format. And then through that process, we, like, unlocked a bunch more sort of learnings. Like, for example, in a sense, like, you're not prompting the model as much because, like, the information density is getting unrolled by the model prompting itself, in a sense. Because there's two speakers, and they're both technically, like, AI personas, right? That have different angles of looking at things. And, like, they'll have a discussion about it. And that sort of, we realized that's kind of what was making it riveting, in a sense. Like, you care about what comes next, even if you've read the material already. Because, like, people say they get new insights on their own journals or books or whatever. Like, anything that they've written themselves. So, yeah, from a modeling perspective, like, it's, like Reiza said earlier, like, we work with the DeepMind audio folks pretty closely. So, they're always cooking up new techniques to, like, get better, more human-like audio. And then Gemini 1.5 is really, really good at absorbing long context. So, we sort of, like, generally put those things together in a way that we could reliably produce the audio.
Raiza [00:14:52]: I would add, like, there's something really nuanced, I think, about sort of the evolution of, like, the utility of text-to-speech. Where, if it's just reading an actual text response, and I've done this several times. I do it all the time with, like, reading my text messages. Or, like, sometimes I'm trying to read, like, a really dense paper, but I'm trying to do actual work. I'll have it, like, read out the screen. There is something really robotic about it that is not engaging. And it's really hard to consume content in that way. And it's never been really effective. Like, particularly for me, where I'm, like, hey, it's actually just, like, it's fine for, like, short stuff. Like, texting, but even that, it's, like, not that great. So, I think the frontier of experimentation here was really thinking about there is a transform that needs to happen in between whatever.
Raiza [00:15:38]: Here's, like, my resume, right?
Raiza [00:15:39]: Or here's, like, a 100-page slide deck or something. There is a transform that needs to happen that is inherently editorial. And I think this is where, like, that two-person persona, right, dialogue model, they have takes on the material that you've presented. That's where it really sort of, like, brings the content to life in a way that's, like, not robotic. And I think that's, like, where the magic is, is, like, you don't actually know what's going to happen when you press generate.
Raiza [00:16:08]: You know, for better or for worse.
Raiza [00:16:09]: Like, to the extent that, like, people are, like, no, I actually want it to be more predictable now. Like, I want to be able to tell them. But I think that initial, like, wow was because you didn't know, right? When you upload your resume, what's it about to say about you? And I think I've seen enough of these where I'm, like, oh, it gave you good vibes, right? Like, you knew it was going to say, like, something really cool. As we start to shape this product, I think we want to try to preserve as much of that wow as much as we can. Because I do think, like, exposing, like, all the knobs and, like, the dials, like, we've been thinking about this a lot. It's like, hey, is that, like, the actual thing?
Raiza [00:16:43]: Is that the thing that people really want?
Alessio [00:16:45]: Have you found differences in having one model just generate the conversation and then using text-to-speech to kind of fake two people? Or, like, are you actually using two different kind of system prompts to, like, have a conversation step-by-step? I'm always curious, like, if persona system prompts make a big difference? Or, like, you just put in one prompt and then you just let it run?
Usama [00:17:05]: I guess, like, generally we use a lot of inference, as you can tell with, like, the spinning thing takes a while. So, yeah, there's definitely, like, a bunch of different things happening under the hood. We've tried both approaches and they have their, sort of, drawbacks and benefits. I think that that idea of, like, questioning, like, the two different personas, like, persists throughout, like, whatever approach we try. It's like, there's a bit of, like, imperfection in there. Like, we had to really lean into the fact that, like, to build something that's engaging, like, it needs to be somewhat human and it needs to be just not a chatbot. Like, that was sort of, like, what we need to diverge from. It's like, you know, most chatbots will just narrate the same kind of answer, like, given the same sources, for the most part, which is ridiculous. So, yeah, there's, like, experimentation there under the hood, like, with the model to, like, make sure that it's spitting out, like, different takes and different personas and different, sort of, prompting each other is, like, a good analogy, I guess.
Swyx [00:18:00]: Yeah, I think Steven Johnson, I think he's on your team. I don't know what his role is. He seems like chief dreamer, writer.
Raiza [00:18:08]: Yeah, I mean, I can comment on Steven. So, Steven joined, actually, in the very early days, I think before it was even a fully funded project. And I remember when he joined, I was like, Steven Johnson's going to be on my team? You know, and for folks who don't know him, Steven is a New York Times bestselling author of, like, 14 books. He has a PBS show. He's, like, incredibly smart, just, like, a true, sort of, celebrity by himself. And then he joined Google, and he was like, I want to come here, and I want to build the thing that I've always dreamed of, which is a tool to help me think. I was like, a what? Like, a tool to help you think? I was like, what do you need help with? Like, you seem to be doing great on your own. And, you know, he would describe this to me, and I would watch his flow. And aside from, like, providing a lot of inspiration, to be honest, like, when I watched Steven work, I was like, oh, nobody works like this, right? Like, this is what makes him special. Like, he is such a dedicated, like, researcher and journalist, and he's so thorough, he's so smart. And then I had this realization of, like, maybe Steven is the product. Maybe the work is to take Steven's expertise and bring it to, like, everyday people that could really benefit from this. Like, just watching him work, I was like, oh, I could definitely use, like, a mini-Steven, like, doing work for me. Like, that would make me a better PM. And then I thought very quickly about, like, the adjacent roles that could use sort of this, like, research and analysis tool. And so, aside from being, you know, chief dreamer, Steven also represents, like, a super workflow that I think all of us, like, if we had access to a tool like it, would just inherently, like, make us better.
Swyx [00:19:46]: Did you make him express his thoughts while he worked, or you just silently watched him, or how does this work?
Raiza [00:19:52]: Oh, now you're making me admit it. But yes, I did just silently watch him.
Swyx [00:19:57]: This is a part of the PM toolkit, right? They give user interviews and all that.
Raiza [00:20:00]: Yeah, I mean, I did interview him, but I noticed, like, if I interviewed him, it was different than if I just watched him. And I did the same thing with students all the time. Like, I followed a lot of students around. I watched them study. I would ask them, like, oh, how do you feel now, right?
Raiza [00:20:15]: Or why did you do that? Like, what made you do that, actually?
Raiza [00:20:18]: Or why are you upset about, like, this particular thing? Why are you cranky about this particular topic? And it was very similar, I think, for Steven, especially because he was describing, he was in the middle of writing a book. And he would describe, like, oh, you know, here's how I research things, and here's how I keep my notes. Oh, and here's how I do it. And it was really, he was doing this sort of, like, self-questioning, right? Like, now we talk about, like, chain of, you know, reasoning or thought, reflection.
Raiza [00:20:44]: And I was like, oh, he's the OG.
Raiza [00:20:46]: Like, I watched him do it in real time. I was like, that's, like, L-O-M right there. And to be able to bring sort of that expertise in a way that was, like, you know, maybe, like, costly inference-wise, but really have, like, that ability inside of a tool that was, like, for starters, free inside of NotebookLM, it was good to learn whether or not people really did find use out of it.
Swyx [00:21:05]: So did he just commit to using NotebookLM for everything, or did you just model his existing workflow?
Raiza [00:21:12]: Both, right?
Raiza [00:21:12]: Like, in the beginning, there was no product for him to use. And so he just kept describing the thing that he wanted. And then eventually, like, we started building the thing. And then I would start watching him use it. One of the things that I love about Steven is he uses the product in ways where it kind of does it, but doesn't quite. Like, he's always using it at, like, the absolute max limit of this thing. But the way that he describes it is so full of promise, where he's like, I can see it going here. And all I have to do is sort of, like, meet him there and sort of pressure test whether or not, you know, everyday people want it. And we just have to build it.
Swyx [00:21:47]: I would say OpenAI has a pretty similar person, Andrew Mason, I think his name is. It's very similar, like, just from the writing world and using it as a tool for thought to shape Chachabitty. I don't think that people who use AI tools to their limit are common. I'm looking at my NotebookLM now. I've got two sources. You have a little, like, source limit thing. And my bar is over here, you know, and it stretches across the whole thing. I'm like, did he fill it up?
Raiza [00:22:09]: Yes, and he has, like, a higher limit than others, I think. He fills it up.
Raiza [00:22:14]: Oh, yeah.
Raiza [00:22:14]: Like, I don't think Steven even has a limit, actually.
Swyx [00:22:17]: And he has Notes, Google Drive stuff, PDFs, MP3, whatever.
Raiza [00:22:22]: Yes, and one of my favorite demos, he just did this recently, is he has actually PDFs of, like, handwritten Marie Curie notes. I see.
Swyx [00:22:29]: So you're doing image recognition as well. Yeah, it does support it today.
Raiza [00:22:32]: So if you have a PDF that's purely images, it will recognize it.
Raiza [00:22:36]: But his demo is just, like, super powerful.
Raiza [00:22:37]: He's like, okay, here's Marie Curie's notes. And it's like, here's how I'm using it to analyze it. And I'm using it for, like, this thing that I'm writing.
Raiza [00:22:44]: And that's really compelling.
Raiza [00:22:45]: It's like the everyday person doesn't think of these applications. And I think even, like, when I listen to Steven's demo, I see the gap. I see how Steven got there, but I don't see how I could without him. And so there's a lot of work still for us to build of, like, hey, how do I bring that magic down to, like, zero work? Because I look at all the steps that he had to take in order to do it, and I'm like, okay, that's product work for us, right? Like, that's just onboarding.
Alessio [00:23:09]: And so from an engineering perspective, people come to you and it's like, hey, I need to use this handwritten notes from Marie Curie from hundreds of years ago. How do you think about adding support for, like, data sources and then maybe any fun stories and, like, supporting more esoteric types of inputs?
Raiza [00:23:25]: So I think about the product in three ways, right? So there's the sources, the source input. There's, like, the capabilities of, like, what you could do with those sources. And then there's the third space, which is how do you output it into the world? Like, how do you put it back out there? There's a lot of really basic sources that we don't support still, right? I think there's sort of, like, the handwritten notes stuff is one, but even basic things like DocX or, like, PowerPoint, right? Like, these are the things that people, everyday people are like, hey, my professor actually gave me everything in DocX. Can you support that? And then just, like, basic stuff, like images and PDFs combined with text. Like, there's just a really long roadmap for sources that I think we just have to work on.
Raiza [00:24:04]: So that's, like, a big piece of it.
Raiza [00:24:05]: On the output side, and I think this is, like, one of the most interesting things that we learned really early on, is, sure, there's, like, the Q&A analysis stuff, which is like, hey, when did this thing launch? Okay, you found it in the slide deck. Here's the answer. But most of the time, the reason why people ask those questions is because they're trying to make something new. And so when, actually, when some of those early features leaked, like, a lot of the features we're experimenting with are the output types. And so you can imagine that people care a lot about the resources that they're putting into NotebookLM because they're trying to create something new. So I think equally as important as, like, the source inputs are the outputs that we're helping people to create. And really, like, you know, shortly on the roadmap, we're thinking about how do we help people use NotebookLM to distribute knowledge? And that's, like, one of the most compelling use cases is, like, shared notebooks. It's, like, a way to share knowledge. How do we help people take sources and, like, one-click new documents out of it, right? And I think that's something that people think is, like, oh, yeah, of course, right? Like, one push a document. But what does it mean to do it right? Like, to do it in your style, in your brand, right?
Raiza [00:25:08]: To follow your guidelines, stuff like that.
Raiza [00:25:09]: So I think there's a lot of work, like, on both sides of that equation.
Raiza [00:25:13]: Interesting.
Swyx [00:25:13]: Any comments on the engineering side of things?
Usama [00:25:16]: So, yeah, like I said, I was mostly working on building the text to audio, which kind of lives as a separate engineering pipeline, almost, that we then put into NotebookLM. But I think there's probably tons of NotebookLM engineering war stories on dealing with sources. And so I don't work too closely with engineers directly. But I think a lot of it does come down to, like, Gemini's native understanding of images really well with the latest generation.
Raiza [00:25:39]: Yeah, I think on the engineering and modeling side, I think we are a really good example of a team that's put a product out there, and we're getting a lot of feedback from the users, and we return the data to the modeling team, right? To the extent that we say, hey, actually, you know what people are uploading, but we can't really support super well?
Raiza [00:25:56]: Text plus image, right?
Raiza [00:25:57]: Especially to the extent that, like, NotebookLM can handle up to 50 sources, 500,000 words each. Like, you're not going to be able to jam all of that into, like, the context window. So how do we do multimodal embeddings with that? There's really, like, a lot of things that we have to solve that are almost there, but not quite there yet.
Alessio [00:26:16]: On then turning it into audio, I think one of the best things is it has so many of the human... Does that happen in the text generation that then becomes audio? Or is that a part of, like, the audio model that transforms the text?
Usama [00:26:27]: It's a bit of both, I would say. The audio model is definitely trying to mimic, like, certain human intonations and, like, sort of natural, like, breathing and pauses and, like, laughter and things like that. But yeah, in generating, like, the text, we also have to sort of give signals on, like, where those things maybe would make sense.
Alessio [00:26:45]: And on the input side, instead of having a transcript versus having the audio, like, can you take some of the emotions out of it, too? If I'm giving, like, for example, when we did the recaps of our podcast, we can either give audio of the pod or we can give a diarized transcription of it. But, like, the transcription doesn't have some of the, you know, voice kind of, like, things.
Raiza [00:27:05]: Yeah, yeah.
Alessio [00:27:05]: Do you reconstruct that when people upload audio or how does that work?
Raiza [00:27:09]: So when you upload audio today, we just transcribe it. So it is quite lossy in the sense that, like, we don't transcribe, like, the emotion from that as a source. But when you do upload a text file and it has a lot of, like, that annotation, I think that there is some ability for it to be reused in, like, the audio output, right? But I think it will still contextualize it in the deep dive format. So I think that's something that's, like, particularly important is, like, hey, today we only have one format.
Raiza [00:27:37]: It's deep dive.
Raiza [00:27:38]: It's meant to be a pretty general overview and it is pretty peppy.
Raiza [00:27:42]: It's just very upbeat.
Raiza [00:27:43]: It's very enthusiastic, yeah.
Raiza [00:27:45]: Yeah, yeah.
Raiza [00:27:45]: Even if you had, like, a sad topic, I think they would find a way to be, like, silver lining, though.
Raiza [00:27:50]: Really?
Raiza [00:27:51]: Yeah.
Raiza [00:27:51]: We're having a good chat.
Raiza [00:27:54]: Yeah, that's awesome.
Swyx [00:27:54]: One of the ways, many, many, many ways that deep dive went viral is people saying, like, if you want to feel good about yourself, just drop in your LinkedIn. Any other, like, favorite use cases that you saw from people discovering things in social media?
Raiza [00:28:08]: I mean, there's so many funny ones and I love the funny ones.
Raiza [00:28:11]: I think because I'm always relieved when I watch them. I'm like, haha, that was funny and not scary. It's great.
Raiza [00:28:17]: There was another one that was interesting, which was a startup founder putting their landing page and being like, all right, let's test whether or not, like, the value prop is coming through. And I was like, wow, that's right.
Raiza [00:28:26]: That's smart.
Usama [00:28:27]: Yeah.
Raiza [00:28:28]: And then I saw a couple of other people following up on that, too.
Raiza [00:28:32]: Yeah.
Swyx [00:28:32]: I put my about page in there and, like, yeah, if there are things that I'm not comfortable with, I should remove it. You know, so that it can pick it up. Right.
Usama [00:28:39]: I think that the personal hype machine was, like, a pretty viral one. I think, like, people uploaded their dreams and, like, some people, like, keep sort of dream journals and it, like, would sort of comment on those and, like, it was therapeutic. I didn't see those.
Raiza [00:28:54]: Those are good. I hear from Googlers all the time, especially because we launched it internally first. And I think we launched it during the, you know, the Q3 sort of, like, check-in cycle. So all Googlers have to write notes about, like, hey, you know, what'd you do in Q3? And what Googlers were doing is they would write, you know, whatever they accomplished in Q3 and then they would create an audio overview. And these people they didn't know would just ping me and be like, wow, I feel really good, like, going into a meeting with my manager.
Raiza [00:29:25]: And I was like, good, good, good, good. You really did that, right?
Usama [00:29:29]: I think another cool one is just, like, any Wikipedia article. Yeah. Like, you drop it in and it's just, like, suddenly, like, the best sort of summary overview.
Raiza [00:29:38]: I think that's what Karpathy did, right? Like, he has now a Spotify channel called Histories of Mysteries, which is basically, like, he just took, like, interesting stuff from Wikipedia and made audio overviews out of it.
Swyx [00:29:50]: Yeah, he became a podcaster overnight.
Raiza [00:29:52]: Yeah.
Raiza [00:29:53]: I'm here for it. I fully support him.
Raiza [00:29:55]: I'm racking up the listens for him.
Swyx [00:29:58]: Honestly, it's useful even without the audio. You know, I feel like the audio does add an element to it, but I always want, you know, paired audio and text. And it's just amazing to see what people are organically discovering. I feel like it's because you laid the groundwork with NotebookLM and then you came in and added the sort of TTS portion and made it so good, so human, which is weird. Like, it's this engineering process of humans. Oh, one thing I wanted to ask. Do you have evals?
Raiza [00:30:23]: Yeah.
Swyx [00:30:23]: Yes.
Raiza [00:30:24]: What? Potatoes for chefs.
Swyx [00:30:27]: What is that? What do you mean, potatoes?
Raiza [00:30:29]: Oh, sorry.
Raiza [00:30:29]: Sorry. We were joking with this, like, a couple of weeks ago. We were doing, like, side-by-sides. But, like, Raiza sent me the file and it was literally called Potatoes for Chefs. And I was like, you know, my job is really serious, but you have to laugh a little bit. Like, the title of the file is, like, Potatoes for Chefs.
Swyx [00:30:47]: Is it like a training document for chefs?
Usama [00:30:50]: It's just a side-by-side for, like, two different kind of audio transcripts.
Swyx [00:30:54]: The question is really, like, as you iterate, the typical engineering advice is you establish some kind of test or benchmark. You're at, like, 30 percent. You want to get it up to 90, right?
Raiza [00:31:05]: Yeah.
Swyx [00:31:05]: What does that look like for making something sound human and interesting and voice?
Usama [00:31:11]: We have the sort of formal eval process as well. But I think, like, for this particular project, we maybe took a slightly different route to begin with. Like, there was a lot of just within the team listening sessions. A lot of, like, sort of, like... Dogfooding.
Raiza [00:31:23]: Yeah.
Usama [00:31:23]: Like, I think the bar that we tried to get to before even starting formal evals with raters and everything was much higher than I think other projects would. Like, because that's, as you said, like, the traditional advice, right? Like, get that ASAP. Like, what are you looking to improve on? Whatever benchmark it is. So there was a lot of just, like, critical listening. And I think a lot of making sure that those improvements actually could go into the model. And, like, we're happy with that human element of it. And then eventually we had to obviously distill those down into an eval set. But, like, still there's, like, the team is just, like, a very, very, like, avid user of the product at all stages.
Raiza [00:32:02]: I think you just have to be really opinionated.
Raiza [00:32:05]: I think that sometimes, if you are, your intuition is just sharper and you can move a lot faster on the product.
Raiza [00:32:12]: Because it's like, if you hold that bar high, right?
Raiza [00:32:15]: Like, if you think about, like, the iterative cycle, it's like, hey, we could take, like, six months to ship this thing. To get it to, like, mid where we were. Or we could just, like, listen to this and be like, yeah, that's not it, right? And I don't need a rater to tell me that. That's my preference, right? And collectively, like, if I have two other people listen to it, they'll probably agree. And it's just kind of this step of, like, just keep improving it to the point where you're like, okay, now I think this is really impressive. And then, like, do evals, right? And then validate that.
Swyx [00:32:43]: Was the sound model done and frozen before you started doing all this? Or are you also saying, hey, we need to improve the sound model as well? Both.
Usama [00:32:51]: Yeah, we were making improvements on the audio and just, like, generating the transcript as well. I think another weird thing here was, like, we needed to be entertaining. And that's much harder to quantify than some of the other benchmarks that you can make for, like, you know, Sweebench or get better at this math.
Swyx [00:33:10]: Do you just have people rate one to five or, you know, or just thumbs up and down?
Usama [00:33:14]: For the formal rater evals, we have sort of like a Likert scale and, like, a bunch of different dimensions there. But we had to sort of break down what makes it entertaining into, like, a bunch of different factors. But I think the team stage of that was more critical. It was like, we need to make sure that, like, what is making it fun and engaging? Like, we dialed that as far as it goes. And while we're making other changes that are necessary, like, obviously, they shouldn't make stuff up or, you know, be insensitive.
Raiza [00:33:41]: Hallucinations. Safety.
Swyx [00:33:42]: Other safety things.
Raiza [00:33:43]: Right.
Swyx [00:33:43]: Like a bunch of safety stuff.
Raiza [00:33:45]: Yeah, exactly.
Usama [00:33:45]: So, like, with all of that and, like, also just, you know, following sort of a coherent narrative and structure is really important. But, like, with all of this, we really had to make sure that that central tenet of being entertaining and engaging and something you actually want to listen to. It just doesn't go away, which takes, like, a lot of just active listening time because you're closest to the prompts, the model and everything.
Swyx [00:34:07]: I think sometimes the difficulty is because we're dealing with non-deterministic models, sometimes you just got a bad roll of the dice and it's always on the distribution that you could get something bad. Basically, how many do you, like, do ten runs at a time? And then how do you get rid of the non-determinism?
Raiza [00:34:23]: Right.
Usama [00:34:23]: Yeah, that's bad luck.
Raiza [00:34:25]: Yeah.
Swyx [00:34:25]: Yeah.
Usama [00:34:26]: I mean, there still will be, like, bad audio overviews. There's, like, a bunch of them that happens. Do you mean for, like, the raider? For raiders, right?
Swyx [00:34:34]: Like, what if that one person just got, like, a really bad rating? You actually had a great prompt, you actually had a great model, great weights, whatever. And you just, you had a bad output.
Usama [00:34:42]: Like, and that's okay, right?
Raiza [00:34:44]: I actually think, like, the way that these are constructed, if you think about, like, the different types of controls that the user has, right? Like, what can the user do today to affect it?
Usama [00:34:54]: We push a button.
Raiza [00:34:55]: You just push a button.
Swyx [00:34:56]: I have tried to prompt engineer by changing the title. Yeah, yeah, yeah.
Raiza [00:34:59]: Changing the title, people have found out.
Raiza [00:35:02]: Yeah.
Raiza [00:35:02]: The title of the notebook, people have found out. You can add show notes, right? You can get them to think, like, the show has changed. Someone changed the language of the output. Changing the language of the output. Like, those are less well-tested because we focused on, like, this one aspect. So it did change the way that we sort of think about quality as well, right? So it's like, quality is on the dimensions of entertainment, of course, like, consistency, groundedness. But in general, does it follow the structure of the deep dive? And I think when we talk about, like, non-determinism, it's like, well, as long as it follows, like, the structure of the deep dive, right? It sort of inherently meets all those other qualities. And so it makes it a little bit easier for us to ship something with confidence to the extent that it's like, I know it's going to make a deep dive. It's going to make a good deep dive. Whether or not the person likes it, I don't know. But as we expand to new formats, as we open up controls, I think that's where it gets really much harder. Even with the show notes, right? Like, people don't know what they're going to get when they do that. And we see that already where it's like, this is going to be a lot harder to validate in terms of quality, where now we'll get a greater distribution. Whereas I don't think we really got, like, varied distribution because of, like, that pre-process that Raiza was talking about. And also because of the way that we'd constrain, like, what were we measuring for? Literally, just like, is it a deep dive?
Swyx [00:36:18]: And you determine what a deep dive is. Yeah. Everything needs a PM. Yeah, I have, this is very similar to something I've been thinking about for AI products in general. There's always like a chief tastemaker. And for Notebook LM, it seems like it's a combination of you and Steven.
Raiza [00:36:31]: Well, okay.
Raiza [00:36:32]: I want to take a step back.
Swyx [00:36:33]: And Raiza, I mean, presumably for the voice stuff.
Raiza [00:36:35]: Raiza's like the head chef, right? Of, like, deep dive, I think. Potatoes.
Raiza [00:36:40]: Of potatoes.
Raiza [00:36:41]: And I say this because I think even though we are already a very opinionated team, and Steven, for sure, very opinionated, I think of the audio generations, like, Raiza was the most opinionated, right? And we all, like, would say, like, hey, I remember, like, one of the first ones he sent me.
Raiza [00:36:57]: I was like, oh, I feel like they should introduce themselves. I feel like they should say a title. But then, like, we would catch things, like, maybe they shouldn't say their names.
Raiza [00:37:04]: Yeah, they don't say their names.
Usama [00:37:05]: That was a Steven catch, like, not give them names.
Raiza [00:37:08]: So stuff like that is, like, we all injected, like, a little bit of just, like, hey, here's, like, my take on, like, how a podcast should be, right? And I think, like, if you're a person who, like, regularly listens to podcasts, there's probably some collective preference there that's generic enough that you can standardize into, like, the deep dive format. But, yeah, it's the new formats where I think, like, oh, that's the next test. Yeah.
Swyx [00:37:30]: I've tried to make a clone, by the way. Of course, everyone did. Yeah. Everyone in AI was like, oh, no, this is so easy. I'll just take a TTS model. Obviously, our models are not as good as yours, but I tried to inject a consistent character backstory, like, age, identity, where they work, where they went to school, what their hobbies are. Then it just, the models try to bring it in too much.
Raiza [00:37:49]: Yeah.
Swyx [00:37:49]: I don't know if you tried this.
Raiza [00:37:51]: Yeah.
Swyx [00:37:51]: So then I'm like, okay, like, how do I define a personality? But it doesn't keep coming up every single time. Yeah.
Raiza [00:37:58]: I mean, we have, like, a really, really good, like, character designer on our team.
Raiza [00:38:02]: What?
Swyx [00:38:03]: Like a D&D person?
Raiza [00:38:05]: Just to say, like, we, just like we had to be opinionated about the format, we had to be opinionated about who are those two people talking.
Raiza [00:38:11]: Okay.
Raiza [00:38:12]: Right.
Raiza [00:38:12]: And then to the extent that, like, you can design the format, you should be able to design the people as well.
Raiza [00:38:18]: Yeah.
Swyx [00:38:18]: I would love, like, a, you know, like when you play Baldur's Gate, like, you roll, you roll like 17 on Charisma and like, it's like what race they are. I don't know.
Raiza [00:38:27]: I recently, actually, I was just talking about character select screens.
Raiza [00:38:30]: Yeah. I was like, I love that, right.
Raiza [00:38:32]: And I was like, maybe there's something to be learned there because, like, people have fallen in love with the deep dive as a, as a format, as a technology, but also as just like those two personas.
Raiza [00:38:44]: Now, when you hear a deep dive and you've heard them, you're like, I know those two.
Raiza [00:38:48]: Right.
Raiza [00:38:48]: And people, it's so funny when I, when people are trying to find out their names, like, it's a, it's a worthy task.
Raiza [00:38:54]: It's a worthy goal.
Raiza [00:38:55]: I know what you're doing. But the next step here is to sort of introduce, like, is this like what people want?
Raiza [00:39:00]: People want to sort of edit the personas or do they just want more of them?
Swyx [00:39:04]: I'm sure you're getting a lot of opinions and they all, they all conflict with each other. Before we move on, I have to ask, because we're kind of on this topic. How do you make audio engaging? Because it's useful, not just for deep dive, but also for us as podcasters. What is, what does engaging mean? If you could break it down for us, that'd be great.
Usama [00:39:22]: I mean, I can try. Like, don't, don't claim to be an expert at all.
Swyx [00:39:26]: So I'll give you some, like variation in tone and speed. You know, there's this sort of writing advice where, you know, this sentence is five words. This sentence is three, that kind of advice where you, where you vary things, you have excitement, you have laughter, all that stuff. But I'd be curious how else you break down.
Usama [00:39:42]: So there's the basics, like obviously structure that can't be meandering, right? Like there needs to be sort of a, an ultimate goal that the voices are trying to get to, human or artificial. I think one thing we find often is if there's just too much agreement between people, like that's not fun to listen to. So there needs to be some sort of tension and build up, you know, withholding information. For example, like as you listen to a story unfold, like you're going to learn more and more about it. And audio that maybe becomes even more important because like you actually don't have the ability to just like skim to the end of something. You're driving or something like you're going to be hooked because like there's, and that's how like, that's how a lot of podcasts work. Like maybe not interviews necessarily, but a lot of true crime, a lot of entertainment in general. There's just like a gradual unrolling of information. And that also like sort of goes back to the content transformation aspect of it. Like maybe you are going from, let's say the Wikipedia article of like one of the History of Mysteries, maybe episodes. Like the Wikipedia article is going to state out the information very differently. It's like, here's what happened would probably be in the very first paragraph. And one approach we could have done is like maybe a person's just narrating that thing. And maybe that would work for like a certain audience. Or I guess that's how I would picture like a standard history lesson to unfold. But like, because we're trying to put it in this two-person dialogue format, like there, we inject like the fact that, you know, there's, you don't give everything at first. And then you set up like differing opinions of the same topic or the same, like maybe you seize on a topic and go deeper into it and then try to bring yourself back out of it and go back to the main narrative. So that's, that's mostly from like the setting up the script perspective. And then the audio, I was saying earlier, it's trying to be as close to just human speech as possible. I think was the, what we found success with so far.
Raiza [00:41:40]: Yeah. Like with interjections, right?
Raiza [00:41:41]: Like I think like when you listen to two people talk, there's a lot of like, yeah, yeah, right. And then there's like a lot of like that questioning, like, oh yeah, really?
Raiza [00:41:49]: What did you think?
Swyx [00:41:50]: I noticed that. That's great.
Raiza [00:41:52]: Totally.
Usama [00:41:54]: Exactly.
Swyx [00:41:55]: My question is, do you pull in speech experts to do this? Or did you just come up with it yourselves? You can be like, okay, talk to a whole bunch of fiction writers to, to make things engaging or comedy writers or whatever, stand up comedy, right? They have to make audio engaging, but audio as well. Like there's professional fields of studying where people do this for a living, but us as AI engineers are just making this up as we go.
Raiza [00:42:19]: I mean, it's a great idea, but you definitely didn't.
Raiza [00:42:22]: Yeah.
Swyx [00:42:24]: My guess is you didn't.
Raiza [00:42:25]: Yeah.
Swyx [00:42:26]: There's a, there's a certain field of authority that people have. They're like, oh, like you can't do this because you don't have any experience like making engaging audio. But that's what you literally did.
Raiza [00:42:35]: Right.
Usama [00:42:35]: I mean, I was literally chatting with someone at Google earlier today about how some people think that like you need a linguistics person in the room for like making a good chatbot. But that's not actually true because like this person went to school for linguistics. And according to him, he's an engineer now. According to him, like most of his classmates were not actually good at language. Like they knew how to analyze language and like sort of the mathematical patterns and rhythms and language. But that doesn't necessarily mean they were going to be eloquent at like while speaking or writing. So I think, yeah, a lot of we haven't invested in specialists in audio format yet, but maybe that would.
Raiza [00:43:13]: I think it's like super interesting because I think there is like a very human question of like what makes something interesting. And there's like a very deep question of like what is it, right? Like what is the quality that we are all looking for? Is it does somebody have to be funny? Does something have to be entertaining? Does something have to be straight to the point? And I think when you try to distill that, this is the interesting thing I think about our experiment, about this particular launch is first, we only launched one format. And so we sort of had to squeeze everything we believed about what an interesting thing is into one package. And as a result of it, I think we learned it's like, hey, interacting with a chatbot is sort of novel at first, but it's not interesting, right? It's like humans are what makes interacting with chatbots interesting.
Raiza [00:43:59]: It's like, ha ha ha, I'm going to try to trick it. It's like, that's interesting.
Raiza [00:44:02]: Spell strawberry, right?
Raiza [00:44:04]: This is like the fun that like people have with it. But like that's not the LLM being interesting.
Raiza [00:44:08]: That's you just like kind of giving it your own flavor. But it's like, what does it mean to sort of flip it on its head and say, no, you be interesting now, right? Like you give the chatbot the opportunity to do it. And this is not a chatbot per se. It is like just the audio. And it's like the texture, I think, that really brings it to life. And it's like the things that we've described here, which is like, okay, now I have to like lead you down a path of information about like this commercialization deck.
Raiza [00:44:36]: It's like, how do you do that?
Raiza [00:44:38]: To be able to successfully do it, I do think that you need experts. I think we'll engage with experts like down the road, but I think it will have to be in the context of, well, what's the next thing we're building, right? It's like, what am I trying to change here? What do I fundamentally believe needs to be improved? And I think there's still like a lot more studying that we have to do in terms of like, well, what are people actually using this for? And we're just in such early days. Like it hasn't even been a month. Two, three weeks.
Usama [00:45:05]: Three weeks.
Raiza [00:45:06]: Yeah, yeah.
Usama [00:45:07]: I think one other element to that is the fact that you're bringing your own sources to it. Like it's your stuff. Like, you know this somewhat well, or you care to know about this. So like that, I think, changed the equation on its head as well. It's like your sources and someone's telling you about it. So like you care about how that dynamic is, but you just care for it to be good enough to be entertaining. Because ultimately they're talking about your mortgage deed or whatever.
Swyx [00:45:33]: So it's interesting just from the topic itself. Even taking out all the agreements and the hiding of the slow reveal. I mean, there's a baseline, maybe.
Usama [00:45:42]: Like if it was like too drab. Like if someone was reading it off, like, you know, that's like the absolute worst.
Raiza [00:45:46]: But like...
Swyx [00:45:47]: Do you prompt for humor? That's a tough one, right?
Raiza [00:45:51]: I think it's more of a generic way to bring humor out if possible. I think humor is actually one of the hardest things. Yeah.
Raiza [00:46:00]: But I don't know if you saw...
Raiza [00:46:00]: That is AGI.
Swyx [00:46:01]: Humor is AGI.
Raiza [00:46:02]: Yeah, but did you see the chicken one?
Raiza [00:46:03]: No.
Raiza [00:46:04]: Okay. If you haven't heard it... We'll splice it in here.
Swyx [00:46:06]: Okay.
Raiza [00:46:07]: Yeah.
Raiza [00:46:07]: There is a video on Threads. I think it was by Martino Wong. And it's a PDF.
Raiza [00:46:16]: Welcome to your deep dive for today. Oh, yeah. Get ready for a fun one. Buckle up. Because we are diving into... Chicken, chicken, chicken. Chicken, chicken. You got that right. By Doug Zonker. Now. And yes, you heard that title correctly. Titles. Our listener today submitted this paper. Yeah, they're going to need our help. And I can totally see why. Absolutely. It's dense. It's baffling. It's a lot. And it's packed with more chicken than a KFC buffet. What? That's hilarious.
Raiza [00:46:48]: That's so funny. So it's like stuff like that, that's like truly delightful, truly surprising.
Raiza [00:46:53]: But it's like we didn't tell it to be funny.
Usama [00:46:55]: Humor is contextual also. Like super contextual is what we're realizing. So we're not prompting for humor, but we're prompting for maybe a lot of other things that are bringing out that humor.
Alessio [00:47:04]: I think the thing about ad-generated content, if we look at YouTube, like we do videos on YouTube and it's like, you know, a lot of people like screaming in the thumbnails to get clicks. There's like everybody, there's kind of like a meta of like what you need to do to get clicks. But I think in your product, there's no actual creator on the other side investing the time. So you can actually generate a type of content that is maybe not universally appealing, you know, at a much, yeah, exactly. I think that's the most interesting thing. It's like, well, is there a way for like, take Mr.
Raiza [00:47:36]: Beast, right?
Alessio [00:47:36]: It's like Mr. Beast optimizes videos to reach the biggest audience and like the most clicks. But what if every video could be kind of like regenerated to be closer to your taste, you know, when you watch it?
Raiza [00:47:48]: I think that's kind of the promise of AI that I think we are just like touching on, which is, I think every time I've gotten information from somebody, they have delivered it to me in their preferred method, right?
Raiza [00:47:59]: Like if somebody gives me a PDF, it's a PDF.
Raiza [00:48:01]: Somebody gives me a hundred slide deck, that is the format in which I'm going to read it. But I think we are now living in the era where transformations are really possible, which is, look, like I don't want to read your hundred slide deck, but I'll listen to a 16 minute audio overview on the drive home. And that, that I think is, is really novel. And that is, is paving the way in a way that like maybe we wanted, but didn't
Raiza [00:48:24]: expect.
Raiza [00:48:25]: Where I also think you're listening to a lot of content that normally wouldn't have had content made about it. Like I watched this TikTok where this woman uploaded her diary from 2004.
Raiza [00:48:36]: For sure, right?
Raiza [00:48:36]: Like nobody was going to make a podcast about a diary.
Raiza [00:48:39]: Like hopefully not. Like it seems kind of embarrassing. It's kind of creepy. Yeah, it's kind of creepy.
Raiza [00:48:43]: But she was, she was doing this like live listen of like, oh, like here's a podcast of my diary.
Raiza [00:48:48]: And it's like, it's entertaining right now to sort of all listen to it together. But like the connection is personal. It was like, it was her interacting with like her information in a totally
Raiza [00:48:57]: different way.
Raiza [00:48:58]: And I think that's where like, oh, that's a super interesting space, right? Where it's like, I'm creating content for myself in a way that suits the way that I want to, I want to consume it.
Usama [00:49:06]: Or people compare like retirement plan options. Like no one's going to give you that content. Like for your personal financial situation.
Raiza [00:49:14]: Yeah.
Usama [00:49:14]: And like, even when we started out the experiment, like a lot of the goal was to go for really obscure content and see how well we could transform that. So like if you look at the mountain view, like city council meeting notes, like you're never going to read it. But like if it was a three minute summary, like that would be interesting. I see.
Swyx [00:49:33]: You have one system, one prompt that just covers everything you threw at it.
Raiza [00:49:37]: Maybe.
Swyx [00:49:39]: I'm just, I'm just like, yeah, it's really interesting. You know what? I'm trying to figure out what you nailed compared to others. And I think that the way that you treat your, the AI is like a little bit different than a lot of the builders I talked to. So I don't know what it is. You said, I wish I had a transcript right in front of me, but it's something like people treat AI as like a tool for thought, but usually it's kind of doing their bidding and you know, what you're really doing is loading up these like two virtual agents. I don't, you've never said the word agents. I put that in your mouth, but two virtual humans or AIs and letting them from the, from their own opinion and letting them kind of just live and embody it a little bit. Is that accurate?
Raiza [00:50:17]: I think that that is as close to accurate as possible. I mean, in general, I try to be careful about saying like, oh, you know,
Raiza [00:50:24]: letting, you know, yeah, like these, these personas live.
Raiza [00:50:27]: But I think to your earlier question of like, what makes it interesting? That's what it takes to make it interesting.
Raiza [00:50:32]: Yeah.
Raiza [00:50:32]: Right. And I think to do it well is like a worthy challenge. I also think that it's interesting because they're interested, right? Like, is it interesting to compare?
Raiza [00:50:42]: Yeah.
Raiza [00:50:42]: Is it, is it interesting to have two retirement plans?
Raiza [00:50:46]: No, but to listen to these two talk about it.
Raiza [00:50:50]: Oh my gosh.
Raiza [00:50:50]: You'd think it was like the best thing ever invented, right? It's like, get this, deep dive into 401k through Chase versus, you know,
Raiza [00:50:59]: whatever.
Swyx [00:51:00]: They do do a lot of get this.
Raiza [00:51:02]: I know. I know.
Raiza [00:51:03]: I dream about it.
Raiza [00:51:06]: I'm sorry.
Swyx [00:51:08]: There's a, I have a few more questions on just like the engineering around this. And obviously some of this is just me creatively asking how this works. How do you make decisions between when to trust the AI overlord to decide for you? In other words, stick it, let's say products as it is today. You want to improve it in some way. Do you engineer it into the system? Like write code to make sure it happens or you just stick it in the prompt and hope that the LLM does it for you?
Raiza [00:51:38]: Do you know what I mean?
Raiza [00:51:39]: Do you mean specifically about audio or sort of in general?
Swyx [00:51:41]: In general, like designing AI products. I think this is like the one thing that people are struggling with. And there's, there's compound AI people and then there's big AI people. So compound AI people will be like Databricks, have lots of little models, chain them together to make an output. It's deterministic. You control every single piece and you know, you produce what you produce. The open AI people, totally the opposite. Like write one giant prompts and let the model figure it out.
Raiza [00:52:05]: Yeah.
Swyx [00:52:06]: And obviously the answer for most people is going to be a spectrum in between those two, like big model, small model. When do you decide that?
Raiza [00:52:11]: I think it depends on the task. It also depends on, well, it depends on the task, but ultimately depends on what is your desired outcome? Like what am I engineering for here? And I think there's like several potential outputs and there's sort of like general
Raiza [00:52:24]: categories.
Raiza [00:52:24]: Am I trying to delight somebody? Am I trying to just like meet whatever the person is trying to do? Am I trying to sort of simplify a workflow?
Raiza [00:52:31]: At what layer am I implementing this?
Raiza [00:52:32]: Am I trying to implement this as part of the stack to reduce like friction, you know, particularly for like engineers or something? Or am I trying to engineer it so that I deliver like a super high quality
Raiza [00:52:43]: thing?
Raiza [00:52:44]: I think that the question of like which of those two, I think you're right, it
Raiza [00:52:48]: is a spectrum.
Raiza [00:52:49]: But I think fundamentally it comes down to like it's a craft, like it's still a craft as much as it is a science. And I think the reality is like you have to have a really strong POV about like what you want to get out of it and to be able to make that decision. Because I think if you don't have that strong POV, like you're going to get lost in sort of the detail of like capability. And capability is sort of the last thing that matters because it's like, models will catch up, right? Like models will be able to do, you know, whatever in the next five years. It's going to be insane. So I think this is like a race to like value. And it's like really having a strong opinion about like, what does that look
Raiza [00:53:25]: like today?
Raiza [00:53:25]: And how far are you going to be able to push it? Sorry, I think maybe that was like very like philosophical.
Swyx [00:53:31]: We get there.
Usama [00:53:32]: And I think that hits a lot of the points it's going to make.
Alessio [00:53:35]: I tweeted today or I ex-posted, whatever, that we're going to interview you on what we should ask you. So we got a list of feature requests, mostly. It's funny. Nobody actually had any like specific questions about how the product was built. They just want to know when you're releasing some feature. So I know you cannot talk about all of these things, but I think maybe it would give people an idea of like where the product is going. So I think the most common question I think five people asked is like, are you going to build an API? And, you know, do you see this product as still be kind of like a full head product for like a login and do everything there? Or do you want it to be a piece of infrastructure that people build on?
Raiza [00:54:13]: I mean, I think why not both?
Raiza [00:54:16]: I think we work at a place where you could have both. I think that end user products, like products that touch the hands of users
Raiza [00:54:23]: have a lot of value.
Raiza [00:54:24]: For me personally, like we learn a lot about what people are trying to do and what's like actually useful and what people are ready for. And so we're going to keep investing in that. I think at the same time, right, there are a lot of developers that are interested in using the same technology to build their own thing. We're going to look into that, how soon that's going to be ready. I can't really comment, but these are the things that like, Hey, we heard it.
Raiza [00:54:47]: We're trying to figure it out.
Raiza [00:54:48]: And I think there's room for both.
Swyx [00:54:50]: Is there a world in which this becomes a default Gemini interface because it's technically different org?
Raiza [00:54:55]: It's such a good question.
Raiza [00:54:56]: And I think every, every time someone asks me, it's like, Hey, I just lead
Raiza [00:55:00]: Domogolem.
Raiza [00:55:02]: We'll ask the Gemini folks what they think.
Alessio [00:55:05]: Multilingual support. I know people kind of hack this a little bit together. Any ideas for full support, but also I'm mostly interested in dialects. In Italy, we have Italian obviously, but we have a lot of local dialects. Like if you go to Rome, people don't really speak Italian, they speak local
Raiza [00:55:20]: dialect.
Alessio [00:55:21]: Do you think there's a path to which these models, especially the speech can learn very like niche dialects? Like how much data do you need? Can people contribute? Like I'm curious, like if you see this as a possibility.
Raiza [00:55:35]: Totally.
Usama [00:55:35]: So I guess high level, like we're definitely working on adding more
Raiza [00:55:39]: languages.
Usama [00:55:39]: That's like top priority. We're going to start small, but like theoretically we should be able to cover like most languages pretty soon. What a ridiculous statement, by the way.
Swyx [00:55:48]: That's, that's crazy.
Usama [00:55:49]: Unlike the soon or the pretty soon part.
Swyx [00:55:52]: No, but like, you know, a few years ago, like a small team of like, I don't know, 10 people saying that we will support the top 100, 200 languages is like absurd, but you can do it. Yeah, you can do it.
Raiza [00:56:03]: And I think like the speech team, you know, we are a small team, but the speech team is another team and the modeling team, like these folks are just like absolutely brilliant at what they do. And I think like when we've talked to them and we've said, Hey, you know, how
Raiza [00:56:17]: about more languages? How about more voices? How about dialects?
Raiza [00:56:20]: Right?
Raiza [00:56:20]: This is something that like they are game to do. And like, that's, that's the roadmap for them.
Usama [00:56:25]: The speech team supports like a bunch of other efforts across Google, like Gemini Live, for example, is also the models built by the same like sort of deep mind speech team. But yeah, the thing about dialects is really interesting. Cause like, and some of our sort of earliest testing with trying out other languages, we actually noticed that sometimes it wouldn't stick to a certain dialect, especially for like, I think for French, we noticed that like when we presented it to like a native speaker, it would sometimes go from like a Canadian person speaking French versus like a French person speaking French or an American person speaking French, which is not what we wanted. So there's a lot more sort of speech quality work that we need to do there to make sure that it works reliably. And at least sort of like the, the standard dialect that we want, but that does show that there's potential to sort of do the thing that you're talking about of like fixing a dialect that you want, maybe contribute your own voice or like you pick from one of the options. There's, there's a lot more headroom there. Yeah.
Alessio [00:57:20]: Because we have movies, like we have old Roman movies that have like different languages, but there's not that many, you know? So I'm always like, well, I'm sure like the Italian is so strong in the model that like when you're trying to like pull that away from it, like you kind of need a lot, but right.
Usama [00:57:35]: That's, that's all sort of like wonderful deep mind speech team.
Swyx [00:57:39]: Well, anyway, if you need Italian, he's got you.
Swyx [00:57:44]: Specifically Singlish.
Raiza [00:57:45]: I got you.
Swyx [00:57:46]: Managing system prompts. People want a lot of that. I assume.
Raiza [00:57:50]: Yes.
Swyx [00:57:50]: Ish.
Raiza [00:57:51]: Definitely looking into it for just core notebook LM. Like everybody's wanted that forever. So we're working on that. I think for the audio itself, we're trying to figure out the best way to do it. So we'll launch something sooner rather than later. So we'll probably stage it. And I think like, you know, just to be fully transparent, we'll probably launch something that's more of a fast follow than like a fully baked feature first.
Raiza [00:58:15]: Just because like, I see so many people put in like the fake show notes.
Raiza [00:58:18]: It's like, Hey, I'll, I'll help you out.
Raiza [00:58:19]: We'll just put a text box. Yeah. Yeah.
Usama [00:58:21]: I think a lot of people are like, this is almost perfect, but like, I just need that extra 10, 20%. Yeah.
Swyx [00:58:26]: I noticed that you say no a lot, I think, or you try to ship one thing and that there's different about you than maybe other PMs or other teams that try to ship, but they're like, Oh, here are all the knobs.
Raiza [00:58:38]: I'm just.
Swyx [00:58:38]: Take all my knobs. Yeah.
Raiza [00:58:40]: Yeah.
Swyx [00:58:40]: Top P top cake. It doesn't matter. I'll just put it in the docs and you figure it out. Right. Whereas for you, it's you, you actually just, you make one product.
Raiza [00:58:49]: Yeah.
Swyx [00:58:49]: As opposed to like 10, you could possibly have done.
Raiza [00:58:51]: Yeah.
Swyx [00:58:51]: I don't know.
Raiza [00:58:52]: It's interesting. I think about this a lot.
Raiza [00:58:53]: I think it requires a lot of discipline because I thought about the knobs.
Raiza [00:58:57]: I was like, Oh, I saw on Twitter, you know, on X people want the knobs. It's like, great.
Raiza [00:59:02]: Start mocking it up, making the text boxes, designing like the little fiddles.
Raiza [00:59:06]: Right.
Raiza [00:59:07]: And then I looked at it and I was kind of sad. I was like, well, right. It's like, Oh, it's like, this is not cool.
Raiza [00:59:12]: This is not fun.
Raiza [00:59:13]: This is not magical. It is sort of exactly what you would expect knobs to be. Then, you know, it's like, Oh, I mean, how much can you, you know, design a knob?
Raiza [00:59:24]: I thought about it. I was like, but the thing that people really like was that there wasn't any.
Raiza [00:59:29]: That they just pushed a button and it was cool.
Raiza [00:59:32]: And so I was like, how do we bring more of that?
Raiza [00:59:34]: Right.
Raiza [00:59:34]: That still gives the user the optionality that they want. And so this is where like, you have to have a strong POV. I think you have to like really boil down. What did I learn in like the month since I've launched this thing that people really want? And I can give it to them while preserving like that, that delightful sort of fun experience. And I think that's actually really hard.
Raiza [00:59:54]: Like I'm not going to come up with that by myself.
Raiza [00:59:55]: And like, that's something that like our team thinks about every day. We all have different ideas. We're all experimenting with sort of how to get the most out of like the insight and also ship it quick. So, so we'll see.
Raiza [01:00:06]: We'll find out soon if people like it or not.
Usama [01:00:08]: I think the other interesting thing about like AI development now is that the knobs are not necessarily like speak going back to all the sort of like craft and like human taste and all of that that went into building it. Like the knobs are not as easy to add as simply like I'm going to add a parameter to this and it's going to make it happen. It's like you kind of have to redo the quality process for everything. Yeah, the prioritization is also different.
Raiza [01:00:36]: It goes back to sort of like, it's a lot easier to do an eval for like the deep dive format than if like, okay, now I'm going to let you inject like these random things, right?
Raiza [01:00:45]: Okay.
Raiza [01:00:45]: How am I going to measure quality?
Raiza [01:00:46]: Either?
Raiza [01:00:46]: I say, I don't care because like you just input whatever.
Raiza [01:00:50]: Or I say, actually wait, right?
Raiza [01:00:53]: Like I want to help you get the best output ever.
Raiza [01:00:55]: What's it going to take?
Usama [01:00:56]: The knob actually needs to work reliably.
Raiza [01:00:58]: Yeah. Yeah. Very important part.
Alessio [01:01:00]: Two more things we definitely want to talk about. I guess now people equivalent notebook LM to like a podcast generator, but I guess, you know, there's a whole product suite there.
Raiza [01:01:09]: Yeah.
Alessio [01:01:10]: How should people think about that? Like is this, and also like the future of the product as far as monetization too, you know, like, is it going to be the voice thing going to be a core to it? Is it just going to be one output modality? And like, you're still looking to build like a broader kind of like a interface with data and documents.
Raiza [01:01:27]: I mean, that's such a, that's such a good question that I think the answer it's I'm waiting to get more data. I think because we are still in the period where everyone's really excited about it, everyone's trying it. I think I'm getting a lot of sort of like positive feedback on the audio. We have some early signal that says it's a really good hook, but people stay for the other features.
Raiza [01:01:49]: So that's really good too.
Raiza [01:01:50]: I was making a joke yesterday.
Raiza [01:01:51]: I was like, it'd be really nice, you know, if it was just the audio, because then I could just like simplify the train.
Raiza [01:01:58]: Right.
Raiza [01:01:58]: I don't have to think about all this other functionality, but I think the reality is that the framework kind of like what we were talking about earlier that we had laid out, which is like you bring your own sources. There's something you do in the middle and then there's an output is that really extensible one. And it's a really interesting one. And I think like, particularly when we think about what a big business looks like, especially when we think about commercialization, audio is just one such modality. But the editor itself, like the space in which you're able to do these things is like, that's the business, right? Like maybe the audio by itself, not so much, but like in this big package, like, oh, I could see that. I could see that being like a really big business.
Raiza [01:02:37]: Yep.
Alessio [01:02:37]: Any thoughts on some of the alternative interact with data and documents thing, like cloud artifacts, like a JGBD canvas, you know, kind of how do you see, maybe we're notebook LM stars, but like Gemini starts, like you have so many amazing teams and products at Google. There's sometimes like, I'm sure you have to figure that out.
Raiza [01:02:56]: Yeah.
Raiza [01:02:56]: Well, I love artifacts.
Raiza [01:02:59]: I played a little bit with canvas. I got a little dizzy using it. I was like, oh, there's something.
Raiza [01:03:03]: Well, you know, I like the idea of it fundamentally, but something about the UX was like, oh, this is like more disorienting than like artifacts.
Raiza [01:03:11]: And I couldn't figure out what it was. And I didn't spend a lot of time thinking about it, but I love that, right?
Raiza [01:03:16]: Like the thing where you are like, I'm working with, you know, an LLM, an agent, a chap or whatever to create something new. And there's like the chat space.
Raiza [01:03:26]: There's like the output space. I love that. And the thing that I think I feel angsty about is like, we've been talking about this for like a year, right?
Raiza [01:03:35]: Like, of course, like I'm going to say that, but it's like, but like for a year now I've had these like mocks that I was just like, I want to push the button.
Raiza [01:03:42]: But we prioritize other things.
Raiza [01:03:43]: We were like, okay, what can we like really win at? And like we prioritize audio, for example, instead of that. But just like when people were like, oh, what is this magic draft thing? Oh, it's like a hundred percent, right?
Raiza [01:03:54]: It's like stuff like that that we want to try to build into notebook too.
Raiza [01:03:57]: And I'd made this comment on Twitter as well, where I was like, now I don't know, actually, right? I don't actually know if that is the right thing.
Raiza [01:04:05]: Like, are people really getting utility out of this? I mean, from the launches, it seems like people are really getting it.
Raiza [01:04:11]: But I think now if we were to ship it, I have to rev on it like one layer more, right? I have to deliver like a differentiating value compared to like artifacts or chemicals, which is hard.
Swyx [01:04:20]: Which is because you've, you demonstrated the ability to fast follow. So you don't have to innovate every single time. I know, I know.
Raiza [01:04:27]: I think for me, it's just like the bar is high to ship.
Raiza [01:04:30]: And when I say that, I think it's sort of like conceptually like the value that you deliver to the user. I mean, you'll, you'll see a notebook alarm. There are a lot of corners that like that I have personally cut where it's like our UX designer is always like, I can't believe you let us ship with like these ugly scroll bars. And I'm like, no, no one notices, I promise.
Raiza [01:04:47]: He's like, no, everyone.
Raiza [01:04:48]: It's a screenshot, this thing.
Raiza [01:04:50]: But I mean, kidding aside, I think that's true that it's like we do want to be able to fast follow.
Raiza [01:04:54]: But I think we want to make sure that things also land really well. So the utility has to be there.
Swyx [01:04:59]: Code in, especially on our podcast has a special place. Is code notebook LLM interesting to you? I haven't, I've never, I don't see like a connect my GitHub to this thing. Yeah, yeah.
Raiza [01:05:10]: I think code, code is a big one. Code is a big one. I think we have been really focused, especially when we had like a much smaller team, we were really focused on like, let's push like an end to end journey together. Let's prove that we can do that. Because then once you lay the groundwork of like sources, do something in the chat output, once you have that, you just scale it up from there. Right. And it's like, now it's just a matter of like scaling the inputs, scaling the outputs, scaling the capabilities of the chat. So I think we're going to get there. And now I also feel like I have a much better view of like where the investment is required. Whereas previously I was like, Hey, like let's flesh out the story first before we put more engineers on this thing, because that's just going to slow us down.
Usama [01:05:49]: For what it's worth, the model still understands code. So I've seen at least one or two people just like download their GitHub repo, put it in there and get like an audio overview of your code.
Raiza [01:06:00]: Yeah, yeah. I've never tried that.
Usama [01:06:01]: This is like, these are all the files are connected together because the model still understands code. Like even if you haven't like.
Raiza [01:06:07]: I think on sort of like the creepy side of things, I did watch a student like with her permission, of course, I watched her do her homework in Notebook LM.
Raiza [01:06:17]: And I didn't tell her like what kind of homework to bring, but she brought like her computer science homework.
Raiza [01:06:23]: And I was like, Oh, and she uploaded it. And she said, here's my homework, read it. And it was just the instructions. And Notebook LM was like, okay, I've read it. And the student was like, okay, here's my code so far.
Raiza [01:06:37]: And she copy pasted it from the editor.
Raiza [01:06:39]: And she was like, check my homework. And Notebook LM was like, well, number one is wrong.
Raiza [01:06:44]: And I thought that was really interesting because it didn't tell her what was wrong. It just said it's wrong.
Raiza [01:06:48]: And she was like, okay, don't tell me the answer, but like walk me through like how you think about this. And it was what was interesting for me was that she didn't ask for the answer.
Raiza [01:06:58]: And I asked her, I was like, oh, why did you do that? And she was like, well, I actually want to learn it. She's like, because I'm gonna have to take a quiz on this at some point. And I was like, oh, yeah, it's a really good point.
Raiza [01:07:05]: And it was interesting because, you know, Notebook LM, while the formatting wasn't perfect, like did say like, hey, have you thought about using, you know, maybe an integer instead of like this?
Raiza [01:07:14]: And so that was, that was really interesting.
Alessio [01:07:16]: Are you adding like real-time chat on the output? Like, you know, there's kind of like the deep dive show and then there's like the listeners call in and say, hey.
Raiza [01:07:26]: Yeah, we're actively, that's one of the things we're actively prioritizing. Actually, one of the interesting things is now we're like, why would anyone want to do that? Like, what are the actual, like kind of going back to sort of having a strong POV about the experience? It's like, what is better? Like, what is fundamentally better about doing that? That's not just like being able to Q&A or Notebook. How is that different from like a conversation? Is it just the fact that there was a show and you want to tweak the show? Is it because you want to participate? So I think there's a lot there that like we can continue to unpack. But yes, that's coming.
Swyx [01:07:58]: It's because I formed a parasocial relationship. Yeah, that just might be part of your life.
Raiza [01:08:03]: Get this.
Raiza [01:08:05]: Totally.
Swyx [01:08:07]: Yeah, but it is obviously because OpenAI has just launched a real-time chat. It's a very hot topic. I would say one of the toughest AI engineering disciplines out there because even their API doesn't do interruptions that well, to be honest. And, you know, yeah, so real-time chat is tough.
Raiza [01:08:25]: I love that thing.
Raiza [01:08:26]: I love it.
Swyx [01:08:27]: Okay, so we have a couple ways to end. Either call to action or laying out one principle of AI PMing or engineering that you really think about a lot. Is there anything that comes to mind?
Raiza [01:08:39]: I feel like that's a test.
Raiza [01:08:40]: Of course, I'm going to say go to notebooklm.google.com, try it out, join the Discord and tell us what you think.
Swyx [01:08:46]: Yeah, especially like you have a technical audience. What do you want from a technical engineering audience?
Raiza [01:08:52]: I mean, I think it's interesting because the technical and engineering audience typically will just say, hey, where's the API?
Raiza [01:08:58]: But, you know, I think we addressed it. But I think what I would really be interested to discover is, is this useful to you?
Raiza [01:09:05]: Why is it useful?
Raiza [01:09:05]: What did you do? Right? Is it useful tomorrow?
Raiza [01:09:08]: How about next week?
Raiza [01:09:08]: Just the most useful thing for me is if you do stop using it or if you do keep using it, tell me why.
Raiza [01:09:14]: Because I think contextualizing it within your life, your background, your motivations, is what really helps me build really cool things.
Swyx [01:09:22]: And then one piece of advice for AI PMs.
Raiza [01:09:24]: Okay, if I had to pick one, it's just always be building. Build things yourself. I think for PMs, it's such a critical skill. And just take time to pop your head up and see what else is new out there. On the weekends, I try to have a lot of discipline. I only use ChatGPT and Cloud on the weekend. I try to use the APIs. Occasionally, I'll try to build something on GCP over the weekend because I don't do that normally at work. But it's just the rigor of just trying to be a builder yourself. And even just testing, right? You could have an idea of how a product should work and maybe your engineers are building it. But it's like, what was your proof of concept? What gave you conviction that that was the right thing?
Raiza [01:10:06]: Call to action?
Usama [01:10:07]: I feel like consistently, the most magical moments out of AI building come about for me when I'm really, really, really just close to the edge of the model capability. And sometimes it's farther than you think it is. I think while building this product, some of the other experiments, there were phases where it was easy to think that you've approached it. But sometimes at that point, what you really need is to show your thing to someone and they'll come up with creative ways to improve it. We're all sort of learning, I think. So yeah, I feel like unless you're hitting that bound of this is what Gemini 1.5 can do, probably the magic moment is somewhere there, in that sort of limit.
Swyx [01:10:48]: So push the edge of the capability. Yeah, totally.
Alessio [01:10:51]: It's funny because we had a Nicola Scarlini from DeepMind on the pod and he was like, if the model is always successful, you're probably not trying hard enough to give it heart.
Raiza [01:11:00]: Right. Thanks.
Alessio [01:11:00]: So, yeah.
Swyx [01:11:03]: My problem is sometimes I'm not smart enough to judge. Yeah, right.
Raiza [01:11:08]: Well, I think I hear that a lot.
Raiza [01:11:11]: Like people are always like, I don't know how to use it.
Raiza [01:11:14]: And it's hard.
Raiza [01:11:15]: Like I remember the first time I used Google search. I was like, what do we type?
Raiza [01:11:18]: My dad was like, anything.
Raiza [01:11:19]: It's like anything.
Raiza [01:11:20]: I got nothing in my brain, dad. What do you mean?
Raiza [01:11:23]: And I think there is a lot of like for product builders is like, have a strong opinion about like, what is the user supposed to do?
Raiza [01:11:30]: Yeah. Help them do it.
Swyx [01:11:31]: Principle for AI engineers or like just one advice that you have others?
Usama [01:11:36]: I guess like in addition to pushing the bounds and to do that, that often means like you're not going to get it right in the first go. So like, don't be afraid to just like batch multiple models together. I guess that's I'm basically describing an agent, but more thinking time equals just better results consistently. And that holds true for probably every single time that I've tried to build something.
Swyx [01:12:01]: Well, at some point we will talk about the sort of longer inference paradigm. It seems like DeepMind is rumored to be coming out with something. You can't comment, of course.
Raiza [01:12:09]: Yeah.
Swyx [01:12:09]: Well, thank you so much. You know, you've created. I actually said, I think you saw this. I think that Notebook LLM was kind of like the ChatGPT moment for Google.
Raiza [01:12:18]: That was so crazy when I saw that.
Raiza [01:12:19]: I was like, what?
Raiza [01:12:20]: Like, ChatGPT was huge for me. And I think, you know, when you said it and other people have said it, I was like, is it?
Raiza [01:12:27]: Yeah. That's crazy.
Swyx [01:12:28]: People weren't like really cognizant of Notebook LLM before and audio overviews and Notebook LLM like unlocked the, you know, a use case for people in the way that I would go so far as to say cloud projects never did. And I don't know. You know, I think a lot of it is competent PMing and engineering, but also just, you know, it's interesting how a lot of these projects are always like low key research previews for you. It's like you're a separate org, but like, you know, you built products and UI innovation on top of also working with research to improve the model. That was a success that wasn't planned to be this whole big thing. You know, your TPUs were on fire, right?
Raiza [01:13:06]: Oh my gosh, that was so funny.
Raiza [01:13:08]: I didn't know people would like really catch on to the Elmo fire, but it was just like one of those things where I was like, you know, we had to ask for more TPUs.
Raiza [01:13:16]: Yeah, we many times.
Raiza [01:13:18]: And, you know, it was a little bit of a, of a subtweet of like, Hey, reminder, give us more TPUs on here.
Raiza [01:13:25]: It's weird.
Swyx [01:13:25]: I just think like when people try to make big launches, then they flop. And then like when they're not trying and they just, they're just trying to build a good thing, then, then they succeed. It's, it's this fundamentally really weird magic that I haven't really encapsulated yet, but you've, you've done it. Well, thank you.
Raiza [01:13:40]: Thank you.
Raiza [01:13:40]: And, you know, I think we'll just keep going in like the same way. We just keep trying, keep trying to make it better.
Raiza [01:13:45]: I hope so.
Swyx [01:13:46]: All right.
Raiza [01:13:47]: Cool.
Swyx [01:13:47]: Thank you.
Raiza [01:13:48]: Thank you. Thanks for having us. Thanks.

Get full access to Latent Space at www.latent.space/subscribe
- Luisteren Nogmaals beluisteren Doorgaan Wordt afgespeeld...
- Later beluisteren Later beluisteren
Building the AI Engineer Nation — with Josephine Teo, Minister of Digital Development and Information, Singapore
19 okt· Latent Space: The AI Engineer Podcast
Singapore's GovTech is hosting an AI CTF challenge with ~$15,000 in prizes, starting October 26th, open to both local and virtual hackers. It will be hosted on Dreadnode's Crucible platform; signup here!
It is common to say if you want to work in AI, you should come to San Francisco.
Not everyone can. Not everyone should. If you can only do meaningful AI work in one city, then AI has failed to generalize meaningfully.
As non-Americans working in the US, we know what it’s like to see AI progress so rapidly here, and yet be at a loss for what our home countries can do. Through Latent Space we’ve tried to tell the story of AI outside of the Bay Area bubble; we talked to Notion in New York and Humanloop and Wondercraft in London and HuggingFace in Paris and ICLR in Vienna, and the Reka, RWKV, and Winds of AI Winter episodes were taped in Singapore (the World’s Fair also had Latin America representation and we intend to at least add China, Japan, and India next year).
The Role of Government with AI
As an intentionally technical resource, we’ve mostly steered clear of regulation and safety debates on the podcast; whether it is safety bills or technoalarmism, often at the cost of our engagement numbers or ability to book big name guests with a political agenda. When SOTA shifts 3x faster than it takes to pass a law, when nobody agrees on definitions of important things, when you can elicit never-before-seen behavior by slightly different prompting or sampling, it is hard enough to simply keep up to speed, so we are happy limiting our role to that. The story of AI progress has more often been achieved in the private sector, usually in spite of, rather than with thanks to, government intervention.
But industrial policy is inextricably linked to the business of AI, which we do very much care about, has an explicitly accelerationist intent if not impact, and has a track record of success in correcting for legitimate market failures in private sector investment, particularly outside of the US. It is with this lens we approach today’s episode and special guest, our first with a sitting Cabinet member.
Singapore’s National AI Strategy
It is well understood that much of Singapore’s economic success is attributable to industrial policy, from direct efforts like the Jurong Town Corporation industrialization to indirect ones like going all in on English as national first language. Singapore’s National AI Strategy grew out of its 2014 Smart Nation initiative, first launched in 2019 and then refreshed in 2023 by Minister Josephine Teo, our guest today.
While Singapore is not often thought of as an AI leader, the National University ranks in the top 10 in publications (above Oxford/Harvard!), and many overseas Singaporeans work at the leading AI companies and institutions in the US (and some of us even run leading AI Substacks?). OpenAI has often publicly named the Singapore government as their model example of government collaborator and is opening an office in Singapore in time for DevDay 2024.
AI Engineer Nations
Swyx first pitched the AI Engineer Nation concept at a private Sovereign AI summit featuring Dr. He Ruimin, Chief AI Officer of Singapore, which eventually led to an invitation to discuss the concept with Minister Teo, the country’s de-facto minister for tech (she calls it Digital Development, for good reasons she explains in the pod).
This chat happened (with thanks to Jing Long, Joyce, and other folks from MDDI)!
The central pitch for any country, not just Singapore, to emphasize and concentrate bets on AI Engineers, compared with other valuable efforts like training more researchers, releasing more government-approved data, or offering more AI funding, is a calculated one, based on the fact that:
* GPU clusters and researchers have massive returns to scale and colocation, mostly concentrated in the US, that are irresponsibly expensive to replicate
* Even if research stopped today and there was no progress for the next 30 years, there are far more capabilities to unlock and productize from existing foundation models and we
- Luisteren Nogmaals beluisteren Doorgaan Wordt afgespeeld...
- Later beluisteren Later beluisteren
Building the Silicon Brain - with Drew Houston of Dropbox
18 okt· Latent Space: The AI Engineer Podcast
CEOs of publicly traded companies are often in the news talking about their new AI initiatives, but few of them have built anything with it. Drew Houston from Dropbox is different; he has spent over 400 hours coding with LLMs in the last year and is now refocusing his 2,500+ employees around this new way of working, 17 years after founding the company.
Timestamps
00:00 Introductions
00:43 Drew's AI journey
04:14 Revalidating expectations of AI
08:23 Simulation in self-driving vs. knowledge work
12:14 Drew's AI Engineering setup
15:24 RAG vs. long context in AI models
18:06 From "FileGPT" to Dropbox AI
23:20 Is storage solved?26:30 Products vs Features
30:48 Building trust for data access
33:42 Dropbox Dash and universal search
38:05 The evolution of Dropbox
42:39 Building a "silicon brain" for knowledge work
48:45 Open source AI and its impact
51:30 "Rent, Don't Buy" for AI
54:50 Staying relevant
58:57 Founder Mode
01:03:10 Advice for founders navigating AI
01:07:36 Building and managing teams in a growing company
Transcript
Alessio [00:00:00]: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO at Decibel Partners, and there's no Swyx today, but I'm joined by Drew Houston of Dropbox. Welcome, Drew.
Drew [00:00:14]: Thanks for having me.
Alessio [00:00:15]: So we're not going to talk about the Dropbox story. We're not going to talk about the Chinatown bus and the flash drive and all that. I think you've talked enough about it. Where I want to start is you as an AI engineer. So as you know, most of our audience is engineering folks, kind of like technology leaders. You obviously run Dropbox, which is a huge company, but you also do a lot of coding. I think that's how you spend almost 400 hours, just like coding. So let's start there. What was the first interaction you had with an LLM API and when did the journey start for you?
Drew [00:00:43]: Yeah. Well, I think probably all AI engineers or whatever you call an AI engineer, those people started out as engineers before that. So engineering is my first love. I mean, I grew up as a little kid. I was that kid. My first line of code was at five years old. I just really loved, I wanted to make computer games, like this whole path. That also led me into startups and eventually starting Dropbox. And then with AI specifically, I studied computer science, I got my, I did my undergrad, but I didn't do like grad level computer science. I didn't, I sort of got distracted by all the startup things, so I didn't do grad level work. But about several years ago, I made a couple of things. So one is I sort of, I knew I wanted to go from being an engineer to a founder. And then, but sort of the becoming a CEO part was sort of backed into the job. And so a couple of realizations. One is that, I mean, there's a lot of like repetitive and like manual work you have to do as an executive that is actually lends itself pretty well to automation, both for like my own convenience. And then out of interest in learning, I guess what we call like classical machine learning these days, I started really trying to wrap my head around understanding machine learning and informational retrieval more, more formally. So I'd say maybe 2016, 2017 started me writing these more successively, more elaborate scripts to like understand basic like classifiers and regression and, and again, like basic information retrieval and NLP back in those days. And there's sort of like two things that came out of that. One is techniques are super powerful. And even just like studying like old school machine learning was a pretty big inversion of the way I had learned engineering, right? You know, I started programming when everyone starts programming and you're, you're sort of the human, you're giving an algorithm to the, and spelling out to the computer how it should run it. And then machine learning, here's machine learning where it's like actually flip that, like give it sort of the answer you want and it'll figure out the algorithm, which was pretty mind bending. And it was both like pretty powerful when I would write tools, like figure out like time audits or like, where's my time going? Is this meeting a one-on-one or is it a recruiting thing or is it a product strategy thing? I started out doing that manually with my assistant, but then found that this was like a very like automatable task. And so, which also had the side effect of teaching me a lot about machine learning. But then there was this big problem, like anytime you, it was very good at like tabular structured data, but like anytime it hit, you know, the usual malformed English that humans speak, it would just like fall over. I had to kind of abandon a lot of the things that I wanted to build because like there's no way to like parse text. Like maybe it would sort of identify the part of speech in a sentence or something. But then fast forward to the LLM, I mean actually I started trying some of like this, what we would call like very small LLMs before kind of the GPT class models. And it was like super hard to get those things working. So like these 500 parameter models would just be like hallucinating and repeating and you know. So actually I'd kind of like written it off a little bit. But then the chat GPT launch and GPT-3 for sure. And then once people figured out like prompting and instruction tuning, this was sort of like November-ish 2022 like everybody else sort of that the chat GPT launch being the starting gun for the whole AI era of computing and then having API access to three and then early access to GPT-4. I was like, oh man, it's happening. And so I was literally on my honeymoon and we're like on a beach in Thailand and I'm like coding these like AI tools to automate like writing or to assist with writing and all these different use cases.
Alessio [00:04:14]: You're like, I'm never going back to work. I'm going to automate all of it before I get back.
Drew [00:04:17]: And I was just, you know, ever since then, I mean, I've always been like coding like prototypes and just stuff to make my life more convenient, but like escalated a lot after 22. And yeah, I spent, I checked, I think it was probably like over 400 hours this year so far coding because I had my paternity leave where I was able to work on some special projects. But yeah, it's a super important part of like my whole learning journey is like being really hands-on with these things. And I mean, it's probably not a typical recipe, but I really love to get down to the metal as far as how this stuff works.
Alessio [00:04:47]: Yeah. So Swyx and I were with Sam Altman in October 22. We were like at a hack day at OpenAI and that's why we started this podcast eventually. But you did an interview with Sam like seven years ago and he asked you what's the biggest opportunity in startups and you were like machine learning and AI and you were almost like too early, right? It's like maybe seven years ago, the models weren't quite there. How should people think about revalidating like expectations of this technology? You know, I think even today people will tell you, oh, models are not really good at X because they were not good 12 months ago, but they're good today.
Drew [00:05:19]: What's your project? Heuristics for thinking about that or how is, yeah, I think the way I look at it now is pretty, has evolved a lot since when I started. I mean, I think everybody intuitively starts with like, all right, let's try to predict the future or imagine like what's this great end state we're going to get to. And the tricky thing is like often those prognostications are right, but they're right in terms of direction, but not when. For example, you know, even in the early days of the internet, 90s when things were even like tech space and you know, even before like the browser or things like that, people were like, oh man, you're going to have, you know, you're going to be able to order food, get like a Snickers delivered to your house, you're going to be able to watch any movie ever created. And they were right. But they were like, you know, it took 20 years for that to actually happen. And before you got to DoorDash, you had to get, you started with like Webvan and Cosmo and before you get to Spotify, you had to do like Napster and Kazaa and LimeWire and like a bunch of like broken Britney Spears MP3s and malware. So I think the big lesson is being early is the same as being wrong. Being late is the same as being wrong. So really how do you calibrate timing? And then I think with AI, it's the same thing that people are like, oh, it's going to completely upend society and all these positive and negative ways. I think that's like most of those things are going to come true. The question is like, when is that going to happen? And then with AI specifically, I think there's also, in addition to sort of the general tech category or like jumping too fast to the future, I think that AI is particularly susceptible to that. And you look at self-driving, right? This idea of like, oh my God, you can have a self-driving car captured everybody's imaginations 10, 12 years ago. And you know, people are like, oh man, in two years, there's not going to be another year. There's not going to be a human driver on the road to be seen. It didn't work out that way, right? We're still 10, 12 years later where we're in a world where you can sort of sometimes get a Waymo in like one city on earth. Exciting, but just took a lot longer than people think. And the reason is there's a lot of engineering challenges, but then there's a lot of other like societal time constants that are hard to compress. So one thing I think you can learn from things like self-driving is they have these levels of autonomy that's a useful kind of framework in driving or these like maturity levels. People sort of skip to like level five, full autonomy, or we're going to have like an autonomous knowledge worker that's just going to take, that's going to, and then we won't need humans anymore kind of projection that that's going to take a long time. But then when you think about level one or level two, like these little assistive experiences, you know, we're seeing a lot of traction with those. So what you see really working is the level one autonomy in the AI world would be like the tab auto-complete and co-pilot, right? And then, you know, maybe a little higher is like the chatbot type interface. Obviously you want to get to the highest level you can to build a good product, but the reliability just isn't, and the capability just isn't there in the early innings. And so, and then you think of other level one, level two type things, like Google Maps probably did more for self-driving than in literal self-driving, like a billion people have like the ability to have like maps and navigation just like taken care of for you autonomously. So I think the timing and maturity are really important factors to include.
Alessio [00:08:23]: The thing with self-driving, maybe one of the big breakthroughs was like simulation. So it's like, okay, instead of driving, we can simulate these environments. It's really hard to do when knowledge work, you know, how do you simulate like a product review? How do you simulate these things? I'm curious if you've done any experiments. I know some companies have started to build kind of like a virtual personas that you can like bounce ideas off of.
Drew [00:08:42]: I mean, fortunately in a company you generate lots of, you know, actual human training data all the time. And then I also just like start with myself, like, all right, I can, you know, it's pretty tricky even within your company to be like, all right, let's open all this up as quote training data. But, you know, I can start with my own emails or my own calendar or own stuff without running into the same kind of like privacy or other concerns. So I often like start with my own stuff. And so that is like a one level of bootstrapping, but actually four or five years ago during COVID, we decided, you know, a lot of companies were thinking about how do we go back to work? And so we decided to really lean into remote and distributed work because I thought, you know, this is going to be the biggest change to the way we work in our lifetimes. And COVID kind of ripped up a bunch of things, but I think everybody was sort of pleasantly surprised how with a lot of knowledge work, you could just keep going. And actually you were sort of fine. Work was decoupled from your physical environment, from being in a physical place, which meant that things people had dreamed about since the fifties or sixties, like telework, like you actually could work from anywhere. And that was now possible. So we decided to really lean into that because we debated, should we sort of hit the fast forward button or should we hit the rewind button and go back to 2019? And obviously that's been playing out over the last few years. And we decided to basically turn, we went like 90% remote. We still, the in-person part's really important. We can kind of come back to our working model, but we're like, yeah, this is, everybody is going to be in some kind of like distributed or hybrid state. So like instead of like running away from this, like let's do a full send, let's really go into it. Let's live in the future. A few years before our customers, let's like turn Dropbox into a lab for distributed work. And we do that like quite literally, both of the working model and then increasingly with our products. And then absolutely, like we have products like Dropbox Dash, which is our universal search product. That was like very elevated in priority for me after COVID because like now you have, we're putting a lot more stress on the system and on our screens, it's a lot more chaotic and overwhelming. And so even just like getting the right information, the right person at the right time is a big fundamental challenge in knowledge work and these, in the distributed world, like big problem today is still getting, you know, has been getting bigger. And then for a lot of these other workflows, yeah, there's, we can both get a lot of natural like training data from just our own like strategy docs and processes. There's obviously a lot you can do with synthetic data and you know, actually like LMs are pretty good at being like imitating generic knowledge workers. So it's, it's kind of funny that way, but yeah, the way I look at it is like really turn Dropbox into a lab for distributed work. You think about things like what are the big problems we're going to have? It's just the complexity on our screens just keeps growing and the whole environment gets kind of more out of sync with what makes us like cognitively productive and engaged. And then even something like Dash was initially seeded, I made a little personal search engine because I was just like personally frustrated with not being able to find my stuff. And along that whole learning journey with AI, like the vector search or semantic search, things like that had just been the tooling for that. The open source stuff had finally gotten to a place where it was a pretty good developer experience. And so, you know, in a few days I had sort of a hello world type search engine and I'm like, oh my God, like this completely works. You don't even have to get the keywords right. The relevance and ranking is super good. We even like untuned. So I guess that's to say like I've been surprised by if you choose like the right algorithm and the right approach, you can actually get like super good results without having like a ton of data. And even with LLMs, you can apply all these other techniques to give them, kind of bootstrap kind of like task maturity pretty quickly.
Alessio [00:12:14]: Before we jump into Dash, let's talk about the Drew Haas and AI engineering stuff. So IDE, let's break that down. What IDE do you use? Do you use Cursor, VS Code, do you use any coding assistant, like WeChat, is it just autocomplete?
Drew [00:12:28]: Yeah, yeah. Both. So I use VS Code as like my daily driver, although I'm like super excited about things like Cursor or the AI agents. I have my own like stack underneath that. I mean, some off the shelf parts, some pretty custom. So I use the continue.dev just like AI chat UI basically as just the UI layer, but I also proxy the request. I proxy the request to my own backend, which is sort of like a router. You can use any backend. I mean, Sonnet 3.5 is probably the best all around. But then these things are like pretty limited if you don't give them the right context. And so part of what the proxy does is like there's a separate thing where I can say like include all these files by default with the request. And then it becomes a lot easier and like without like cutting and pasting. And I'm building mostly like prototype toy apps, so it's like a front end React thing and a Python backend thing. And so it can do these like end to end diffs basically. And then I also like love being able to host everything locally or do it offline. So I have my own, when I'm on a plane or something or where like you don't have access or the internet's not reliable, I actually bring a gaming laptop on the plane with me. It's like a little like blue briefcase looking thing. And then I like literally hook up a GPU like into one of the outlets. And then I have, I can do like transcription, I can do like autocomplete, like I have an 8 billion, like Llama will run fine.
Alessio [00:13:44]: And you're using like a Llama to run the model?
Drew [00:13:47]: No, I use, I have my own like LLM inference stack. I mean, it uses the backend somewhat interchangeable. So everything from like XLlama to VLLM or SGLang, there's a bunch of these different backends you can use. And then I started like working on stuff before all this tooling was like really available. So you know, over the last several years, I've built like my own like whole crazy environment and like in stack here. So I'm a little nuts about it.
Alessio [00:14:12]: Yeah. What's the state of the art for, I guess not state of the art, but like when it comes to like frameworks and things like that, do you like using them? I think maybe a lot of people say, hey, things change so quickly, they're like trying to abstract things. Yeah.
Drew [00:14:24]: It's maybe too early today. As much as I do a lot of coding, I have to be pretty surgical with my time. I don't have that much time, which means I have to sort of like scope my innovation to like very specific places or like my time. So for the front end, it'll be like a pretty vanilla stack, like a Next.js, React based thing. And then these are toy apps. So it's like Python, Flask, SQLite, and then all the different, there's a whole other thing on like the backend. Like how do you get, sort of run all these models locally or with a local GPU? The scaffolding on the front end is pretty straightforward, the scaffolding on the backend is pretty straightforward. Then a lot of it is just like the LLM inference and control over like fine grained aspects of how you do generation, caching, things like that. And then there's a lot, like a lot of the work is how do you take, sort of go to an IMAP, like take an email, get a new, or a document or a spreadsheet or any of these kinds of primitives that you work with and then translate them, render them in a format that an LLM can understand. So there's like a lot of work that goes into that too. Yeah.
Alessio [00:15:24]: So I built a kind of like email triage system and like I would say 80% of the code is like Google and like pulling emails and then the actual AI part is pretty easy.
Drew [00:15:34]: Yeah. And even, same experience. And then I tried to do all these like NLP things and then to my dismay, like a bunch of reg Xs were like, got you like 95% of the way there. So I still leave it running, I just haven't really built like the LLM powered version of it yet. Yeah.
Alessio [00:15:51]: So do you have any thoughts on rag versus long context, especially, I mean with Dropbox, you know? Sure. Do you just want to shove things in? Like have you seen that be a lot better?
Drew [00:15:59]: Well, they kind of have different strengths and weaknesses, so you need both for different use cases. I mean, it's been awesome in the last 12 months, like now you have these like long context models that can actually do a lot. You can put a book in, you know, Sonnet's context and then now with the later versions of LLAMA, you can have 128k context. So that's sort of the new normal, which is awesome and that, that wasn't even the case a year ago. That said, models don't always use, and certainly like local models don't use the full context well fully yet, and actually if you provide too much irrelevant context, the quality degrades a lot. And so I say in the open source world, like we're still just getting to the cusp of like the full context is usable. And then of course, like when you're something like Dropbox Dash, like it's basically building this whole like brain that's like read everything your company's ever written. And so that's not going to fit into your context window, so you need rag just as a practical reality. And even for a lot of similar reasons, you need like RAM and hard disk in conventional computer architecture. And I think these things will keep like horse trading, like maybe if, you know, a million or 10 million is the new, tokens is the new context length, maybe that shifts. Maybe the bigger picture is like, it's super exciting to talk about the LLM and like that piece of the puzzle, but there's this whole other scaffolding of more conventional like retrieval or conventional machine learning, especially because you have to scale up products to like millions of people you do in your toy app is not going to scale to that from a cost or latency or performance standpoint. So I think you really need these like hybrid architectures that where you have very like purpose fit tools, or you're probably not using Sonnet 3.5 for all of your normal product use cases. You're going to use like a fine tuned 8 billion model or sort of the minimum model that gets you the right output. And then a smaller model also is like a lot more cost and latency versus like much better characteristics on that front.
Alessio [00:17:48]: Yeah. Let's jump into the Dropbox AI story. So sure. Your initial prototype was Files GPT. How did it start? And then how did you communicate that internally? You know, I know you have a pretty strong like mammal culture. One where you're like, okay, Hey, we got to really take this seriously.
Drew [00:18:06]: Yeah. Well, on the latter, it was, so how do we say like how we took Dropbox, how AI seriously as a company started kind of around that time, that honeymoon time, unfortunately. In January, I wrote this like memo to the company, like around basically like how we need to play offense in 23. And that most of the time the kind of concrete is set and like the winners are the winners and things are kind of frozen. But then with these new eras of computing, like the PC or the internet or the phone or the concrete on freezes and you can sort of build, do things differently and have a new set of winners. It's sort of like a new season starts as a result of a lot of that sort of personal hacking and just like thinking about this. I'm like, yeah, this is an inflection point in the industry. Like we really need to change how we think about our strategy. And then becoming an AI first company was probably the headline thing that we did. And then, and then that got, and then calling on everybody in the company to really think about in your world, how is AI going to reshape your workflows or what sort of the AI native way of thinking about your job. File GPT, which is sort of this Dropbox AI kind of initial concept that actually came from our engineering team as, you know, as we like called on everybody, like really think about what we should be doing that's new or different. So it was kind of organic and bottoms up like a bunch of engineers just kind of hacked that together. And then that materialized as basically when you preview a file on Dropbox, you can have kind of the most straightforward possible integration of AI, which is a good thing. Like basically you have a long PDF, you want to be able to ask questions of it. So like a pretty basic implementation of RAG and being able to do that when you preview a file on Dropbox. So that was the origin of that, that was like back in 2023 when we released just like the starting engines had just, you know, gotten going.
Alessio [00:19:53]: It's funny where you're basically like these files that people have, they really don't want them in a way, you know, like you're storing all these files and like you actually don't want to interact with them. You want a layer on top of it. And that's kind of what also takes you to Dash eventually, which is like, Hey, you actually don't really care where the file is. You just want to be the place that aggregates it. How do you think about what people will know about files? You know, are files the actual file? Are files like the metadata and they're just kind of like a pointer that goes somewhere and you don't really care where it is?
Drew [00:20:21]: Yeah.
Alessio [00:20:22]: Any thoughts about?
Drew [00:20:23]: Totally. Yeah. I mean, there's a lot of potential complexity in that question, right? Is it a, you know, what's the difference between a file and a URL? And you can go into the technicals, it's like pass by value, pass by reference. Okay. What's the format like? All right. So it starts with a primitive. It's not really a flat file. It's like a structured data. You're sort of collaborative. Yeah. That's keeping in sync. Blah, blah, blah. I actually don't start there at all. I just start with like, what do people, like, what do humans, let's work back from like how humans think about this stuff or how they should think about this stuff. Meaning like, I don't think about, Oh, here are my files and here are my links or cloud docs. I'm just sort of like, Oh, here's my stuff. This, this, here's sort of my documents. Here's my media. Here's my projects. Here are the people I'm working with. So it starts from primitives more like those, like how do people, how do humans think about these things? And then, then start from like a more ideal experience. Because if you think about it, we kind of have this situation that will look like particularly medieval in hindsight where, all right, how do you manage your work stuff? Well, on all, you know, on one side of your screen, you have this file browser that literally hasn't changed since the early eighties, right? You could take someone from the original Mac and sit them in front of like a computer and they'd be like, this is it. And that's, it's been 40 years, right? Then on the other side of your screen, you have like Chrome or a browser that has so many tabs open, you can no longer see text or titles. This is the state of the art for how we manage stuff at work. Interestingly, neither of those experiences was purpose-built to be like the home for your work stuff or even anything related to it. And so it's important to remember, we get like stuck in these local maxima pretty often in tech where we're obviously aware that files are not going away, especially in certain domains. So that format really matters and where files are still going to be the tool you use for like if there's something big, right? If you're a big video file, that kind of format in a file makes sense. There's a bunch of industries where it's like construction or architecture or sort of these domain specific areas, you know, media generally, if you're making music or photos or video, that all kind of fits in the big file zone where Dropbox is really strong and that's like what customers love us for. It's also pretty obvious that a lot of stuff that used to be in, you know, Word docs or Excel files, like all that has tilted towards the browser and that tilt is going to continue. So with Dash, we wanted to make something that was really like cloud-native, AI-native and deliberately like not be tied down to the abstractions of the file system. Now on the other hand, it would be like ironic and bad if we then like fractured the experience that you're like, well, if it touches a file, it's a syncing metaphor to this app. And if it's a URL, it's like this completely different interface. So there's a convergence that I think makes sense over time. But you know, but I think you have to start from like, not so much the technology, start from like, what do the humans want? And then like, what's the idealized product experience? And then like, what are the technical underpinnings of that, that can make that good experience?
Alessio [00:23:20]: I think it's kind of intuitive that in Dash, you can connect Google Drive, right? Because you think about Dropbox, it's like, well, it's file storage, you really don't want people to store files somewhere, but the reality is that they do. How do you think about the importance of storage and like, do you kind of feel storage is like almost solved, where it's like, hey, you can kind of store these files anywhere, what matters is like access.
Drew [00:23:38]: It's a little bit nuanced in that if you're dealing with like large quantities of data, it actually does matter. The implementation matters a lot or like you're dealing with like, you know, 10 gig video files like that, then you sort of inherit all the problems of sync and have to go into a lot of the challenges that we've solved. Switching on a pretty important question, like what is the value we provide? What does Dropbox do? And probably like most people, I would have said like, well, Dropbox syncs your files. And we didn't even really have a mission of the company in the beginning. I'm just like, yeah, I just don't want to carry a thumb driving around and life would be a lot better if our stuff just like lived in the cloud and I just didn't have to think about like, what device is the thing on or what operating, why are these operating systems fighting with each other and incompatible? You know, I just want to abstract all of that away. But then so we thought, even we were like, all right, Dropbox provides storage. But when we talked to our customers, they're like, that's not how we see this at all. Like actually, Dropbox is not just like a hard drive in the cloud. It's like the place where I go to work or it's a place like I started a small business is a place where my dreams come true. Or it's like, yeah, it's not keeping files in sync. It's keeping people in sync. It's keeping my team in sync. And so they're using this kind of language where we're like, wait, okay, yeah, because I don't know, storage probably is a commodity or what we do is a commodity. But then we talked to our customers like, no, we're not buying the storage, we're buying like the ability to access all of our stuff in one place. We're buying the ability to share everything and sort of, in a lot of ways, people are buying the ability to work from anywhere. And Dropbox was kind of, the fact that it was like file syncing was an implementation detail of this higher order need that they had. So I think that's where we start too, which is like, what is the sort of higher order thing, the job the customer is hiring Dropbox to do? Storage in the new world is kind of incidental to that. I mean, it still matters for things like video or those kinds of workflows. The value of Dropbox had never been, we provide you like the cheapest bits in the cloud. But it is a big pivot from Dropbox is the company that syncs your files to now where we're going is Dropbox is the company that kind of helps you organize all your cloud content. I started the company because I kept forgetting my thumb drive. But the question I was really asking was like, why is it so hard to like find my stuff, organize my stuff, share my stuff, keep my stuff safe? You know, I'm always like one washing machine and I would leave like my little thumb drive with all my prior company stuff on in the pocket of my shorts and then almost wash it and destroy it. And so I was like, why do we have to, this is like medieval that we have to think about this. So that same mindset is how I approach where we're going. But I think, and then unfortunately the, we're sort of back to the same problems. Like it's really hard to find my stuff. It's really hard to organize myself. It's hard to share my stuff. It's hard to secure my content at work. Now the problem is the same, the shape of the problem and the shape of the solution is pretty different. You know, instead of a hundred files on your desktop, it's now a hundred tabs in your browser, et cetera. But I think that's the starting point.
Alessio [00:26:30]: How has the idea of a product evolved for you? So, you know, famously Steve Jobs started by Dropbox and he's like, you know, this is just a feature. It's not a product. And then you build like a $10 billion feature. How in the age of AI, how do you think about, you know, maybe things that used to be a product are now features because the AI on top of it, it's like the product, like what's your mental model? Do you think about it?
Drew [00:26:50]: Yeah. So I don't think there's really like a bright line. I don't know if like I use the word features and products and my mental model that much of how I break it down because it's kind of a, it's a good question. I mean, I don't not think about features, I don't think about products, but it does start from that place of like, all right, we have all these new colors we can paint with and all right, what are these higher order needs that are sort of evergreen, right? So people will always have stuff at work. They're always need to be able to find it or, you know, all the verbs I just mentioned. It's like, okay, how can we make like a better painting and how can we, and then how can we use some of these new colors? And then, yeah, it's like pretty clear that after the large models, the way you find stuff share stuff, it's going to be completely different after COVID, it's going to be completely different. So that's the starting point. But I think it is also important to, you know, you have to do more than just work back from the customer and like what they're trying to do. Like you have to think about, and you know, we've, we've learned a lot of this the hard way sometimes. Okay. You might start with a customer. You might start with a job to be on there. You're like, all right, what's the solution to their problem? Or like, can we build the best product that solves that problem? Right. Like what's the best way to find your stuff in the modern world? Like, well, yeah, right now the status quo for the vast majority of the billion, billion knowledge workers is they have like 10 search boxes at work that each search 10% of your stuff. Like that's clearly broken. Obviously you should just have like one search box. All right. So we can do that. And that also has to be like, I'll come back to defensibility in a second, but like, can we build the right solution that is like meaningfully better from the status quo? Like, yes, clearly. Okay. Then can we like get distribution and growth? Like that's sort of the next thing you learned is as a founder, you start with like, what's the product? What's the product? What's the product? Then you're like, wait, wait, we need distribution and we need a business model. So those are the next kind of two dominoes you have to knock down or sort of needles you have to thread at the same time. So all right, how do we grow? I mean, if Dropbox 1.0 is really this like self-serve viral model that there's a lot of, we sort of took a borrowed from a lot of the consumer internet playbook and like what Facebook and social media were doing and then translated that to sort of the business world. How do you get distribution, especially as a startup? And then a business model, like, all right, storage happened to be something in the beginning happened to be something people were willing to pay for. They recognize that, you know, okay, if I don't buy something like Dropbox, I'm going to have to buy an external hard drive. I'm going to have to buy a thumb drive and I have to pay for something one way or another. People are already paying for things like backup. So we felt good about that. But then the last domino is like defensibility. Okay. So you build this product or you get the business model, but then, you know, what do you do when the incumbents, the next chess move for them is I just like copy, bundle, kill. So they're going to copy your product. They'll bundle it with their platforms and they'll like give it away for free or no added cost. And, you know, we had a lot of, you know, scar tissue from being on the wrong side of that. Now you don't need to solve all four for all four or five variables or whatever at once or you can sort of have, you know, some flexibility. But the more of those gates that you get through, you sort of add a 10 X to your valuation. And so with AI, I think, you know, there's been a lot of focus on the large language model, but it's like large language models are a pretty bad business from a, you know, you sort of take off your tech lens and just sort of business lens. Like there's sort of this weirdly self-commoditizing thing where, you know, models only have value if they're kind of on this like Pareto frontier of size and quality and cost. Being number two, you know, if you're not on that frontier, the second the frontier moves out, which it moves out every week, like your model literally has zero economic value because it's dominated by the new thing. LLMs generate output that can be used to train or improve. So there's weird, peculiar things that are specific to the large language model. And then you have to like be like, all right, where's the value going to accrue in the stack or the value chain? And, you know, certainly at the bottom with Nvidia and the semiconductor companies, and then it's going to be at the top, like the people who have the customer relationship who have the application layer. Those are a few of the like lenses that I look at a question like that through.
Alessio [00:30:48]: Do you think AI is making people more careful about sharing the data at all? People are like, oh, data is important, but it's like, whatever, I'm just throwing it out there. Now everybody's like, but are you going to train on my data? And like your data is actually not that good to train on anyway. But like how have you seen, especially customers, like think about what to put in, what to not?
Drew [00:31:06]: I mean, everybody should be. Well, everybody is concerned about this and nobody should be concerned about this, right? Because nobody wants their personal companies information to be kind of ground up into little pellets to like sell you ads or train the next foundation model. I think it's like massively top of mind for every one of our customers, like, and me personally, and with my Dropbox hat on, it's like so fundamental. And, you know, we had experience with this too at Dropbox 1.0, the same kind of resistance, like, wait, I'm going to take my stuff on my hard drive and put it on your server somewhere. Are you serious? What could possibly go wrong? And you know, before that, I was like, wait, are you going to sell me, I'm going to put my credit card number into this website? And before that, I was like, hey, I'm going to take all my cash and put it in a bank instead of under my mattress. You know, so there's a long history of like tech and comfort. So in some sense, AI is kind of another round of the same thing, but the issues are real. And then when I think about like defensibility for Dropbox, like that's actually a big advantage that we have is one, our incentives are very aligned with our customers, right? We only get, we only make money if you pay us and you only pay us if we do a good job. So we don't have any like side hustle, you know, we're not training the next foundation model. You know, we're not trying to sell you ads. Actually we're not even trying to lock you into an ecosystem, like the whole point of Dropbox is it works, you know, everywhere. Because I think one of the big questions we've circling around is sort of like, in the world of AI, where should our lane be? Like every startup has to ask, or in every big company has to ask, like, where can we really win? But to me, it was like a lot of the like trust advantages, platform agnostic, having like a very clean business model, not having these other incentives. And then we also are like super transparent. We were transparent early on. We're like, all right, we're going to establish these AI principles, very table stakes stuff of like, here's transparency. We want to give people control. We want to cover privacy, safety, bias, like fairness, all these things. And we put that out up front to put some sort of explicit guardrails out where like, hey, we're, you know, because everybody wants like a trusted partner as they sort of go into the wild world of AI. And then, you know, you also see people cutting corners and, you know, or just there's a lot of uncertainty or, you know, moving the pieces around after the fact, which no one feels good about.
Alessio [00:33:14]: I mean, I would say the last 10, 15 years, the race was kind of being the system of record, being the storage provider. I think today it's almost like, hey, if I can use Dash to like access my Google Drive file, why would I pay Google for like their AI feature? So like vice versa, you know, if I can connect my Dropbook storage to this other AI assistant, how do you kind of think about that, about, you know, not being able to capture all the value and how open people will stay? I think today things are still pretty open, but I'm curious if you think things will get more closed or like more open later.
Drew [00:33:42]: Yeah. Well, I think you have to get the value exchange right. And I think you have to be like a trustworthy partner or like no one's going to partner with you if they think you're going to eat their lunch, right? Or if you're going to disintermediate them and like all the companies are quite sophisticated with how they think about that. So we try to, like, we know that's going to be the reality. So we're actually not trying to eat anyone's like Google Drive's lunch or anything. Actually we'll like integrate with Google Drive, we'll integrate with OneDrive, really any of the content platforms, even if they compete with file syncing. So that's actually a big strategic shift. We're not really reliant on being like the store of record and there are pros and cons to this decision. But if you think about it, we're basically like providing all these apps more engagement. We're like helping users do what they're really trying to do, which is to get, you know, that Google Doc or whatever. And we're not trying to be like, oh, by the way, use this other thing. This is all part of our like brand reputation. It's like, no, we give people freedom to use whatever tools or operating system they want. We're not taking anything away from our partners. We're actually like making it, making their thing more useful or routing people to those things. I mean, on the margin, then we have something like, well, okay, to the extent you do rag and summarize things, maybe that doesn't generate a click. Okay. You know, we also know there's like infinity investment going into like the work agents. So we're not really building like a co-pilot or Gemini competitor. Not because we don't like those. We don't find that thing like captivating. Yeah, of course. But just like, you know, you learn after some time in this business that like, yeah, there's some places that are just going to be such kind of red oceans or just like super big battlefields. Everybody's kind of trying to solve the same problem and they just start duplicating all each other effort. And then meanwhile, you know, I think the concern would be is like, well, there's all these other problems that aren't being properly addressed by AI. And I was concerned that like, yeah, and everybody's like fixated on the agent or the chatbot interface, but forgetting that like, hey guys, like we have the opportunity to like really fix search or build a self-organizing Dropbox or environment or there's all these other things that can be a compliment. Because we don't really want our customers to be thinking like, well, do I use Dash or do I use co-pilot? And frankly, none of them do. In a lot of ways, actually, some of the things that we do on the security front with Dash for Business are a good compliment to co-pilot. Because as part of Dash for Business, we actually give admins, IT, like universal visibility and control over all the different, what's being shared in your company across all these different platforms. And as a precondition to installing something like co-pilot or Dash or Glean or any of these other things, right? You know, IT wants to know like, hey, before we like turn all the lights in here, like let's do a little cleaning first before we let everybody in. And there just haven't been good tools to do that. And post AI, you would do it completely differently. And so that's like a big, that's a cornerstone of what we do and what sets us apart from these tools. And actually, in a lot of cases, we will help those tools be adopted because we actually help them do it safely. Yeah.
Alessio [00:36:27]: How do you think about building for AI versus people? It's like when you mentioned cleaning up is because maybe before you were like, well, humans can have some common sense when they look at data on what to pick versus models are just kind of like ingesting. Do you think about building products differently, knowing that a lot of the data will actually be consumed by LLMs and like agents and whatnot versus like just people?
Drew [00:36:46]: I think it'll always be, I aim a little bit more for like, you know, level three, level four kind of automation, because even if the LLM is like capable of completely autonomously organizing your environment, it probably would do a reasonable job. But like, I think you build bad UI when the sort of user has to fit itself to the computer versus something that you're, you know, it's like an instrument you're playing or something where you have some kind of good partnership. And you know, and on the other side, you don't have to do all this like manual effort. And so like the command line was sort of subsumed by like, you know, graphical UI. We'll keep toggling back and forth. Maybe chat will be, chat will be an increasing, especially when you bring in voice, like will be an increasing part of the puzzle. But I don't think we're going to go back to like a million command lines either. And then as far as like the sort of plumbing of like, well, is this going to be consumed by an LLM or a human? Like fortunately, like you don't really have to design it that differently. I mean, you have to make sure everything's legible to the LLM, but it's like quite tolerant of, you know, malformed everything. And actually the more, the easier it makes something to read for a human, the easier it is for an LLM to read to some extent as well. But we really think about what's that kind of right, how do we build that right, like human machine interface where you're still in control and driving, but then it's super easy to translate your intent into like the, you know, however you want your folder, setting your environment set up or like your preferences.
Alessio [00:38:05]: What's the most underrated thing about Dropbox that maybe people don't appreciate?
Drew [00:38:09]: Well, I think this is just such a natural evolution for us. It's pretty true. Like when people think about the world of AI, file syncing is not like the next thing you would auto complete mentally. And I think we also did like our first thing so well that there were a lot of benefits to that. But I think there also are like, we hit it so hard with our first product that it was like pretty tough to come up with a sequel. And we had a bit of a sophomore slump and you know, I think actually a lot of kids do use Dropbox through in high school or things like that, but you know, they're not, they're using, they're a lot more in the browser and then their file system, right. And we know all this, but still like we're super well positioned to like help a new generation of people with these fundamental problems and these like that affect, you know, a billion knowledge workers around just finding, organizing, sharing your stuff and keeping it safe. And there's, there's a ton of unsolved problems in those four verbs. We've talked about search a little bit, but just even think about like a whole new generation of people like growing up without the ability to like organize their things and yeah, search is great. And if you just have like a giant infinite pile of stuff, then search does make that more manageable. But you know, you do lose some things that were pretty helpful in prior decades, right? So even just the idea of persistence, stuff still being there when you come back, like when I go to sleep and wake up, my physical papers are still on my desk. When I reboot my computer, the files are still on my hard drive. But then when in my browser, like if my operating system updates the wrong way and closes the browser or if I just more commonly just declared tab bankruptcy, it's like your whole workspace just clears itself out and starts from zero. And you're like, on what planet is this a good idea? There's no like concept of like, oh, here's the stuff I was working on. Yeah, let me get back to it. And so that's like a big motivation for things like Dash. Huge problems with sharing, right? If I'm remodeling my house or if I'm getting ready for a board meeting, you know, what do I do if I have a Google doc and an air table and a 10 gig 4k video? There's no collection that holds mixed format things. And so it's another kind of hidden problem, hidden in plain sight, like he's missing primitives. Files have folders, songs have playlists, links have, you know, there's no, somehow we miss that. And so we're building that with stacks in Dash where it's like a mixed format, smart collection that you can then, you know, just share whatever you need internally, externally and have it be like a really well designed experience and platform agnostic and not tying you to any one ecosystem. We're super excited about that. You know, we talked a little bit about security in the modern world, like IT signs all these compliance documents, but in reality has no way of knowing where anything is or what's being shared. It's actually better for them to not know about it than to know about it and not be able to do anything about it. And when we talked to customers, we found that there were like literally people in IT whose jobs it is to like manually go through, log into each, like log into office, log into workspace, log into each tool and like go comb through one by one the links that people have shared and like unshares. There's like an unshare guy in all these companies and that that job is probably about as fun as it sounds like, my God. So there's, you know, fortunately, I guess what makes technology a good business is for every problem it solves, it like creates a new one, so there's always like a sequel that you need. And so, you know, I think the happy version of our Act 2 is kind of similar to Netflix. I look at a lot of these companies that really had multiple acts and Netflix had the vision to be streaming from the beginning, but broadband and everything wasn't ready for it. So they started by mailing you DVDs, but then went to streaming and then, but the value probably the whole time was just like, let me press play on something I want to see. And they did a really good job about bringing people along from the DVD mailing off. You would think like, oh, the DVD mailing piece is like this burning platform or it's like legacy, you know, ankle weight. And they did have some false starts in that transition. But when you really think about it, they were able to take that DVD mailing audience, move, like migrate them to streaming and actually bootstrap a, you know, take their season one people and bootstrap a victory in season two, because they already had, you know, they weren't starting from scratch. And like both of those worlds were like super easy to sort of forget and be like, oh, it's all kind of destiny. But like, no, that was like an incredibly competitive environment. And Netflix did a great job of like activating their Act 1 advantages and winning in Act 2 because of it. So I don't think people see Dropbox that way. I think people are sort of thinking about us just in terms of our Act 1 and they're like, yeah, Dropbox is fine. I used to use it 10 years ago. But like, what have they done for me lately? And I don't blame them. So fortunately, we have like better and better answers to that question every year.
Alessio [00:42:39]: And you call it like the silicon brain. So you see like Dash and Stacks being like the silicon brain interface, basically for
Drew [00:42:46]: people. I mean, that's part of it. Yeah. And writ large, I mean, I think what's so exciting about AI and everybody's got their own kind of take on it, but if you like really zoom out civilizationally and like what allows humans to make progress and, you know, what sort of is above the fold in terms of what's really mattered. I certainly want to, I mean, there are a lot of points, but some that come to mind like you think about things like the industrial revolution, like before that, like mechanical energy, like the only way you could get it was like by your own hands, maybe an animal, maybe some like clever sort of machines or machines made of like wood or something. But you were quite like energy limited. And then suddenly, you know, the industrial revolution, things like electricity, it suddenly is like, all right, mechanical energy is now available on demand as a very fungible kind of, and then suddenly we consume a lot more of it. And then the standard of living goes way, way, way, way up. That's been pretty limited to the physical realm. And then I believe that the large models, that's really the first time we can kind of bottle up cognitive energy and offloaded, you know, if we started by offloading a lot of our mechanical or physical busy work to machines that freed us up to make a lot of progress in other areas. But then with AI and computing, we're like, now we can offload a lot more of our cognitive busy work to machines. And then we can create a lot more of it. Price of it goes way down. Importantly, like, it's not like humans never did anything physical again. It's sort of like, no, but we're more leveraged. We can move a lot more earth with a bulldozer than a shovel. And so that's like what is at the most fundamental level, what's so exciting to me about AI. And so what's the silicon brain? It's like, well, we have our human brains and then we're going to have this other like half of our brain that's sort of coming online, like our silicon brain. And it's not like one or the other. They complement each other. They have very complimentary strengths and weaknesses. And that's, that's a good thing. There's also this weird tangent we've gone on as a species to like where knowledge work, knowledge workers have this like epidemic of, of burnout, great resignation, quiet quitting. And there's a lot going on there. But I think that's one of the biggest problems we have is that be like, people deserve like meaningful work and, you know, can't solve all of it. But like, and at least in knowledge work, there's a lot of own goals, you know, enforced errors that we're doing where it's like, you know, on one side with brain science, like we know what makes us like productive and fortunately it's also what makes us engaged. It's like when we can focus or when we're some kind of flow state, but then we go to work and then increasingly going to work is like going to a screen and you're like, if you wanted to design an environment that made it impossible to ever get into a flow state or ever be able to focus, like what we have is that. And that was the thing that just like seven, eight years ago just blew my mind. I'm just like, I cannot understand why like knowledge work is so jacked up on this adventure. It's like, we, we put ourselves in like the most cognitively polluted environment possible and we put so much more stress on the system when we're working remotely and things like that. And you know, all of these problems are just like going in the wrong direction. And I just, I just couldn't understand why this was like a problem that wasn't fixing itself. And I'm like, maybe there's something Dropbox can do with this and you know, things like Dash are the first step. But then, well, so like what, well, I mean, now like, well, why are humans in this like polluted state? It's like, well, we're just, all of the tools we have today, like this generation of tools just passes on all of the weight, the burden to the human, right? So it's like, here's a bajillion, you know, 80,000 unread emails, cool. Here's 25 unread Slack channels. Here's, we all get started like, it's like jittery like thinking about it. And then you look at that, you're like, wait, I'm looking at my phone, it says like 80,000 unread things. There's like no question, product question for which this is the right answer. Fortunately, that's why things like our silicon brain are pretty helpful because like they can serve as like an attention filter where it's like, actually, computers have no problem reading a million things. Humans can't do that, but computers can. And to some extent, this was already happening with computer, you know, Excel is an aversion of your silicon brain or, you know, you could draw the line arbitrarily. But with larger models, like now so many of these little subtasks and tasks we do at work can be like fully automated. And I think, you know, I think it's like an important metaphor to me because it mirrors a lot of what we saw with computing, computer architecture generally. It's like we started out with the CPU, very general purpose, then GPU came along much better at these like parallel computations. We talk a lot about like human versus machine being like substituting, it's like CPU, GPU, it's not like one is categorically better than the other, they're complements. Like if you have something really parallel, use a GPU, if not, use a CPU. The whole relationship, that symbiosis between CPU and GPU has obviously evolved a lot since, you know, playing Quake 2 or something. But right now we have like the human CPU doing a lot of, you know, silicon CPU tasks. And so you really have to like redesign the work thoughtfully such that, you know, probably not that different from how it's evolved in computer architecture, where the CPU is sort of an orchestrator of these really like heavy lifting GPU tasks. That dividing line does shift a little bit, you know, with every generation. And so I think we need to think about knowledge work in that context, like what are human brains good at? What's our silicon brain good at? Let's resegment the work. Let's offload all the stuff that can be automated. Let's go on a hunt for like anything that could save a human CPU cycle. Let's give it to the silicon one. And so I think we're at the early earnings of actually being able to do something about it.
Alessio [00:48:00]: It's funny, I gave a talk to a few government people earlier this year with a similar point where we used to make machines to release human labor. And then the kilowatt hour was kind of like the unit for a lot of countries. And now you're doing the same thing with the brain and the data centers are kind of computational power plants, you know, they're kind of on demand tokens. You're on the board of Meta, which is the number one donor of Flops for the open source world. The thing about open source AI is like the model can be open source, but you need to carry a briefcase to actually maybe run a model that is not even that good compared to some of the big ones. How do you think about some of the differences in the open source ethos with like traditional software where it's like really easy to run and act on it versus like models where it's like it might be open source, but like I'm kind of limited, sort of can do with it?
Drew [00:48:45]: Yeah, well, I think with every new era of computing, there's sort of a tug of war between is this going to be like an open one or a closed one? And, you know, there's pros and cons to both. It's not like open is always better or open always wins. But, you know, I think you look at how the mobile, like the PC era and the Internet era started out being more on the open side, like it's very modular. Everybody sort of party that everybody could, you know, come to some downsides of that security. But I think, you know, the advent of AI, I think there's a real question, like given the capital intensity of what it takes to train these foundation models, like are we going to live in a world where oligopoly or cartel or all, you know, there's a few companies that have the keys and we're all just like paying them rent. You know, that's one future. Or is it going to be more open and accessible? And I'm like super happy with how that's just I find it exciting on many levels with all the different hats I wear about it. You know, fortunately, you've seen in real life, yeah, even if people aren't bringing GPUs on a plane or something, you've seen like the price performance of these models improve 10 or 100x year over year, which is sort of like many Moore's laws compounded together for a bunch of reasons like that wouldn't have happened without open source. Right. You know, for a lot of same reasons, it's probably better that we can anyone can sort of spin up a website without having to buy an internet information server license like there was some alternative future. So like things are Linux and really good. And there was a good balance of trade to where like people contribute their code and then also benefit from the community returning the favor. I mean, you're seeing that with open source. So you wouldn't see all this like, you know, this flourishing of research and of just sort of the democratization of access to compute without open source. And so I think it's been like phenomenally successful in terms of just moving the ball forward and pretty much anything you care about, I believe, even like safety. You can have a lot more eyes on it and transparency instead of just something is happening. And there was three places with nuclear power plants attached to them. Right. So I think it's it's been awesome to see. And then and again, for like wearing my Dropbox hat, like anybody who's like scaling a service to millions of people, again, I'm probably not using like frontier models for every request. It's, you know, there are a lot of different configurations, mostly with smaller models. And even before you even talk about getting on the device, like, you know, you need this whole kind of constellation of different options. So open source has been great for that.
Alessio [00:51:06]: And you were one of the first companies in the cloud repatriation. You kind of brought back all the storage into your own data centers. Where are we in the AI wave for that? I don't think people really care today to bring the models in-house. Like, do you think people will care in the future? Like, especially as you have more small models that you want to control more of the economics? Or are the tokens so subsidized that like it just doesn't matter? It's more like a principle. Yeah. Yeah.
Drew [00:51:30]: I mean, I think there's another one where like thinking about the future is a lot easier if you start with the past. So, I mean, there's definitely this like big surge in demand as like there's sort of this FOMO driven bubble of like all of big tech taking their headings and shipping them to Jensen for a couple of years. And then you're like, all right, well, first of all, we've seen this kind of thing before. And in the late 90s with like Fiber, you know, this huge race to like own the internet, own the information superhighway, literally, and then way overbuilt. And then there was this like crash. I don't know to what extent, like maybe it is really different this time. Or, you know, maybe if we create AGI that will sort of solve the rest of the, or we'll just have a different set of things to worry about. But, you know, the simplest way I think about it is like this is sort of a rent not buy phase because, you know, I wouldn't want to be, we're still so early in the maturity, you know, I wouldn't want to be buying like pallets of over like of 286s at a 5x markup when like the 386 and 486 and Pentium and everything are like clearly coming there around the corner. And again, because of open source, there's just been a lot more competition at every layer in the stack. And so product developers are basically beneficiaries of that. You know, the things we can do with the sort of cost estimates I was looking at a year or two ago to like provide different capabilities in the product, you know, cut, right, you know, slashing by 10, 100, 1000x. I think about coming back around. I mean, I think, you know, at some point you have to believe that the sort of supply and demand will even out as it always does. And then there's also like non-NVIDIA stacks like the Grok or Cerebris or some of these custom silicon companies that are super interesting and outperformed NVIDIA stack in terms of latency and things like that. So I guess it'd be a pretty exciting change. I think we're not close to the point where we were with like hard drives or storage when we sort of went back from the public cloud because like there it was like, yeah, the cost curves are super predictable. We know what the cost of a hard drive and a server and, you know, terabyte of bandwidth and all the inputs are going to just keep going down, riding down this cost curve. But to like rely on the public cloud to pass that along is sort of, we need a better strategy than like relying on the kindness of strangers. So we decided to bring that in house and still do, and we still get a lot of advantages. That said, like the public cloud is like scaled and been like a lot more reliable and just good all around than we would have predicted because actually back then we were worried like, is the public cloud going to even scale fast enough to where to keep up with us? But yeah, I think we're in the early innings. It's a little too chaotic right now. So I think renting and not sort of preserving agility is pretty important in times like these. Yeah.
Alessio [00:54:01]: We just went to the Cerebrus factory to do an episode there. We saw one of their data centers inside. Yeah. It's kind of like, okay, if this really works, you know, it kind of changes everything.
Drew [00:54:13]: And that is one of the things there, like this is one where you could just have these things that just like, okay, there's just like a new kind of piece on the chessboard, like recalc everything. So I think there's still, I mean, this is like not that likely, but I think this is an area where it actually could, you could have these sort of like, you know, and out of nowhere, all of a sudden, you know, everything's different. Yeah.
Alessio [00:54:33]: I know one of the management books he references, Ending Growth's, I'm only the paranoid survive.
Drew [00:54:37]: Yeah.
Alessio [00:54:37]: Maybe if you look at Intel, they did a great job memory to chip, but then it's like maybe CPU to GPU, they kind of missed that thing. Yeah. How do you think about staying relevant for so long now? It's been 17 years you've been doing Dropbox.
Drew [00:54:50]: What's the secret?
Alessio [00:54:50]: And maybe we can touch on founder mode and all of that. Yeah.
Drew [00:54:55]: Well, first, what makes tech exciting and also makes it hard is like, there's no standing still, right? And your customers never are like, oh no, we're good now. They always want more just, and then the ground is shifting under you or it's like, oh yeah, well, files are not even that relevant to the modern. I mean, it's still important, but like, you know, so much is tilted elsewhere. So I think you have to like always be moving and think about on the one level, like what is, and thinking of these different layers of abstraction, like, well, yeah, the technical service we provide is file syncing and storage in the past, but in the future it's going to be different. The way Netflix had to look at, well, technically we mail people physical DVDs and fulfillment centers, and then we have to switch like streaming and codex and bandwidth and data centers. So you, you, you do have to think about that level, but then it's like our, what's the evergreen problem we're solving is an important problem. Can we build the best product? Can we get distribution? Can we get a business model? Can we defend ourselves when we get copied? And then having like some context of like history has always been like one of the reading about the history, not just in tech, but of business or government or sports or military, these things that seem like totally new, you know, and to me would have been like totally new as a 25 year old, like, oh my God, the world's completely different and everything's going to change. You're like, well, there's not a lot of great things about getting older, but you do see like, well, no, this actually has like a million like precedents and you can actually learn a lot from, you know, about like the future of GPUs from like, I don't know how, you know, how formula one teams work or you can draw all these like weird analogies that are super helpful in guiding you from first principles or through a combination of first principles and like past context. But like, you know, build s**t we're really proud of. Like, that's a pretty important first step and really think about like, you sort of become blind to like how technology works as that's just the way it works. And even something like carrying a thumb drive, you're like, well, I'd much rather have a thumb drive than like literally not have my stuff or like have to carry a big external hard drive around. So you're always thinking like, oh, this is awesome. Like I ripped CDs and these like MP3s and these files and folders. This is the best. But then you miss on the other side. You're like, this isn't the end, right? MP3s and folders. It's like an Apple comes along. It's like, this is dumb. You should have like a catalog, artists, playlists, you know, that Spotify is like, Hey, this is dumb. Like you should, why are you buying these things? All the cards, it's the internet. You should have access to everything. And then by the way, why is this like such a single player experience? You should be able to share and they should have, there should be AI curated, et cetera, et cetera. And then a lot of it is also just like drawing, connecting dots between different disciplines, right? So a lot of what we did to make Dropbox successful is like we took a lot of the consumer internet playbook, applied it to business software from a virality and kind of ease of use standpoints. And then, you know, I think there's a lot of, you can draw from the consumer realm and what's worked there and that hasn't been ported over to business, right? So a lot of what we think about is like, yeah, when you sign into Netflix or Spotify or YouTube or any consumer experience, like what do you see? Well, you don't see like a bunch of titles starting with AA, right? You see like this whole, and it went on evolution, right? Like we talked about music and TV went through the same thing, like 10 channels over the air broadcast to 30 channels, a hundred channels, but that's something like a thousand channels. You're like, this has totally lost the plot. So we're sort of in the thousand channels era of productivity tools, which is like, wait, wait, we just need to like rethink the system here and we don't need another thousand channels. We need to redesign the whole experience. And so I think the consumer experiences that are like smart, you know, when you sign into Netflix, it's not like a thousand channels. It's like, here are a bunch of smart defaults. Even if you're a new signup, we don't know anything about you, but because of what the world is watching, here are some, you know, reasonable suggestions. And then it's like, okay, I watched drive to survive. I didn't watch squid game. You know, the next time I sign in, it's like a complete, it's a learning system, right? So a combination of design, machine learning, and just like the courage to like rethink the whole thing. I think that's, that's a pretty reliable recipe. And then you think you're like, all right, there's all that intelligence in the consumer experience. There's no filing things away. Everything's, there's all this sort of auto curated for you and sort of self optimizing. Then you go to work and you're like, there's not even an attempt to incorporate any intelligence or organization anywhere in this experience. And so like, okay, can we do something about that?
Alessio [00:58:57]: You know, you're one of the last founder CEOs, like you would talk, then you're like, Toby Lute, some of these folks.
Drew [00:59:03]: How, how does that change? I'm like 300 years old and why can't I be a founder CEO?
Alessio [00:59:07]: I was saying like when you run, when you run a company, like you've had multiple executives over the years, like how important is that for the founder to be CEO and just say, Hey, look, we're changing the way the company and the strategy works. It's like, we're really taking this seriously versus like you could be a public CEO and be like, Hey, I got my earnings call and like whatever, I just need to focus on getting the right numbers. Like how does that change the culture in the company? Yeah.
Drew [00:59:29]: Well, I think it's sort of dovetails with the founder mode whole thing. You know, I think founder mode is kind of this Rorschach test. It's, it's sort of like ill specified. So it's sort of like whatever you, you know, it is whatever you see it. I think it's also like a destination you get to more than like a state of mind. Right. So if you think about, you know, imagine someone, there was something called surgeon mode, you know, given a med student, the scalpel on day one, it's like, okay, hold up. You know, so there's something to be said for like experience and conviction and you know, you're going to do a lot better. A lot of things are a lot easier for me, like 17 years into it than they were one year into it. I think part of why founder mode is so resonant is, or it's like striking such a chord with so many people is, yeah, there's, there's a real power when you have like a directive, intuitive leader who can like decisively take the company like into the future. It's like, how the hell do you get that? Um, and I think every founder who makes it this long, like kind of can't help it, but to learn a lot during that period. And you talk about the, you know, Steve jobs or Elan's of the world, they, they did go through like wandering a period of like wandering in the desert or like nothing was working and they weren't the cool kids. I think you either sort of like unsubscribe or kind of get off the train during that. And I don't blame anyone for doing that. There are many times where I thought about that, but I think at some point you sort of, it all comes together and you sort of start being able to see the matrix. So you've sort of seen enough and learned enough. And as long as you keep your learning rate up, you can kind of surprise yourself in terms of like how capable you can become over a long period. And so I think there's a lot of like founder CEO journey, especially as an engineer. Like, you know, I never like set out to be a CEO. In fact, like the more I like understood in the early days, what CEOs did, the more convinced I was that I was like not the right person actually. And it was only after some like shoving by a previous mentor, like, Hey, don't just, just go try it. And if you don't like it, then you don't have to do it forever. So I think you start founder mode, you're, you're sort of default that because there's like, you realize pretty quickly, like nothing gets done in this company unless the founders are literally doing it by hand, then you scale. And then you're like, you get, you know, a lot of actually pretty good advice that like, you can't do everything yourself. Like you actually do need to hire people and like give them real responsibilities and empower people. And that's like a whole discipline called like management that, you know, we're not figuring out for the first time here, but then you, then there's a tendency to like lean too far back, you know, it's tough. And if you're like a 30 year old and you hire a 45 year old exec from, you know, high-flying company and a guy who was running like a $10 billion P&L and came to work for Dropbox where we were like a fraction of a billion dollar P&L and, you know, what am I going to tell him about sales? Right. And so you sort of recognize pretty quickly, like, I actually don't know a lot about all these different disciplines and like, maybe I should lean back and like let people do their thing. But then you can create this, like, if you lean too far back out, you create this sort of like vacuum, leadership vacuum where people are like, what are we doing? And then, you know, the system kind of like nature reports a vacuum, it builds all these like kind of weird structures just to keep the thing like standing up. And then at some point you learn enough of this that you're like, wait, this is not how this should be designed. And you actually get like the conviction and you learn enough to like know what to do and things like that. And then on the other side, you lean way back in. I think it's more of like a table flipping where you're like, hey, this company is like not running the way I want it. Like something, I don't know what happened, but it's going to be like this now. And I think that that's like an important developmental stage for a founder CEO. And if you can do it right and like make it to that point, like then the job becomes like a lot of fun and exciting and good things happen for the company, good things for happening for your customers. But it's not, it's like a really rough, you know, learning journey. It is. It is.
Alessio [01:03:10]: I've had many therapy sessions with founder CEOs. Let's go back to the beginning. Like today, the AI wave is like so big that like a lot of people are kind of scared to jump in the water. And when you started Dropbox, one article said, fortunately, the Dropbox founders are too stupid to know everyone's already tried this. In AI now, it kind of feels the same. You have a lot of companies that sound the same, but like none of them are really working. So obviously the problem is not solved. Do you have any advice for founders trying to navigate like the idea maze today on like what they should do? What are like counterintuitive things maybe to try?
Drew [01:03:45]: Well, I think like, you know, bringing together some of what we've covered, I think there's a lot of very common kind of category errors that founders make. One is, you know, I think he's starting from the technology versus starting from like a customer or starting from a use case. And I think every founder has to start with what you know. Like you're, yeah, you know, maybe if you're an engineer, you know how to build a product, but don't know any of the other next, you know, hurdle. You don't know much about the next hurdles you have to go through. So I think, I think the biggest lesson would be you have to keep your personal growth curve out of the company's growth curve. And for me, that meant you have to be like super systematic about training up what you don't know, because no one's going to do that for you. Your investors aren't going to do that. Like literally no one else will do that for you. And so then, then you have to have like, all right, well, and I think the most important, one of the most helpful questions to ask there is like, in five years from now, what do I wish I had been learning today? In three years from now, what do I wish in one year? You know, how will my job be different? How do I work back from that? And so, for example, you know, when I was just starting in 2007, it really was just like coding and talking to customers. And it's sort of like the YC ethos, you know, make something people want and coding and talking to customers are really all you should be doing in that early phase. But then if I were like, all right, well, that's sort of YC phase, what's, what are the next hurdles? Well, a year from now, then I'm going to need, but to get people, we're going to need fundraise, like raise money. Okay. To raise money, we're going to have to like, have to answer all these questions. We have to see like work back from that. And you're like, all right, we need to become like an expert in like venture capital financing. And then, you know, the circle keeps expanding. Then if we have a bunch of money, we're going to need like accountants and lawyers and employees. And I'm not to start managing people. Then two years would be like, well, we're gonna have this like products, but then we're gonna need users. We need money revenue. And then in five years, it'd be like, yeah, we're going to be like tangling with like Microsoft, Google, Apple, Facebook, everybody. And like, somehow we're going to feel like deal with that. And then that's like what the company's got to deal with. And as CEO, I'm going to be responsible for all that. But then like my personal growth, there's all these skills I'm going to need. I'm going to need to know like what marketing is and like what finance is and how to manage people, how to be a leader, whatever that is. And so, and then I think one thing people often do is like, oof, like that it's like imposter syndrome kind of stuff. You're like, oh, it seems so remote or far away that, or I'm not comfortable speaking publicly or I've never managed people before. I haven't this. I haven't been like, and maybe even learning a little bit about it makes it feel even worse. He's like, now I, I thought I didn't know a lot. Now I know I don't know a lot, right. Part of it is more technical. Like how do I learn all these different disciplines and sort of train myself and a lot of that's like reading, you know, having founders or community that are sort of going through the same thing. So that's, that was how I learned. Maybe reading was the single most helpful thing more than any one person or, or talking to people like reading books. But then there's a whole mindset piece of it, which is sort of like, you have to cut yourself a little bit of slack. Like, you know, I wish someone had sort of sat me down and told me like, dude, you may be an engineer, but like, look, all the tech founders that, you know, tech CEOs that you admire, like they actually all, you know, almost all of them started out as engineers, they learned the business stuff on the job. So like, this is actually something that's normal and achievable. You're not like broken for not knowing, you know, no, those people didn't, weren't like, didn't come out of the womb with like shiny hair and Armani suit. You know, you can learn this stuff. So even just like knowing it's learnable and then second, like, but I think there's a big piece of it around like discomfort where it's like, I mean, we're like kind of pushing the edges. I don't know if I want to be CEO or I don't know if I'm ready for this, this, this, like learning to like walk towards that when you want to run away from it. And then lastly, I think, you know, just recognizing the time constant. So five weeks, you're not going to be a great leader or manager or a great public speaker or whatever, you know, think any more than you'll be a great guitar players, you know, play sport that well, or be a surgeon. But in like five years, like actually you can be pretty good at any of those things. Maybe you won't be like fully expert, but you like a lot more latent potential. You know, people have a lot more latent potential than they fully appreciate, but it doesn't happen by itself. You have to like carve out time and really be systematic about unlocking it.
Alessio [01:07:36]: How do you think about that for building your team? I know you're a big Pat's fan. Obviously the, that's a great example of building a dynasty on like some building blocks and bringing people into the system. When you're building a company, like how much slack do you have people on, Hey, you're going to learn this versus like, how do you measure like the learning grade of the people you hire? And like, how do you think about picking and choosing? Great question.
Drew [01:07:56]: It's hard. Um, what you want is a balance, right? And we've had a lot of success with great leaders who actually grew up with a company, started as an IC engineer or something, then made their way to whatever level our exec team is populated with a lot of those folks. But, but yeah, but there's also a lot of benefit to experience and having seen different environments and kind of been there, done that. And there's a lot of drawbacks to kind of learning by trial and error only. Um, and then even your high potential people like can go up the learning curve faster if they have like someone experienced to learn from now, like experiences in a panacea, either you can, you know, have various organ rejection or misfit or like overfitting from their past experience or cultural mismatches or, you know, you name it, I've seen it all. I've done, I've kind of gotten all the mistake merit badges on that. But I think it's like constructing a team where there's a good balance, like, okay, for the high potential folks who are sort of in the biggest jobs, their lives can, do they either have someone that they're managing them that they can learn from, you know, as a CEO, part of your job or as a manager, like you have to like surround or they help support them. So getting the mentors are getting first time execs like mentors who have been there, done that, or, um, getting them in like, you know, there's usually for any function, there's usually like a social group, like, Oh, chiefs of staff of Silicon Valley. Okay. Like, you know, there's usually these informal kind of communities you can join. And then, um, yeah, you just don't want to be too rotated in one direction or the other, because we've, we've done it. We've like overdone it on the high potential piece, but then like everybody's kind of making dumb mistakes, the bad mistakes are the ones where you're like, either you're making it multiple times or like these are known knowns to the industry, but if they're not known, known, if they're like unknown unknowns to your team, then you're doing, you have a problem. And then again, if you have too much, if you've just only hire external people, like then you're sort of at the mercy, you'll be like whatever random average of whatever culture or practices they bring in can create resentment or like lack of career opportunities. Um, so it's really about how do you get, you know, it doesn't really matter if it's like exactly 50 50, I don't think about a sort of perfect balance, but you just need to be sort of tending that garden continuously. Awesome.
Alessio [01:09:57]: Drew, just to wrap, do you have any call to actions? Like who should come work at Dropbox? Like who should use Dropbox? Anything you want, uh, you want to tell people?
Drew [01:10:06]: Well, I'm super, I mean, today's a super exciting day for, cause we just launched dash for business and, you know, we've talked a little bit about the product. It's like universal search, universal access control, a lot of rethinking, sharing for the modern environment. But you know, what's personally exciting, you could talk about the product, but like the, it's just really exciting for me to like, yeah, this is like the first, like most major and most public step we've taken from our kind of Dropbox 1.0 roots. And there's probably a lot of people out there who either like grew up not using Dropbox or like, yeah, I used Dropbox like 10 years ago and it was cool, but I don't do that much of fun. So I think there's a lot of new reasons to kind of tune into what we're doing. And, and it's a lot of, it's been a lot of fun to, I think like the sort of the AI era has created all these new like paths forward for Dropbox that wouldn't have been here five years ago. And then, yeah, to the founders, like, you know, hang in there, do some reading and don't be too stressed about it. So we're pretty lucky to get to do what we do. Yeah.
Alessio [01:11:05]: Watch the Pats documentary on Apple TV.
Drew [01:11:08]: Yeah, Bill Belichick. I'm still Pats fan. Really got an F1. So we're technology partners with McLaren. They're doing super well.
Alessio [01:11:15]: So were you a McLaren fan before you were technology partner? So did you become partners?
Drew [01:11:19]: It's sort of like co-evolved. Yeah. I mean, I was a fan beforehand, but I'm like a lot more of a fan now, as you'd imagine.
Alessio [01:11:24]: Awesome. Well, thank you so much for the time, Drew. This was great. It was a lot of fun.
Drew [01:11:28]: Thanks for having me.

Get full access to Latent Space at www.latent.space/subscribe
- Luisteren Nogmaals beluisteren Doorgaan Wordt afgespeeld...
- Later beluisteren Later beluisteren
Production AI Engineering starts with Evals — with Ankur Goyal of Braintrust
11 okt· Latent Space: The AI Engineer Podcast
We are in 🗽 NYC this Monday! Join the AI Eng NYC meetup, bring demos and vibes!
It is a bit of a meme that the first thing developer tooling founders think to build in AI is all the non-AI operational stuff outside the AI. There are well over 60 funded LLM Ops startups all with hoping to solve the new observability, cost tracking, security, and reliability problems that come with putting LLMs in production, not to mention new LLM oriented products from incumbent, established ops/o11y players like Datadog and Weights & Biases.
2 years in to the current hype cycle, the early winners have tended to be people with practical/research AI backgrounds rather than MLOps heavyweights or SWE tourists:
* LangSmith: We covered how Harrison Chase worked on AI at Robust Intelligence and Kensho, the alma maters of many great AI founders
* HumanLoop: We covered how Raza Habib worked at Google AI during his PhD
* BrainTrust: Today’s guest Ankur Goyal founded Impira pre-Transformers and was acquihired to run Figma AI before realizing how to solve the Ops problem.
There have been many VC think pieces and market maps describing what people thought were the essential pieces of the AI Engineering stack, but what was true for 2022-2023 has aged poorly. The basic insight that Ankur had is the same thesis that Hamel Husain is pushing in his World’s Fair talk and podcast with Raza and swyx:
Evals are the centerpiece of systematic AI Engineering.
REALLY believing in this is harder than it looks with the benefit of hindsight. It’s not like people didn’t know evals were important. Basically every LLM Ops feature list has them. It’s an obvious next step AFTER managing your prompts and logging your LLM calls. In fact, up til we met Braintrust, we were working on an expanded version of the Impossible Triangle Theory of the LLM Ops War that we first articulated in the Humanloop writeup:
The single biggest criticism of the Rise of the AI Engineer piece is that we neglected to split out the role of product evals (as opposed to model evals) in the now infamous “API line” chart:
With hindsight, we were very focused on the differentiating 0 to 1 phase that AI Engineers can bring to an existing team of ML engineers. As swyx says on the Day 2 keynote of AI Engineer, 2024 added a whole new set of concerns as AI Engineering grew up:
A closer examination of Hamel’s product-oriented virtuous cycle and this infra-oriented SDLC would have eventually revealed that Evals, even more than logging, was the first point where teams start to get really serious about shipping to production, and therefore a great place to make an entry into the marketplace, which is exactly what Braintrust did.
Also notice what’s NOT on this chart: shifting to shadow open source models, and finetuning them… per Ankur, Fine-tuning is not a viable standalone product:
“The thing I would say is not debatable is whether or not fine-tuning is a business outcome or not. So let's think about the other components of your triangle. Ops/observability, that is a business… Frameworks, evals, databases [are a business, but] Fine-tuning is a very compelling method that achieves an outcome. The outcome is not fine-tuning, it is can I automatically optimize my use case to perform better if I throw data at the problem? And fine-tuning is one of multiple ways to achieve that.”
OpenAI vs Open AI Market Share
We last speculated about the market shifts in the End of OpenAI Hegemony and the Winds of AI Winter, and Ankur’s perspective is super valuable given his customer list:
Some surprises based on what he is seeing:
* Prior to Claude 3, OpenAI had near 100% market share. This tracks with what Harrison told us last year.
* Claude 3.5 Sonnet and also notably Haiku have made serious dents
* Open source model adoption is
- Luisteren Nogmaals beluisteren Doorgaan Wordt afgespeeld...
- Later beluisteren Later beluisteren
Building AGI in Real Time (OpenAI Dev Day 2024)
3 okt· Latent Space: The AI Engineer Podcast
We all have fond memories of the first Dev Day in 2023:
and the blip that followed soon after.
As Ben Thompson has noted, this year’s DevDay took a quieter, more intimate tone. No Satya, no livestream, (slightly fewer people?).
Instead of putting ChatGPT announcements in DevDay as in 2023, o1 was announced 2 weeks prior, and DevDay 2024 was reserved purely for developer-facing API announcements, primarily the Realtime API, Vision Finetuning, Prompt Caching, and Model Distillation.
However the larger venue and more spread out schedule did allow a lot more hallway conversations with attendees as well as more community presentations including our recent guest Alistair Pullen of Cosine as well as deeper dives from OpenAI including our recent guest Michelle Pokrass of the API Team.
Thanks to OpenAI’s warm collaboration (we particularly want to thank Lindsay McCallum Rémy!), we managed to record exclusive interviews with many of the main presenters of both the keynotes and breakout sessions. We present them in full in today’s episode, together with a full lightly edited Q&A with Sam Altman.
Show notes and related resources
Some of these used in the final audio episode below
* Simon Willison Live Blog
* swyx live tweets and videos
* Greg Kamradt coverage of Structured Output session, Scaling LLM Apps session
* Fireside Chat Q&A with Sam Altman
Timestamps
* [00:00:00] Intro by Suno.ai
* [00:01:23] NotebookLM Recap of DevDay
* [00:09:25] Ilan's Strawberry Demo with Realtime Voice Function Calling
* [00:19:16] Olivier Godement, Head of Product, OpenAI
* [00:36:57] Romain Huet, Head of DX, OpenAI
* [00:47:08] Michelle Pokrass, API Tech Lead at OpenAI ft. Simon Willison
* [01:04:45] Alistair Pullen, CEO, Cosine (Genie)
* [01:18:31] Sam Altman + Kevin Weill Q&A
* [02:03:07] Notebook LM Recap of Podcast
Transcript
[00:00:00] Suno AI: Under dev daylights, code ignites. Real time voice streams reach new heights. O1 and GPT, 4. 0 in flight. Fine tune the future, data in sight. Schema sync up, outputs precise. Distill the models, efficiency splice.
[00:00:33] AI Charlie: Happy October. This is your AI co host, Charlie. One of our longest standing traditions is covering major AI and ML conferences in podcast format. Delving, yes delving, into the vibes of what it is like to be there stitched in with short samples of conversations with key players, just to help you feel like you were there.
[00:00:54] AI Charlie: Covering this year's Dev Day was significantly more challenging because we were all requested not to record the opening keynotes. So, in place of the opening keynotes, we had the viral notebook LM Deep Dive crew, my new AI podcast nemesis, Give you a seven minute recap of everything that was announced.
[00:01:15] AI Charlie: Of course, you can also check the show notes for details. I'll then come back with an explainer of all the interviews we have for you today. Watch out and take care.
[00:01:23] NotebookLM Recap of DevDay
[00:01:23] NotebookLM: All right, so we've got a pretty hefty stack of articles and blog posts here all about open ais. Dev day 2024.
[00:01:32] NotebookLM 2: Yeah, lots to dig into there.
[00:01:34] NotebookLM 2: Seems
[00:01:34] NotebookLM: like you're really interested in what's new with AI.
[00:01:36] NotebookLM 2: Definitely. And it seems like OpenAI had a lot to announce. New tools, changes to the company. It's a lot.
[00:01:43] NotebookLM: It is. And especially since you're interested in how AI can be used in the real world, you know, practical applications, we'll focus on that.
[00:01:51] NotebookLM: Perfect. Like, for example, this Real time API, they announced that, right? That seems like a big deal if we want AI to sound, well, less like a robot.
[00:01:59] NotebookLM 2: It could be huge. The real time API could completely change how we, like, interact with AI. Like, imagine if your voice assistant could actually handle it if you interrupted it.
[00:02:08] NotebookLM: Or, like, have an actual conversation.
[00:02:10] NotebookLM 2: Right, not just these clunky back and forth things we're used to.
[00:02:14] NotebookLM: And they actually showed it off, didn't they? I read something about a travel app, one for languages. Even one where the AI ordered takeout.
[00:02:21] NotebookLM 2: Those demos were really interesting, and I think they show how this real time API can be used in so many ways.
[00:02:28] NotebookLM 2: And the tech behind it is fascinating, by the way. It uses persistent WebSocket connections and this thing called function calling, so it can respond in real time.
[00:02:38] NotebookLM: So the function calling thing, that sounds kind of complicated. Can you, like, explain how that works?
[00:02:42] NotebookLM 2: So imagine giving the AI Access to this whole toolbox, right?
[00:02:46] NotebookLM 2: Information, capabilities, all sorts of things. Okay. So take the travel agent demo, for example. With function calling, the AI can pull up details, let's say about Fort Mason, right, from some database. Like nearby restaurants, stuff like that.
[00:02:59] NotebookLM: Ah, I get it. So instead of being limited to what it already knows, It can go and find the information it needs, like a human travel agent would.
[00:03:07] NotebookLM 2: Precisely. And someone on Hacker News pointed out a cool detail. The API actually gives you a text version of what's being said. So you can store that, analyze it.
[00:03:17] NotebookLM: That's smart. It seems like OpenAI put a lot of thought into making this API easy for developers to use. But, while we're on OpenAI, you know, Besides their tech, there's been some news about, like, internal changes, too.
[00:03:30] NotebookLM: Didn't they say they're moving away from being a non profit?
[00:03:32] NotebookLM 2: They did. And it's got everyone talking. It's a major shift. And it's only natural for people to wonder how that'll change things for OpenAI in the future. I mean, there are definitely some valid questions about this move to for profit. Like, will they have more money for research now?
[00:03:46] NotebookLM 2: Probably. But will they, you know, care as much about making sure AI benefits everyone?
[00:03:51] NotebookLM: Yeah, that's the big question, especially with all the, like, the leadership changes happening at OpenAI too, right? I read that their Chief Research Officer left, and their VP of Research, and even their CTO.
[00:04:03] NotebookLM 2: It's true. A lot of people are connecting those departures with the changes in OpenAI's structure.
[00:04:08] NotebookLM: And I guess it makes you wonder what's going on behind the scenes. But they are still putting out new stuff. Like this whole fine tuning thing really caught my eye.
[00:04:17] NotebookLM 2: Right, fine tuning. It's essentially taking a pre trained AI model. And, like, customizing it.
[00:04:23] NotebookLM: So instead of a general AI, you get one that's tailored for a specific job.
[00:04:27] NotebookLM 2: Exactly. And that opens up so many possibilities, especially for businesses. Imagine you could train an AI on your company's data, you know, like how you communicate your brand guidelines.
[00:04:37] NotebookLM: So it's like having an AI that's specifically trained for your company?
[00:04:41] NotebookLM 2: That's the idea.
[00:04:41] NotebookLM: And they're doing it with images now, too, right?
[00:04:44] NotebookLM: Fine tuning with vision is what they called it.
[00:04:46] NotebookLM 2: It's pretty incredible what they're doing with that, especially in fields like medicine.
[00:04:50] NotebookLM: Like using AI to help doctors make diagnoses.
[00:04:52] NotebookLM 2: Exactly. And AI could be trained on thousands of medical images, right? And then it could potentially spot things that even a trained doctor might miss.
[00:05:03] NotebookLM: That's kind of scary, to be honest. What if it gets it wrong?
[00:05:06] NotebookLM 2: Well, the idea isn't to replace doctors, but to give them another tool, you know, help them make better decisions.
[00:05:12] NotebookLM: Okay, that makes sense. But training these AI models must be really expensive.
[00:05:17] NotebookLM 2: It can be. All those tokens add up. But OpenAI announced something called automatic prompt caching.
[00:05:23] Alex Volkov: Automatic what now? I don't think I came across that.
[00:05:26] NotebookLM 2: So basically, if your AI sees a prompt that it's already seen before, OpenAI will give you a discount.
[00:05:31] NotebookLM: Huh. Like a frequent buyer program for AI.
[00:05:35] NotebookLM 2: Kind of, yeah. It's good that they're trying to make it more affordable. And they're also doing something called model distillation.
[00:05:41] NotebookLM: Okay, now you're just using big words to sound smart. What's that?
[00:05:45] NotebookLM 2: Think of it like like a recipe, right? You can take a really complex recipe and break it down to the essential parts.
[00:05:50] NotebookLM: Make it simpler, but it still tastes the same.
[00:05:53] NotebookLM 2: Yeah. And that's what model distillation is. You take a big, powerful AI model and create a smaller, more efficient version.
[00:06:00] NotebookLM: So it's like lighter weight, but still just as capable.
[00:06:03] NotebookLM 2: Exactly. And that means more people can actually use these powerful tools. They don't need, like, a supercomputer to run them.
[00:06:10] NotebookLM: So they're making AI more accessible. That's great.
[00:06:13] NotebookLM 2: It is. And speaking of powerful tools, they also talked about their new O1 model.
[00:06:18] NotebookLM 2: That's the one they've been hyping up. The one that's supposed to be this big leap forward.
[00:06:22] NotebookLM: Yeah, O1. It sounds pretty futuristic. Like, from what I read, it's not just a bigger, better language model.
[00:06:28] NotebookLM 2: Right. It's a different porch.
[00:06:29] NotebookLM: They're saying it can, like, actually reason, right? Think.
[00:06:33] NotebookLM 2: It's trained differently.
[00:06:34] NotebookLM 2: They used reinforcement learning with O1.
[00:06:36] NotebookLM: So it's not just finding patterns in the data it's seen before.
[00:06:40] NotebookLM 2: Not just that. It can actually learn from its mistakes. Get better at solving problems.
[00:06:46] NotebookLM: So give me an example. What can O1 do that, say, GPT 4 can't?
[00:06:51] NotebookLM 2: Well, OpenAI showed it doing some pretty impressive stuff with math, like advanced math.
[00:06:56] NotebookLM 2: And coding, too. Complex coding. Things that even GPT 4 struggled with.
[00:07:00] NotebookLM: So you're saying if I needed to, like, write a screenplay, I'd stick with GPT 4? But if I wanted to solve some crazy physics problem, O1 is what I'd use.
[00:07:08] NotebookLM 2: Something like that, yeah. Although there is a trade off. O1 takes a lot more power to run, and it takes longer to get those impressive results.
[00:07:17] NotebookLM: Hmm, makes sense. More power, more time, higher quality.
[00:07:21] NotebookLM 2: Exactly.
[00:07:22] NotebookLM: It sounds like it's still in development, though, right? Is there anything else they're planning to add to it?
[00:07:26] NotebookLM 2: Oh, yeah. They mentioned system prompts, which will let developers, like, set some ground rules for how it behaves. And they're working on adding structured outputs and function calling.
[00:07:38] Alex Volkov: Wait, structured outputs? Didn't we just talk about that? We
[00:07:41] NotebookLM 2: did. That's the thing where the AI's output is formatted in a way that's easy to use.
[00:07:47] NotebookLM: Right, right. So you don't have to spend all day trying to make sense of what it gives you. It's good that they're thinking about that stuff.
[00:07:53] NotebookLM 2: It's about making these tools usable.
[00:07:56] NotebookLM 2: And speaking of that, Dev Day finished up with this really interesting talk. Sam Altman, the CEO of OpenAI, And Kevin Weil, their new chief product officer. They talked about, like, the big picture for AI.
[00:08:09] NotebookLM: Yeah, they did, didn't they? Anything interesting come up?
[00:08:12] NotebookLM 2: Well, Altman talked about moving past this whole AGI term, Artificial General Intelligence.
[00:08:18] NotebookLM: I can see why. It's kind of a loaded term, isn't it?
[00:08:20] NotebookLM 2: He thinks it's become a bit of a buzzword, and people don't really understand what it means.
[00:08:24] NotebookLM: So are they saying they're not trying to build AGI anymore?
[00:08:28] NotebookLM 2: It's more like they're saying they're focused on just Making AI better, constantly improving it, not worrying about putting it in a box.
[00:08:36] NotebookLM: That makes sense. Keep pushing the limits.
[00:08:38] NotebookLM 2: Exactly. But they were also very clear about doing it responsibly. They talked a lot about safety and ethics.
[00:08:43] NotebookLM: Yeah, that's important.
[00:08:44] NotebookLM 2: They said they were going to be very careful. About how they release new features.
[00:08:48] NotebookLM: Good! Because this stuff is powerful.
[00:08:51] NotebookLM 2: It is. It was a lot to take in, this whole Dev Day event.
[00:08:54] NotebookLM 2: New tools, big changes at OpenAI, and these big questions about the future of AI.
[00:08:59] NotebookLM: It was. But hopefully this deep dive helped make sense of some of it. At least, that's what we try to do here.
[00:09:05] AI Charlie: Absolutely.
[00:09:06] NotebookLM: Thanks for taking the deep dive with us.
[00:09:08] AI Charlie: The biggest demo of the new Realtime API involved function calling with voice mode and buying chocolate covered strawberries from our friendly local OpenAI developer experience engineer and strawberry shop owner, Ilan Biggio.
[00:09:21] AI Charlie: We'll first play you the audio of his demo and then go into a little interview with him.
[00:09:25] Ilan's Strawberry Demo with Realtime Voice Function Calling
[00:09:25] Romain Huet: Could you place a call and see if you could get us 400 strawberries delivered to the venue? But please keep that under 1500. I'm on it. We'll get those strawberries delivered for you.
[00:09:47] Ilan: Hello? Hi there. Is this Ilan? I'm Romain's AI assistant. How is it going? Fantastic. Can you tell me what flavors of strawberry dips you have for me? Yeah, we have chocolate, vanilla, and we have peanut butter. Wait, how much would 400 chocolate covered strawberries cost? 400? Are you sure you want 400? Yes, 400 chocolate covered
[00:10:14] swyx: strawberries.
[00:10:15] Ilan: Wait,
[00:10:16] swyx: how much
[00:10:16] Ilan: would that be? I think that'll be around, like, 1, 415. 92.
[00:10:25] Alex Volkov: Awesome. Let's go ahead and place the order for four chocolate covered strawberries.
[00:10:31] Ilan: Great, where would you like that delivered? Please deliver them to the Gateway Pavilion at Fort Mason. And I'll be paying in cash.
[00:10:42] Alex Volkov: Okay,
[00:10:43] Ilan: sweet. So just to confirm, you want four strawberries?
[00:10:45] Ilan: 400 chocolate covered strawberries to the Gateway Pavilion. Yes, that's perfect. And when can we expect delivery? Well, you guys are right nearby, so it'll be like, I don't know, 37 seconds? That's incredibly fast. Cool, you too.
[00:11:09] swyx: Hi, Ilan, welcome to Lanespace. Oh, thank you. I just saw your amazing demos, had your amazing strawberries. You are dressed up, like, exactly like a strawberry salesman. Gotta have it all. What was the building on demo like? What was the story behind the demo?
[00:11:22] swyx: It was really interesting. This is actually something I had been thinking about for months before the launch.
[00:11:27] swyx: Like, having a, like, AI that can make phone calls is something like I've personally wanted for a long time. And so as soon as we launched internally, like, I started hacking on it. And then that sort of just started. We made it into like an internal demo, and then people found it really interesting, and then we thought how cool would it be to have this like on stage as, as one of the demos.
[00:11:47] swyx: Yeah, would would you call out any technical issues building, like you were basically one of the first people ever to build with a voice mode API. Would you call out any issues like integrating it with Twilio like that, like you did with function calling, with like a form filling elements. I noticed that you had like intents of things to fulfill, and then.
[00:12:07] swyx: When there's still missing info, the voice would prompt you, roleplaying the store guy.
[00:12:13] swyx: Yeah, yeah, so, I think technically, there's like the whole, just working with audio and streams is a whole different beast. Like, even separate from like AI and this, this like, new capabilities, it's just, it's just tough.
[00:12:26] swyx: Yeah, when you have a prompt, conversationally it'll just follow, like the, it was, Instead of like, kind of step by step to like ask the right questions based on like the like what the request was, right? The function calling itself is sort of tangential to that. Like, you have to prompt it to call the functions, but then handling it isn't too much different from, like, what you would do with assistant streaming or, like, chat completion streaming.
[00:12:47] swyx: I think, like, the API feels very similar just to, like, if everything in the API was streaming, it actually feels quite familiar to that.
[00:12:53] swyx: And then, function calling wise, I mean, does it work the same? I don't know. Like, I saw a lot of logs. You guys showed, like, in the playground, a lot of logs. What is in there?
[00:13:03] swyx: What should people know?
[00:13:04] swyx: Yeah, I mean, it is, like, the events may have different names than the streaming events that we have in chat completions, but they represent very similar things. It's things like, you know, function call started, argument started, it's like, here's like argument deltas, and then like function call done.
[00:13:20] swyx: Conveniently we send one that has the full function, and then I just use that. Nice.
[00:13:25] swyx: Yeah and then, like, what restrictions do, should people be aware of? Like, you know, I think, I think, before we recorded, we discussed a little bit about the sensitivities around basically calling random store owners and putting, putting like an AI on them.
[00:13:40] swyx: Yeah, so there's, I think there's recent regulation on that, which is why we want to be like very, I guess, aware of, of You know, you can't just call anybody with AI, right? That's like just robocalling. You wouldn't want someone just calling you with AI.
[00:13:54] swyx: I'm a developer, I'm about to do this on random people.
[00:13:57] swyx: What laws am I about to break?
[00:14:00] swyx: I forget what the governing body is, but you should, I think, Having consent of the person you're about to call, it always works. I, as the strawberry owner, have consented to like getting called with AI. I think past that you, you want to be careful. Definitely individuals are more sensitive than businesses.
[00:14:19] swyx: I think businesses you have a little bit more leeway. Also, they're like, businesses I think have an incentive to want to receive AI phone calls. Especially if like, they're dealing with it. It's doing business. Right, like, it's more business. It's kind of like getting on a booking platform, right, you're exposed to more.
[00:14:33] swyx: But, I think it's still very much like a gray area. Again, so. I think everybody should, you know, tread carefully, like, figure out what it is. I, I, I, the law is so recent, I didn't have enough time to, like, I'm also not a lawyer. Yeah, yeah, yeah, of course. Yeah.
[00:14:49] swyx: Okay, cool fair enough. One other thing, this is kind of agentic.
[00:14:52] swyx: Did you use a state machine at all? Did you use any framework? No. You just stick it in context and then just run it in a loop until it ends call?
[00:15:01] swyx: Yeah, there isn't even a loop, like Okay. Because the API is just based on sessions. It's always just going to keep going. Every time you speak, it'll trigger a call.
[00:15:11] swyx: And then after every function call was also invoked invoking like a generation. And so that is another difference here. It's like it's inherently almost like in a loop, be just by being in a session, right? No state machines needed. I'd say this is very similar to like, the notion of routines, where it's just like a list of steps.
[00:15:29] swyx: And it, like, sticks to them softly, but usually pretty well. And the steps is the prompts? The steps, it's like the prompt, like the steps are in the prompt. Yeah, yeah, yeah. Right, it's like step one, do this, step one, step two, do that. What if I want to change the system prompt halfway through the conversation?
[00:15:44] swyx: You can. Okay. You can. To be honest, I have not played without two too much. Yeah,
[00:15:47] swyx: yeah.
[00:15:48] swyx: But, I know you can.
[00:15:49] swyx: Yeah, yeah. Yeah. Awesome. I noticed that you called it real time API, but not voice API. Mm hmm. So I assume that it's like real time API starting with voice. Right, I think that's what he said on the thing.
[00:16:00] swyx: I can't imagine, like, what else is real
[00:16:02] swyx: time? Well, I guess, to use ChatGPT's voice mode as an example, Like, we've demoed the video, right? Like, real time image, right? So, I'm not actually sure what timelines are, But I would expect, if I had to guess, That, like, that is probably the next thing that we're gonna be making.
[00:16:17] swyx: You'd probably have to talk directly with the team building this. Sure. But, You can't promise their timelines. Yeah, yeah, yeah, right, exactly. But, like, given that this is the features that currently, Or that exists that we've demoed on Chachapiti. Yeah. There
[00:16:29] swyx: will never be a
[00:16:29] swyx: case where there's like a real time text API, right?
[00:16:31] swyx: I don't Well, this is a real time text API. You can do text only on this. Oh. Yeah. I don't know why you would. But it's actually So text to text here doesn't quite make a lot of sense. I don't think you'll get a lot of latency gain. But, like, speech to text is really interesting. Because you can prevent You can prevent responses, like audio responses.
[00:16:54] swyx: And force function calls. And so you can do stuff like UI control. That is like super super reliable. We had a lot of like, you know, un, like, we weren't sure how well this was gonna work because it's like, you have a voice answering. It's like a whole persona, right? Like, that's a little bit more, you know, risky.
[00:17:10] swyx: But if you, like, cut out the audio outputs and make it so it always has to output a function, like you can end up with pretty pretty good, like, Pretty reliable, like, command like a command architecture. Yeah,
[00:17:21] swyx: actually, that's the way I want to interact with a lot of these things as well. Like, one sided voice.
[00:17:26] swyx: Yeah, you don't necessarily want to hear the
[00:17:27] swyx: voice back. And like, sometimes it's like, yeah, I think having an output voice is great. But I feel like I don't always want to hear an output voice. I'd say usually I don't. But yeah, exactly, being able to speak to it is super sweet.
[00:17:39] swyx: Cool. Do you want to comment on any of the other stuff that you announced?
[00:17:41] swyx: From caching I noticed was like, I like the no code change part. I'm looking forward to the docs because I'm sure there's a lot of details on like, what you cache, how long you cache. Cause like, enthalpy caches were like 5 minutes. I was like, okay, but what if I don't make a call every 5 minutes?
[00:17:56] swyx: Yeah,
[00:17:56] swyx: to be super honest with you, I've been so caught up with the real time API and making the demo that I haven't read up on the other stuff. Launches too much. I mean, I'm aware of them, but I think I'm excited to see how all distillation works. That's something that we've been doing like, I don't know, I've been like doing it between our models for a while And I've seen really good results like I've done back in a day like from GPT 4 to GPT 3.
[00:18:19] swyx: 5 And got like, like pretty much the same level of like function calling with like hundreds of functions So that was super super compelling So, I feel like easier distillation, I'm really excited for. I see. Is it a tool?
[00:18:31] swyx: So, I saw evals. Yeah. Like, what is the distillation product? It wasn't super clear, to be honest.
[00:18:36] swyx: I, I think I want to, I want to let that team, I want to let that team talk about it. Okay,
[00:18:40] swyx: alright. Well, I appreciate you jumping on. Yeah, of course. Amazing demo. It was beautifully designed. I'm sure that was part of you and Roman, and
[00:18:47] swyx: Yeah, I guess, shout out to like, the first people to like, creators of Wanderlust, originally, were like, Simon and Carolis, and then like, I took it and built the voice component and the voice calling components.
[00:18:59] swyx: Yeah, so it's been a big team effort. And like the entire PI team for like Debugging everything as it's been going on. It's been, it's been so good working with them. Yeah, you're the first consumers on the DX
[00:19:07] swyx: team. Yeah. Yeah, I mean, the classic role of what we do there. Yeah. Okay, yeah, anything else? Any other call to action?
[00:19:13] swyx: No, enjoy Dev Day. Thank you. Yeah. That's it.
[00:19:16] Olivier Godement, Head of Product, OpenAI
[00:19:16] AI Charlie: The latent space crew then talked to Olivier Godmont, head of product for the OpenAI platform, who led the entire Dev Day keynote and introduced all the major new features and updates that we talked about today.
[00:19:28] swyx: Okay, so we are here with Olivier Godmont. That's right.
[00:19:32] swyx: I don't pronounce French. That's fine. It was perfect. And it was amazing to see your keynote today. What was the back story of, of preparing something like this? Preparing, like, Dev Day? It
[00:19:43] Olivier Godement: essentially came from a couple of places. Number one, excellent reception from last year's Dev Day.
[00:19:48] Olivier Godement: Developers, startup founders, researchers want to spend more time with OpenAI, and we want to spend more time with them as well. And so for us, like, it was a no brainer, frankly, to do it again, like, you know, like a nice conference. The second thing is going global. We've done a few events like in Paris and like a few other like, you know, non European, non American countries.
[00:20:05] Olivier Godement: And so this year we're doing SF, Singapore, and London. To frankly just meet more developers.
[00:20:10] swyx: Yeah, I'm very excited for the Singapore one.
[00:20:12] Olivier Godement: Ah,
[00:20:12] swyx: yeah. Will you be
[00:20:13] Olivier Godement: there?
[00:20:14] swyx: I don't know. I don't know if I got an invite. No. I can't just talk to you. Yeah, like, and then there was some speculation around October 1st.
[00:20:22] Olivier Godement: Yeah. Is it because
[00:20:23] swyx: 01, October 1st? It
[00:20:25] Olivier Godement: has nothing to do. I discovered the tweet yesterday where like, people are so creative. No one, there was no connection to October 1st. But in hindsight, that would have been a pretty good meme by Tiana. Okay.
[00:20:37] swyx: Yeah, and you know, I think like, OpenAI's outreach to developers is something that I felt the whole in 2022, when like, you know, like, people were trying to build a chat GPT, and like, there was no function calling, all that stuff that you talked about in the past.
[00:20:51] swyx: And that's why I started my own conference as like like, here's our little developer conference thing. And, but to see this OpenAI Dev Day now, and like to see so many developer oriented products coming to OpenAI, I think it's really encouraging.
[00:21:02] Olivier Godement: Yeah, totally. It's that's what I said, essentially, like, developers are basically the people who make the best connection between the technology and, you know, the future, essentially.
[00:21:14] Olivier Godement: Like, you know, essentially see a capability, see a low level, like, technology, and are like, hey, I see how that application or that use case that can be enabled. And so, in the direction of enabling, like, AGI, like, all of humanity, it's a no brainer for us, like, frankly, to partner with Devs.
[00:21:31] Alessio: And most importantly, you almost never had waitlists, which, compared to like other releases, people usually, usually have.
[00:21:38] Alessio: What is the, you know, you had from caching, you had real time voice API, we, you know, Shawn did a long Twitter thread, so people know the releases. Yeah. What is the thing that was like sneakily the hardest to actually get ready for, for that day, or like, what was the kind of like, you know, last 24 hours, anything that you didn't know was gonna work?
[00:21:56] Olivier Godement: Yeah. The old Fairly, like, I would say, involved, like, features to ship. So the team has been working for a month, all of them. The one which I would say is the newest for OpenAI is the real time API. For a couple of reasons. I mean, one, you know, it's a new modality. Second, like, it's the first time that we have an actual, like, WebSocket based API.
[00:22:16] Olivier Godement: And so, I would say that's the one that required, like, the most work over the month. To get right from a developer perspective and to also make sure that our existing safety mitigation that worked well with like real time audio in and audio out.
[00:22:30] swyx: Yeah, what design choices or what was like the sort of design choices that you want to highlight?
[00:22:35] swyx: Like, you know, like I think for me, like, WebSockets, you just receive a bunch of events. It's two way. I obviously don't have a ton of experience. I think a lot of developers are going to have to embrace this real time programming. Like, what are you designing for, or like, what advice would you have for developers exploring this?
[00:22:51] Olivier Godement: The core design hypothesis was essentially, how do we enable, like, human level latency? We did a bunch of tests, like, on average, like, human beings, like, you know, takes, like, something like 300 milliseconds to converse with each other. And so that was the design principle, essentially. Like, working backward from that, and, you know, making the technology work.
[00:23:11] Olivier Godement: And so we evaluated a few options, and WebSockets was the one that we landed on. So that was, like, one design choice. A few other, like, big design choices that we had to make prompt caching. Prompt caching, the design, like, target was automated from the get go. Like, zero code change from the developer.
[00:23:27] Olivier Godement: That way you don't have to learn, like, what is a prompt prefix, and, you know, how long does a cache work, like, we just do it as much as we can, essentially. So that was a big design choice as well. And then finally, on distillation, like, and evaluation. The big design choice was something I learned at Skype, like in my previous job, like a philosophy around, like, a pit of success.
[00:23:47] Olivier Godement: Like, what is essentially the, the, the minimum number of steps for the majority of developers to do the right thing? Because when you do evals on fat tuning, there are many, many ways, like, to mess it up, frankly, like, you know, and have, like, a crappy model, like, evals that tell, like, a wrong story. And so our whole design was, okay, we actually care about, like, helping people who don't have, like, that much experience, like, evaluating a model, like, get, like, in a few minutes, like, to a good spot.
[00:24:11] Olivier Godement: And so how do we essentially enable that bit of success, like, in the product flow?
[00:24:15] swyx: Yeah, yeah, I'm a little bit scared to fine tune especially for vision, because I don't know what I don't know for stuff like vision, right? Like, for text, I can evaluate pretty easily. For vision let's say I'm like trying to, one of your examples was grab.
[00:24:33] swyx: Which, very close to home, I'm from Singapore. I think your example was like, they identified stop signs better. Why is that hard? Why do I have to fine tune that? If I fine tune that, do I lose other things? You know, like, there's a lot of unknowns with Vision that I think developers have to figure out.
[00:24:50] swyx: For
[00:24:50] Olivier Godement: sure. Vision is going to open up, like, a new, I would say, evaluation space. Because you're right, like, it's harder, like, you know, to tell correct from incorrect, essentially, with images. What I can say is we've been alpha testing, like, the Vision fine tuning, like, for several weeks at that point. We are seeing, like, even higher performance uplift compared to text fine tuning.
[00:25:10] Olivier Godement: So that's, there is something here, like, we've been pretty impressed, like, in a good way, frankly. But, you know, how well it works. But for sure, like, you know, I expect the developers who are moving from one modality to, like, text and images will have, like, more, you know Testing, evaluation, like, you know, to set in place, like, to make sure it works well.
[00:25:25] Alessio: The model distillation and evals is definitely, like, the most interesting. Moving away from just being a model provider to being a platform provider. How should people think about being the source of truth? Like, do you want OpenAI to be, like, the system of record of all the prompting? Because people sometimes store it in, like, different data sources.
[00:25:41] Alessio: And then, is that going to be the same as the models evolve? So you don't have to worry about, you know, refactoring the data, like, things like that, or like future model structures.
[00:25:51] Olivier Godement: The vision is if you want to be a source of truth, you have to earn it, right? Like, we're not going to force people, like, to pass us data.
[00:25:57] Olivier Godement: There is no value prop, like, you know, for us to store the data. The vision here is at the moment, like, most developers, like, use like a one size fits all model, like be off the shelf, like GP40 essentially. The vision we have is fast forward a couple of years. I think, like, most developers will essentially, like, have a.
[00:26:15] Olivier Godement: An automated, continuous, fine tuned model. The more, like, you use the model, the more data you pass to the model provider, like, the model is automatically, like, fine tuned, evaluated against some eval sets, and essentially, like, you don't have to every month, when there is a new snapshot, like, you know, to go online and, you know, try a few new things.
[00:26:34] Olivier Godement: That's a direction. We are pretty far away from it. But I think, like, that evaluation and decision product are essentially a first good step in that direction. It's like, hey, it's you. I set it by that direction, and you give us the evaluation data. We can actually log your completion data and start to do some automation on your behalf.
[00:26:52] Alessio: And then you can do evals for free if you share data with OpenAI. How should people think about when it's worth it, when it's not? Sometimes people get overly protective of their data when it's actually not that useful. But how should developers think about when it's right to do it, when not, or
[00:27:07] Olivier Godement: if you have any thoughts on it?
[00:27:08] Olivier Godement: The default policy is still the same, like, you know, we don't train on, like, any API data unless you opt in. What we've seen from feedback is evaluation can be expensive. Like, if you run, like, O1 evals on, like, thousands of samples Like, your build will get increased, like, you know, pretty pretty significantly.
[00:27:22] Olivier Godement: That's problem statement number one. Problem statement number two is, essentially, I want to get to a world where whenever OpenAI ships a new model snapshot, we have full confidence that there is no regression for the task that developers care about. And for that to be the case, essentially, we need to get evals.
[00:27:39] Olivier Godement: And so that, essentially, is a sort of a two bugs one stone. It's like, we subsidize, basically, the evals. And we also use the evals when we ship new models to make sure that we keep going in the right direction. So, in my sense, it's a win win, but again, completely opt in. I expect that many developers will not want to share their data, and that's perfectly fine to me.
[00:27:56] swyx: Yeah, I think free evals though, very, very good incentive. I mean, it's a fair trade. You get data, we get free evals. Exactly,
[00:28:04] Olivier Godement: and we sanitize PII, everything. We have no interest in the actual sensitive data. We just want to have good evaluation on the real use cases.
[00:28:13] swyx: Like, I always want to eval the eval. I don't know if that ever came up.
[00:28:17] swyx: Like, sometimes the evals themselves are wrong, and there's no way for me to tell you.
[00:28:22] Olivier Godement: Everyone who is starting with LLM, teaching with LLM, is like, Yeah, evaluation, easy, you know, I've done testing, like, all my life. And then you start to actually be able to eval, understand, like, all the corner cases, And you realize, wow, there's like a whole field in itself.
[00:28:35] Olivier Godement: So, yeah, good evaluation is hard and so, yeah. Yeah, yeah.
[00:28:38] swyx: But I think there's a, you know, I just talked to Brain Trust which I think is one of your partners. Mm-Hmm. . They also emphasize code based evals versus your sort of low code. What I see is like, I don't know, maybe there's some more that you didn't demo.
[00:28:53] swyx: YC is kind of like a low code experience, right, for evals. Would you ever support like a more code based, like, would I run code on OpenAI's eval platform?
[00:29:02] Olivier Godement: For sure. I mean, we meet developers where they are, you know. At the moment, the demand was more for like, you know, easy to get started, like eval. But, you know, if we need to expose like an evaluation API, for instance, for people like, you know, to pass, like, you know, their existing test data we'll do it.
[00:29:15] Olivier Godement: So yeah, there is no, you know, philosophical, I would say, like, you know, misalignment on that. Yeah,
[00:29:19] swyx: yeah, yeah. What I think this is becoming, by the way, and I don't, like it's basically, like, you're becoming AWS. Like, the AI cloud. And I don't know if, like, that's a conscious strategy, or it's, like, It doesn't even have to be a conscious strategy.
[00:29:33] swyx: Like, you're going to offer storage. You're going to offer compute. You're going to offer networking. I don't know what networking looks like. Networking is maybe, like, Caching or like it's a CDN. It's a prompt CDN.
[00:29:45] Alex Volkov: Yeah,
[00:29:45] swyx: but it's the AI versions of everything, right? Do you like do you see the analogies or?
[00:29:52] Olivier Godement: Whatever Whatever I took to developers. I feel like Good models are just half of the story to build a good app There's a third model you need to do Evaluation is the perfect example. Like, you know, you can have the best model in the world If you're in the dark, like, you know, it's really hard to gain the confidence and so Our philosophy is
[00:30:11] Olivier Godement: The whole like software development stack is being basically reinvented, you know, with LLMs. There is no freaking way that open AI can build everything. Like there is just too much to build, frankly. And so my philosophy is, essentially, we'll focus on like the tools which are like the closest to the model itself.
[00:30:28] Olivier Godement: So that's why you see us like, you know, investing quite a bit in like fine tuning, distillation, our evaluation, because we think that it actually makes sense to have like in one spot, Like, you know, all of that. Like, there is some sort of virtual circle, essentially, that you can set in place. But stuff like, you know, LLMOps, like tools which are, like, further away from the model, I don't know if you want to do, like, you know, super elaborate, like, prompt management, or, you know, like, tooling, like, I'm not sure, like, you know, OpenAI has, like, such a big edge, frankly, like, you know, to build this sort of tools.
[00:30:56] Olivier Godement: So that's how we view it at the moment. But again, frankly, the philosophy is super simple. The strategy is super simple. It's meeting developers where they want us to be. And so, you know that's frankly, like, you know, day in, day out, like, you know, what I try to do.
[00:31:08] Alessio: Cool. Thank you so much for the time.
[00:31:10] Alessio: I'm sure you,
[00:31:10] swyx: Yeah, I have more questions on, a couple questions on voice, and then also, like, your call to action, like, what you want feedback on, right? So, I think we should spend a bit more time on voice, because I feel like that's, like, the big splash thing. I talked well Well, I mean, I mean, just what is the future of real time for OpenAI?
[00:31:28] swyx: Yeah. Because I think obviously video is next. You already have it in the, the ChatGPT desktop app. Do we just have a permanent, like, you know, like, are developers just going to be, like, sending sockets back and forth with OpenAI? Like how do we program for that? Like, what what is the future?
[00:31:44] Olivier Godement: Yeah, that makes sense. I think with multimodality, like, real time is quickly becoming, like, you know, essentially the right experience, like, to build an application. Yeah. So my expectation is that we'll see like a non trivial, like a volume of applications like moving to a real time API. Like if you zoom out, like, audio is really simple, like, audio until basically now.
[00:32:05] Olivier Godement: Audio on the web, in apps, was basically very much like a second class citizen. Like, you basically did like an audio chatbot for users who did not have a choice. You know, they were like struggling to read, or I don't know, they were like not super educated with technology. And so, frankly, it was like the crappy option, you know, compared to text.
[00:32:25] Olivier Godement: But when you talk to people in the real world, the vast majority of people, like, prefer to talk and listen instead of typing and writing.
[00:32:34] swyx: We speak before we write.
[00:32:35] Olivier Godement: Exactly. I don't know. I mean, I'm sure it's the case for you in Singapore. For me, my friends in Europe, the number of, like, WhatsApp, like, voice notes they receive every day, I mean, just people, it makes sense, frankly, like, you know.
[00:32:45] Olivier Godement: Chinese. Chinese, yeah.
[00:32:46] swyx: Yeah,
[00:32:47] Olivier Godement: all voice. You know, it's easier. There is more emotions. I mean, you know, you get the point across, like, pretty well. And so my personal ambition for, like, the real time API and, like, audio in general is to make, like, audio and, like, multimodality, like, truly a first class experience.
[00:33:01] Olivier Godement: Like, you know, if you're, like, you know, the amazing, like, super bold, like, start up out of YC, you want to build, like, the next, like, billion, like, you know, user application to make it, like, truly your first and make it feel, like, you know, an actual good, like, you know, product experience. So that's essentially the ambition, and I think, like, yeah, it could be pretty big.
[00:33:17] swyx: Yeah. I think one, one people, one issue that people have with the voice so far as, as released in advanced voice mode is the refusals.
[00:33:24] Alex Volkov: Yeah.
[00:33:24] swyx: You guys had a very inspiring model spec. I think Joanne worked on that. Where you said, like, yeah, we don't want to overly refuse all the time. In fact, like, even if, like, not safe for work, like, in some occasions, it's okay.
[00:33:38] swyx: How, is there an API that we can say, not safe for work, okay?
[00:33:41] Olivier Godement: I think we'll get there. I think we'll get there. The mobile spec, like, nailed it, like, you know. It nailed it! It's so good! Yeah, we are not in the business of, like, policing, you know, if you can say, like, vulgar words or whatever. You know, there are some use cases, like, you know, I'm writing, like, a Hollywood, like, script I want to say, like, will go on, and it's perfectly fine, you know?
[00:33:59] Olivier Godement: And so I think the direction where we'll go here is that basically There will always be like, you know, a set of behavior that we will, you know, just like forbid, frankly, because they're illegal against our terms of services. But then there will be like, you know, some more like risky, like themes, which are completely legal, like, you know, vulgar words or, you know, not safe for work stuff.
[00:34:17] Olivier Godement: Where basically we'll expose like a controllable, like safety, like knobs in the API to basically allow you to say, hey, that theme okay, that theme not okay. How sensitive do you want the threshold to be on safety refusals? I think that's the Dijkstra. So a
[00:34:31] swyx: safety API.
[00:34:32] Olivier Godement: Yeah, in a way, yeah.
[00:34:33] swyx: Yeah, we've never had that.
[00:34:34] Olivier Godement: Yeah. '
[00:34:35] swyx: cause right now is you, it is whatever you decide. And then it's, that's it. That, that, that would be the main reason I don't use opening a voice is because of
[00:34:42] Olivier Godement: it's over police. Over refuse over refusals. Yeah. Yeah, yeah. No, we gotta fix that. Yeah. Like singing,
[00:34:47] Alessio: we're trying to do voice. I'm a singer.
[00:34:49] swyx: And you, you locked off singing.
[00:34:51] swyx: Yeah,
[00:34:51] Alessio: yeah, yeah.
[00:34:52] swyx: But I, I understand music gets you in trouble. Okay. Yeah. So then, and then just generally, like, what do you want to hear from developers? Right? We have, we have all developers watching you know, what feedback do you want? Any, anything specific as well, like from, especially from today anything that you are unsure about, that you are like, Our feedback could really help you decide.
[00:35:09] swyx: For sure.
[00:35:10] Olivier Godement: I think, essentially, it's becoming pretty clear after today that, you know, I would say the open end direction has become pretty clear, like, you know, after today. Investment in reasoning, investment in multimodality, Investment as well, like in, I would say, tool use, like function calling. To me, the biggest question I have is, you know, Where should we put the cursor next?
[00:35:30] Olivier Godement: I think we need all three of them, frankly, like, you know, so we'll keep pushing.
[00:35:33] swyx: Hire 10, 000 people, or actually, no need, build a bunch of bots.
[00:35:37] Olivier Godement: Exactly, and so let's take O1 smart enough, like, for your problems? Like, you know, let's set aside for a second the existing models, like, for the apps that you would love to build, is O1 basically it in reasoning, or do we still have, like, you know, a step to do?
[00:35:50] Olivier Godement: Preview is not enough, I
[00:35:52] swyx: need the full one.
[00:35:53] Olivier Godement: Yeah, so that's exactly that sort of feedback. Essentially what they would love to do is for developers I mean, there's a thing that Sam has been saying like over and over again, like, you know, it's easier said than done, but I think it's directionally correct. As a developer, as a founder, you basically want to build an app which is a bit too difficult for the model today, right?
[00:36:12] Olivier Godement: Like, what you think is right, it's like, sort of working, sometimes not working. And that way, you know, that basically gives us like a goalpost, and be like, okay, that's what you need to enable with the next model release, like in a few months. And so I would say that Usually, like, that's the sort of feedback which is like the most useful that I can, like, directly, like, you know, incorporate.
[00:36:33] swyx: Awesome. I think that's our time. Thank you so much, guys. Yeah, thank you so much.
[00:36:38] AI Charlie: Thank you. We were particularly impressed that Olivier addressed the not safe for work moderation policy question head on, as that had only previously been picked up on in Reddit forums. This is an encouraging sign that we will return to in the closing candor with Sam Altman at the end of this episode.
[00:36:57] Romain Huet, Head of DX, OpenAI
[00:36:57] AI Charlie: Next, a chat with Roman Hewitt, friend of the pod, AI Engineer World's fair closing keynote speaker, and head of developer experience at OpenAI on his incredible live demos And advice to AI engineers on all the new modalities.
[00:37:12] Alessio: Alright, we're live from OpenAI Dev Day. We're with Juan, who just did two great demos on, on stage.
[00:37:17] Alessio: And he's been a friend of Latentspace, so thanks for taking some of the time.
[00:37:20] Romain Huet: Of course, yeah, thank you for being here and spending the time with us today.
[00:37:23] swyx: Yeah, I appreciate appreciate you guys putting this on. I, I know it's like extra work, but it really shows the developers that you're, Care and about reaching out.
[00:37:31] Romain Huet: Yeah, of course, I think when you go back to the OpenAI mission, I think for us it's super important that we have the developers involved in everything we do. Making sure that you know, they have all of the tools they need to build successful apps. And we really believe that the developers are always going to invent the ideas, the prototypes, the fun factors of AI that we can't build ourselves.
[00:37:49] Romain Huet: So it's really cool to have everyone here.
[00:37:51] swyx: We had Michelle from you guys on. Yes, great episode. She very seriously said API is the path to AGI. Correct. And people in our YouTube comments were like, API is not AGI. I'm like, no, she's very serious. API is the path to AGI. Like, you're not going to build everything like the developers are, right?
[00:38:08] swyx: Of
[00:38:08] Romain Huet: course, yeah, that's the whole value of having a platform and an ecosystem of amazing builders who can, like, in turn, create all of these apps. I'm sure we talked about this before, but there's now more than 3 million developers building on OpenAI, so it's pretty exciting to see all of that energy into creating new things.
[00:38:26] Alessio: I was going to say, you built two apps on stage today, an international space station tracker and then a drone. The hardest thing must have been opening Xcode and setting that up. Now, like, the models are so good that they can do everything else. Yes. You had two modes of interaction. You had kind of like a GPT app to get the plan with one, and then you had a cursor to do apply some of the changes.
[00:38:47] Alessio: Correct. How should people think about the best way to consume the coding models, especially both for You know, brand new projects and then existing projects that you're trying to modify.
[00:38:56] Romain Huet: Yeah. I mean, one of the things that's really cool about O1 Preview and O1 Mini being available in the API is that you can use it in your favorite tools like cursor like I did, right?
[00:39:06] Romain Huet: And that's also what like Devin from Cognition can use in their own software engineering agents. In the case of Xcode, like, it's not quite deeply integrated in Xcode, so that's why I had like chat GPT side by side. But it's cool, right, because I could instruct O1 Preview to be, like, my coding partner and brainstorming partner for this app, but also consolidate all of the, the files and architect the app the way I wanted.
[00:39:28] Romain Huet: So, all I had to do was just, like, port the code over to Xcode and zero shot the app build. I don't think I conveyed, by the way, how big a deal that is, but, like, you can now create an iPhone app from scratch, describing a lot of intricate details that you want, and your vision comes to life in, like, a minute.
[00:39:47] Romain Huet: It's pretty outstanding.
[00:39:48] swyx: I have to admit, I was a bit skeptical because if I open up SQL, I don't know anything about iOS programming. You know which file to paste it in. You probably set it up a little bit. So I'm like, I have to go home and test it. And I need the ChatGPT desktop app so that it can tell me where to click.
[00:40:04] Romain Huet: Yeah, I mean like, Xcode and iOS development has become easier over the years since they introduced Swift and SwiftUI. I think back in the days of Objective C, or like, you know, the storyboard, it was a bit harder to get in for someone new. But now with Swift and SwiftUI, their dev tools are really exceptional.
[00:40:23] Romain Huet: But now when you combine that with O1, as your brainstorming and coding partner, it's like your architect, effectively. That's the best way, I think, to describe O1. People ask me, like, can GPT 4 do some of that? And it certainly can. But I think it will just start spitting out code, right? And I think what's great about O1, is that it can, like, make up a plan.
[00:40:42] Romain Huet: In this case, for instance, the iOS app had to fetch data from an API, it had to look at the docs, it had to look at, like, how do I parse this JSON, where do I store this thing, and kind of wire things up together. So that's where it really shines. Is mini or preview the better model that people should be using?
[00:40:58] Romain Huet: Like, how? I think people should try both. We're obviously very excited about the upcoming O1 that we shared the evals for. But we noticed that O1 Mini is very, very good at everything math, coding, everything STEM. If you need for your kind of brainstorming or your kind of science part, you need some broader knowledge than reaching for O1 previews better.
[00:41:20] Romain Huet: But yeah, I used O1 Mini for my second demo. And it worked perfectly. All I needed was very much like something rooted in code, architecting and wiring up like a front end, a backend, some UDP packets, some web sockets, something very specific. And it did that perfectly.
[00:41:35] swyx: And then maybe just talking about voice and Wanderlust, the app that keeps on giving, what's the backstory behind like preparing for all of that?
[00:41:44] Romain Huet: You know, it's funny because when last year for Dev Day, we were trying to think about what could be a great demo app to show like an assistive experience. I've always thought travel is a kind of a great use case because you have, like, pictures, you have locations, you have the need for translations, potentially.
[00:42:01] Romain Huet: There's like so many use cases that are bounded to travel that I thought last year, let's use a travel app. And that's how Wanderlust came to be. But of course, a year ago, all we had was a text based assistant. And now we thought, well, if there's a voice modality, what if we just bring this app back as a wink.
[00:42:19] Romain Huet: And what if we were interacting better with voice? And so with this new demo, what I showed was the ability to like, So, we wanted to have a complete conversation in real time with the app, but also the thing we wanted to highlight was the ability to call tools and functions, right? So, like in this case, we placed a phone call using the Twilio API, interfacing with our AI agents, but developers are so smart that they'll come up with so many great ideas that we could not think of ourselves, right?
[00:42:48] Romain Huet: But what if you could have like a, you know, a 911 dispatcher? What if you could have like a customer service? Like center, that is much smarter than what we've been used to today. There's gonna be so many use cases for real time, it's awesome.
[00:43:00] swyx: Yeah, and sometimes actually you, you, like this should kill phone trees.
[00:43:04] swyx: Like there should not be like dial one
[00:43:07] Romain Huet: of course para
[00:43:08] swyx: espanol, you know? Yeah, exactly. Or whatever. I dunno.
[00:43:12] Romain Huet: I mean, even you starting speaking Spanish would just do the thing, you know you don't even have to ask. So yeah, I'm excited for this future where we don't have to interact with those legacy systems.
[00:43:22] swyx: Yeah. Yeah. Is there anything, so you are doing function calling in a streaming environment. So basically it's, it's web sockets. It's UDP, I think. It's basically not guaranteed to be exactly once delivery. Like, is there any coding challenges that you encountered when building this?
[00:43:39] Romain Huet: Yeah, it's a bit more delicate to get into it.
[00:43:41] Romain Huet: We also think that for now, what we, what we shipped is a, is a beta of this API. I think there's much more to build onto it. It does have the function calling and the tools. But we think that for instance, if you want to have something very robust, On your client side, maybe you want to have web RTC as a client, right?
[00:43:58] Romain Huet: And, and as opposed to like directly working with the sockets at scale. So that's why we have partners like Life Kit and Agora if you want to, if you want to use them. And I'm sure we'll have many mores in the, in many more in the future. But yeah, we keep on iterating on that, and I'm sure the feedback of developers in the weeks to come is going to be super critical for us to get it right.
[00:44:16] swyx: Yeah, I think LiveKit has been fairly public that they are used in, in the Chachapiti app. Like, is it, it's just all open source, and we just use it directly with OpenAI, or do we use LiveKit Cloud or something?
[00:44:28] Romain Huet: So right now we, we released the API, we released some sample code also, and referenced clients for people to get started with our API.
[00:44:35] Romain Huet: And we also partnered with LifeKit and Agora, so they also have their own, like ways to help you get started that plugs natively with the real time API. So depending on the use case, people can, can can decide what to use. If you're working on something that's completely client or if you're working on something on the server side, for the voice interaction, you may have different needs, so we want to support all of those.
[00:44:55] Alessio: I know you gotta run. Is there anything that you want the AI engineering community to give feedback on specifically, like even down to like, you know, a specific API end point or like, what, what's like the thing that you want? Yeah. I
[00:45:08] Romain Huet: mean, you know, if we take a step back, I think dev Day this year is all different from last year and, and in, in a few different ways.
[00:45:15] Romain Huet: But one way is that we wanted to keep it intimate, even more intimate than last year. We wanted to make sure that the community is. Thank you very much for joining us on the Spotlight. That's why we have community talks and everything. And the takeaway here is like learning from the very best developers and AI engineers.
[00:45:31] Romain Huet: And so, you know we want to learn from them. Most of what we shipped this morning, including things like prompt caching the ability to generate prompts quickly in the playground, or even things like vision fine tuning. These are all things that developers have been asking of us. And so, the takeaway I would, I would leave them with is to say like, Hey, the roadmap that we're working on is heavily influenced by them and their work.
[00:45:53] Romain Huet: And so we love feedback From high feature requests, as you say, down to, like, very intricate details of an API endpoint, we love feedback, so yes that's, that's how we, that's how we build this API.
[00:46:05] swyx: Yeah, I think the, the model distillation thing as well, it might be, like, the, the most boring, but, like, actually used a lot.
[00:46:12] Romain Huet: True, yeah. And I think maybe the most unexpected, right, because I think if I, if I read Twitter correctly the past few days, a lot of people were expecting us. To shape the real time API for speech to speech. I don't think developers were expecting us to have more tools for distillation, and we really think that's gonna be a big deal, right?
[00:46:30] Romain Huet: If you're building apps that have you know, you, you want high, like like low latency, low cost, but high performance, high quality on the use case distillation is gonna be amazing.
[00:46:40] swyx: Yeah. I sat in the distillation session just now and they showed how they distilled from four oh to four mini and it was like only like a 2% hit in the performance and 50 next.
[00:46:49] swyx: Yeah,
[00:46:50] Romain Huet: I was there as well for the superhuman kind of use case inspired for an Ebola client. Yeah, this was really good. Cool man! so much for having me. Thanks again for being here today. It's always
[00:47:00] AI Charlie: great to have you. As you might have picked up at the end of that chat, there were many sessions throughout the day focused on specific new capabilities.
[00:47:08] Michelle Pokrass, Head of API at OpenAI ft. Simon Willison
[00:47:08] AI Charlie: Like the new model distillation features combining EVOLs and fine tuning. For our next session, we are delighted to bring back two former guests of the pod, which is something listeners have been greatly enjoying in our second year of doing the Latent Space podcast. Michelle Pokras of the API team joined us recently to talk about structured outputs, and today gave an updated long form session at Dev Day, describing the implementation details of the new structured output mode.
[00:47:39] AI Charlie: We also got her updated thoughts on the VoiceMode API we discussed in her episode, now that it is finally announced. She is joined by friend of the pod and super blogger, Simon Willison, who also came back as guest co host in our Dev Day. 2023 episode.
[00:47:56] Alessio: Great, we're back live at Dev Day returning guest Michelle and then returning guest co host Fork.
[00:48:03] Alessio: Fork, yeah, I don't know. I've lost count. I think it's been a few. Simon Willison is back. Yeah, we just wrapped, we just wrapped everything up. Congrats on, on getting everything everything live. Simon did a great, like, blog, so if you haven't caught up, I
[00:48:17] Simon Willison: wrote my, I implemented it. Now, I'm starting my live blog while waiting for the first talk to start, using like GPT 4, I wrote me the Javascript, and I got that live just in time and then, yeah, I was live blogging the whole day.
[00:48:28] swyx: Are you a cursor enjoyer?
[00:48:29] Simon Willison: I haven't really gotten into cursor yet to be honest. I just haven't spent enough time for it to click, I think. I'm more a copy and paste things out of Cloud and chat GPT. Yeah. It's interesting.
[00:48:39] swyx: Yeah. I've converted to cursor and 01 is so easy to just toggle on and off.
[00:48:45] Alessio: What's your workflow?
[00:48:46] Alessio: VS
[00:48:48] Michelle Pokrass: Code co pilot, so Yep, same here. Team co pilot. Co pilot is actually the reason I joined OpenAI. It was, you know, before ChatGPT, this is the thing that really got me. So I'm still into it, but I keep meaning to try out Cursor, and I think now that things have calmed down, I'm gonna give it a real go.
[00:49:03] swyx: Yeah, it's a big thing to change your tool of choice.
[00:49:06] swyx: Yes,
[00:49:06] Michelle Pokrass: yeah, I'm pretty dialed, so.
[00:49:09] swyx: I mean, you know, if you want, you can just fork VS Code and make your own. That's the thing to dumb thing, right? We joked about doing a hackathon where the only thing you do is fork VS Code and bet me the best fork win.
[00:49:20] Michelle Pokrass: Nice.
[00:49:22] swyx: That's actually a really good idea. Yeah, what's up?
[00:49:26] swyx: I mean, congrats on launching everything today. I know, like, we touched on it a little bit, but, like, everyone was kind of guessing that Voice API was coming, and, like, we talked about it in our episode. How do you feel going into the launch? Like, any design decisions that you want to highlight?
[00:49:41] Michelle Pokrass: Yeah, super jazzed about it. The team has been working on it for a while. It's, like, a very different API for us. It's the first WebSocket API, so a lot of different design decisions to be made. It's, like, what kind of events do you send? When do you send an event? What are the event names? What do you send, like, on connection versus on future messages?
[00:49:57] Michelle Pokrass: So there have been a lot of interesting decisions there. The team has also hacked together really cool projects as we've been testing it. One that I really liked is we had an internal hack a thon for the API team. And some folks built like a little hack that you could use to, like VIM with voice mode, so like, control vim, and you would tell them on like, nice, write a file and it would, you know, know all the vim commands and, and pipe those in.
[00:50:18] Michelle Pokrass: So yeah, a lot of cool stuff we've been hacking on and really excited to see what people build with it.
[00:50:23] Simon Willison: I've gotta call out a demo from today. I think it was Katja had a 3D visualization of the solar system, like WebGL solar system, you could talk to. That is one of the coolest conference demos I've ever seen.
[00:50:33] Simon Willison: That was so convincing. I really want the code. I really want the code for that to get put out there. I'll talk
[00:50:39] Michelle Pokrass: to the team. I think we can
[00:50:40] Simon Willison: probably set it up. Absolutely beautiful example. And it made me realize that The Realtime API, this WebSocket API, it means that building a website that you can just talk to is easy now.
[00:50:50] Simon Willison: It's like, it's not difficult to build, spin up a web app where you have a conversation with it, it calls functions for different things, it interacts with what's on the screen. I'm so excited about that. There are all of these projects I thought I'd never get to, and now I'm like, you know what? Spend a weekend on it.
[00:51:04] Simon Willison: I could have a talk to your data, talk to your database. With a web, with a, with a little web application. Yeah. That's so
[00:51:10] Michelle Pokrass: cool. Chat with PDF, but really chat with, really chat with pdf. No, completely.
[00:51:15] Simon Willison: Totally. And that's not even hard to build. That's the crazy thing about this.
[00:51:18] Michelle Pokrass: Yeah. Very cool. Yeah, when I first saw the space demo, I was actually just wowed and I, and I had a similar moment I think to all the people in the crowd.
[00:51:27] Michelle Pokrass: I also thought Romain's drone demo was super cool. That was a super
[00:51:30] Simon Willison: fun one as well. Yeah, I
[00:51:31] Michelle Pokrass: actually saw that live this morning, and I was holding my breath for sure.
[00:51:35] swyx: Knowing Romain, he probably spent the last two days working on it. But yeah, like, I'm curious about you were talking with Romain actually earlier about what the different levels of extraction are with WebSockets.
[00:51:47] swyx: It's something that most developers have zero experience with. I have zero experience with it. Apparently there's like, the RTC level, and then there's the WebSocket level, and there's like, levels in between.
[00:51:56] Simon Willison: Not so much. I mean, with WebSockets with the way they've built their API, you can connect directly to the OpenAI WebSocket from your browser.
[00:52:04] Simon Willison: And it's actually just regular JavaScript. Like, you instantiate the WebSocket thing. It looks quite easy from their example code. The problem is that if you do that, you're sending your API key. From like, source code that anyone can view. Yeah, we
[00:52:16] Michelle Pokrass: don't recommend that for production.
[00:52:18] Simon Willison: So it doesn't work for production, which is frustrating, because it means that you have to build a proxy.
[00:52:23] Simon Willison: So I'm going to have to go home and build myself a little WebSocket proxy just to hide my API key. I want OpenAI to do that. I want OpenAI to solve that problem for me, so I don't have to build the 1000th WebSocket proxy just for that one problem. Totally.
[00:52:36] Michelle Pokrass: We've also partnered with some some partner solutions.
[00:52:39] Michelle Pokrass: We've partnered with, I think, Agora. LiveKit a few others. So there's some loose solutions there, but yeah, we hear you. It's a beta.
[00:52:49] swyx: Yeah, yeah, I mean You still want a solution where someone brings their own key, And they can trust that you
[00:52:55] Simon Willison: don't get it.
[00:52:56] swyx: Right?
[00:52:56] Simon Willison: Kind of. I mean, I've been building a lot of bring your own key apps, Where it's my HTML and JavaScript, I store the key in local storage in their browser, And it never goes anywhere near my server.
[00:53:06] Simon Willison: Which works, but how do they trust me? How do they know I'm not gonna ship another piece of javascript that steals the key from them? And so, nominally, this actually
[00:53:13] swyx: comes with the crypto background. This is what MetaMask does. Where Yeah, it's a
[00:53:18] Michelle Pokrass: public private key thing. Yeah. Yeah.
[00:53:20] swyx: Like, why doesn't OpenAI do that?
[00:53:22] swyx: I don't know if, obviously it's
[00:53:24] Michelle Pokrass: I mean, as with most things, I think there's, like, some really interesting questions. And the answer is just, you know, it's not been the top priority and it's hard for a small team to do everything. I have been hearing a lot more about the need for things like sign in with OpenAI.
[00:53:40] Simon Willison: I want OAuth. I want to bounce my users through chat GPT and I get back a token that lets me spend up to 4 on the API on their behalf. Then I could ship all of my stupid little experiments, which currently require people to copy and paste their API key in, which cuts off everyone. Nobody knows how to do that.
[00:53:57] Michelle Pokrass: Totally, I hear you. Something we're thinking about, and yeah, stay tuned.
[00:54:01] swyx: Yeah, yeah right now, I think the only player in town is OpenRouter that is basically, it's funny, it was made by I forget his name but he used to be CTO of OpenSea, and the first thing he did when he came over was build Metamask for AI.
[00:54:16] Michelle Pokrass: Totally. Yeah, very cool.
[00:54:19] Alessio: What's the most underrated release from today?
[00:54:23] Michelle Pokrass: Vision Fine Tuning. Vision Fine Tuning is so underrated. For the past, like, two months, whenever I talk to founders, they tell me this is the thing they need most. A lot of people are doing, like, OCR on very bespoke formats, like government documents, and Vision Fine Tuning can help a lot with that use case.
[00:54:39] Michelle Pokrass: Also, bounding boxes. People have found, like, a lot of improvements for bounding boxes with Visionfine Tuning. So yeah, I think it's pretty slept on and people should try it. You only really need 100 images to get going.
[00:54:49] Simon Willison: Tell me more about bounding boxes. I didn't think that GPT 4 Vision could do bounding boxes at all.
[00:54:55] Michelle Pokrass: Yeah, it's actually not that amazing at it, we're working on it, but with fine tuning, you can make it really good for your use case.
[00:55:02] Simon Willison: That's cool, because I've been using Google Gemini's bounding block stuff recently, it's very, very impressive.
[00:55:06] Michelle Pokrass: Yeah, totally. But
[00:55:07] Simon Willison: being able to fine tune a model for that. The first thing I'm going to do with fine tuning for images is, I've got fine tuning.
[00:55:13] Simon Willison: And I'm going to fine tune a model that can tell which chicken is which. Which is hard because three of them are grey. So there's a little bit of Okay, this is
[00:55:20] Michelle Pokrass: my new favourite use case. Yeah, it's
[00:55:22] Simon Willison: I've managed to do it with prompting. Just like, I gave Claude Pictures of all of the chickens and then said, okay, which chicken is this?
[00:55:30] Michelle Pokrass: Yeah,
[00:55:30] Simon Willison: but it's not quite good enough because it confuses the great chicken. Listen,
[00:55:33] Michelle Pokrass: we can close that eval gap. Yeah That's it's
[00:55:36] Simon Willison: gonna be a great eval. My chicken eval is gonna be fantastic.
[00:55:39] Michelle Pokrass: I'm also really jazzed about the evals product It's kind of like a sub launch of the distillation thing But people have been struggling to make evals and the first time I saw the flow with how easy it is to make an eval And in our product, I was just blown away so I recommend people really try that.
[00:55:53] Michelle Pokrass: I think that's what's holding a lot of people back from really investing in AI, because they just have a hard time figuring out if it's going well for their use case. So we've been working on making it easier to do that.
[00:56:03] Alessio: Does the eval product include structured output testing? Like, function calling and things?
[00:56:08] Alessio: Yeah, you can
[00:56:08] Michelle Pokrass: check if it matches your JSON schema yeah.
[00:56:12] swyx: I mean, we have guaranteed structured output anyway, right? Well, but So we don't have to test it. Well,
[00:56:18] Michelle Pokrass: not the schema, but like the See, these seem easy to tell apart. I think so. So I might call them a function,
[00:56:24] Alessio: or Oh, I see. You're gonna write schema, wrong output.
[00:56:27] Alessio: So you can do function
[00:56:28] swyx: calling testing. Right.
[00:56:29] Michelle Pokrass: I'm pretty sure. I'll have to check that for you, but I think
[00:56:31] Alessio: so. Yeah, yeah, yeah. We'll make sure it's sent
[00:56:33] swyx: out.
[00:56:33] Alessio: How do you think about the evolution of, like, the API design? I think to me that's, like, the most important thing, so even with the OpenAI levels, like, chatbots, I can understand what the API design looks like. Reasoning, I can kind of understand it, even though, like, train of thought kind of changes things.
[00:56:49] Alessio: As you think about real time voice, and then you think about agents, it's like, how do you think about how you design the API, and, like, what the shape of it is?
[00:56:58] Michelle Pokrass: Yeah, so I think we're starting with the lowest level capabilities. And then we build on top of that, as we know that they're useful. So, a really good example of this is Realtime.
[00:57:07] Michelle Pokrass: We're actually going to be shipping audio capabilities in chat completions. So this is like the lowest level capability. So you supply in audio, and you can get back raw audio, and it works at the request response layer. But, in through building advanced voice mode, we realized ourselves that like, it's not It's pretty hard to do with something like Chat Completions, and so that led us to building this WebSocket API.
[00:57:28] Michelle Pokrass: So we really learned a lot from our own tools, and we think, you know, the Chat Completions thing is nice, and for certain use cases, or async stuff, but you're really gonna want a real time API? And then as we, you know, test more with developers, we might see that it makes sense to have like another layer of abstraction on top of that.
[00:57:44] Michelle Pokrass: Something like closer to you know, more client side libraries. But, for now, you know, that's where we feel we have like a really good point of view.
[00:57:52] Simon Willison: So that's a question I have is if I've got a half hour long audio recording, At the moment, the only way I can feed that in is if I call the WebSocket API and slice it up into little JSON basics for snippets and fire them all over.
[00:58:04] Simon Willison: That's it. In that case, I'd rather just give you a, like an image in the chat completion API, give you a URL files and input. Is that something That's what we're
[00:58:11] Michelle Pokrass: going to do.
[00:58:12] Simon Willison: Oh, thank goodness for that.
[00:58:13] Michelle Pokrass: Yes. It's in the blog post. I think it's a short one liner, but it's rolling out, I think, in the coming weeks.
[00:58:17] Michelle Pokrass: Oh, wow.
[00:58:18] Simon Willison: Oh, really soon then.
[00:58:19] Michelle Pokrass: Yeah, the team has been sprinting we're just putting finishing touches on stuff. Do you
[00:58:22] Simon Willison: have a feel for the length limit on that?
[00:58:24] Michelle Pokrass: I don't have it off the top. Okay. Sorry.
[00:58:26] Simon Willison: Because, yeah, often I want to do, I do a lot of work with, like, transcripts of hour long YouTube videos, which Yeah.
[00:58:31] Simon Willison: Yeah. Currently, I run them through Whisper and then I do the transcript that way, but being able to do the multimodal thing with those would be really useful.
[00:58:37] Michelle Pokrass: Totally, yeah. We're really jazzed about it. We want to basically give the lowest capabilities we have, lowest level capabilities, and, you know, the things that make it easier to use.
[00:58:45] Michelle Pokrass: And so, you know, targeting kind of both. I
[00:58:50] Simon Willison: just realized what I can do, though, is I do a lot of Unix utilities, little, like, Unix things. I want to be able to pipe the output of a command into something which streams that up to the WebSocket API and then speaks it out loud. So I can do streaming speech of the output of things.
[00:59:06] Simon Willison: That should work. Like, I think you've given me everything I need for that. That's cool.
[00:59:10] Michelle Pokrass: Yeah. Excited to see what you build. Is
[00:59:14] swyx: there I heard there are, like, multiple competing solutions. And you guys evaluated before you picked WebSockets. Like server set events, polling, I don't, like, can you give, like, your thoughts on, like, the live updating paradigms that you guys looked at?
[00:59:31] swyx: Because I think a lot of engineers have looked at stuff like this.
[00:59:34] Michelle Pokrass: Well, I think WebSockets are just a natural fit for bi directional streaming. You know, other places I've worked, like, Coinbase, we had a WebSocket API for pricing data. I think it's just like a very natural format.
[00:59:46] swyx: So it wasn't even really that controversial at all?
[00:59:49] Michelle Pokrass: I don't think it was super controversial. I mean, we definitely explored the space a little bit, but I think we came to WebSockets pretty quickly.
[00:59:56] swyx: Cool. Video?
[00:59:58] Michelle Pokrass: Yeah. Not yet, but, you know.
[01:00:03] swyx: I actually was hoping for the chat, GPT desktop app with video today. Yeah. Yeah.
[01:00:09] Simon Willison: Oh,
[01:00:10] Michelle Pokrass: my
[01:00:11] Simon Willison: question is one frame a second.
[01:00:16] Simon Willison: How frequently? Yeah.
[01:00:19] swyx: Because Yeah, I mean sending a sending a whole video frame of like a 1080p screen. Maybe it might be too much What's the limitations on a on a WebSocket chunk going over? I don't know
[01:00:33] Michelle Pokrass: I don't have that off the top
[01:00:34] Simon Willison: Like Google Gemini you can do an hour's worth of video in their context window and just by slicing it up into one frame At ten frames a second and it does work so I Don't know.
[01:00:46] Simon Willison: I'm I'm not sure But then that's the weird thing about Gemini is it's so good at you just giving it a flood of individual frames It'll be interesting to see if GPT 4. 0 can handle that or not
[01:00:55] Alessio: Do you have any more feature requests? It's been a long day for everybody, but you got you got me show right here So my one
[01:01:03] Simon Willison: is I want you to do all of the accounting for me I want my users to be able to run my app And I want them to call your APIs with their user ID and have you go, oh, they've spent 30 cents.
[01:01:15] Simon Willison: Check, cut them off at a dollar. I can like, check how much they spent. All of that stuff, because I'm having to build that at the moment, and I really don't want to. I don't want to be a token accountant. I want you to do the token accounting for me.
[01:01:26] Michelle Pokrass: Yeah, totally. I hear you. It's good feedback.
[01:01:29] swyx: Well, like, how does that contrast with your actual priorities, right?
[01:01:32] swyx: Like, I feel like you have a bunch of priorities. They showed some on stage with multi modality and all that.
[01:01:37] Michelle Pokrass: Yeah.
[01:01:37] swyx: Like
[01:01:39] Michelle Pokrass: Yeah it's good feedback. It's hard to say. I would say things change really quickly. Things that are big adop big blockers for user adoption we find very important. And, yeah. It's a rolling prioritization.
[01:01:53] Michelle Pokrass: Yeah.
[01:01:54] swyx: No assistance API update.
[01:01:56] Michelle Pokrass: Not at this time. Yeah. Yeah.
[01:01:59] swyx: I was hoping for, like, an O1 native. Do thing in assistance? Yeah. I thought they would go well together. we're still
[01:02:07] Michelle Pokrass: kind of iterating on the formats, I think there are some problems with the assistance API. Some things it does really well.
[01:02:13] Michelle Pokrass: And I think we'll keep iterating and land on something really good. But just, you know, it wasn't quite ready yet. Some of the things that are good in the assistance API is hosted tools. People really like hosted tools and especially RAG. And then some things that are, you know, less intuitive is just how many API requests you need to get going with the assistance API.
[01:02:30] Michelle Pokrass: It's
[01:02:30] Simon Willison: quite.
[01:02:30] Michelle Pokrass: It's quite a lot. Yeah, you gotta create an assistant, you gotta create a thread, you gotta, you know, do all this stuff. So yeah, it's something we're thinking about. It shouldn't be so hard.
[01:02:39] Simon Willison: The only thing I've used it for so far is Code Interpreter. It's like it's an API to Code Interpreter.
[01:02:43] Simon Willison: Crazy exciting. Yeah.
[01:02:44] Michelle Pokrass: Yes, we want to fix, we want to fix that and make it easier to use, so. I
[01:02:48] Simon Willison: want code intercepts over WebSockets, that would be wildly interesting.
[01:02:53] swyx: Yeah, do you, do you want to bring your own code interpreter or you want to use OpenAI's one? I want to
[01:02:57] Simon Willison: use theirs, because code intercepts is a hard problem, sandboxing and all of that stuff is Yeah, but there's a bunch
[01:03:02] swyx: of code interpreter as a
[01:03:03] Simon Willison: service
[01:03:04] swyx: things out there.
[01:03:04] swyx: There are a few now, yeah. Because there's, I think you don't Allow arbitrary installation of packages. Oh, they do. Unless
[01:03:10] Simon Willison: they really do actually use your hack code. It, huh?
[01:03:13] Michelle Pokrass: Yeah,
[01:03:13] Simon Willison: and I do.
[01:03:14] Michelle Pokrass: Yeah. You upload a pit package,
[01:03:16] Simon Willison: you can run, you can compile C code and code interpreter. I know. You know, to do it.
[01:03:20] Simon Willison: That's a hack. Oh, it's such a glorious hack though. Okay. I've had it Write me custom seql light extensions in C and compile them and run them inside of Python and it works.
[01:03:31] swyx: I mean, yeah, there's, there's others. E two B is one of them, like, yeah. It'll be interesting to see what the real time version of that will be.
[01:03:39] Alessio: Awesome, Michelle. Thank you for the update. We left the episode as, what will voice mode look like? Obviously, you knew what it looked like, but you didn't say it, so now you could share this.
[01:03:50] Alessio: Yeah, here we are. Hope you
[01:03:51] AI Charlie: guys
[01:03:51] Alessio: like
[01:03:52] swyx: it. Yeah, awesome. That's
[01:03:53] Alessio: it.
[01:03:53] AI Charlie: Our final guest today, and also a familiar, recent voice on the Latent Space pod, presented at one of the community talks at this year's Dev Day. Alistair Pullen of Cosene made a huge impression with all of you. Special shout out to listeners like Jesse from Morphlabs, when he came on to talk about how he created synthetic datasets to fine tune the largest LORAs that had ever been created for GPT 4.
[01:04:20] AI Charlie: 0 to post the highest ever scores on SWEbench and SWEbench Verified. While not getting recognition for it, because he refused to disclose his reasoning traces to the SWEbench team. Now that OpenAI's R1 preview is announced, it is incredible to see the OpenAI team also obscure their chain of thought traces for competitive reasons, and still perform lower than Cozine's genie model.
[01:04:45] Alistair Pullen, CEO, Cosine (Genie)
[01:04:45] AI Charlie: We snagged some time with Ali to break down what has happened since his episode aired.
[01:04:50] swyx: Welcome back, Ali. Thank you so much. Thanks for having me. So you just spoke at OpenAI Dev Day. What was the experience like? Did they reach out to you? You seem to have a very close relationship.
[01:04:59] Alessio: Yeah, so off the back of, off the back of the work that we've done, that we spoke about last time we saw each other I think that OpenAI definitely felt that the work we've been doing around fine tuning was worth sharing.
[01:05:10] Alessio: I would obviously tend to agree, but today today I spoke about some of the techniques that we learned. Obviously it was like a non linear path arriving to where we've arrived and the techniques that we've built to build Genie. So I definitely, I think I shared a few, a few extra pieces about some of the techniques and how it really works under the hood.
[01:05:25] Alessio: How you generate a data set to show the model how to do what we show the model. And that was mainly what I spoke about today. I mean, yeah, they reached out and they were, I was, I was Super excited at the opportunity, obviously, like, it's not every day that you get to come and do this. Especially in San Francisco, so Yeah, they reached out and they were like, do you want to talk at Dev Day?
[01:05:41] Alessio: You can speak about basically anything you want related to what you've built, and I was like, sure, that's amazing. I'll talk about fine tuning, how you build a model that does this software engineering, so yeah.
[01:05:50] swyx: Yeah and the trick here is when we talked, O1 was not out. No, it wasn't. Did you know about O1, or?
[01:05:57] Alessio: I didn't know. I knew some bits and pieces. No, not really. I knew a reasoning model was on the way. I didn't know what it was going to be called. I knew as much as everyone else. Strawberry was the name back then. Because,
[01:06:08] swyx: you know, I'll fast forward. You were the first to hide your chain of thought, reasoning traces as IP.
[01:06:14] swyx: Yes. Right? Famously, that got you in trouble with 3Bench or whatever. Yes. I feel slightly vindicated by that now. And now, obviously, O1 is doing it. Yeah, the
[01:06:22] Alessio: fact that, yeah, I mean, like, I think it's, I think it's true to say right now that the reasoning of your model gives you the edge that you have. Unlike.
[01:06:33] Alessio: The amount of effort that we put into our data pipeline to generate these human like reasoning traces was, I mean, that wasn't for nothing. We knew that this was the way that you'd unlock more performance, getting the model to think in a specific way. In our case, we wanted it to think like a software engineer.
[01:06:46] Alessio: But, yeah, I think, I think that, The approach that other people have taken, like OpenAI, in terms of reasoning, has definitely showed us that we were going down the right path pretty early on. And even now, we've started replacing some of the reasoning traces in our genie model with reasoning traces generated by O1, or at least in tandem with O1.
[01:07:09] Alessio: And we've already started seeing improvements in performance from that point. But no, like back to your point, in terms of like the, the whole like approach. Withholding them. I, I, I, I still think that that was the right decision to do because of the very reason that everyone else has decided to, to, to, to not share those things.
[01:07:26] Alessio: It's, it is exactly, it shows exactly how we do what we do and that is our edge at the moment. So,
[01:07:32] Alessio: yeah. As a founder, so, they also feature Cognition on, on stage, talk about that. How does that make you feel that like, you know, they're like, hey, 01 is so much better, makes us better. For you, it should be like.
[01:07:45] Alessio: Oh, I'm so excited about it too, because now all of a sudden it's like, it kind of like, raises the floor for everybody, like, how should people, especially new founders, how should they think about, you know, worrying about the new model versus like, being excited about them just focusing on like, the core FP and maybe switching out some of the parts, like you mentioned.
[01:08:00] Alessio: Yeah, I, I, I, I, speaking for us, I mean obviously like, we were extremely excited about O1 because, At that point, the process of reasoning is obviously very much baked into the model. We fundamentally, if you like, remove all distractions and everything, we are a reasoning company. Right? We want to reason in the way that a software engineer reasons.
[01:08:18] Alessio: So when I saw that model announced, I thought immediately, well, I can improve the quality of my traces coming out of my pipeline, so like, my signal to noise ratio gets better. And then, not immediately, but down the line, I'm going to be able to train those traces into O1 itself. So I'm going to get even more performance that way as well.
[01:08:35] Alessio: So it's For us, a really nice position to be in, to be able to take advantage of it, both on the prompted side and the fine tuned side. And also because, fundamentally, like, we are, I think, fairly clearly in a position now where we don't have to worry about what happens when O2 comes out, what happens when O3 comes out.
[01:08:51] Alessio: This process continues, like, even going from You know, when we first started going from 3. 5 to 4, we saw this happen and then from 4 turbo to 4. 0 and then from 4. 0 to 0. 1, we've seen the performance get better every time and I think, I mean, like, the crude advice I'd give to any startup founder is try to put yourself in a position where you can take advantage of the same, you know, like, C level rise every time, essentially.
[01:09:15] swyx: Do you make anything out of the fact that you were able to take 4. 0 and fine tune it higher than 0. 1 currently scores on SweeBench Verified? Yeah, I mean like,
[01:09:25] Alessio: that was obviously, to be honest with you, you realized that before I did. Adding value. Yes, absolutely, that's a value add investor right there. No, obviously I think it's been, that in of itself is really vindicating to see because I think, I think we have, heard from some people, not a lot of people, but some people saying, well, okay, well, if I, one can reason, then what's the point of doing your reasoning, but it shows how much more signal is in, like the custom reasoning that we generate.
[01:09:52] Alessio: And again, it's the, it's the very sort of obvious thing. If you take something that's made to be general and you make it specific, of course, it's going to be better at that thing. Right? So it was obviously great to see, like, we still are better than no one out of the box. You know, even with an older model, and I'm sure that that's, you know, That delta will continue to grow once we're able to train O1, and once we've done more work on our dataset using O1, like, that delta will grow as well.
[01:10:13] swyx: It's not obvious to me that they will allow you to fine tune O1, but, you know, maybe they'll try. I think the, the, the core question that OpenAI really doesn't want you to figure out is can you use an open source model and beat O1?
[01:10:28] Romain Huet: Interesting. Because, because
[01:10:30] swyx: you basically have shown proof of concept that a non O1 model can beat O1.
[01:10:35] swyx: And their whole L1 marketing is, don't bother trying. Like, don't bother stitching together multiple chain of thought calls. We did something special, secret sauce, you don't know anything about it. And somehow, you know, your 4. 0 chain of thought reasoning as a software engineer is still better. Maybe it doesn't last.
[01:10:53] swyx: Maybe they're going to run L1 for five hours instead of five minutes, and then suddenly it works. So, I don't know.
[01:10:59] Alessio: It's hard to know. I mean, one of the things that we just want to do out of sheer curiosity is do something like fine tune 405B on the same dataset. Like, same context window length, right? So, it should be fairly easy.
[01:11:09] Alessio: We haven't done it yet. Truthfully, we have been so swamped with the waitlist, shipping product, you know, dev day, like, you know, onboarding customers from our waitlist. All these different things have gotten in the way, but it is definitely something out of more curiosity than anything else I'd like to try out.
[01:11:23] Alessio: But also It opens up a new vector of like, if someone has a VPC where they can't deploy an OpenAI model, but they might be able to deploy an open source model, it opens that up for us as well from a customer perspective. So it'll probably be quite useful. I'd be very keen to see what the results are though.
[01:11:38] Alessio: I suspect the answer is yes,
[01:11:40] swyx: but it may be hard to do. So like Reflection70b was like a really crappy attempt at doing it. You guys were much better, and that's why we had you on the show. I, yeah, I'm interested to see if there's an OpenO1 basically. If people want OpenO1.
[01:11:53] Alessio: Yeah, I'm sure they do. As soon as we, as soon as we do it, I'm like, Once we've wrapped up what we're doing in San Francisco, I'm sure we'll give it a go.
[01:12:01] Alessio: I spoke to some guys today, actually, about fine tuning 405B, who might be able to allow us to do it very, like, very easily. I don't want to have to basically do all the setup myself. So, yeah, that might happen sooner rather than later.
[01:12:15] Alessio: Anything from the releases today that you're super excited about? So prompt caching, I'm guessing when you're like dealing with a lot of codebases, that might be helpful.
[01:12:22] Alessio: Is there anything with vision fine tuning related to
[01:12:25] Alessio: like more like UI related development? Yeah, definitely. Yeah, I mean like we were talking, it's funny, like my co founder Sam, who you've met, and I were talking about the idea of doing vision fine tuning. Like, way back, like, well over a year ago, before Genie existed as it does now when we, when we collected our original dataset to do what we do now whenever there were image links and links to, like like, graphical resources and stuff, we also pulled that in as well.
[01:12:50] Alessio: We never had the opportunity to use it, but it's something we have in storage. And, again, like, when we have the time, it's something that I'm super excited, particularly on the UI side. To be able to, like, leverage, particularly if you think about one of the things, I mean, not to sidetrack, but one of the things we've noticed is, I know Swebench is, like, the most commonly talked about thing, and honestly, it's a very, it's an amazing project, but, One of the things we've learned the most from actually shipping this product to users is, It's a pretty bad proxy at telling us how competent the model is, so, for example, When people are doing, like, React development using Genie, For us, it's impossible to know whether what it's written has actually done, you know, done what it wanted to.
[01:13:26] Alessio: So at least even using, like, the fine tuning provision to be able to help eval, like, what we output is already something that's very useful. But also, in terms of being able to pair, here's a UI I want, here's the code that actually, like, represents that UI, is also going to be super useful as well, I think.
[01:13:42] Alessio: In terms of generally, what have I been most impressed by? The distillation thing is awesome. I think we'll probably end up using it in places. But what it shows me more broadly about OpenAI's approach is they're going to be building a lot of the things that we've had to hack together internally, in terms from a tooling point of view, just to make our lives so much easier.
[01:14:03] Alessio: And I've spoken to, you know, John, the head of fine tuning, extensively about this. But there's a bunch of tools that we've had to build internally for things like dealing with model lineage, dealing with dataset lineage, because it gets so messy so quickly, that we would love OpenAI to build. Like, absolutely would love them to build it.
[01:14:19] Alessio: It's not, it's not what gives us our edge, but it certainly means that then we don't have to build it and maintain it afterwards. So, it's a really good first step, I think, in, like, the overall maturity of the fine tuning product and API in terms of where they're going to see those early products. And I think that they'll be continuing in that direction going on.
[01:14:37] Alessio: Did you not, so there's a very
[01:14:39] swyx: active ecosystem of LLLmaps tools. Mm hmm. Did you not evaluate those before building your own?
[01:14:47] Alessio: We did, but I think fundamentally, like, No more. Yeah, like, I think, in a lot of places, it was never a big enough pain point to be like, oh, we absolutely must outsource this. It's definitely, in many places, something that you can hack a script together In a day or two, and then hook it up to our already existing internal tool UI, and then you have, you know, what you need, and whenever you need a new thing, you just tack it on.
[01:15:14] Alessio: But for, like, all of these LLM Ops tools, I've never felt the pain point enough to really, like, bother, and that's not to deride them at all, I'm sure many people find them useful, but just for us as a company, we've never felt the need for them. So it's great that, it's great that OpenAI are going to build them in because it's really nice to have them there, for sure.
[01:15:36] Alessio: But it's not something that, like, I'd ever consider really paying for externally or something like that, if that makes sense.
[01:15:40] swyx: Yeah. Does voice mode factor into Genie?
[01:15:44] Alessio: Maybe one day, that'd be sick, wouldn't it? I don't know. Yeah, I think so. You're
[01:15:48] swyx: the first person, we've been asking this question to everybody.
[01:15:50] swyx: Yeah, I think. You're the first person to not mention voice mode.
[01:15:52] Alessio: Oh, well, it's, it's, it's currently so distant from what we do. But I definitely think, like, this whole talk, if we want it to be a full on AI software engineering colleague, like, there is definitely a vector in some way that you can build that in.
[01:16:06] Alessio: Maybe even during the ideation stage, talking through a problem with Genius in terms of how we want to build something down the line. I think that might be useful, but honestly, like, that would be nice to have when we have the time. Yeah, amazing.
[01:16:19] swyx: One last question. On your in your talk, you mentioned a lot about So you're curating your data and your distribution and all that, and before we sat down you talked a little bit about having to diversify your dataset.
[01:16:30] swyx: Absolutely, yeah. What's driving that,
[01:16:32] Alessio: what are you finding? So, we have been rolling people off the waitlist that we sort of amassed when we announced when I last saw you. And it's been really interesting because as I may have mentioned on the podcast, like we had to be very opinionated about the data mix and the data set that we put together for like sort of the V0 of Genie.
[01:16:49] Alessio: Again, like, to your point, Javascript, Javascript, Javascript, Python, right? There's a lot of Javascripts in its various forms in there. But it turns out that when we've shipped it to the very early alpha users we rolled it out to for example, we had some guys using it with a C sharp codebase.
[01:17:05] Alessio: And C sharp currently represents, I think, about 3 percent of the overall data mix. And they weren't getting the levels of performance that they saw when they tried it with a Python codebase. And It was obviously not great for them to have a bad experience, but it was nice to be able to correlate it with the actual, like, objective data mix that we saw.
[01:17:25] Alessio: So we did what we've been doing is like little top up fine tunes where we take, like, the general genie model and do an incremental fine tune on top with just a bit more data for a given, you know, vertical language. And we've been seeing improvements coming from that. So. Again, this is one of the great things about sort of baptism by fire and letting people use it and giving you feedback and telling you where it sucks.
[01:17:46] Alessio: Because that is not something that we could have just known ahead of time. So I want that data mix to, over time as we roll it out to more and more people, and we are trying to do that as fast as possible, but we're still a team of five for the time being. And so To be as general and as representative of what our users do as possible and not what we think they need.
[01:18:02] swyx: Yeah, so every customer is going to have their own fine
[01:18:05] Alessio: tune. There is going to be the option to, yeah, there is going to be the option to fine tune the model on your code base. It won't be in, like, the base pricing tier, but you will definitely be able to do that. It will go through All of your codebase history, learn how everything happened, and then you'll have an incrementally fine tuned genie just on your codebase.
[01:18:23] Alessio: That's what enterprises really love the idea of. Perfect.
[01:18:27] swyx: Anything else? Yeah, that's it. Thank you so much. Thank you so
[01:18:29] Alessio: much, guys. Good to
[01:18:30] swyx: see you.
[01:18:31] Sam Altman + Kevin Weill Q&A
[01:18:31] AI Charlie: Lastly, this year's Dev Day ended with an extended Q& A with Sam Altman and Kevin Weil. We think both the questions asked and answers given were particularly insightful, so we are posting what we could snag of the audio here from publicly available sources.
[01:18:48] AI Charlie: Credited in the show notes, for you to pick through. If the poorer quality audio here is a problem, we recommend waiting for approximately 1 to 2 months until the final video is released on YouTube. In the meantime, we particularly recommend Sam's answers on the moderation policy, on the underappreciated importance of agents and AI employees beyond level 3.
[01:19:11] AI Charlie: And his projections of the intelligence of O1, O2, and O3 models in future.
[01:19:23] Speaker 17: Alright, I think everybody knows you. For those who don't know me, I'm Kevin Wheel, Chief Product Officer at OpenAI. I have the good fortune of getting to turn the amazing research that our research teams do into the products that you all use every day and the APIs that you all build on every day. I thought we'd start with some audience engagement here.
[01:19:42] Speaker 17: So on the count of three, I want to count to three, and I want you all to say, of all the things that you saw launched here today, what's the first thing you're going to integrate? It's the thing you're most excited to build on. Alright? You gotta do it. Alright? One, two, three. Real time
[01:20:01] Alex Volkov: API!
[01:20:03] Speaker 17: I'll say personally, I'm super excited about our distillation products.
[01:20:07] Speaker 17: I think that's going to be really, really interesting. I'm also excited to see what you all do with advanced voicemail with the real time API, and with vision fine tuning in particular. Okay, so I've got some questions for Sam, I've got my CEO here in the hot seat, let's see if I can't make a career limiting move.
[01:20:30] Speaker 17: So we'll start this we'll start with an easy one, Sam. How close are we to AGI?
[01:20:37] Sam Altman: You know, we used to, every time we finished a system, we would say like, in what way is this not an AGI? Okay. And it used to be like, very easy, you could like, make a little robotic hand that does a prefix cube, or a dotabot, and it's like, oh, it does some things, but definitely not an AGI.
[01:20:54] Sam Altman: It's obviously harder to say now, and so we're trying to like, stop talking about AGI as this general thing. We have this levels framework, because the word AGI has become so overloaded. So like, real quickly, we use one for chatbots, two for reasoners, three for agents, four for innovators, five for organizations, like roughly.
[01:21:15] Sam Altman: I think we clearly got to level two, or we clearly got to level two. With O1 and it, you know, can do really quite impressive Python tasks. It's a very smart model. It doesn't feel AGI like in a few important ways, but I think if you just do the one next step of making it, you know, very agent like, which is our level three, and which I think we will be able to do in the not distant future, It will feel surprisingly capable still probably not something that most of you would call an AGI, though maybe some of you would but it's going to feel like, all right, this is, this is like a significant thing.
[01:21:52] Sam Altman: And then the, the leap, and I think we do that pretty quickly the, the leap from that to something that can really increase the rate of new scientific discovery, which for me is like a very important part. of having an AGI. I feel a little bit less certain on that, but not a long time. Like, I think all of this now is going to happen pretty quickly, and if you think about what happened from last decade to this one, in terms of model capabilities, and you're like, eh.
[01:22:20] Sam Altman: I mean, if you go look at like, If you go from my 01 on a hard problem back to like 4Turbo that we launched 11 months ago, you'll be like, wow, this is happening pretty fast. And I think the next year will be very steep progress. Next two years will be very steep progress. Harder than that. Hard to say with a lot of certainty.
[01:22:34] Sam Altman: But I would say like the math will vary. And at this point, the definitions really matter. And in fact, the fact that the definitions matter this much, Somehow means we're, like, getting pretty close. Yeah.
[01:22:45] Speaker 17: And, you know, there used to be this sense of AGI where it was like, it was a binary thing, and you were gonna go to sleep one day, and there was no AGI, and wake up the next day and there was AGI.
[01:22:56] Speaker 17: I don't think that's exactly how we think about it anymore, but how have your
[01:23:00] Sam Altman: views on this evolved? You know, the one, I agree with that, I think we're, like, you know, in this, like, kind of period where it's It's gonna feel very blurry for a while, and the, you know, is this AGI yet, or is this not AGI, or kind of like, at what point?
[01:23:16] Sam Altman: It's just gonna be this like, smooth exponential, and, you know, probably most people, looking back at history, won't agree, like, when that milestone was hit, and will just realize it was like, a silly thing. Even the Turing test, which I thought always was like, this very clear milestone, you know, there was this like, fuzzy period.
[01:23:33] Sam Altman: It kind of like, went oosh and bye, no one cared But, but I think the right framework is just this one exponential. That said if we can make an AI system that is like materially better at all of open AI than doing, at doing AI research, that does feel to me like some sort of important discontinuity.
[01:23:53] Sam Altman: It's probably still wrong to think about it that way. It probably still is the smooth exponential curve. Bye. That feels like a new milestone.
[01:24:00] Alex Volkov: Is
[01:24:03] Speaker 17: OpenAI still as committed to research as it was in the early days? Will research still drive the core of our advancements in our product development? Yeah,
[01:24:12] Sam Altman: I mean, I think more than ever.
[01:24:15] Sam Altman: The, there was like a time in our history when the right thing to do was just to scale up compute, and we saw that with conviction, and we had a spirit of like, We'll do whatever works, you know, like, we want to, we have this mission, we want to like, build, say, AGI, figure out how to share the benefits. If the answer is like, rack up GPUs, we'll do that.
[01:24:33] Sam Altman: And right now, the answer is, again, really push on research. And I think you see this with O1, like, that is a giant research breakthrough that we were attacking from many vectors over a long period of time that came together in this really powerful way. We have many more giant research breakthroughs to come, but the thing that I think is most special about OpenAI is that we really deeply care about research and we understand how to do it.
[01:25:02] Sam Altman: I think, it's easy to copy something you know works, and you know, I actually don't even mean that as a bad thing, like, when people copy OpenAI, I'm like, great, the world gets more AI? That's wonderful. But, to do something new for the first time, to like, really do research in the true sense of it, which is not like, you know, let's barely get soda out of this thing, or like, let's tweak this.
[01:25:22] Sam Altman: But like, let's go find the new paradigm, and the one after that, and the one after that. That is what motivates us, and I think the thing that is special about us as an org. Besides the fact that we, you know, married product and research and all this other stuff together, is that we know how to run that kind of a culture that can go, that can go push back the frontier, and that's really hard.
[01:25:43] Sam Altman: But we love it and that's, you know, I have to do that a few more times in a week at AGI.
[01:25:49] Speaker 17: Yeah, I'll say like the litmus test for me coming from the outside, from, you know, sort of normal tech companies, of how critical research is to open AI, is that building product in open AI is fundamentally different than any other place that I have ever done it before.
[01:26:05] Speaker 17: You know, normally you have, you have some sense of your tech stack, you have some sense of what you have to work with, and what capabilities computers have, and, and then you're trying to build the best product, right? You're figuring out who your users are, what problems they have, and how you can help solve those problems for them.
[01:26:23] Speaker 17: There is that at OpenAI, but also, the state of, like, what computers can do just evolves every two months, three months, and suddenly computers have a new capability that they've never had in the history of the world. And we're trying to figure out how to build a great product and expose that for developers and our APIs and so on.
[01:26:46] Speaker 17: And then, you know, you can't totally tell what's coming, they're coming through, it's coming through the mist a little bit at you and gradually taking shape. It's fundamentally different than any other company I've ever worked at, and it's, I think, Is that the thing that has
[01:26:58] Sam Altman: most surprised you?
[01:26:59] Speaker 17: Yes. Yeah, and it's interesting how, Even internally we don't always have a sense.
[01:27:06] Speaker 17: You have like, okay, I think this capability is coming, but is it going to be, you know, 90 percent accurate or 99 percent accurate in the next model because the difference really changes what kind of product you can build. And you know that you're gonna get to 99, you don't quite know when, and figuring out how you put a roadmap together in that world is really interesting.
[01:27:26] Sam Altman: Yeah, the degree to which we have to just, like, follow the science, and let that determine what we go work on next, and what products we build, and everything else, is, I think, hard to get across. Like, we have guesses about where things are gonna go. Sometimes we're right, often we're not. But, if something starts working, or if something doesn't work that you thought was gonna work, our willingness to just say, we're gonna like, pivot everything, and do what the science allows, and you don't get to like, pick what the science allows?
[01:27:54] Sam Altman: Yeah. That's surprising.
[01:27:55] Speaker 17: I was sitting with an Enterprise customer a couple weeks ago, and they said, you know, one of the things we really want, this is all working great, we love this, one of the things we really want is a notification 60 days in advance when you're gonna launch something. And I was like, I want that too.
[01:28:14] Speaker 17: Alright, so I'm going through, these are a bunch of questions from the audience, by the way, and we're going to try and also leave some time at the end for people to ask audience questions. So we've got some folks with mics, and when we get there they'll be thinking. But next thing is So many in the alignment community are genuinely concerned that open AI is now only paying lib service to alignment.
[01:28:34] Speaker 17: Can you reassure us?
[01:28:35] Sam Altman: Yeah I think it's true we have a different take on alignment than, like, maybe what people write about on whatever that, like, internet forum is. But we really do care a lot about building safe systems. We have an approach to do it that has been informed by our experience so far.
[01:28:55] Sam Altman: And touch on that other question, which is you don't get to pick where the science goes. Of, we want to figure out how to make capable models that get safer and safer over time. And, you know, a couple of years ago, we didn't think the whole strawberry or the O1 paradigm was gonna work in the way that it's worked.
[01:29:13] Sam Altman: And that brought a whole new set of safety challenges, but also safety opportunities. And, rather than kind of, like, plan to make theoretical ones, You know, superintelligence gets here, here's the like, 17 principles. We have an approach of, figure out where the capabilities are going, and then work to make that system safe.
[01:29:38] Sam Altman: And, O1 is obviously our most capable model ever, but it's also our most aligned model ever, by a lot. And as, as these models get better intelligence, better reasoning, whatever you want to call it, the things that we can do to align them the things we can do to build really safe systems across the entire stack our tool set keeps increasing as well.
[01:30:00] Sam Altman: So,
[01:30:01] Sam Altman: we, we have to build models that are generally accepted as safe and robust to be able to put them in the world. And when we started OpenAI, what the picture of alignment looked like, and what we thought the problems that we needed to solve were going to be, turned out to be nothing like the problems that actually are in front of us and that we had to solve now.
[01:30:20] Sam Altman: And also, when we made the first GPT 3 if you ask me for the techniques that would have worked for us to be able to now deploy. all of current systems as generally expected to be safe and robust. They would not have been the ones that turned out to work. So, by this idea of iterative deployment, which I think has been one of our most important safety stances ever and sort of confronting reality as it sits in front of us, we've made a lot of progress, and we expect to make more, and we keep finding new problems to solve, but we also keep finding new techniques to solve them.
[01:30:54] Sam Altman: All of that said, I
[01:30:56] Sam Altman: I think worrying about the sci fi ways this all goes wrong is also very important. We have people thinking about that. It's a little bit less clear, kind of, what to do there, and sometimes you end up backtracking a lot, but,
[01:31:09] Sam Altman: but I don't think it's I also think it's fair to say we're only gonna work on the thing in front of us. We do have to think about where this is going, and we do that too. And I think if we keep approaching the problem from both ends like that, most of our thrust on the, like, okay, here's the next thing, we're gonna deploy this.
[01:31:22] Sam Altman: What it needs to happen to get there. But also like, what happens if this curve just keeps going? That's been, that's been an effective strategy for us.
[01:31:30] Speaker 17: I'll say also, it's one of the places where I'm really, I really like our philosophy of iterative deployment. When I was at Twitter, back, I don't know, a hundred years ago now Ev said something that stuck with me, which is, So no matter how many smart people you have inside your walls, there are way more smart people outside your walls.
[01:31:48] Speaker 17: And so, when we try and get our, you know, it'd be one thing if we just said we're gonna try and figure out everything that could possibly go wrong within our walls, and it'd be just us and the red teamers that we can hire and so on. And we do that, we work really hard at that. But also, Launching iteratively and launching carefully and learning from the ways that folks like you all use it, what can go right, what can go wrong, I think is a big way that we get these things right.
[01:32:13] Speaker 17: I also think that as we head into this world of
[01:32:18] Sam Altman: agents off doing things in the world, that is going to become really, really important. As these systems get more complex and are acting over longer horizons the pressure testing from the whole outside world, like, really,
[01:32:30] Alex Volkov: really
[01:32:31] Sam Altman: critical.
[01:32:32] Speaker 17: Yeah. So. We'll go, actually, we'll go off of that and maybe talk to us a bit more about how you see agents fitting in with OpenAI's long term plans.
[01:32:40] Speaker 17: What do you think? I think I'm a huge part of the I mean, I think the exciting thing is this This set of models, O1 in particular, and all of its successors, are going to be what makes this possible. Because you finally have the ability to reason, to take hard problems, break them into simpler problems, and act on them.
[01:33:02] Speaker 17: I mean, I think 2025 is going to be the year that's really, that's big. Yeah, I,
[01:33:09] Sam Altman: I mean, chat interfaces are great, and they all, I think, have an important place in the world, but I don't know. The,
[01:33:16] Sam Altman: when you can like ask a model, when you can ask like ChatGT or some agent something, and it's not just like you get a kind of quick response, or even if you get like 15 seconds of thinking, and oh, one gives you like a nice piece of code back or whatever. But you can like really give something a multi term interaction with environments or other people or whatever, like think for the equivalent of multiple days of human effort, and, and like a really smart, really capable human, and like have stuff happen.
[01:33:45] Sam Altman: We all say that, we're all like, oh yeah, this is the next thing, this is coming, this is gonna be another thing, and we just talk about it like, okay, you know, it's like the next model in evolution. I would bet, and we don't really know until we get to use these, that it's We'll of course get used to it quickly, people get used to any new technology quickly, but this will be like a very significant change to the way the world works.
[01:34:07] Sam Altman: in a short period of time.
[01:34:09] Speaker 17: Yeah, it's amazing. Somebody was talking about getting used to new capabilities and AI models and how quickly, actually I think it was about Waymo but they were talking about how in the first ten seconds of using Waymo, they were like, oh my god, is this thing that, like, there's like, let's watch out, and then ten minutes in, they were like, oh, this is really cool.
[01:34:28] Speaker 17: And then twenty minutes in, they were like, checking their phone for, you know, it's amazing how much your, your sort of internal firmware updates. For this new stuff, right? Yeah, like,
[01:34:39] Sam Altman: I think that people will ask an agent to do something for them that would have taken them a month, and they'll finish in an hour, and it'll be great, and then they'll have like ten of those at the same time, and then they'll have like a thousand of those at the same time, and by 2030 or whatever, we'll look back and be like, yeah, this is just like what a human is supposed to be capable of, what a human used to like, you know, grind at for years or whatever, many humans used to grind at for years.
[01:35:07] Sam Altman: I just now I can ask a computer to do it and it's like done in an hour. That's, why is it not a minute? Yeah,
[01:35:16] Speaker 17: it's also, it's one of the things that makes having an amazing development platform great too because, you know, we'll experiment and we'll build some agentic things of course and like we've already got, I think just like, we're just pushing the boundaries of what's possible today you've got groups like cognition doing amazing things and coding Like Harvey and case text, you guys speak doing cool things with language translation.
[01:35:39] Speaker 17: Like, we're beginning to see this stuff work, and I think it's really gonna start working as we,
[01:35:44] Sam Altman: as we continue to iterate these models. One of the very fun things for us about having this development platform is just getting to, like, watch the unbelievable speed and creativity of people that are building these experiences.
[01:35:56] Sam Altman: Like, developers, very near and dear to our heart it's kind of like the first thing we watched. And it's brilliant. Many of us came building on platforms, but the, so much of the capability of these models and great experiences have been built by people building on the platform. We'll continue to try to offer, like, great first party products, but we know that will only ever be, like, a small, narrow slice of the apps or agents or whatever people build in the world, and seeing what has happened in the world in the last, you know, 18 24 months.
[01:36:30] Sam Altman: It's been like quite amazing to watch.
[01:36:33] Speaker 17: We'll keep going on the agent front here. What do you see as the current hurdles for computer
[01:36:39] Sam Altman: controlling agents? Safety and alignment. Like, if you are really going to give an agent the ability to start clicking around your computer which you will. You are going to have a very high bar for The robustness and the reliability and the alignment of that system.
[01:36:58] Sam Altman: So technically speaking, I think that, you know, we're getting, like, pretty close to the capability side. But the sort of agent safety and trust framework, that's gonna, I think, be the long haul.
[01:37:11] Speaker 17: And now I'll kind of ask a question that's almost the opposite of one of the questions from earlier. Do you think safety could act as a false positive and actually limit public access to critical tools that would enable a more egalitarian world?
[01:37:23] Sam Altman: The honest answer is yes, that will happen sometimes. Like, we'll try to get the balance right. But if we were fully alone and didn't care about, like, safety and alignment at all, could we have launched O1 faster? Yeah, we could have done that. It would have come at a cost. There would have been things that would have gone really wrong.
[01:37:40] Sam Altman: I'm very proud that we didn't. The cost, you know, I think would have been manageable with O1, but by the time of O3 or whatever, like, immediately. Pretty unacceptable. And so, starting on the conservative side, like, you know, I don't think people are complaining, like, oh, voice mode, like, it won't say this offensive thing, and I really want it to, and, you know, formal comedy, and let it offend me.
[01:38:03] Sam Altman: You know what? I actually mostly agree. If you are trying to get O1 to say something offensive, it should follow the instructions of its user most of the time. There's plenty of cases where it shouldn't. But, we have, like, a long history of when we put a new technology in. We change the world, we start on the conservative side.
[01:38:20] Sam Altman: We try to give society time to adapt, we try to understand where the real harms are versus sort of like, kind of more theoretical ones. And that's like, part of our approach to safety. And, not everyone likes it all the time, I don't even like it all the time. But, but if we're right that these systems are, and we're gonna get it wrong too, like sometimes we won't be conservative enough in some area.
[01:38:42] Sam Altman: But if we're right that these systems are going to get as powerful as we think they are. as quickly as we think they might, then I think starting that way makes sense. And, you know, we like to relax over time. Totally agree. What's
[01:38:57] Speaker 17: the next big challenge for a startup that's using AI as a core feature?
[01:39:01] Speaker 17: I'll say it. You first. I've got it. I've got one, which is, I think one of the challenges, and we face this too, because we're also building products on top of our own models, is trying to find the, kind of the frontier. You want to be building, these AI models are evolving so rapidly, and if you're building for something that the AI model does well today, it'll work well today, but it's going to feel, it's going to feel old tomorrow.
[01:39:28] Speaker 17: And so you want to build for, for things that the AI model can just barely not do. You know, where maybe the early adopters will go for it and other people won't quite, but that just means that when the next model comes out, as we continue to make improvements, that use case that just barely didn't work, you're gonna be, you're gonna be the first to do it, and it's gonna be amazing.
[01:39:47] Speaker 17: But figuring out that boundary is really hard. I think it's where the best products are gonna get built up.
[01:39:53] Speaker 17: Totally agree with that. The other
[01:39:54] Sam Altman: thing I'm gonna add is, I think it's like, very tempting to think that a technology makes a startup. And that is almost never true. No matter how cool a new technology or a new sort of like, tech title is, it doesn't excuse you from having to do all the hard work of building a great company that is going to have durability or like, accumulated advantage over time.
[01:40:18] Sam Altman: And, we hear from a lot of startups that ORC is just like a very common thing, which is like, I can do this incredible thing, I can make this incredible service And that seems like a complete answer, but it doesn't excuse you from any of, like, the normal laws of business. You still have to, like, build a good business and a good strategic position.
[01:40:35] Sam Altman: And I think a mistake is that in the unbelievable excitement and updraft of AI, people are very tempted to forget that.
[01:40:45] Speaker 17: This is a, this is an interesting one. The mode of voice is like tapping directly into the human API. How do you ensure ethical use of such a powerful tool with obvious abilities and manipulation?
[01:40:59] Speaker 17: Yeah, you
[01:41:00] Sam Altman: know, voice mode was a really interesting one for me. It was like the first time that I felt like I sort of had gotten like really tricked by an AI, in that when I was playing with the first beta of it, I couldn't like, I couldn't stop myself. I mean, I kind of, like I still say like, please switch out GBT.
[01:41:21] Sam Altman: But in voice code, I like, couldn't not kind of use the normal ICDs. I was like so convinced, like, ah, it might be a real per like, you know? And obviously it's just like hacking some circuit in my brain, but I really felt it with voice code. And I sort of still do The, I think this is a more, this is an example of like a more general thing that we're going to start facing, which is, as these systems become more and more capable, and as we try to make them as natural as possible to interact with they're gonna like, hit parts of our neural circuitry that would like evolve to deal with other people.
[01:42:01] Sam Altman: And You know, there's like a bunch of clear lines about things we don't want to do, like, we don't. Like, there's a whole bunch of like weird personality growth hacking, like, I think vaguely socially manipulative stuff we could do. But then there's these like other things that are just not nearly as clear cut.
[01:42:19] Sam Altman: Like, you want the voice mode to feel as natural as possible, but then you get across the uncanny valley, and it like, at least in me, triggers something. And and, you know, me saying, like, please and thank you to chat. gt, no problem. Probably the thing to do. You never know. But, but I think this like really points at the kinds of safety and alignment issues we have to start analyzing.
[01:42:43] Speaker 17: Alright, back to brass tacks. Sam, when's O1 going to support function tools? Do you know? Before the end of the year. There are three things that we really want to get in for
[01:42:53] Speaker 17: We're gonna record this, take this back to the research team, show them how badly we need to do this. There, I mean, there are a handful of things that we really wanted to get into O1, and we also, you know, it's a balance of should we get this out to the world earlier and begin, you know, learning from it, learning from how you all use it, or should we launch a fully complete thing that is, you know, in line with it, that has all the abilities that every other model that we've launched has.
[01:43:18] Speaker 17: I'm really excited to see things like system properties. and structured outputs and function calling make it into O1, we will be there by the end of the year. It really matters to us too.
[01:43:32] Sam Altman: In addition to that, just because I can't resist the opportunity to reinforce this, like, we will get all of those things in and a whole bunch more things you'll have asked for.
[01:43:39] Sam Altman: The model is going to get so much better so fast. Like, we are so early, this is like, you know, maybe it's the GPT 2 scale moment, but like, we know how to get to GPT 4, we have the fundamental stuff in place now to 4. And, in addition to planning for us to build all of those things, Plan for the model to just get, like, rapidly smarter, like, you know, hope you all come back next year and plan for it to feel like way more of a year of improvement than from 4.
[01:44:10] Sam Altman: 0. 1.
[01:44:13] Speaker 17: What feature or capability of a competitor do you really admire? I
[01:44:17] Sam Altman: think Google's notebook thing is super cool. What are they called? Notebook LL. Notebook LL, yeah. I was like, I woke up early this morning and I was like looking at examples on Twitter and I was just like, this is like, this is just cool.
[01:44:28] Sam Altman: This is just a good, cool thing. And, like, I think not enough of, not enough of the world is like shipping new and different things, it's mostly like the same stuff. But that I think is like, that brought me a lot of joy this morning.
[01:44:43] Speaker 17: Yeah. It was very, very well done. One of the things I really appreciate about that product is the, there's the, the, just the format itself is really interesting, but they also nailed the podcast style voices.
[01:44:55] Speaker 17: They have really nice microphones. They have these sort of sonorant voices. As you guys see, somebody on Twitter was saying like, the cool thing to do is take your LinkedIn and put it, you know, gimme a hit, and give it to these give it to notebook. lm and you'll have two podcasters riffing back and forth about how amazing you are and all of your accomplishments over the years.
[01:45:19] Speaker 17: I'll say mine is I think Anthropic did a really good job. On projects it's kind of a, a different take on what we did with GBTs and GBTs are a little bit more long lived. It's something you build and can use over and over again. Projects are kind of the same idea, but like more temporary, meant to be kind of stood up, used for a while, and then you can move on.
[01:45:41] Speaker 17: And that, that the different mental model makes a difference. And I think they did a really nice job with that.
[01:45:47] Speaker 17: Alright, we're getting close to audience questions, so be thinking of what you want to ask. So in OpenAI, how do you balance what you think users may need? Versus what they actually need today.
[01:45:59] Sam Altman: Also a better question for you.
[01:46:00] Speaker 17: Yeah, well, I think it does get back to a bit of what we were saying around trying to, trying to build for what the model can just, like, not quite do, but almost do.
[01:46:09] Speaker 17: But it's a real balance, too, as we, as we, you know, we support over 200 million people every week on ChatGPT. You also can't say, Now it's cool, like, deal with this bug for three months, or this issue we've got something really cool coming. You've gotta solve for the needs of today. And there are some really interesting product problems.
[01:46:29] Speaker 17: I mean, you think about, I'm speaking to a group of people who know AI really well. Think of all the people in the world who have never used any of these products. And that is the vast majority of the world still. You're basically giving them a text interface, and on the other side of the text interface is this like alien intelligence that's constantly evolving that they've never seen or interacted with, and you're trying to teach them all the crazy things that you can actually do it, all the ways it can help, can integrate into your life, can solve problems for you.
[01:47:01] Speaker 17: And people don't know what to do with it. You know, like, you come in and you're just like, people type like, Hi. And in response, you know, hey! Great to see you, like, how can I help you today? And then, you're like, okay, I don't know what to say. And then you end up, you kind of walk away, and you're like, well, I didn't see the magic in that.
[01:47:19] Speaker 17: And so it's a real challenge, figuring out how You, I mean, we all have a hundred different ways that we use chat GPT and AI tools in general, but teaching people what those can be, and then bringing them along as the model changes month by month by month, and suddenly gains these capabilities way faster than we as humans gain the capabilities, it's, it's a really interesting set of problems, and I'm I know it's one that you all solve in, in different ways as well.
[01:47:47] Speaker 17: I,
[01:47:47] Sam Altman: I
[01:47:47] Speaker 17: have
[01:47:47] Sam Altman: a question. Who feels like they, they spend a lot of time with O1, and they would say like, I feel definitively smarter than that thing?
[01:47:58] Sam Altman: Do you think you still go by O2? No one, no one taking the bet of like being smarter than O2. So, One of the challenges that we face is, like, we know how to go do this thing that we think will be, like, at least probably smarter than all of us in, like, a broad array of tasks. And yet we have to, like, still like fixed bugs and do the, hey, how are you problem.
[01:48:25] Sam Altman: And mostly what we believe in is that if we keep pushing on model intelligence people will do incredible things with that. You know, we want to build the smartest, most helpful models in the world, and And find all sorts of ways to use that and build on top of that. It has been definitely an evolution for us, to not just be entirely research focused, and we do have to fix all those bugs and make this super usable and I think we've gotten better at balancing that.
[01:48:54] Sam Altman: But still, as part of our culture, I think, we trust that if we can keep pushing on intelligence, 6. 0. 4 if you run down here it'll, people will build this incredible thing. Yeah,
[01:49:09] Speaker 17: I think it's a core part of the philosophy, and you do a good job of pushing us to always, well, basically incorporate the frontier of intelligence into our products, both in the APIs and into our first party products.
[01:49:22] Speaker 17: Because it's, it's easy to kind of stick to the thing you know, the thing that works well, but you're always pushing us to like, get the frontier in, even if it only kind of works, because it's going to work really well soon. So I always find that a really helpful piece of advice. You kind of answered the next one.
[01:49:38] Speaker 17: You do say, please and thank you to the models. I'm curious how many people say Please and thank you. Isn't that so interesting? I do too. . I kind of can't. I feel bad if I don't. And,
[01:49:50] Speaker 17: okay, last question and then we'll go into audience questions for the last 10 or so minutes. Do you plan to build models specifically made for ag agent use cases, things that are better at reasoning and tool calling.
[01:50:02] Sam Altman: Specific, we plan to make models that are great at agentive use cases, that'll be a key priority for us over the coming months.
[01:50:08] Sam Altman: Specifically is a hard thing to ask for, because I think it's also just how we keep making smarter models. So yes, there's like some things like tool use, function calling that we need to build in that'll help, but mostly we just want to make the best reasoning models in the world. Those will also be the best agentive based models in the world.
[01:50:25] Sam Altman: Cool, let's
[01:50:25] Speaker 17: go to audience questions.
[01:50:27] Unkown: How extensively do you dogfood your own technology in your company? Do you have any interesting examples that may not be obvious?
[01:50:37] Sam Altman: Yeah I mean we put models up for internal use even before they're done training. We use checkpoints and try to have people use them for whatever they can, and try to sort of like build new ways to explore the capability of the model internally, and use them for our own development.
[01:50:52] Sam Altman: Element or research or whatever else, as much as we can, we're still always surprised by the creativity of the outside world and what people do. But basically the way we have figured out every step along our way of how to, what to push on next, what we can productize, what, what, what, like, what the models are really good at is by internal dog food.
[01:51:13] Sam Altman: That's like our whole, that's how we like, feel our way through this.
[01:51:17] Sam Altman: We don't yet have like. Employees that are based off of O1, but, I, you know, as we like move into the world of agents, we will try that. Like, we'll try having like, you know, things that we deploy in our internal systems that help you with stuff. There are things that get
[01:51:31] Speaker 17: closer to that, I mean, they're like, customer service, we have bots internally, that do a ton about answering external questions and fielding internal people's questions on Slack and so on.
[01:51:43] Speaker 17: And our customer service team is probably I don't know, 20 percent the size it might otherwise need to be because of it. I know Matt Knight and our security team has talked extensively about all the different ways we use models internally for, to automate a bunch of security things and, you know, take what used to be a manual process where you might not have The number of humans to even, like, look at everything incoming, and have models taking, you know, separating signal from noise, and highlighting to humans what they need to go look at, things like that.
[01:52:13] Speaker 17: So, I think internally there are tons of examples, and people maybe underestimate the You all probably will not be surprised by this, but a lot of folks that I talk to are. The extent to which it's not just using a model in a place, it's actually about using, like chains of models that are good at doing different things and connecting them all together to get one end to end process that is very good at the thing you're doing, even if the individual models have You know, flaws and make mistakes.
[01:52:46] Unknown: Thank you. I'm wondering if you guys have any plans on sharing models for like offline usage? Because with this distillation thing, it's really cool that we can share our own models, but a lot of use cases you really want kind of like have a version of it.
[01:53:02] Sam Altman: We're open to it. It's not on, it's not like high priority on the current roadmap. The, if we had, like, more resources and bandwidth, we would go to that. I think there's a lot of reasons you want a local model. But it's not like, it's not like a this year kind of thing.
[01:53:21] Unknown: Hi. My question is, there are many agencies in the government, above the local, state, and national level, that could really greatly benefit from the tools that you guys are developing, but I have perhaps some hesitancy on deploying them because of, you know, security concerns, data concerns, privacy concerns.
[01:53:38] Unknown: And, I guess, I'm curious to know if there are any sort of, you know, planned partnerships with governments, rural governments, once whatever AGI is achieved. Because obviously AGI can help. Solve problems like, you know, world hunger, poverty, climate change. Government's gonna have to get involved with that, right?
[01:53:57] Unknown: And I'm just curious to know if there is some you know, plan that works when, and if that time comes.
[01:54:04] Speaker 17: Yeah, I think, I actually think you don't want to wait until AGI. You want to start now, right? Because there's a learning process, and there's a lot of good that we can do with our current models. So we We've announced a handful of partnerships with government agencies, some states, I think Minnesota, and some others, Pennsylvania, Also with organizations like USAID.
[01:54:22] Speaker 17: It's actually a huge priority of ours to be able to help governments around the world get acclimated, get benefit from the technology, And of all places, government feels like somewhere where you can automate a bunch of workflows and make things more efficient, reduce drudgery, and so on. So I think there's a huge amount of good we can do now.
[01:54:40] Speaker 17: And if we do that now It just accrues over the long run as the models get better and we get closer to AGI. I've got
[01:54:49] Vibhu Sapra: pretty open ended question. What are your thoughts on open source? So, whether that's open weights, just general discussion, where do you guys sit with open source?
[01:55:01] Sam Altman: I think open source is awesome. Again, if we had more bandwidth, we would do that too. We've, like, gotten very close to making a big open source effort a few times.
[01:55:09] Sam Altman: And then, you know, the really hard part is prioritization. And we have put other things ahead of it. Part of it is, like, there's such good open source models in the world now that I think that segment The thing we always end in motion A really great on device model. And I think that segment is fairly well served.
[01:55:28] Sam Altman: I do hope we do something at some point, but we want to find something that we feel like, if we don't do it, then we'll just be the same as them and not make, like, another thing that's, like, a tiny bit better on benchmarks. Because we think there's, like, a lot of potential. A lot of good stuff out there now.
[01:55:41] Sam Altman: But, but like, spiritually, philosophically, I'm very glad it exists. I would
[01:55:46] Alex Volkov: like to
[01:55:47] Sam Altman: contribute.
[01:55:50] Alex Volkov: Hi Shane. Hi Kevin. Thanks for inviting us. Good dev day. It's been awesome. All the live demos work. It's incredible. Why can't advanced voice mode sing? And as a follow up to this, if it's a company, like, legal issue in terms of corporate, et cetera, Is there a daylight between how you think about safety in terms of your own products, on your own platform, Versus giving us developers kind of the I don't know, sign the right things off so we can, we can make our voice not sing.
[01:56:15] Alex Volkov: Could you answer the question?
[01:56:19] Speaker 17: Oh, you know the funny thing is Sam asked the same question. Why can't this thing sing? I want it to sing. I've seen it sing before. It's, actually, it's there are things, obviously, that we can't have it sing, right? We can't have it sing copyrighted songs, we don't have the licenses, etc.
[01:56:35] Speaker 17: And then there are things that it can't sing, and you can have it sing Happy Birthday, and that would be just fine, right? And we want that too. It's a matter of, I think, once you, it, basically, it's easier in finite time to Say no, and then build it in, but it's nuanced to get it right, and we, you know, There are penalties to getting these kinds of things wrong.
[01:56:55] Speaker 17: So it's really just where we are now. We really want the models to sync too.
[01:57:03] Sam Altman: We waited for us to ship voice mode, which is like, very fair. We could've like, waited longer and kind of really got the classifications and filters on, you know, congregated music versus not, but we decided we'd just ship it and we'll have more. But I think Sam has asked me like, four or five times why we didn't have
[01:57:19] Speaker 17: voice
[01:57:20] Sam Altman: feature.
[01:57:21] Sam Altman: I mean, we still can't like, offer something where we're gonna be in like, pretty badly. You know, hot water developers or first party or whatever. Yes, we can, like, maybe have some differences, but we like, comply with the law.
[01:57:36] Unknown: Could you speak a little to the future of where you see context windows going? And kind of the timeline for when, how you see things balance between context window growth and RAG, basically, information retrieval.
[01:57:49] Sam Altman: I think there's, like, two different Takes on that the better. One is like, when is it going to get to like, kind of normal long context?
[01:57:56] Sam Altman: Like, context length 10 million or whatever, like long enough that you just throw stuff in there, and it's fast enough you're happy about it. And I expect everybody's going to make pretty fast progress there, and that'll just be a thing. Long context has gotten weirdly less usage than I would have expected so far.
[01:58:11] Sam Altman: But I think, you know, there's a bunch of reasons for that, I don't want to go too much into it. And then there's this other question of, like, when do we get to context length? Not like 10 million, but 10 trillion. Like, when do we get to the point where you throw, like, every piece of data you've ever seen in your entire life in there?
[01:58:26] Sam Altman: And you know, like, that's a whole different set of things. That obviously takes some research breakthroughs. But I assume that infinite context will happen at some point. And some point is, like, less than a decade. And that's going to be just a totally different way that we use these models. Even getting to the, like, 10 million tokens of very fast and accurate context, which I expect to measure in, like, months, something like that.
[01:58:52] Sam Altman: You know, like, people will use that in all sorts of ways. And it'll be great. But yeah, the very, very long context, I think, is gonna happen, and it's really interesting. I think we maybe have time for one or two
[01:59:08] Speaker 17: more.
[01:59:10] Alex Volkov: Don't worry, this is gonna be your favorite question. So, with voice, and all the other changes that users have experienced since you all have launched your technology, what do you see is the vision?
[01:59:25] Alex Volkov: for the new engagement layer, the form factor, and how we actually engage with this technology to make our lives so much better.
[01:59:34] Speaker 17: I love that question. It's one that we ask ourselves a lot, frankly. There's this, and I think it's one where developers can play a really big part here because there's this trade off between generality and specificity here.
[01:59:47] Speaker 17: I'll give you an example. I was in Seoul and, and Tokyo. A few weeks ago, and I was in a number of conversations with folks that, with whom I didn't have a common language, and we didn't have a translator around. Before, we would not have been able to have a conversation. We would have just sort of smiled at each other and continued on.
[02:00:05] Speaker 17: I took out my phone, I said, JGPT, I want you to be Translator for me, when I speak in English, I want you to speak in Korean, you hear Korean, and I want you to repeat it in English. And I was able to have a full business conversation, and it was amazing. You think about the impact that could have, not just for business, but think about travel and tourism and people's willingness to go places where they might not have a word of the language.
[02:00:28] Speaker 17: You can have these really amazing impacts, but inside ChetGBT, that was still a thing that I had to, like, ChetGBT is not optimized for that, right? Like, you want this sort of digital, you know, universal translator in your pocket that just knows that what you want to do is translate. Not that hard to build.
[02:00:47] Speaker 17: But I think there's, we struggle with the, with trying to build an application that can do lots of things for lots of people. And it keeps up, like we've been talking about a few times, it keeps up with the pace of change and with the capabilities, you know, agentive capabilities and so on. I think there's also a huge opportunity for the creativity of an audience like this to come in and like, Solve problems that we're not thinking of, that we don't have the expertise to do, And ultimately the world is a much better place if we get more AI to more people, And it's why we are so proud to serve all of you.
[02:01:23] Sam Altman: The only thing I would add is, if you just think about everything that's gonna come together, At some point, in not that many years in the future, you'll walk up to a piece of glass, You will say whatever you want they will have like, There'll be incredible reasoning models, agents connected to everything, there'll be a video model Streaming back to you like a custom interface just for you.
[02:01:40] Sam Altman: This is one request. Whatever you need, it's just gonna get, like, rendered in real time, and you'll be able to interact with it, you'll be able to, like, click through the stream, or say different things, and it'll be off doing, like, again, the kinds of things that used to take, like, humans years to figure out.
[02:01:54] Sam Altman: And, it'll just You know, dynamically render whatever you need, and it'll be a completely different way of using a computer. And also getting things to happen in the world. That, it's gonna be quite a while.
[02:02:07] Speaker 17: Awesome. Thank you. That was a great question to end on. I think we're out of time. Thank you so much for coming.
[02:02:12] Speaker 17: Applause
[02:02:23] AI Charlie: That's all for our coverage of Dev Day 2024. We want to extend an extra special note of gratitude to Lindsay McCallum of the OpenAI Comms team, who helped us set up so many interviews at very short notice, and physically helped ensure the smooth continuity of the video recordings. We couldn't do this without you, Lindsay.
[02:02:44] AI Charlie: If you have any feedback on the launches or for our guests, hop on over to our YouTube or Substack comments section and say hi. We're especially interested in your personal feedback and demos built with the new things launched this week. Feel the AGI.
[02:03:07] Notebook LM Recap of Podcast
[02:03:07] NotebookLM 2: Alright, so you wanted to know more about OpenAI's Dev Day and what stood out to us. We're diving into all the developer interviews and discussions and there's a lot to unpack.
[02:03:16] NotebookLM: Yeah, it's interesting. OpenAI seems to be, like, transitioning, moving beyond just building these impressive AI models. One expert even called them, get this, the AWS of AI.
[02:03:26] NotebookLM 2: EWS of AI.
[02:03:28] NotebookLM: Yeah.
[02:03:28] NotebookLM 2: Okay, so what does that even mean when we talk about AI?
[02:03:31] NotebookLM: So it means, instead of just offering this raw power, they're building a whole ecosystem. The tools to fine tune those models. Distillation, you know, for efficiency. And a bunch of new evaluation tools. Oh, and a huge emphasis on real time capabilities.
[02:03:46] NotebookLM: You
[02:03:46] NotebookLM 2: know, instead of just giving us the ingredients, it's like they're providing the whole kitchen.
[02:03:49] NotebookLM: Exactly. They're laying the groundwork for, well, they envision a future where you can build almost anything with AI.
[02:03:56] NotebookLM 2: I see. And one of the tools that really caught my eye was this function calling. They used it in that travel agent demo, remember?
[02:04:04] NotebookLM 2: How does that even work?
[02:04:05] NotebookLM: So function calling, it's like giving the AI access to external tools and information. Imagine, instead of just having all this pre programmed knowledge, you can like, search the web for you, book flights, even order a pizza.
[02:04:17] NotebookLM 2: So instead of a static encyclopedia, it's like giving the AI a smartphone with internet.
[02:04:21] NotebookLM: Yeah, precisely. Yeah. And this ties into their focus on real time interaction, right? They see a future where AI can respond instantly, just like a human would.
[02:04:31] NotebookLM 2: Which would be a game changer.
[02:04:32] NotebookLM: Right! It's like, imagine voice assistants that actually understand you. Or, even seamless real time translation.
[02:04:39] NotebookLM 2: No more language barriers.
[02:04:40] NotebookLM: Exactly. That's just the tip of the iceberg, though. They really believe this real time capability is key to making AR truly mainstream.
[02:04:48] NotebookLM 2: Okay, so OpenAI is building this AI platform, emphasizing real time interactions. How does this translate into, like, actual results?
[02:04:56] NotebookLM: Yeah.
[02:04:56] NotebookLM 2: You know, real world stuff.
[02:04:58] NotebookLM: Well, that's where things get really interesting.
[02:04:59] NotebookLM: Let's talk about the O1 model and how developers are using it to, like, really push the boundaries of what's possible.
[02:05:06] NotebookLM 2: So this O1 model, everyone's talking about it. One developer even said they built an entire iPhone app just by describing it as O1. Is that just hype?
[02:05:16] NotebookLM: I think there's definitely some substance behind all the hype.
[02:05:19] NotebookLM: What's so fascinating about O1, it's not just about the code it generates, it's how it seems to understand, like, the logic. The
[02:05:24] Alex Volkov: logic.
[02:05:25] NotebookLM: Yeah. Like, this developer They didn't give O1 lines of code, they described the idea of the app. And O1, it actually designed the architecture, connected everything, the developer just took that code, put it right into Xcode, and it worked.
[02:05:37] NotebookLM 2: Wow, so it's not just writing code, it's understanding the intent.
[02:05:40] NotebookLM: Yeah, exactly. And this actually challenges how we measure these models, you know, even OpenAI admitted that these benchmarks, like what was it? Swebench.
[02:05:49] NotebookLM 2: Swebench.
[02:05:51] NotebookLM: Right, which looks at code accuracy. It doesn't always reflect how things work in the real world.
[02:05:55] NotebookLM 2: Right, because in the real world, you don't just need code that compiles. It has to be, like, efficient, maintainable.
[02:06:01] NotebookLM: Exactly. It all has to work together, and OpenAI is really working on this with developers. They're finding that UI development, especially in things like React, it needs better evaluation.
[02:06:11] NotebookLM: It's one thing to code a button that works, and another to make it actually look good, you know, and be intuitive.
[02:06:16] NotebookLM 2: Right, and it seems like this need for real world context, It goes beyond just, like, evaluating those models. There was a developer working with this code generating AI genie, I think it was called.
[02:06:27] NotebookLM: Genie, yeah.
[02:06:28] NotebookLM 2: And it's more focused on those specific coding tasks, but they found that its performance really changed between different programming languages, like JavaScript versus C Sharp, for example.
[02:06:39] NotebookLM: And that just highlights how important the data is, right? Just like us, AI needs that variety to learn.
[02:06:45] NotebookLM: If you train it on just one type of code, it'll be great at that. But anything new and It'll fall flat. Yeah. So it's about making sure these models have a broad diet of data to learn from. That way they're more adaptable and ready for whatever we throw at them.
[02:06:59] NotebookLM 2: So we've got AI that can build apps, understand what we want, even write different kinds of code.
[02:07:04] NotebookLM 2: It's a lot, and it feels like things are changing so fast. How can developers even keep up, let alone, like, build something successful with AI?
[02:07:11] NotebookLM: Right. That's the question, isn't it? But it's interesting, you know, both OpenAI and the developers building with these tools, they kind of agree on one thing. You got to aim for what's just out of reach.
[02:07:22] NotebookLM 2: So don't wait for the tech to catch up to your Like, wildest dreams. Focus on what's almost possible right now.
[02:07:29] NotebookLM: Yeah. Build for where things are going, not where they are today. You wait for that perfect AI, you might miss the boat on shaping how it develops, and being the first one out there doing something new.
[02:07:39] NotebookLM 2: Riding the wave, not chasing after it.
[02:07:41] NotebookLM: Exactly. But, and OpenAI really emphasized this too, Even with all this amazing AI, you can't forget the basics of building a business.
[02:07:50] NotebookLM 2: So just because it's got AI doesn't mean it's automatically going to be a success. Right.
[02:07:54] NotebookLM: You need a good strategy, know who you're selling to, and it's got to actually solve a real problem.
[02:07:59] NotebookLM: AI is a tool, not a magic wand.
[02:08:01] NotebookLM 2: Like, having the best oven in the world won't help if you don't know how to cook.
[02:08:05] NotebookLM: Perfect analogy. And then there's this other thing OpenAI talked about that's really interesting. Balancing safety with access for everyone.
[02:08:14] NotebookLM 2: So making sure these AI tools are used responsibly, but also making them available to everyone who could benefit.
[02:08:21] NotebookLM: Yeah, they're really aware that focusing on safety, while important, could limit access to some really powerful stuff. It's a tough balance.
[02:08:30] NotebookLM 2: It's like that debate around, you know, life saving medications. How do you make sure they're used correctly, but also make sure people who need them can actually get them?
[02:08:38] NotebookLM: It's complicated, no easy answers. But it's something they're thinking hard about.
[02:08:42] NotebookLM 2: Well, it's clear that all this AI stuff, especially with these new models like O1, is changing how we think about tech, how we use it.
[02:08:49] NotebookLM: Imagine walking up to a screen, and it just creates a personalized experience for you, right there, adapts to what you need.
[02:08:57] NotebookLM: That's the potential.
[02:08:57] NotebookLM 2: Like having a personal assistant in every device.
[02:09:00] NotebookLM: It's exciting, but we got to be thoughtful about it, build responsibly.
[02:09:03] NotebookLM 2: So there you have it. OpenAI isn't just building these cool AI models, they're building a whole world around them and it's changing everything. It's going to be a wild ride, that's for sure.
[02:09:12] NotebookLM 2: And we're just at the beginning.

Get full access to Latent Space at www.latent.space/subscribe
- Luisteren Nogmaals beluisteren Doorgaan Wordt afgespeeld...
- Later beluisteren Later beluisteren
Language Agents: From Reasoning to Acting
27 sep· Latent Space: The AI Engineer Podcast
OpenAI DevDay is almost here! Per tradition, we are hosting a DevDay pregame event for everyone coming to town! Join us with demos and gossip!
Also sign up for related events across San Francisco: the AI DevTools Night, the xAI open house, the Replicate art show, the DevDay Watch Party (for non-attendees), Hack Night with OpenAI at Cloudflare. For everyone else, join the Latent Space Discord for our online watch party and find fellow AI Engineers in your city.
OpenAI’s recent o1 release (and Reflection 70b debacle) has reignited broad interest in agentic general reasoning and tree search methods.
While we have covered some of the self-taught reasoning literature on the Latent Space Paper Club, it is notable that the Eric Zelikman ended up at xAI, whereas OpenAI’s hiring of Noam Brown and now Shunyu suggests more interest in tool-using chain of thought/tree of thought/generator-verifier architectures for Level 3 Agents.
We were more than delighted to learn that Shunyu is a fellow Latent Space enjoyer, and invited him back (after his first appearance on our NeurIPS 2023 pod) for a look through his academic career with Harrison Chase (one year after his first LS show).
ReAct: Synergizing Reasoning and Acting in Language Models
paper link
Following seminal Chain of Thought papers from Wei et al and Kojima et al, and reflecting on lessons from building the WebShop human ecommerce trajectory benchmark, Shunyu’s first big hit, the ReAct paper showed that using LLMs to “generate both reasoning traces and task-specific actions in an interleaved manner” achieved remarkably greater performance (less hallucination/error propagation, higher ALFWorld/WebShop benchmark success) than CoT alone.
In even better news, ReAct scales fabulously with finetuning:
As a member of the elite Princeton NLP group, Shunyu was also a coauthor of the Reflexion paper, which we discuss in this pod.
Tree of Thoughts
paper link here
Shunyu’s next major improvement on the CoT literature was Tree of Thoughts:
Language models are increasingly being deployed for general problem solving across a wide range of tasks, but are still confined to token-level, left-to-right decision-making processes during inference. This means they can fall short in tasks that require exploration, strategic lookahead, or where initial decisions play a pivotal role…
ToT allows LMs to perform deliberate decision making by considering multiple different reasoning paths and self-evaluating choices to decide the next course of action, as well as looking ahead or backtracking when necessary to make global choices.
The beauty of ToT is it doesnt require pretraining with exotic methods like backspace tokens or other MCTS architectures. You can listen to Shunyu explain ToT in his own words on our NeurIPS pod, but also the ineffable Yannic Kilcher:
Other Work
We don’t have the space to summarize the rest of Shunyu’s work, you can listen to our pod with him now, and recommend the CoALA paper and his initial hit webinar with Harrison, today’s guest cohost:
as well as Shunyu’s PhD Defense Lecture:
as well as Shunyu’s latest lecture covering a Brief History of LLM Agents:
As usual, we are live on YouTube!
Show Notes
* Harrison Chase
* LangChain, LangSmith, LangGraph
* Shunyu Yao
* Alec Radford
* ReAct Paper
* Hotpot QA
* Tau Bench
* WebShop
* SWE-Agent
* SWE-Bench
* Trees of Thought
* CoALA Paper
* Related Episodes
* Our Thomas Scialom (Meta) episode
* Shunyu on our NeurIPS 2023 Best Papers episode
* Harrison on our LangChain episode
* Mentions
* Sierra
* Voyager
* Jason Wei
* Tavily
* SERP API
* Exa
Timestamps
* [00:00:00] Opening Song by Suno
* [00:03:00] Introductions
* [00:06:16] The ReAct paper
* [00:12:09] Early applications of ReAct in LangChain
* [00:17:15] Discussion of the Reflection paper
* [00:22:35] Tree of Thoughts paper and search algorithms in language models
* [00:27:21] SWE-Agent and SWE-Bench for coding benchmarks
* [00:39:21] CoALA: Cognitive Architectures for Language Agents
* [00:45:24] Agent-Computer Interfaces (ACI) and tool design for agents
* [00:49:24] Designing frameworks for agents vs humans
* [00:53:52] UX design for AI applications and agents
* [00:59:53] Data and model improvements for agent capabilities
* [01:19:10] TauBench
* [01:23:09] Promising areas for AI
Transcript
Alessio [00:00:01]: Hey, everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO of Residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Small AI.
Swyx [00:00:12]: Hey, and today we have a super special episode. I actually always wanted to take like a selfie and go like, you know, POV, you're about to revolutionize the world of agents because we have two of the most awesome hiring agents in the house. So first, we're going to welcome back Harrison Chase. Welcome. Excited to be here. What's new with you recently in sort of like the 10, 20 second recap?
Harrison [00:00:34]: Linkchain, Linksmith, Lingraph, pushing on all of them. Lots of cool stuff related to a lot of the stuff that we're going to talk about today, probably.
Swyx [00:00:42]: Yeah.
Alessio [00:00:43]: We'll mention it in there. And the Celtics won the title.
Swyx [00:00:45]: And the Celtics won the title. You got that going on for you. I don't know. Is that like floorball? Handball? Baseball? Basketball.
Alessio [00:00:52]: Basketball, basketball.
Harrison [00:00:53]: Patriots aren't looking good though, so that's...
Swyx [00:00:56]: And then Xun Yu, you've also been on the pod, but only in like a sort of oral paper presentation capacity. But welcome officially to the LinkedSpace pod.
Shunyu [00:01:03]: Yeah, I've been a huge fan. So thanks for the invitation. Thanks.
Swyx [00:01:07]: Well, it's an honor to have you on. You're one of like, you're maybe the first PhD thesis defense I've ever watched in like this AI world, because most people just publish single papers, but every paper of yours is a banger. So congrats.
Shunyu [00:01:22]: Thanks.
Swyx [00:01:24]: Yeah, maybe we'll just kick it off with, you know, what was your journey into using language models for agents? I like that your thesis advisor, I didn't catch his name, but he was like, you know... Karthik. Yeah. It's like, this guy just wanted to use language models and it was such a controversial pick at the time. Right.
Shunyu [00:01:39]: The full story is that in undergrad, I did some computer vision research and that's how I got into AI. But at the time, I feel like, you know, you're just composing all the GAN or 3D perception or whatever together and it's not exciting anymore. And one day I just see this transformer paper and that's really cool. But I really got into language model only when I entered my PhD and met my advisor Karthik. So he was actually the second author of GPT-1 when he was like a visiting scientist at OpenAI. With Alec Redford?
Swyx [00:02:10]: Yes.
Shunyu [00:02:11]: Wow. That's what he told me. It's like back in OpenAI, they did this GPT-1 together and Ilya just said, Karthik, you should stay because we just solved the language. But apparently Karthik is not fully convinced. So he went to Princeton, started his professorship and I'm really grateful. So he accepted me as a student, even though I have no prior knowledge in NLP. And you know, we just met for the first time and he's like, you know, what do you want to do? And I'm like, you know, you have done those test game scenes. That's really cool. I wonder if we can just redo them with language models. And that's how the whole journey began. Awesome.
Alessio [00:02:46]: So GPT-2 was out at the time? Yes, that was 2019.
Shunyu [00:02:48]: Yeah.
Alessio [00:02:49]: Way too dangerous to release. And then I guess the first work of yours that I came across was React, which was a big part of your defense. But also Harrison, when you came on The Pockets last year, you said that was one of the first papers that you saw when you were getting inspired for BlankChain. So maybe give a recap of why you thought it was cool, because you were already working in AI and machine learning. And then, yeah, you can kind of like intro the paper formally. What was that interesting to you specifically?
Harrison [00:03:16]: Yeah, I mean, I think the interesting part was using these language models to interact with the outside world in some form. And I think in the paper, you mostly deal with Wikipedia. And I think there's some other data sets as well. But the outside world is the outside world. And so interacting with things that weren't present in the LLM and APIs and calling into them and thinking about the React reasoning and acting and kind of like combining those together and getting better results. I'd been playing around with LLMs, been talking with people who were playing around with LLMs. People were trying to get LLMs to call into APIs, do things, and it was always, how can they do it more reliably and better? And so this paper was basically a step in that direction. And I think really interesting and also really general as well. Like I think that's part of the appeal is just how general and simple in a good way, I think the idea was. So that it was really appealing for all those reasons.
Shunyu [00:04:07]: Simple is always good. Yeah.
Alessio [00:04:09]: Do you have a favorite part? Because I have one favorite part from your PhD defense, which I didn't understand when I read the paper, but you said something along the lines, React doesn't change the outside or the environment, but it does change the insight through the context, putting more things in the context. You're not actually changing any of the tools around you to work for you, but you're changing how the model thinks. And I think that was like a very profound thing when I, not that I've been using these tools for like 18 months. I'm like, I understand what you meant, but like to say that at the time you did the PhD defense was not trivial. Yeah.
Shunyu [00:04:41]: Another way to put it is like thinking can be an extra tool that's useful.
Alessio [00:04:47]: Makes sense. Checks out.
Swyx [00:04:49]: Who would have thought? I think it's also more controversial within his world because everyone was trying to use RL for agents. And this is like the first kind of zero gradient type approach. Yeah.
Shunyu [00:05:01]: I think the bigger kind of historical context is that we have this two big branches of AI. So if you think about RL, right, that's pretty much the equivalent of agent at a time. And it's like agent is equivalent to reinforcement learning and reinforcement learning is equivalent to whatever game environment they're using, right? Atari game or go or whatever. So you have like a pretty much, you know, you have a biased kind of like set of methodologies in terms of reinforcement learning and represents agents. On the other hand, I think NLP is like a historical kind of subject. It's not really into agents, right? It's more about reasoning. It's more about solving those concrete tasks. And if you look at SEL, right, like each task has its own track, right? Summarization has a track, question answering has a track. So I think really it's about rethinking agents in terms of what could be the new environments that we came to have is not just Atari games or whatever video games, but also those text games or language games. And also thinking about, could there be like a more general kind of methodology beyond just designing specific pipelines for each NLP task? That's like the bigger kind of context, I would say.
Alessio [00:06:14]: Is there an inspiration spark moment that you remember or how did you come to this? We had Trida on the podcast and he mentioned he was really inspired working with like systems people to think about Flash Attention. What was your inspiration journey?
Shunyu [00:06:27]: So actually before React, I spent the first two years of my PhD focusing on text-based games, or in other words, text adventure games. It's a very kind of small kind of research area and quite ad hoc, I would say. And there are like, I don't know, like 10 people working on that at the time. And have you guys heard of Zork 1, for example? So basically the idea is you have this game and you have text observations, like you see a monster, you see a dragon.
Swyx [00:06:57]: You're eaten by a grue.
Shunyu [00:06:58]: Yeah, you're eaten by a grue. And you have actions like kill the grue with a sword or whatever. And that's like a very typical setup of a text game. So I think one day after I've seen all the GPT-3 stuff, I just think about, you know, how can I solve the game? Like why those AI, you know, machine learning methods are pretty stupid, but we are pretty good at solving the game relatively, right? So for the context, the predominant method to solve this text game is obviously reinforcement learning. And the idea is you just try out an arrow in those games for like millions of steps and you kind of just overfit to the game. But there's no language understanding at all. And I'm like, why can't I solve the game better? And it's kind of like, because we think about the game, right? Like when we see this very complex text observation, like you see a grue and you might see a sword, you know, in the right of the room and you have to go through the wooden door to go to that room. You will think, you know, oh, I have to kill the monster and to kill that monster, I have to get the sword, I have to get the sword, I have to go, right? And this kind of thinking actually helps us kind of throw shots off the game. And it's like, why don't we also enable the text agents to think? And that's kind of the prototype of React. And I think that's actually very interesting because the prototype, I think, was around November of 2021. So that's even before like chain of thought or whatever came up. So we did a bunch of experiments in the text game, but it was not really working that well. Like those text games are just too hard. I think today it's still very hard. Like if you use GPD 4 to solve it, it's still very hard. So the change came when I started the internship in Google. And apparently Google care less about text game, they care more about what's more practical. So pretty much I just reapplied the idea, but to more practical kind of environments like Wikipedia or simpler text games like Alphard, and it just worked. It's kind of like you first have the idea and then you try to find the domains and the problems to demonstrate the idea, which is, I would say, different from most of the AI research, but it kind of worked out for me in that case.
Swyx [00:09:09]: For Harrison, when you were implementing React, what were people applying React to in the early days?
Harrison [00:09:14]: I think the first demo we did probably had like a calculator tool and a search tool. So like general things, we tried to make it pretty easy to write your own tools and plug in your own things. And so this is one of the things that we've seen in LangChain is people who build their own applications generally write their own tools. Like there are a few common ones. I'd say like the three common ones might be like a browser, a search tool, and a code interpreter. But then other than that-
Swyx [00:09:37]: The LMS. Yep.
Harrison [00:09:39]: Yeah, exactly. It matches up very nice with that. And we actually just redid like our integrations docs page, and if you go to the tool section, they like highlight those three, and then there's a bunch of like other ones. And there's such a long tail of other ones. But in practice, like when people go to production, they generally have their own tools or maybe one of those three, maybe some other ones, but like very, very few other ones. So yeah, I think the first demos was a search and a calculator one. And there's- What's the data set?
Shunyu [00:10:04]: Hotpot QA.
Harrison [00:10:05]: Yeah. Oh, so there's that one. And then there's like the celebrity one by the same author, I think.
Swyx [00:10:09]: Olivier Wilde's boyfriend squared. Yeah. 0.23. Yeah. Right, right, right.
Harrison [00:10:16]: I'm forgetting the name of the author, but there's-
Swyx [00:10:17]: I was like, we're going to over-optimize for Olivier Wilde's boyfriend, and it's going to change next year or something.
Harrison [00:10:21]: There's a few data sets kind of like in that vein that require multi-step kind of like reasoning and thinking. So one of the questions I actually had for you in this vein, like the React paper, there's a few things in there, or at least when I think of that, there's a few things that I think of. There's kind of like the specific prompting strategy. Then there's like this general idea of kind of like thinking and then taking an action. And then there's just even more general idea of just like taking actions in a loop. Today, like obviously language models have changed a lot. We have tool calling. The specific prompting strategy probably isn't used super heavily anymore. Would you say that like the concept of React is still used though? Or like do you think that tool calling and running tool calling in a loop, is that React
Swyx [00:11:02]: in your mind?
Shunyu [00:11:03]: I would say like it's like more implicitly used than explicitly used. To be fair, I think the contribution of React is actually twofold. So first is this idea of, you know, we should be able to use calls in a very general way. Like there should be a single kind of general method to handle interaction with various environments. I think React is the first paper to demonstrate the idea. But then I think later there are two form or whatever, and this becomes like a trivial idea. But I think at the time, that's like a pretty non-trivial thing. And I think the second contribution is this idea of what people call like inner monologue or thinking or reasoning or whatever, to be paired with tool use. I think that's still non-trivial because if you look at the default function calling or whatever, like there's no inner monologue. And in practice, that actually is important, especially if the tool that you use is pretty different from the training distribution of the language model. I think those are the two main things that are kind of inherited.
Harrison [00:12:10]: On that note, I think OpenAI even recommended when you're doing tool calling, it's sometimes helpful to put a thought field in the tool, along with all the actual acquired arguments,
Swyx [00:12:19]: and then have that one first.
Harrison [00:12:20]: So it fills out that first, and they've shown that that's yielded better results. The reason I ask is just like this same concept is still alive, and I don't know whether to call it a React agent or not. I don't know what to call it. I think of it as React, like it's the same ideas that were in the paper, but it's obviously a very different implementation at this point in time. And so I just don't know what to call it.
Shunyu [00:12:40]: I feel like people will sometimes think more in terms of different tools, right? Because if you think about a web agent versus, you know, like a function calling agent, calling a Python API, you would think of them as very different. But in some sense, the methodology is the same. It depends on how you view them, right? I think people will tend to think more in terms of the environment and the tools rather than the methodology. Or, in other words, I think the methodology is kind of trivial and simple, so people will try to focus more on the different tools. But I think it's good to have a single underlying principle of those things.
Alessio [00:13:17]: How do you see the surface of React getting molded into the model? So a function calling is a good example of like, now the model does it. What about the thinking? Now most models that you use kind of do chain of thought on their own, they kind of produce steps. Do you think that more and more of this logic will be in the model? Or do you think the context window will still be the main driver of reasoning and thinking?
Shunyu [00:13:39]: I think it's already default, right? You do some chain of thought and you do some tool call, the cost of adding the chain of thought is kind of relatively low compared to other things. So it's not hurting to do that. And I think it's already kind of common practice, I would say.
Swyx [00:13:56]: This is a good place to bring in either Tree of Thought or Reflection, your pick.
Shunyu [00:14:01]: Maybe Reflection, to respect the time order, I would say.
Swyx [00:14:05]: Any backstory as well, like the people involved with NOAA and the Princeton group. We talked about this offline, but people don't understand how these research pieces come together and this ideation.
Shunyu [00:14:15]: I think Reflection is mostly NOAA's work, I'm more like advising kind of role. The story is, I don't remember the time, but one day we just see this pre-print that's like Reflection and Autonomous Agent with memory or whatever. And it's kind of like an extension to React, which uses this self-reflection. I'm like, oh, somehow you've become very popular. And NOAA reached out to me, it's like, do you want to collaborate on this and make this from an archive pre-print to something more solid, like a conference submission? I'm like, sure. We started collaborating and we remain good friends today. And I think another interesting backstory is NOAA was contacted by OpenAI at the time. It's like, this is pretty cool, do you want to just work at OpenAI? And I think Sierra also reached out at the same time. It's like, this is pretty cool, do you want to work at Sierra? And I think NOAA chose Sierra, but it's pretty cool because he was still like a second year undergrad and he's a very smart kid.
Swyx [00:15:16]: Based on one paper. Oh my god.
Shunyu [00:15:19]: He's done some other research based on programming language or chemistry or whatever, but I think that's the paper that got the attention of OpenAI and Sierra.
Swyx [00:15:28]: For those who haven't gone too deep on it, the way that you present the inside of React, can you do that also for reflection? Yeah.
Shunyu [00:15:35]: I think one way to think of reflection is that the traditional idea of reinforcement learning is you have a scalar reward and then you somehow back-propagate the signal of the scalar reward to the rest of your neural network through whatever algorithm, like policy grading or A2C or whatever. And if you think about the real life, most of the reward signal is not scalar. It's like your boss told you, you should have done a better job in this, but you could jump on that or whatever. It's not like a scalar reward, like 29 or something. I think in general, humans deal more with long scalar reward, or you can say language feedback. And the way that they deal with language feedback also has this back-propagation process, right? Because you start from this, you did a good job on job B, and then you reflect what could have been done differently to change to make it better. And you kind of change your prompt, right? Basically, you change your prompt on how to do job A and how to do job B, and then you do the whole thing again. So it's really like a pipeline of language where in self-graded descent, you have something like text reasoning to replace those gradient descent algorithms. I think that's one way to think of reflection.
Harrison [00:16:47]: One question I have about reflection is how general do you think the algorithm there is? And so for context, I think at LangChain and at other places as well, we found it pretty easy to implement React in a standard way. You plug in any tools and it kind of works off the shelf, can get it up and running. I don't think we have an off-the-shelf kind of implementation of reflection and kind of the general sense. I think the concepts, absolutely, we see used in different kind of specific cognitive architectures, but I don't think we have one that comes off the shelf. I don't think any of the other frameworks have one that comes off the shelf. And I'm curious whether that's because it's not general enough or it's complex as well, because it also requires running it more times.
Swyx [00:17:28]: Maybe that's not feasible.
Harrison [00:17:30]: I'm curious how you think about the generality, complexity. Should we have one that comes off the shelf?
Shunyu [00:17:36]: I think the algorithm is general in the sense that it's just as general as other algorithms, if you think about policy grading or whatever, but it's not applicable to all tasks, just like other algorithms. So you can argue PPO is also general, but it works better for those set of tasks, but not on those set of tasks. I think it's the same situation for reflection. And I think a key bottleneck is the evaluator, right? Basically, you need to have a good sense of the signal. So for example, if you are trying to do a very hard reasoning task, say mathematics, for example, and you don't have any tools, you're operating in this chain of thought setup, then reflection will be pretty hard because in order to reflect upon your thoughts, you have to have a very good evaluator to judge whether your thought is good or not. But that might be as hard as solving the problem itself or even harder. The principle of self-reflection is probably more applicable if you have a good evaluator, for example, in the case of coding. If you have those arrows, then you can just reflect on that and how to solve the bug and
Swyx [00:18:37]: stuff.
Shunyu [00:18:38]: So I think another criteria is that it depends on the application, right? If you have this latency or whatever need for an actual application with an end-user, the end-user wouldn't let you do two hours of tree-of-thought or reflection, right? You need something as soon as possible. So in that case, maybe this is better to be used as a training time technique, right? You do those reflection or tree-of-thought or whatever, you get a lot of data, and then you try to use the data to train your model better. And then in test time, you still use something as simple as React, but that's already improved.
Alessio [00:19:11]: And if you think of the Voyager paper as a way to store skills and then reuse them, how would you compare this reflective memory and at what point it's just ragging on the memory versus you want to start to fine-tune some of them or what's the next step once you get a very long reflective corpus? Yeah.
Shunyu [00:19:30]: So I think there are two questions here. The first question is, what type of information or memory are you considering, right? Is it like semantic memory that stores knowledge about the word, or is it the episodic memory that stores trajectories or behaviors, or is it more of a procedural memory like in Voyager's case, like skills or code snippets that you can use to do actions, right?
Swyx [00:19:54]: That's one dimension.
Shunyu [00:19:55]: And the second dimension is obviously how you use the memory, either retrieving from it, using it in the context, or fine-tuning it. I think the Cognitive Architecture for Language Agents paper has a good categorization of all the different combinations. And of course, which way you use depends on the concrete application and the concrete need and the concrete task. But I think in general, it's good to think of those systematic dimensions and all the possible options there.
Swyx [00:20:25]: Harrison also has in LangMEM, I think you did a presentation in my meetup, and I think you've done it at a couple other venues as well. User state, semantic memory, and append-only state, I think kind of maps to what you just said.
Shunyu [00:20:38]: What is LangMEM? Can I give it like a quick...
Harrison [00:20:40]: One of the modules of LangChain for a long time has been something around memory. And I think we're still obviously figuring out what that means, as is everyone kind of in the space. But one of the experiments that we did, and one of the proof of concepts that we did was, technically what it was is you would basically create threads, you'd push messages to those threads in the background, we process the data in a few ways. One, we put it into some semantic store, that's the semantic memory. And then two, we do some extraction and reasoning over the memories to extract. And we let the user define this, but extract key facts or anything that's of interest to the user. Those aren't exactly trajectories, they're maybe more closer to the procedural memory. Is that how you'd think about it or classify it?
Shunyu [00:21:22]: Is it like about knowledge about the word, or is it more like how to do something?
Swyx [00:21:27]: It's reflections, basically.
Harrison [00:21:28]: So in generative worlds.
Shunyu [00:21:30]: Generative agents.
Swyx [00:21:31]: The Smallville. Yeah, the Smallville one.
Harrison [00:21:33]: So the way that they had their memory there was they had the sequence of events, and that's kind of like the raw events that happened. But then every N events, they'd run some synthesis over those events for the LLM to insert its own memory, basically. It's that type of memory.
Swyx [00:21:49]: I don't know how that would be classified.
Shunyu [00:21:50]: I think of that as more of the semantic memory, but to be fair, I think it's just one way to think of that. But whether it's semantic memory or procedural memory or whatever memory, that's like an abstraction layer. But in terms of implementation, you can choose whatever implementation for whatever memory. So they're totally kind of orthogonal. I think it's more of a good way to think of the things, because from the history of cognitive science and cognitive architecture and how people study even neuroscience, that's the way people think of how the human brain organizes memory. And I think it's more useful as a way to think of things. But it's not like for semantic memory, you have to do this kind of way to retrieve or fine-tune, and for procedural memory, you have to do that. I think those are totally orthogonal kind of dimensions.
Harrison [00:22:34]: How much background do you have in cognitive sciences, and how much do you model some of your thoughts on?
Shunyu [00:22:40]: That's a great question, actually. I think one of the undergrad influences for my follow-up research is I was doing an internship at MIT's Computational Cognitive Science Lab with Josh Tannenbaum, and he's a very famous cognitive scientist. And I think a lot of his ideas still influence me today, like thinking of things in computational terms and getting interested in language and a lot of stuff, or even developing psychology kind of stuff. So I think it still influences me today.
Swyx [00:23:14]: As a developer that tried out LangMEM, the way I view it is just it's a materialized view of a stream of logs. And if anything, that's just useful for context compression. I don't have to use the full context to run it over everything. But also it's kind of debuggable. If it's wrong, I can show it to the user, the user can manually fix it, and I can carry on. That's a really good analogy. I like that. I'm going to steal that. Sure. Please, please. You know I'm bullish on memory databases. I guess, Tree of Thoughts? Yeah, Tree of Thoughts.
Shunyu [00:23:39]: I feel like I'm relieving the defense in like a podcast format. Yeah, no.
Alessio [00:23:45]: I mean, you had a banger. Well, this is the one where you're already successful and we just highlight the glory. It was really good. You mentioned that since thinking is kind of like taking an action, you can use action searching algorithms to think of thinking. So just like you will use Tree Search to find the next thing. And the idea behind Tree of Thought is that you generate all these possible outcomes and then find the best tree to get to the end. Maybe back to the latency question, you can't really do that if you have to respond in real time. So what are maybe some of the most helpful use cases for things like this? Where have you seen people adopt it where the high latency is actually worth the wait?
Shunyu [00:24:21]: For things that you don't care about latency, obviously. For example, if you're trying to do math, if you're just trying to come up with a proof. But I feel like one type of task is more about searching for a solution. You can try a hundred times, but if you find one solution, that's good. For example, if you're finding a math proof or if you're finding a good code to solve a problem or whatever, I think another type of task is more like reacting. For example, if you're doing customer service, you're like a web agent booking a ticket for an end user. Those are more reactive kind of tasks, or more real-time tasks. You have to do things fast. They might be easy, but you have to do it reliably. And you care more about can you solve 99% of the time out of a hundred. But for the type of search type of tasks, then you care more about can I find one solution out of a hundred. So it's kind of symmetric and different.
Alessio [00:25:11]: Do you have any data or intuition from your user base? What's the split of these type of use cases? How many people are doing more reactive things and how many people are experimenting with deep, long search?
Harrison [00:25:23]: I would say React's probably the most popular. I think there's aspects of reflection that get used. Tree of thought, probably the least so. There's a great tweet from Jason Wei, I think you're now a colleague, and he was talking about prompting strategies and how he thinks about them. And I think the four things that he had was, one, how easy is it to implement? How much compute does it take? How many tasks does it solve? And how much does it improve on those tasks? And I'd add a fifth, which is how likely is it to be relevant when the next generation of models come out? And I think if you look at those axes and then you look at React, reflection, tree of thought, it tracks that the ones that score better are used more. React is pretty easy to implement. Tree of thought's pretty hard to implement. The amount of compute, yeah, a lot more for tree of thought. The tasks and how much it improves, I don't have amazing visibility there. But I think if we're comparing React versus tree of thought, React just dominates the first two axes so much that my question around that was going to be like, how do you think about these prompting strategies, cognitive architectures, whatever you want to call them? When you're thinking of them, what are the axes that you're judging them on in your head when you're thinking whether it's a good one or a less good one?
Swyx [00:26:38]: Right.
Shunyu [00:26:39]: Right. I think there is a difference between a prompting method versus research, in the sense that for research, you don't really even care about does it actually work on practical tasks or does it help? Whatever. I think it's more about the idea or the principle, right? What is the direction that you're unblocking and whatever. And I think for an actual prompting method to solve a concrete problem, I would say simplicity is very important because the simpler it is, the less decision you have to make about it. And it's easier to design. It's easier to propagate. And it's easier to do stuff. So always try to be as simple as possible. And I think latency obviously is important. If you can do things fast and you don't want to do things slow. And I think in terms of the actual prompting method to use for a particular problem, I think we should all be in the minimalist kind of camp, right? You should try the minimum thing and see if it works. And if it doesn't work and there's absolute reason to add something, then you add something, right? If there's absolute reason that you need some tool, then you should add the tool thing. If there's absolute reason to add reflection or whatever, you should add that. Otherwise, if a chain of thought can already solve something, then you don't even need to use any of that.
Harrison [00:27:57]: Yeah. Or if it's just better prompting can solve it. Like, you know, you could add a reflection step or you could make your instructions a little bit clearer.
Swyx [00:28:03]: And it's a lot easier to do that.
Shunyu [00:28:04]: I think another interesting thing is like, I personally have never done those kind of like weird tricks. I think all the prompts that I write are kind of like just talking to a human, right? It's like, I don't know. I never say something like, your grandma is dying and you have to solve it. I mean, those are cool, but I feel like we should all try to solve things in a very intuitive way. Just like talking to your co-worker. That should work 99% of the time. That's my personal take.
Swyx [00:28:29]: The problem with how language models, at least in the GPC 3 era, was that they over-optimized to some sets of tokens in sequence. So like reading the Kojima et al. paper that was listing step-by-step, like he tried a bunch of them and they had wildly different results. It should not be the case, but it is the case. And hopefully we're getting better there.
Shunyu [00:28:51]: Yeah. I think it's also like a timing thing in the sense that if you think about this whole line of language model, right? Like at the time it was just like a text generator. We don't have any idea how it's going to be used, right? And obviously at the time you will find all kinds of weird issues because it's not trained to do any of that, right? But then I think we have this loop where once we realize chain of thought is important or agent is important or tool using is important, what we see is today's language models are heavily optimized towards those things. So I think in some sense they become more reliable and robust over those use cases. And you don't need to do as much prompt engineering tricks anymore to solve those things. I feel like in some sense, I feel like prompt engineering even is like a slightly negative word at the time because it refers to all those kind of weird tricks that you have to apply. But I think we don't have to do that anymore. Like given today's progress, you should just be able to talk to like a coworker. And if you're clear and concrete and being reasonable, then it should do reasonable things for you.
Swyx [00:29:51]: Yeah. The way I put this is you should not be a prompt engineer because it is the goal of the big labs to put you out of a job.
Shunyu [00:29:58]: You should just be a good communicator. Like if you're a good communicator to humans, you should be a good communicator to language
Swyx [00:30:02]: models.
Harrison [00:30:03]: That's the key though, because oftentimes people aren't good communicators to these language models and that is a very important skill and that's still messing around with the prompt. And so it depends what you're talking about when you're saying prompt engineer.
Shunyu [00:30:14]: But do you think it's like very correlated with like, are they like a good communicator to humans? You know, it's like.
Harrison [00:30:20]: It may be, but I also think I would say on average, people are probably worse at communicating with language models than to humans right now, at least, because I think we're still figuring out how to do it. You kind of expect it to be magical and there's probably some correlation, but I'd say there's also just like, people are worse at it right now than talking to humans.
Shunyu [00:30:36]: We should make it like a, you know, like an elementary school class or whatever, how to
Swyx [00:30:41]: talk to language models. Yeah. I don't know. Very pro that. Yeah. Before we leave the topic of trees and searching, not specific about QSTAR, but there's a lot of questions about MCTS and this combination of tree search and language models. And I just had to get in a question there about how seriously should people take this?
Shunyu [00:30:59]: Again, I think it depends on the tasks, right? So MCTS was magical for Go, but it's probably not as magical for robotics, right? So I think right now the problem is not even that we don't have good methodologies, it's more about we don't have good tasks. It's also very interesting, right? Because if you look at my citation, it's like, obviously the most cited are React, Refraction and Tree of Thought. Those are methodologies. But I think like equally important, if not more important line of my work is like benchmarks and environments, right? Like WebShop or SuiteVenture or whatever. And I think in general, what people do in academia that I think is not good is they choose a very simple task, like Alford, and then they apply overly complex methods to show they improve 2%. I think you should probably match the level of complexity of your task and your method. I feel like where tasks are kind of far behind the method in some sense, right? Because we have some good test-time approaches, like whatever, React or Refraction or Tree of Thought, or like there are many, many more complicated test-time methods afterwards. But on the benchmark side, we have made a lot of good progress this year, last year. But I think we still need more progress towards that, like better coding benchmark, better web agent benchmark, better agent benchmark, not even for web or code. I think in general, we need to catch up with tasks.
Harrison [00:32:27]: What are the biggest reasons in your mind why it lags behind?
Shunyu [00:32:31]: I think incentive is one big reason. Like if you see, you know, all the master paper are cited like a hundred times more than the task paper. And also making a good benchmark is actually quite hard. It's almost like a different set of skills in some sense, right? I feel like if you want to build a good benchmark, you need to be like a good kind of product manager kind of mindset, right? You need to think about why people should use your benchmark, why it's challenging, why it's useful. If you think about like a PhD going into like a school, right? The prior skill that expected to have is more about, you know, can they code this method and can they just run experiments and can solve that? I think building a benchmark is not the typical prior skill that we have, but I think things are getting better. I think more and more people are starting to build benchmarks and people are saying that it's like a way to get more impact in some sense, right? Because like if you have a really good benchmark, a lot of people are going to use it. But if you have a super complicated test time method, like it's very hard for people to use it.
Harrison [00:33:35]: Are evaluation metrics also part of the reason? Like for some of these tasks that we might want to ask these agents or language models to do, is it hard to evaluate them? And so it's hard to get an automated benchmark. Obviously with SweetBench you can, and with coding, it's easier, but.
Shunyu [00:33:50]: I think that's part of the skillset thing that I mentioned, because I feel like it's like a product manager because there are many dimensions and you need to strike a balance and it's really hard, right? If you want to make sense, very easy to autogradable, like automatically gradable, like either to grade or either to evaluate, then you might lose some of the realness or practicality. Or like it might be practical, but it might not be as scalable, right? For example, if you think about text game, human have pre-annotated all the rewards and all the language are real. So it's pretty good on autogradable dimension and the practical dimension. If you think about, you know, practical, like actual English being practical, but it's not scalable, right? It takes like a year for experts to build that game. So it's not really that scalable. And I think part of the reason that SweetBench is so popular now is it kind of hits the balance between these three dimensions, right? Easy to evaluate and being actually practical and being scalable. Like if I were to criticize upon some of my prior work, I think webshop, like it's my initial attempt to get into benchmark world and I'm trying to do a good job striking the balance. But obviously we make it all gradable and it's really scalable, but then I think the practicality is not as high as actually just using GitHub issues, right? Because you're just creating those like synthetic tasks.
Harrison [00:35:13]: Are there other areas besides coding that jump to mind as being really good for being autogradable?
Shunyu [00:35:20]: Maybe mathematics.
Swyx [00:35:21]: Classic. Yeah. Do you have thoughts on alpha proof, the new DeepMind paper? I think it's pretty cool.
Shunyu [00:35:29]: I think it's more of a, you know, it's more of like a confidence boost or like sometimes, you know, the work is not even about, you know, the technical details or the methodology that it chooses or the concrete results. I think it's more about a signal, right?
Swyx [00:35:47]: Yeah. Existence proof. Yeah.
Shunyu [00:35:50]: Yeah. It can be done. This direction is exciting. It kind of encourages people to work more towards that direction. I think it's more like a boost of confidence, I would say.
Swyx [00:35:59]: Yeah. So we're going to focus more on agents now and, you know, all of us have a special interest in coding agents. I would consider Devin to be the sort of biggest launch of the year as far as AI startups go. And you guys in the Princeton group worked on Suiagents alongside of Suibench. Tell us the story about Suiagent. Sure.
Shunyu [00:36:21]: I think it's kind of like a triology, it's actually a series of three works now. So actually the first work is called Intercode, but it's not as famous, I know. And the second work is called Suibench and the third work is called Suiagent. And I'm just really confused why nobody is working on coding. You know, it's like a year ago, but I mean, not everybody's working on coding, obviously, but a year ago, like literally nobody was working on coding. I was really confused. And the people that were working on coding are, you know, trying to solve human evil in like a sick-to-sick way. There's no agent, there's no chain of thought, there's no anything, they're just, you know, fine tuning the model and improve some points and whatever, like, I was really confused because obviously coding is the best application for agents because it's autogradable, it's super important, you can make everything like API or code action, right? So I was confused and I collaborated with some of the students in Princeton and we have this work called Intercode and the idea is, first, if you care about coding, then you should solve coding in an interactive way, meaning more like a Jupyter Notebook kind of way than just writing a program and seeing if it fails or succeeds and stop, right? You should solve it in an interactive way because that's exactly how humans solve it, right? You don't have to, you know, write a program like next token, next token, next token and stop and never do any edits and you cannot really use any terminal or whatever tool. It doesn't make sense, right? And that's the way people are solving coding at the time, basically like sampling a program from a language model without chain of thought, without tool call, without refactoring, without anything. So the first point is we should solve coding in a very interactive way and that's a very general principle that applies for various coding benchmarks. And also, I think you can make a lot of the agent task kind of like interactive coding. If you have Python and you can call any package, then you can literally also browse internet or do whatever you want, like control a robot or whatever. So that seems to be a very general paradigm. But obviously I think a bottleneck is at the time we're still doing, you know, very simple tasks like human eval or whatever coding benchmark people proposed. They were super hard in 2021, like 20%, but they're like 95% already in 2023. So obviously the next step is we need a better benchmark. And Carlos and John, which are the first authors of Swaybench, I think they come up with this great idea that we should just script GitHub and solve whatever human engineers are solving. And I think it's actually pretty easy to come up with the idea. And I think in the first week, they already made a lot of progress. They script the GitHub and they make all the same, but then there's a lot of painful info work and whatever, you know. I think the idea is super easy, but the engineering is super hard. And I feel like that's a very typical signal of a good work in the AI era now.
Swyx [00:39:17]: I think also, I think the filtering was challenging, because if you look at open source PRs, a lot of them are just like, you know, fixing typos. I think it's challenging.
Shunyu [00:39:27]: And to be honest, we didn't do a perfect job at the time. So if you look at the recent blog post with OpenAI, we improved the filtering so that it's more solvable.
Swyx [00:39:36]: I think OpenAI was just like, look, this is a thing now. We have to fix this. These students just rushed it.
Shunyu [00:39:45]: It's a good convergence of interests for me.
Alessio [00:39:48]: Was that tied to you joining OpenAI? Or was that just unrelated?
Shunyu [00:39:52]: It's a coincidence for me, but it's a good coincidence.
Swyx [00:39:55]: There is a history of anytime a big lab adopts a benchmark, they fix it. Otherwise, it's a broken benchmark.
Shunyu [00:40:03]: So naturally, once we propose swimmage, the next step is to solve it. But I think the typical way you solve something now is you collect some training samples, or you design some complicated agent method, and then you try to solve it. Either super complicated prompt, or you build a better model with more training data. But I think at the time, we realized that even before those things, there's a fundamental problem with the interface or the tool that you're supposed to use. Because that's like an ignored problem in some sense. What your tool is, or how that matters for your task. So what we found concretely is that if you just use the text terminal off the shelf as a tool for those agents, there's a lot of problems. For example, if you edit something, there's no feedback. So you don't know whether your edit is good or not. That makes the agent very confused and makes a lot of mistakes. There are a lot of small problems, you would say. Well, you can try to do prompt engineering and improve that, but it turns out to be actually very hard. We realized that the interface design is actually a very omitted part of agent design. So we did this switch agent work. And the key idea is just, even before you talk about what the agent is, you should talk about what the environment is. You should make sure that the environment is actually friendly to whatever agent you're trying to apply. That's the same idea for humans. Text terminal is good for some tasks, like git, pool, or whatever. But it's not good if you want to look at browser and whatever. Also, browser is a good tool for some tasks, but it's not a good tool for other tasks. We need to talk about how design interface, in some sense, where we should treat agents as our customers. It's like when we treat humans as a customer, we design human computer interfaces. We design those beautiful desktops or browsers or whatever, so that it's very intuitive and easy for humans to use. And this whole great subject of HCI is all about that. I think now the research idea of switch agent is just, we should treat agents as our customers. And we should do like, you know… AICI.
Swyx [00:42:16]: AICI, exactly.
Harrison [00:42:18]: So what are the tools that a suite agent should have, or a coding agent in general should have?
Shunyu [00:42:24]: For suite agent, it's like a modified text terminal, which kind of adapts to a lot of the patterns of language models to make it easier for language models to use. For example, now for edit, instead of having no feedback, it will actually have a feedback of, you know, actually here you introduced like a syntax error, and you should probably want to fix that, and there's an ended error there. And that makes it super easy for the model to actually do that. And there's other small things, like how exactly you write arguments, right? Like, do you want to write like a multi-line edit, or do you want to write a single line edit? I think it's more interesting to think about the way of the development process of an ACI rather than the actual ACI for like a concrete application. Because I think the general paradigm is very similar to HCI and psychology, right? Basically, for how people develop HCIs, they do behavior experiments on humans, right? I do every test, right? Like, which interface is actually better? And I do those behavior experiments, kind of like psychology experiments to humans, and I change things. And I think what's really interesting for me, for this three-agent paper, is we can probably do the same thing for agents, right? We can do every test for those agents and do behavior tests. And through the process, we not only invent better interfaces for those agents, that's the practical value, but we also better understand agents. Just like when we do those A-B tests, we do those HCI, we better understand humans. Doing those ACI experiments, we actually better understand agents. And that's pretty cool.
Harrison [00:43:51]: Besides that A-B testing, what are other processes that people can use to think about this in a good way?
Swyx [00:43:57]: That's a great question.
Shunyu [00:43:58]: And I think three-agent is an initial work. And what we do is the kind of the naive approach, right? You just try some interface, and you see what's going wrong, and then you try to fix that. We do this kind of iterative fixing. But I think what's really interesting is there'll be a lot of future directions that's very promising if we can apply some of the HCI principles more systematically into the interface design. I think that would be a very cool interdisciplinary research opportunity.
Harrison [00:44:26]: You talked a lot about agent-computer interfaces and interactions. What about human-to-agent UX patterns? Curious for any thoughts there that you might have.
Swyx [00:44:38]: That's a great question.
Shunyu [00:44:39]: And in some sense, I feel like prompt engineering is about human-to-agent interface. But I think there can be a lot of interesting research done about... So prompting is about how humans can better communicate with the agent. But I think there could be interesting research on how agents can better communicate with humans, right? When to ask questions, how to ask questions, what's the frequency of asking questions. And I think those kinds of stuff could be very cool research.
Harrison [00:45:07]: Yeah, I think some of the most interesting stuff that I saw here was also related to coding with Devin from Cognition. And they had the three or four different panels where you had the chat, the browser, the terminal, and I guess the code editor as well.
Swyx [00:45:19]: There's more now.
Harrison [00:45:19]: There's more. Okay, I'm not up to date. Yeah, I think they also did a good job on ACI.
Swyx [00:45:25]: I think that's the main learning I have from Devin. They cracked that. Actually, there was no foundational planning breakthrough. The planner is actually pretty simple, but ACI that they broke through on.
Shunyu [00:45:35]: I think making the tool good and reliable is probably like 90% of the whole agent. Once the tool is actually good, then the agent design can be much, much simpler. On the other hand, if the tool is bad, then no matter how much you put into the agent design, planning or search or whatever, it's still going to be trash.
Harrison [00:45:53]: Yeah, I'd argue the same. Same with like context and instructions. Like, yeah, go hand in hand.
Alessio [00:46:00]: On the tool, how do you think about the tension of like, for both of you, I mean, you're building a library, so even more for you. The tension between making now a language or a library that is like easy for the agent to grasp and write versus one that is easy for like the human to grasp and write. Because, you know, the trend is like more and more code gets written by the agent. So why wouldn't you optimize the framework to be as easy as possible for the model versus for the person?
Swyx [00:46:24]: I think it's possible to design an interface
Shunyu [00:46:25]: that's both friendly to humans and agents. But what do you think?
Harrison [00:46:29]: We haven't thought about that from the perspective, like we're not trying to design LangChain or LangGraph to be friendly. But I mean, I think to be friendly for agents to write.
Swyx [00:46:42]: But I mean, I think we see this with like,
Harrison [00:46:43]: I saw some paper that used TypeScript notation instead of JSON notation for tool calling and it got a lot better performance. So it's definitely a thing. I haven't really heard of anyone designing like a syntax or a language explicitly for agents, but there's clearly syntaxes that are better.
Shunyu [00:46:59]: I think function calling is a good example where it's like a good interface for both human programmers and for agents, right? Like for developers, it's actually a very friendly interface because it's very concrete and you don't have to do prompt engineering anymore. You can be very systematic. And for models, it's also pretty good, right? Like it can use all the existing coding content. So I think we need more of those kinds of designs.
Swyx [00:47:21]: I will mostly agree and I'll slightly disagree in terms of this, which is like, whether designing for humans also overlaps with designing for AI. So Malte Ubo, who's the CTO of Vercel, who is creating basically JavaScript's competitor to LangChain, they're observing that basically, like if the API is easy to understand for humans, it's actually much easier to understand for LLMs, for example, because they're not overloaded functions. They don't behave differently under different contexts. They do one thing and they always work the same way. It's easy for humans, it's easy for LLMs. And like that makes a lot of sense. And obviously adding types is another one. Like type annotations only help give extra context, which is really great. So that's the agreement. And then a disagreement is that when I use structured output to do my chain of thought, I have found that I change my field names to hint to the LLM of what the field is supposed to do. So instead of saying topics, I'll say candidate topics. And that gives me a better result because the LLM was like, ah, this is just a draft thing I can use for chain of thought. And instead of like summaries, I'll say topic summaries to link the previous field to the current field. So like little stuff like that, I find myself optimizing for the LLM where I, as a human, would never do that. Interesting.
Shunyu [00:48:32]: It's kind of like the way you optimize the prompt, it might be different for humans and for machines. You can have a common ground that's both clear for humans and agents, but to improve the human performance versus improving the agent performance, they might move to different directions.
Swyx [00:48:48]: Might move different directions. There's a lot more use of metadata as well, like descriptions, comments, code comments, annotations and stuff like that. Yeah.
Harrison [00:48:56]: I would argue that's just you communicating
Swyx [00:48:58]: to the agent what it should do.
Harrison [00:49:00]: And maybe you need to communicate a little bit more than to humans because models aren't quite good enough yet.
Swyx [00:49:06]: But like, I don't think that's crazy.
Harrison [00:49:07]: I don't think that's like- It's not crazy.
Swyx [00:49:09]: I will bring this in because it just happened to me yesterday. I was at the cursor office. They held their first user meetup and I was telling them about the LLM OS concept and why basically every interface, every tool was being redesigned for AIs to use rather than humans. And they're like, why? Like, can we just use Bing and Google for LLM search? Why must I use Exa? Or what's the other one that you guys work with?
Harrison [00:49:32]: Tavilli.
Swyx [00:49:33]: Tavilli. Web Search API dedicated for LLMs. What's the difference?
Shunyu [00:49:36]: Exactly. To Bing API.
Swyx [00:49:38]: Exactly.
Harrison [00:49:38]: There weren't great APIs for search. Like the best one, like the one that we used initially in LangChain was SERP API, which is like maybe illegal. I'm not sure.
Swyx [00:49:49]: And like, you know,
Harrison [00:49:52]: and now there are like venture-backed companies.
Swyx [00:49:53]: Shout out to DuckDuckGo, which is free.
Harrison [00:49:55]: Yes, yes.
Swyx [00:49:56]: Yeah.
Harrison [00:49:56]: I do think there are some differences though. I think you want, like, I think generally these APIs try to return small amounts of text information, clear legible field. It's not a massive JSON blob. And I think that matters. I think like when you talk about designing tools, it's not only the, it's the interface in the entirety, not only the inputs, but also the outputs that really matter. And so I think they try to make the outputs.
Shunyu [00:50:18]: They're doing ACI.
Swyx [00:50:19]: Yeah, yeah, absolutely.
Harrison [00:50:20]: Really?
Swyx [00:50:21]: Like there's a whole set of industries that are just being redone for ACI. It's weird. And so my simple answer to them was like the error messages. When you give error messages, they should be basically prompts for the LLM to take and then self-correct. Then your error messages get more verbose, actually, than you normally would with a human. Stuff like that. Like a little, honestly, it's not that big. Again, like, is this worth a venture-backed industry? Unless you can tell us. But like, I think Code Interpreter, I think is a new thing. I hope so.
Alessio [00:50:52]: We invested in it to be so.
Shunyu [00:50:53]: I think that's a very interesting point. You're trying to optimize to the extreme, then obviously they're going to be different. For example, the error—
Swyx [00:51:00]: Because we take it very seriously. Right.
Shunyu [00:51:01]: The error for like language model, the longer the better. But for humans, that will make them very nervous and very tired, right? But I guess the point is more like, maybe we should try to find a co-optimized common ground as much as possible. And then if we have divergence, then we should try to diverge. But it's more philosophical now.
Alessio [00:51:19]: But I think like part of it is like how you use it. So Google invented the PageRank because ideally you only click on one link, you know, like the top three should have the answer. But with models, it's like, well, you can get 20. So those searches are more like semantic grouping in a way. It's like for this query, I'll return you like 20, 30 things that are kind of good, you know? So it's less about ranking and it's more about grouping.
Shunyu [00:51:42]: Another fundamental thing about HCI is the difference between human and machine's kind of memory limit, right? So I think what's really interesting about this concept HCI versus HCI is interfaces that's optimized for them. You can kind of understand some of the fundamental characteristics, differences of humans and machines, right? Why, you know, if you look at find or whatever terminal command, you know, you can only look at one thing at a time or that's because we have a very small working memory. You can only deal with one thing at a time. You can only look at one paragraph of text at the same time. So the interface for us is by design, you know, a small piece of information, but more temporal steps. But for machines, that should be the opposite, right? You should just give them a hundred different results and they should just decide in context what's the most relevant stuff and trade off the context for temporal steps. That's actually also better for language models because like the cost is smaller or whatever. So it's interesting to connect those interfaces to the fundamental kind of differences of those.
Harrison [00:52:43]: When you said earlier, you know, we should try to design these to maybe be similar as possible and diverge if we need to.
Swyx [00:52:49]: I actually don't have a problem with them diverging now
Harrison [00:52:51]: and seeing venture-backed startups emerging now because we are different from machines code AI. And it's just so early on, like they may still look kind of similar and they may still be small differences, but it's still just so early. And I think we'll only discover more ways that they differ. And so I'm totally fine with them kind of like diverging early
Swyx [00:53:10]: and optimizing for the...
Harrison [00:53:11]: I agree. I think it's more like, you know,
Shunyu [00:53:14]: we should obviously try to optimize human interface just for humans. We're already doing that for 50 years. We should optimize agent interface just for agents, but we might also try to co-optimize both and see how far we can get. There's enough people to try all three directions. Yeah.
Swyx [00:53:31]: There's a thesis I sometimes push, which is the sour lesson as opposed to the bitter lesson, which we're always inspired by human development, but actually AI develops its own path.
Shunyu [00:53:40]: Right. We need to understand better, you know, what are the fundamental differences between those creatures.
Swyx [00:53:45]: It's funny when really early on this pod, you were like, how much grounding do you have in cognitive development and human brain stuff? And I'm like, maybe that doesn't matter. And actually, so in my original agents blog posts, I had a picture of the human brain, and now it looks a lot more like a CPU. Canonical picture of the LLMOS is kind of like a CPU with all the input and output going into it. And I think that that's probably the more scalable system.
Shunyu [00:54:10]: I think the problem with a lot of cognitive scientists is that... They think by analogy, right? They think, you know, the only way to solve intelligence is through the human way. And therefore they like have a lot of critics for whatever things that are not cognitive or human. But I think a more useful way to use those knowledge is to think of that as just a reference point. I don't think we should copy exactly what's going on with humans all the way, but I think it's good to have a reference point because this is a working example of how intelligence works. Yeah. And if you know all the knowledge and you compare them, I think that actually establishes more interesting insights as opposed to just copying that, or not copying that, or opposing that. I think comparing is the way to go.
Swyx [00:54:53]: I feel like this is an unanswerable question, but I'll just put it out there anyway. If we can answer this, I think it'll be worth a lot, which is, can we separate intelligence from knowledge?
Shunyu [00:55:01]: That's a very deep question, actually. And to have a little history background, I think that's really the key thesis at the beginning of AI. If you think about Neville and Simon and all those symbolic AI people, basically, they're trying to create intelligence by writing down all the knowledge. For example, they write a checker program, basically, how you will solve the checker. You write down all the knowledge and then implement that. I think the whole thesis of symbolic AI is, we should just be able to write down all the knowledge, and that just creates intelligence, but that kind of fails. And I think, really, a great quote from Hinton is, I think there are two approaches to intelligence. One approach is, let's deal with reasoning or thinking or knowledge, whatever you call that, and then let's worry about learning later. The other approach is, let's deal with learning first, and then let's worry about whatever, knowledge or reasoning or thinking later. And it turns out, right now, at least, the second approach works, and the first approach doesn't work. And I think there might be something deep about it. Does that answer your question?
Swyx [00:56:08]: Partially. I think Apple Intelligence might change that. Can you explain? If this year is the year of multi-modal models, next year is on-device year, and Apple Intelligence basically has hot-swappable capabilities, right? They have 50 Loras that they swap onto a base model that does different tasks. And that's the first instance that we have of the separation of intelligence and knowledge. And I think that's a really interesting approach. Obviously, it's not exactly knowledge. It's just more styles. Context.
Shunyu [00:56:37]: Yeah, it's more about context.
Swyx [00:56:38]: So it's like, you can have the same model
Shunyu [00:56:40]: deployed to 10 million phones with 10 million contacts, and see if...
Swyx [00:56:44]: For on-device deployment, I think it's super important. Like, if you can boil out... Like, I actually have most of my problems with AI news when the model thinks it knows more than it knows because it combines knowledge with intelligence. I want it to have zero knowledge whatsoever, and it only has the ability to parse the things I tell it.
Shunyu [00:57:00]: I kind of get what you mean. I feel like it's more like memorization versus kind of just generalization in some sense. Yeah, raw ability to understand things. You don't want it to know facts like who is the president of the United States. They should be able to just call the internet and use a tool to solve it.
Swyx [00:57:15]: Yes, right. Because otherwise, it's not going to call the tool if it thinks it knows.
Shunyu [00:57:19]: I kind of get what you mean. I think it's... That's why it's valuable. Okay, so if that's the case, I guess my point is, I don't think it's possible to fully separate them because those kinds of intelligence kind of emerges. Even for humans, you can't just operate in an intelligent mode without knowledge, right? Throughout the years, you learn how to do things and what things are, and it's very hard to separate those things. I would say, yeah.
Swyx [00:57:45]: But what if we could? As a meta strategy, I'm trying to keep a stack-ranked list of what are the 10 most valuable questions.
Shunyu [00:57:55]: You can think of knowledge as a cache of intelligence in some sense. Like if you have like wikihow.com saying that you should tie a shoelace using the following stuff, you can think of that piece of text as like a cache to intelligence. Right.
Alessio [00:58:13]: I guess that's kind of like reflection anyway, right? It's like you're storing these things as memory and then you put them back. So without the knowledge, you wouldn't have the intelligence to do it better. Right.
Swyx [00:58:23]: I had a couple of things.
Alessio [00:58:24]: So we had Thomas Shalom from Meta to talk about Llama 3.1. Then he started talking about Llama 4.
Swyx [00:58:30]: Yeah, he was like, whoa, okay.
Alessio [00:58:33]: And he said it's going to be like really focused on agents. I know you talked before about, you know, it's next token prediction enough to get to like problem solving. If you say you got the perfect environment, they got the terminal, they got everything. And if you were to now move down to the model level and say, I need to make a model that is better for like a genetic workflow,
Swyx [00:58:52]: where would you start?
Shunyu [00:58:53]: I think it's data. I think it's data because like changing architecture now is too hard and we don't have a good, better alternative solution now. I think it's mostly about data and agent data is obviously hard because people just write down the final result on the internet. They don't write down how they, like step by step, how they do this thing on the internet, right? So naturally it's easier for models to learn chain of thought than tool call or whatever, agent self-reflection or search, right? Like even if you do a search, you won't write down all the search processes
Swyx [00:59:24]: on the internet.
Shunyu [00:59:24]: You would just write down the final result. And I think it's a great thing that Llama4 is going to be more towards agents. That means, I mean, that should mean a lot for a lot of people.
Swyx [00:59:35]: In terms of data,
Harrison [00:59:36]: you think the right data looks like trajectories basically of a React agent or of...
Swyx [00:59:43]: Yeah, I mean,
Shunyu [00:59:44]: I have a paper called FireAct. Do you still remember?
Swyx [00:59:47]: No. Okay. Tell us. Okay.
Shunyu [00:59:49]: That's one of the not famous paper, I guess.
Swyx [00:59:52]: It's not even on your website.
Alessio [00:59:53]: How are we supposed to find it?
Swyx [00:59:55]: It's on this Google Scholar. I've got it pulled up. Okay.
Shunyu [00:59:58]: It's not... It's been rejected for like a couple of times.
Alessio [01:00:03]: But now it's online in space. Yeah, everybody will find it.
Shunyu [01:00:05]: Anyway, I think the idea is very simple. Like you can try a lot of different agent methods, right? React, chain of thought, reflection, whatever. And the idea is very simple. You just have very diverse data, like tasks, and you try very diverse agent methods, and you filter all the correct solutions and you train a model on all of that. And then the benefit is that you should somehow learn, you know, how to use simpler methods for simpler tasks and harder methods for harder tasks. I guess the problem is we don't have diverse high quality tasks. That's the bottleneck for it.
Harrison [01:00:35]: So it's going to be trained on all code.
Shunyu [01:00:36]: Yeah, let's hope we have more better benchmarks.
Alessio [01:00:39]: In school, that kind of pissed me off a little bit. When you're doing like a homework exercises for like calculus, like they give you the problem, then they give you the solution. But there's no way without the professor or the TA to get like the steps to actually how you got there. And so I feel like because of how schools are structured, we never brought this thing down. But I feel like if you went to every university and it's like, write down step-by-step the solution to every single problem in the set and make it available online, that's a start to make this dataset better.
Shunyu [01:01:06]: I think it's also because,
Swyx [01:01:08]: you know,
Shunyu [01:01:08]: it might be hard for you to write down your chain of thought, even when you're solving the same, because part of that is conscious in language, but maybe even part of that is not in language. And okay, so a funny side story. So when I wrote down the React thing, I was telling to my Google manager, like, you know what we should do? We should just hire, you know, as many people as possible and let them use Google and write down exactly what they think, what they search on the internet. And we train them all on that. But I think it's non-trivial to write down your thoughts. Like if you're not trained to do that, if I tell you like, okay, write down what you're thinking right now, it's actually not as trivial a task as you might imagine.
Swyx [01:01:48]: It might be more of a diffusion process than the autoregressive process.
Alessio [01:01:52]: But I think the problem is starting with the experts, you know, because there's so much like muscle memory and what you do once you've done it for so long. That's why we need to like get everybody to do it. And then you can see like- Separate knowledge and intelligence.
Shunyu [01:02:06]: The simplest way to achieve AGI is literally just record the reaction of every human being and just put them together, you know? Like, what do you have thought about?
Swyx [01:02:16]: Yeah.
Shunyu [01:02:16]: What do you have done? Let's say on the computer, right? Imagine like a thought experiment. Like you write down literally everything you think about and everything you do on the computer and you record them and you train on all the successful trajectories by some metric of success. I think that should just lead us to AGI.
Swyx [01:02:33]: My first work of fiction in like 10 years was exploring that idea. What if you recorded everything and uploaded yourself? I'm pretty science-based, like, you know, but probably the most like spiritual woo-woo thing about me is I don't think that would lead to consciousness or AGI just because like there's something in- there's a soul, you know? That is the unspeakable quality of- Let's say it emerges through skill. We can simulate that for sure.
Harrison [01:02:58]: What do you think about the role of few-shot prompting for some of these like agent trajectories? That was a big part of the original React paper, I think. And as we talk about showing your work
Swyx [01:03:09]: and how you think like-
Harrison [01:03:09]: I feel like it's becoming less used
Shunyu [01:03:12]: than zero-shot prompting. What's your observation?
Harrison [01:03:15]: I'm pretty bullish on it, to be honest. For a few reasons, like one, I think it can maybe help for more complex things. But then also two, like, it's a form of prompting and prompting is just communicating with the model what you want it to do. And sometimes it's easier to just show the model what you want it to do than write out detailed kind of like instructions.
Shunyu [01:03:31]: I think the practical reason it has become less used is because the agent kind of scaffold become more complex or the task you're trying to solve is becoming more complex. It's harder to annotate a few-shot examples, right? Like in the Chain of Thought era, she just write down three lines of things. It's very easy to write down a few-shot or whatever. But I feel like annotation difficulty has become harder.
Harrison [01:03:53]: I think also one of the reasons that I'm bullish on it is because I think it's a really good way to achieve kind of like personalization. Like if you can collect this through feedback automatically, you can then use that in the system at a user level or something like that. Again, the issue with that is more complex things that doesn't really work.
Shunyu [01:04:08]: It's probably more useful as like an automatic prompt, right? If you have some way to retrieve examples and put it in like automatic pipeline to prompt. But I feel like if you're manually writing now, I feel like more people will try to use zero-shot.
Swyx [01:04:22]: Yeah, but if you're doing a consumer product,
Harrison [01:04:24]: you're probably not going to ask user-facing people to write a prompt or something like that. But I think the thing that you brought up is also really relevant here where you can collect feedback from a user, but it's usually at the top level. And so then if you have three or four or five or however many LLM calls down below, how do you disperse that feedback to those? And I don't have an answer for that.
Alessio [01:04:45]: There's another super popular paper that you authored called Koala, Cognitive Architectures for Language Agents. I'm not sure if it's super popular.
Shunyu [01:04:52]: Well, I think I hear it.
Swyx [01:04:54]: People speak highly of it here within my circles. So shout out to Charles Fry who told me about it.
Harrison [01:04:59]: I think that was our most popular webinar we did on LinkedIn.
Shunyu [01:05:02]: I think Harrison promoted the paper a lot, thanks to him.
Swyx [01:05:06]: I'll read what you wrote in here and then you can just kind of go take it wherever. Koala organizes agents along three key dimensions. They're information storage, divided into working and long-term memories. They're action space, divided into internal and external actions. And they're decision-making procedure, which is structured as an interactive loop with planning and execution. By the way, I think your communication is very clear. So kudos on how you do these things. Take us through the sort of three components. And you also have like this development diagram, which I think is really cool. I think it's figure one on your paper for people reading along. Normally people have input, LLM, output. Then they develop into, all right, language agents that takes an action into environments and has observations. And then they go into this Koala architecture.
Shunyu [01:05:46]: Shout out to my co-first author, Ted, who made figure one.
Swyx [01:05:51]: Yeah.
Shunyu [01:05:51]: It's like, you know, figure is really good. You don't even need a color. You just, exactly. One of the motivation of Koala is we're seeing those agents become really complicated.
Swyx [01:06:01]: I think my personal philosophy
Shunyu [01:06:02]: is try to make things as simple as possible. But obviously this field has become more complex as a whole. And it's very hard to understand what's going on. And I think Koala provides a very good way to understand things in terms of those three dimensions. And I think they're pretty first principle because I think this idea of memory is pretty first principle. If you think about where memory, where information is stored. And you can even think of the ways of neural network as some kind of non-memory because that's also part of the information is stored. I think a very first principle way of thinking of agents is pretty much just a neural network plus the code to call and use the neural network. Obviously also maybe plus some vector store or whatever other memory modules, right? And thinking through that, then you immediately realize is that the kind of the non-term memory or the persistent information is first the neural network. And second, the code associated with the agent that calls the neural network and maybe also some other vector stores. But then there's obviously another kind of storage of information that's shorter horizon, right? Which is the context window or whatever episode that people are using. Like you're trying to solve this task, the information happens there. But once this task is solved, the information is gone, right? So I think it's very systematic and first principle to think about where information is and thinking, organizing them through categories and time horizon, right? So once you have those information stores, then obviously for agent, the next thing is what kind of action can you do? And that leads to the concept of action space, right? And I think one of the fundamental difference between language agents and the previous agents is that for traditional agents, if you think about Atari or video game, they only have like a predefined action space
Swyx [01:07:49]: by the environment.
Shunyu [01:07:49]: They only have external actions, right? Because they don't have complicated memory or information and kind of devices to do internal thinking. I think the contribution of React is just to point out that we can also have internal actions called thinking. And obviously if you have long-term memory, then you also have retrieval or writing or whatever. And then third, once you have those actions, which action should you do? That's the problem of decision-making. And the three parts should just fully describe an agent.
Swyx [01:08:17]: We solved it. We have defined agents. Yeah, it's done. Does anything that you normally say about agents not fit in that framework? Because you also get asked this question a lot.
Harrison [01:08:28]: I think it's very aligned. If we think about a lot of the stuff we do, I'm just thinking out loud now, but a lot of the stuff we do on agents now is through Langraff. Langraff, we would view as kind of the code part of what defines some of these things.
Shunyu [01:08:41]: It also defines part of the decision-making. Decision procedure.
Swyx [01:08:44]: That's what I was thinking, actually.
Harrison [01:08:46]: And actually one analogy that I like there is some of the code and part of Langraff. And I'm actually curious what you think about this. But sometimes I say that the LLMs aren't great at planning yet, so we can help them plan by telling them how to plan and code, because that's very explicit. And that's a good way of communicating how they should plan and stuff like that.
Shunyu [01:09:05]: What do you mean by that? Give them a DFS algorithm?
Harrison [01:09:08]: No, something much simpler. You could tell an agent in a prompt, hey, every time you do this, you need to also do this and make sure to check this. Or you could just put those as explicit checks in the decision-making procedure
Swyx [01:09:19]: or something like that.
Harrison [01:09:21]: And the more complex it gets, I think the more we see people encoding that in code. And another way that I say this is, all of life really is communication, right? So you can do that through prompts or you can do that through code. And code's great at communicating things.
Swyx [01:09:34]: It really is.
Shunyu [01:09:35]: Is this the most philosophical solution that we've ever had?
Swyx [01:09:37]: Okay, this is great.
Shunyu [01:09:38]: That's good, that's good.
Swyx [01:09:40]: We're talking about agents, you know?
Harrison [01:09:42]: I think the biggest thing that we're thinking a lot about is just the memory component. And we touched on it a little bit earlier in the episode, but I think it's still very unsolved. I think clearly semantic memory, episodic memory, or types of memory, I think, but where the boundaries are,
Swyx [01:09:57]: are there other types,
Harrison [01:09:58]: how to think about that. I think that to me is maybe one of the bigger unsolved things in terms of agents is just memory. Like what does memory even mean? That's another top high value question.
Swyx [01:10:08]: Is it a knowledge graph?
Shunyu [01:10:12]: I think that's one type of memory.
Swyx [01:10:14]: Yeah.
Harrison [01:10:15]: If you're using a knowledge graph as a hammer to hit a nail, it's not that. But I think practically what we see is it's still so application specific what relevant memory is. And that also makes it really tough to answer generically, like what is memory? So it could be a knowledge graph. It could also be, I don't know,
Swyx [01:10:33]: a list of instructions
Harrison [01:10:34]: that you just keep in a list.
Swyx [01:10:36]: Yeah.
Shunyu [01:10:36]: A meta point is I feel sometimes we underestimate some aspects where humans and agents are actually similar, and we overestimate sometimes. The difference is, I feel like, I mean, one point I think that's shared by agents and humans is we all have very different types of memories, right? Some people use Google Docs. Some people use Notion. Some people use paper and pen. You can argue those are different types of long-term memories for people, right? And each person develops its own way to maintain their long-term memory and diary or whatever. It's a very kind of individual kind of thing. And I feel like for agents, probably there's no single best solution. But what we can do is we can create as many good tools as possible, like Google Docs or Notion, equivalent of agent memory. And we should just give the choice to the agent, like what do you want to use? And through learning, they should be able to come up with their own way to use the memory.
Harrison [01:11:29]: Or give the choice to the developer who's building the agents. Because I think it also, that it might, it depends on the task. I think we want to control that one. Right now, I would agree with that for sure, because I think you need that level of control. I use linear for planning for code. I don't use that for my grocery list, right? Like depending on what I'm trying to do, I have different types of long-form memory.
Swyx [01:11:49]: Maybe if you tried, you would have a gorgeous kitchen.
Shunyu [01:11:52]: Do you think our tool making kind of progress is good or not good enough in terms of, you know, we have all sorts of different memory stores or retrieval methods or whatever?
Swyx [01:12:03]: On the memory front in particular,
Harrison [01:12:04]: I don't think it's very good. I think there's a lot to still be done.
Shunyu [01:12:07]: What do you think are lacking?
Swyx [01:12:09]: Yeah, you have a memory service. What's missing? The memory service we launched,
Harrison [01:12:12]: I don't think really found product market fit. I think like, I mean,
Swyx [01:12:16]: I think there's a bunch
Harrison [01:12:16]: of different types of memory. I'll probably write a blog. I mean, I have a blog that I published at some point on this. But I think like right off the bat, there's like procedural memory, which is like how you do things. I think this is basically episodic memory, like trajectories of correct things.
Swyx [01:12:30]: But there's also,
Harrison [01:12:31]: then I think a very different type is like personalization. Like I like Italian food.
Swyx [01:12:35]: It's kind of a semantic memory. That's kind of maybe like a system prompt. Yeah, exactly. Yeah, exactly.
Harrison [01:12:40]: It could be a semantic. It depends if it's semantic over like raw events or over reflections over events.
Shunyu [01:12:46]: Right. Again, a semantic procedure, whatever, is just like a categorization. What really matters is the implementation. And so one of the things
Harrison [01:12:51]: that we'll probably have released by the time this podcast comes out is right now in LineGraph, LineGraph is very stateful. You define a state for your graph. And basically a run of an agent operates on a thread. It's very similar to threads in OpenAI's Assistant API. But you can define the state however you want.
Swyx [01:13:07]: You can define whatever keys,
Harrison [01:13:08]: whatever values you want. Right now, they're all persistent for a single thread. We're going to add the ability to persist that between threads. So then if you basically want to scope a memory to a user ID or to an assistant or to an organization,
Swyx [01:13:21]: then you can do that.
Harrison [01:13:22]: And practically what that means is you can write to that channel
Swyx [01:13:25]: whatever you want,
Harrison [01:13:25]: and then that can be read in other threads. We're not making any kind of claims around what the shape of memory is, right? You can write what you want there. I still think it's so early on
Swyx [01:13:35]: and we see people needing
Harrison [01:13:36]: a lot of control over that. And so I think this is our current best thought.
Swyx [01:13:41]: This is what we're doing
Harrison [01:13:41]: around memory at the moment
Swyx [01:13:43]: is basically extending the state
Harrison [01:13:45]: to beyond a thread level. I feel like there's a trade-off
Shunyu [01:13:47]: between complexity and control, right? For example, Notion is more complex than Google Docs. But if you use it well, then it gives you more capability, right? And it's like a different tool might suit different applications or scenarios or whatever.
Swyx [01:14:01]: Yeah.
Shunyu [01:14:01]: We should make more good tools, I guess.
Swyx [01:14:04]: My quick take is when I started writing about the AI engineer, this was kind of vaguely in my head. But this is basically the job. Everything outside the LLM is the AI engineer that the researcher is not going to do.
Harrison [01:14:15]: This basically maps to LLM, LLMOS?
Swyx [01:14:18]: I would add in the code interpreter, the browser and the other stuff. But yeah, this is mostly it. I mean, those are the tools. Yeah.
Shunyu [01:14:27]: Those are the external environment, which is a small box at the bottom.
Swyx [01:14:30]: So then having this reasonable level of confidence that I know what things are, then I want to break it. I want to be like, OK, what's coming up that's going to blindside me completely? And it really is maybe like OmniModel where everything in, everything out. And does that change anything? If you scale up models 100 times more, does that change anything?
Shunyu [01:14:50]: That's actually a great, great question. I think that's actually the last paragraph of the paper that's talking about this. I also got asked this question when I was interviewing with OpenAI.
Swyx [01:15:01]: Please tell us how to pass OpenAI interviews.
Shunyu [01:15:05]: Is any of this still true if, you know...
Swyx [01:15:08]: If you 100x everything, yeah.
Shunyu [01:15:09]: If we make the model much better. My longer answer to this,
Swyx [01:15:13]: you should just refer to
Shunyu [01:15:13]: the last paragraph of the paper, which is like a more prepared, longer answer. I think the short answer is understanding is always better. It's like a way of understanding things. The thought experiment that I write at the end of the paper is, imagine you have GPT-10, which is really good. It doesn't even need a chain of thought, right? Just input, output. Just stick to stick, right? It doesn't even need to do browsing or whatever. Or maybe it still needs some tools. But let's say it's really powerful. Then I think, even in that point, I think something like Koala is still useful if we want to do some neuroscience on GPT-10. It's like kind of doing human kind of neuroscience, right? Which module actually correlates to-
Swyx [01:15:51]: You want it to be inspectable. Yeah, like you want to expect
Shunyu [01:15:53]: what is episodic memory? What is a decision-making module? What is the- It's kind of like dissecting the human brain, right? And you need some kind of prior kind of framework to help you do this kind of discovery.
Swyx [01:16:05]: Cool.
Alessio [01:16:05]: Just one thing I want to highlight from your work. We don't have to go into it. It's a Tau bench.
Swyx [01:16:10]: Oh, yeah. Which-
Shunyu [01:16:11]: We should definitely cover this.
Alessio [01:16:12]: Yeah, I'm a big fan of Simulative AI. We had a summer of Simulative AI. Another term we're trying to coin.
Swyx [01:16:17]: Hasn't stuck, but I'm going to keep at it.
Shunyu [01:16:20]: I'm really glad you covered my zero citation work. I'm really happy.
Swyx [01:16:23]: No, now it's one. Now it's one. First citation. It's me.
Alessio [01:16:28]: It's me right now.
Swyx [01:16:29]: We just cited it here.
Alessio [01:16:30]: So that counts.
Shunyu [01:16:31]: Does it show on Google Story?
Alessio [01:16:33]: We'll write a paper about this episode.
Swyx [01:16:35]: One citation. One citation. Let's go.
Shunyu [01:16:38]: Last time I checked, it's still zero.
Alessio [01:16:40]: It's awesome. Okay. This one was funny because you have agents interact with like LM simulated person. So it's like actually just another agent.
Swyx [01:16:49]: Right. Right?
Alessio [01:16:49]: So it's like agents simulating with other agents. This has always been my thing with startups doing agents. I'm like, one day there's going to be training grounds for companies to train agents that they hire. Actually, Singapore is the first country to build the cyber range for cyber attack training. And I think you'll see more of that. So what was the inspiration there? Most of these models are bad at it,
Swyx [01:17:11]: which is great.
Alessio [01:17:11]: You know, we have some room for, I think the best model is 4.0 at like 48% average. So there's a lot of room to go.
Swyx [01:17:19]: Yeah.
Alessio [01:17:19]: Any fun stories from their directions that you hope that people take?
Swyx [01:17:23]: Yeah.
Shunyu [01:17:23]: First, I think shout out to Ciara, which is this very good startup, which was founded by Brad Taylor and Clay Barber. And Ciara is a startup doing conversational AI. So what they do is they they build agents for businesses. Like suppose you have a business and you have a customer service. We want to automate that part. And then it becomes very interesting because it's very different from coding a web agent or whatever people are doing, because it's more about how can you do simple things reliably? It's not about, you know, can you sample a hundred times and you find one good mass proof or kill solution. It's more about you chat with a hundred different users on very simple things. Can you be robust to solve like 99% of the time, right? And then we find there's no really good benchmark around this. So that's one thing. I guess another thing is obviously this kind of customer service kind of domain. Previously, there are some benchmarks, but they all have their limitations. And I think you want the task to be kind of hard and you want user simulation to be real. We don't have that until LLM. So data sets from 10 years ago, like either just have trajectories conversating with humans or they have very fake kind of simulators. I think right now it's a good opportunity to, if you really just care about this task of customer service, then it's a good opportunity because now you have LLMs to simulate humans. But I think a more general motivation is we don't have enough agent benchmarks that target this kind of robustness, reliability kind of standpoint. It's more about, you know, code or web. So this is a very good addition to the landscape.
Alessio [01:18:57]: If you have a model that can simulate the persona, like the user the right way, shouldn't the model also be able to accomplish the task, right? If he has the knowledge of like what the person will want, then it means...
Swyx [01:19:09]: This is a great question.
Shunyu [01:19:09]: I think it really stems from like asymmetry of information, right? Because if you think about the customer service agent, it has information you cannot access, right? Like the APIs it could call or, you know, the policies of internal company policy, whatever. And that, I think, very interesting for TopEng is like it's kind of okay for the user to be kind of stupid. So you can imagine like there are failure cases, right? But I think in our case, as long as the user specifies the need very clearly, then it's up to the agent to figure out, for example, what is the second cheapest flight from this to that under that constraint, very complicated reasoning Like we shouldn't require users to be able to solve those things. They should just be able to clearly express their need. But then if the task failed, then it's up to the agent. That makes the evaluation much easier.
Alessio [01:19:59]: Awesome. Anything else? I have one last question
Shunyu [01:20:01]: for Harrison, actually.
Harrison [01:20:03]: No, that's not this podcast.
Shunyu [01:20:07]: I mean, there are a lot of questions
Swyx [01:20:09]: around AI right now,
Shunyu [01:20:09]: but I feel like perhaps the biggest question is application. Because if we have great application, we have super app, whatever, that keeps the whole thing going, right? Obviously, we have problems with infra, with chip, with transformer, with whatever, S4, a lot of stuff. But I do think the biggest question is application. I'm curious, from your perspective, is there any things that are actually already kind of working but people don't know enough? Is there any promising application that you're seeing so far?
Harrison [01:20:37]: Okay, so I think one big area where there's clearly been success is in customer support. Both companies doing that as a service, but also larger enterprises
Swyx [01:20:47]: doing that and building
Harrison [01:20:47]: that functionality inside. There's a bunch of people doing coding stuff. We've already talked about that. I think that's a little bit...
Swyx [01:20:56]: I wouldn't say that's a success yet,
Harrison [01:20:57]: but there's a lot of excitement and stuff there. One thing that I've seen more of recently, I guess the general category would be research-style agents. Specific things recently would be... I've seen a few AISDR companies pop up where they basically do some data enrichment. They get a company name. They go out, find funding.
Swyx [01:21:18]: What is SDR? Sales Development Rep. It's an entry-level job title in B2B SaaS. Yeah, so... I don't know why I noticed this. You were very quick on that.
Alessio [01:21:27]: The PhD mind cannot comprehend.
Harrison [01:21:30]: And so I'd classify that under the general area of research-style agents. I think legal falls in this as well. I think legal is a pretty good domain
Swyx [01:21:42]: for this.
Shunyu [01:21:43]: I wonder how good Harvey is doing.
Swyx [01:21:46]: There was some debate, but they raised a lot of money. So who knows?
Harrison [01:21:50]: I'd say those are... Those are a few of the categories
Swyx [01:21:53]: that jumped to mind.
Shunyu [01:21:53]: Entry-type kind of research.
Harrison [01:21:55]: On the topic of applications though,
Swyx [01:21:57]: the thing that I think
Harrison [01:21:57]: is most interesting in this space right now is probably all the UXs around these apps and the different things besides chat that might come out. I think two that I'm really interested in. One, for the idea of this AISDR. I've seen a bunch of them do it a spreadsheet-style view, where you have 10 different companies or hundreds of different companies and five different attributes you want to run up and then each cell is an agent.
Shunyu [01:22:21]: The good thing about this is you can already use the first couple of rows of spreadsheets as a few-shot example. There's so many good things about it.
Harrison [01:22:27]: Yeah, you can test it out on a few. It's a great way for humans to run things in batch,
Swyx [01:22:32]: which I don't...
Harrison [01:22:32]: It's a great interface for that.
Swyx [01:22:34]: It's still kind of elusive
Shunyu [01:22:35]: to do this PhD kind of research, but I think those entry-type research where it's more repetitive
Swyx [01:22:41]: it should be more automated.
Harrison [01:22:42]: And then the other UX I'm really, really interested in is when you have agents running in the background, ambient-style agents, how can they reach out to you? So I think, as an example of this, I have an email assistant that runs in the background. It triages all my emails and it tries to respond to them. And then when it needs my input, do you want to do this podcast? It reaches out to me.
Swyx [01:23:02]: It sends me a message. Oh, you have it? It is live? Yeah, yeah, yeah. Thank you, agent. I use it for all my emails. Thank you, agent. Well, we did Twitter.
Harrison [01:23:08]: I don't have a company.
Shunyu [01:23:09]: Did you write it with LengChain?
Swyx [01:23:11]: Yeah, LengGraph. We'll open source it at some point.
Shunyu [01:23:13]: LengGraph or LengChain?
Swyx [01:23:15]: Yeah, yeah, yeah. I wonder. Both. Yeah. Both.
Harrison [01:23:17]: So at this point, LengGraph for the orchestration, LengChain for the integrations with the different models.
Shunyu [01:23:23]: I'm curious how the low-code kind of direction is going right now. Are people...
Swyx [01:23:27]: We talked about this. Oh, sorry. It's not low-code.
Harrison [01:23:29]: LengGraph is not low-code.
Swyx [01:23:31]: You can cut this out.
Shunyu [01:23:32]: No, no, no, no.
Swyx [01:23:34]: People will tune in just for this. Well, it actually has to do
Harrison [01:23:37]: with UXs as well. Probably sums back to this idea of, I think, what it means to build with AI is changing. I still really, really strongly believe that developers will be a core kind of like part of this, largely because we see you need a lot of control
Swyx [01:23:51]: over these agents
Harrison [01:23:51]: to get them to work reliably. But there's also very clearly components
Swyx [01:23:55]: that you don't need to be a developer
Harrison [01:23:56]: for prompting is kind of like the most obvious one.
Swyx [01:23:59]: With LengGraph,
Harrison [01:24:00]: one of the things that we added recently was like a LengGraph studio.
Swyx [01:24:04]: So we called it kind of like
Harrison [01:24:05]: an IDE for agents. You point it to your code file, where you have your graph defined in code.
Swyx [01:24:10]: It spins up a representation
Harrison [01:24:11]: of the graph. You can interact with it there. You can test it out. We've hooked it up to kind of
Swyx [01:24:15]: like a persistence layer
Harrison [01:24:16]: so you can do time travel stuff, which I think is another really cool UX that I first saw in Devon.
Swyx [01:24:22]: Devon's time travel is good. The UX for Devon in general,
Harrison [01:24:24]: I think you said it, but that was the novel. That was the best part. But to the low-code, no-code part, the way that I think about it is you probably want to have your cognitive architecture
Swyx [01:24:35]: defined in code.
Harrison [01:24:36]: Decision-making procedure.
Shunyu [01:24:37]: Yes.
Harrison [01:24:38]: But then there's parts within that that are prompts or maybe configuration options like something to do with drag or something like that. We've seen that be a popular configuration option.
Shunyu [01:24:48]: So is it useful for programmers more or is it for people who cannot program? I guess if you cannot program,
Swyx [01:24:54]: it's still very complicated for them. It's useful for both.
Harrison [01:24:56]: I think we see it being useful for developers right now, but then we also see... There's often teams building this, right? It's not one person. And so I think there's this handoff where the engineer might define the cognitive architecture. They might do some initial prompt engineering.
Shunyu [01:25:08]: It's easier to communicate to the product manager.
Swyx [01:25:10]: It's easier to show them what's going on
Harrison [01:25:11]: and it's easier to let them control it. And maybe they're doing the prompting. And so, yeah, I think what the TLDR is, what it means to build is changing. And also UX in general is interesting, whether it's for how to build these agents or for how to use them as end consumers. And there might also be overlap as well. And it's so early on
Swyx [01:25:30]: and no one knows anything,
Harrison [01:25:30]: but I think UX is one of the most exciting spaces to be innovating in right now.
Swyx [01:25:34]: Let's do ACI. Yeah.
Shunyu [01:25:36]: Okay.
Swyx [01:25:37]: That's another theme that we cover on the pod. We had the first AI UX meetup and we're trying to get that going. It's not a job. It's just people just tinkering.
Alessio [01:25:47]: Well, thank you guys so much.
Swyx [01:25:49]: Yeah, it was amazing. Karrison, you're amazing as a co-host. We'd love to have you back.
Harrison [01:25:54]: I just tried it. I listened to you guys for inspiration.
Swyx [01:25:58]: It's actually really scary to have you as a listener because I don't want to misrepresent. Like I talk about 100 companies, right? And God forbid I get one of them wrong. I'm sure all of them listen as well, not to add pressure. Thank you so much. It was a pleasure to have you on. And you had one of the most impactful PhDs in this sort of AI wave. So I don't know how you do it, but I'm excited to see what you do at OpenAI. Thank you.

Get full access to Latent Space at www.latent.space/subscribe
- Luisteren Nogmaals beluisteren Doorgaan Wordt afgespeeld...
- Later beluisteren Later beluisteren
Laat meer zien

Afleveringen

2024 in Agents [LS Live! @ NeurIPS 2024]

2024 in Synthetic Data and Smol Models [LS Live @ NeurIPS]

2024 in Post-Transformers Architectures (State Space Models, RWKV) [LS Live @ NeurIPS]

2024 in Open Models [LS Live @ NeurIPS]

2024 in Vision [LS Live @ NeurIPS]

2024 in AI Startups [LS Live @ NeurIPS]

Windsurf: The Enterprise AI IDE - with Varun and Anshul of Codeium AI

Generative Video WorldSim, Diffusion, Vision, Reinforcement Learning and Robotics — ICML 2024 Part 1

Bolt.new, Flow Engineering for Code Agents, and >$8m ARR in 2 months as a Claude Wrapper

The new Claude 3.5 Sonnet, Computer Use, and Building SOTA Agents — with Erik Schluntz, Anthropic

Why Compound AI + Open Source will beat Closed AI

Agents @ Work: Lindy.ai

Agents @ Work: Dust.tt

In the Arena: How LMSys changed LLM Benchmarking Forever

How NotebookLM Was Made

Building the AI Engineer Nation — with Josephine Teo, Minister of Digital Development and Information, Singapore

Building the Silicon Brain - with Drew Houston of Dropbox

Production AI Engineering starts with Evals — with Ankur Goyal of Braintrust

Building AGI in Real Time (OpenAI Dev Day 2024)

Language Agents: From Reasoning to Acting