Afleveringen
-
As I navigate my career change after Ai2, I wanted to share my views of how this blog relates to my missions and broader work. In my farewell post, I summarized my three goals right now as:
* Provide clarity in the evolution of frontier models.
* Create a vibrant and diverse open (model) ecosystem.
* To build institutions that make these goals possible.
Within this, Interconnects is at its core a bit different than many of the highly-polished, professional newsletters on this platform ā and this is becoming intentional.
How Interconnects fits into my career goals
Interconnects is the tip of the spear of all of my missions in AI. It is meant to start a conversation and to let the reader into the mind of someone at the frontier. This insight makes the writing sometimes a bit raw, sometimes a bit too technical, but it is the map of how I progress my thinking in the ever changing world.
This style of writing has helped me create very strong relationships with the core group of readers, many of who listen to the voiceovers I do for these posts. The plan is to keep operating and refining the Interconnects experience around those loyal fans. These are to a large part people building the frontier AI ecosystem ā researchers at labs, top investors, policymakers obsessed with the frontier, and students aspiring to have one of those roles.
Iām very happy with this sort of raw, high-voice outcome for the blog. It is not something I sought out, but rather accepted as I saw it coming and realized it would be disproportionately successful in a near-future of vast AI slop media. With years of trying to squeeze writing into a busy schedule, the only sort of writing I had time for was that which had a style very closely matching how I think.
Iām also very happy to be an independent voice. As a person I donāt do well with some power structures like having a boss, and I think there are very few people without extreme financial conflicts of interest that are willing and allowed to write. Through a wide job search, few companies were genuinely excited about me continuing writing.
Over the past few months, I considered taking Interconnects in more of a direction like SemiAnalysis or Stratechery, where it is my full-time gig and number one priority, but it didnāt seem like the right fit for what I am trying to achieve. Iām trying to build an open ecosystem and a movement for true open-science at the frontier of AI. These areas are very narrowly populated and trying to influence them with only commentary, analysis, and related research products wouldnāt work for me.
These sorts of full-time outcomes are definitely still one of my dreams, and I will do it at some point. The dream of this is also one of the reasons I take conflicts of interest seriously. Though, in this era of AI I canāt be fully on the outside.
In this vein, I wanted to disclose two advising agreements I recently signed. I donāt view them as a compromise of the above independence, as Iāll happily quit if I feel like I canāt speak my mind, but as a form of support in accomplishing my missions.
If I want to make a true open-science ecosystem I have some catching up to do with how the frontier labs approach post-training. The two companies Iām advising, whose leadership Iāve become friends with, are Arcee AI and Mercor. Arcee should be fairly obvious as the no-nonsense player building open-weight models. Mercor will make more sense over time, but theyāre a close ally to a lot of my goals in transparent evaluations, open post-training, and neutrality with respect to the leading labs. These advising agreements are based on me wanting to learn more, and I donāt suspect I will ever engage in the very cursory advising roles that are more of name-stamping.
I keep an up-to-date disclosures statement at the end of the Interconnects about page: https://www.interconnects.ai/about.
Otherwise, my full-time job should still be in the non-profit sector as long as I get the next few months of logistics right.
Interconnects AI is a reader-supported publication. Consider becoming a subscriber.
Some operations & audience notes
Interconnects has cultivated an excellent, niche, and largely technical audience with representatives of all the top companies and labs (recently crossed 70K subscribers). I intend to protect this niche audience rather than trying to expand to bigger pastures. I think this success in audience alignment is reflected in my ~900 paid subscribers supporting it with infrequent paywalled content. I appreciate the support greatly, as the money has let me expand Interconnects operations and quality over the last 18 months.
I created Interconnects AI, LLC last January along with business bank accounts. Since then Iāve made some money, but Iāve reinvested it (and more) back into the business and the various AI services I need to try to write these articles. So, at this moment going full-time on Interconnects is a pretty risky financial proposition for me. In fact the Interconnects bank account has hovered around $0 for months (Iām personally fine having another job). This made me hesitate in going all-in on it, but in reflections I concluded that I would have more impact in AI by building these systems than focusing on commentary.
Second, as AI services get more expensive (e.g. Fable becoming API only), Iām going to need to spend more out of pocket to make this happen. Iām happy to do this in the near term, but Iām starting to optimize the blog to have more consistent financial growth, so when I want to go all in on writing in a few years I have a safety net.
I donāt do special offers, free trials, etc. for Interconnects paid subscribers (mostly to mitigate noise in the Discord community), but if you have the means to support this project it would mean a lot to me as I center my career around it. Joining a lab or a well-paying startup would be a much simpler path for me and my family but itās never felt like the right thing to do.
I have a very arbitrary goal of reaching the 1000 paid subscribers orange checkmark on Substack this summer. So you can help and/or just watch my attempts to make it happen.
In this vein, I wanted to be direct in sharing how I view a few core operational components of Interconnects, and what you can expect going forward.
* All comments will be paywalled. Whenever I have a popular post without paywalled comments I get a flood of low-quality posts ā many of which are obviously AI generated. This is a detriment of the highly selective audience weāve built. If Substack supports a feature like āonly users with a paid subscription somewhere on the platform can engage,ā Iād implement it. The blog comments, Substack chat, and Discord will be spaces where I perform active curation to maintain a 0% AI slop rate.
* Slightly more articles will be paywalled. I want to keep experimenting with what is the right way to do this, but the only metric I can rely on for increasing influence of the blog is revenue. Views, likes, etc. are all vanity metrics which donāt reliably measure this type of content. Cultivating a highly engaged audience is existential to me in attempting to maintain an AGI-proof expertise.
* Slightly more in-person events. With a small community that I respect, I have to opportunity to translate that to excellent real-world experiences. I expect to keep these small, but I want to be more proactive at organizing them so loyal readers know what to expect. The few coming soonest will be for my book launch, which should be in the next month or two. Plus, I know people always want to meet likeminded folks in AI!
Together these should make it easier and more enjoyable to be a loyal fan for Interconnects. Iām looking forward to continuing convincing my fans that the support is worthwhile.
Thanks for reading! My career wouldnāt be possible without all of the support.
This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe -
As Iāve been recapping fundamentals of post-training to wrap up my RLHF / Post-training book I knew I needed to get Finbarr Timbers back on the podcast to talk about the state of play. Over the last few months weāve had many discussions on what weād need to do to take an Olmo-style recipe to the frontier, supported by Finbarrās extensive reading of recent model technical reports.
To prepare for this, I put together a summary slide deck on the key post-training recipes historically ā the path from InstructGPT to today ā and today ā the key open frontier models. This deck is summarized below as the technical summary, but we do spend 20-35 minutes on it in the podcast, so watching on YouTube is likely the best experience for this one.
I previously interviewed Finbarr in December of 2024, shortly after the release of o1 and Tülu 3 (and before he joined Ai2) on the āWe are so backā era of RL.
Chapters:
* 00:00 Introduction & Olmo reflections
* 06:28 Post-train recipes review (history)
* 23:00 2026ās model recipes (MiMo Flash, DeepSeek V4, GLM 5, Kimi K2.6, etc.)
* 39:05 Open-ended post-training discussions
* 48:22 Career advice in the LLM race
Listen on Apple Podcasts, Spotify, and where ever you get your podcasts. For other Interconnects interviews, go here.
For more educational post-training videos, see the course Iām putting together.
Technical Summary
These are notes cleaned up from a slide-deck created with AI assistance ā mostly useful as a discussion topic and reference.
The shape of a post-training recipe has changed more in the last year than in the prior three.
* 2022ā2023 (InstructGPT): one pipeline ā SFT ā reward model ā RL.
* 2024 (Llama 3, Tülu 3, etc.): open recipes formalize SFT ā DPO ā RL with verifiable rewards. Closed recipes use many stages of RLHF.
* 2025 (DeepSeek R1): reasoning RL (R1) makes large-scale RL the centerpiece.
* 2026 (MiMo Flash V2): recipes fragment into many specialist models that are merged back into one.
The new thing: MOPD
Multi-teacher On-Policy Distillation (MOPD) is the pattern showing up across the 2026 frontier.
* Train N domain-specialist teachers (each: SFT, then RL on the relevant domains).
* Train one general student by sampling its own trajectories (this is the final post-trained model).
* On each rollout, minimize reverse-KL to the relevant teacherās output distribution, token by token.
Lineage: MiMo Flash v2 introduced it ā DeepSeek V4 & Nemotron 3 Ultra scale it to >10 teachers.
Why did MOPD emerge?
* RL got expensive and conflict-prone. Mixing math, code, and agentic RL in one run eventually trades capabilities off against each other.
* Specialists are cheap to make / organizationally scalable. SFT-then-RL on a single domain is well understood and parallelizable. As post-training becomes more complex, scaling it across organizations is a big win.
* On-policy distillation matured. Literature and know-how continued to emerge through the RLVR renaissance.
Sources: DeepSeek V4 §5.1, MiMo-V2-Flash
Key historical recipes
InstructGPT (Mar. 2022) ā the canonical 3 steps Ā· paper
* SFT on human demonstrations
* Reward model trained on human comparisons
* PPO against the reward model
Llama 2 (Jul. 2023) ā multi-stage RLHF Ā· paper Ā· interconnects recap
* SFT, then iterative RLHF over multiple rounds
* Each round: rejection sampling ā PPO
* Two reward models ā separate helpfulness and safety
Llama 3 (Jul. 2024) ā a complex multi-stage recipe with simpler optimizers Ā· paper Ā· interconnects recap
* Per round: reward model ā sample K per prompt ā rejection sampling ā SFT ā DPO
* No online RL ā the RM only filters; run over 6 rounds, best models seed the next
Tülu 3 (Nov. 2024) ā simple three-stage post-training Ā· paper Ā· interconnects recap
Curated prompts ā SFT ā DPO ā RLVR (RL with verifiable rewards ā the acronym was coined in this paper).
OLMo 3 (Dec. 2025) ā a reasoning update to the Tülu 3 recipe Ā· paper Ā· interconnects recap
DeepSeek R1 (Jan. 2025) ā RL as the centerpiece Ā· paper Ā· interconnects recap
The recipe:
* R1-Zero ā pure RL (GRPO) on the base, no SFT; used to seed reasoning behaviors for the full run, not a separate product
* R1 ā cold-start SFT ā reasoning RL ā rejection-sampling SFT ā final RL ā distill to dense
* A big change in recipes: Large-scale RLVR as the primary driver, SFT to distill and refine RL behaviors
DeepSeek evolution after V3
* V3 Ā· Dec ā24 ā SFT + GRPO RL.
* R1 Ā· Jan ā25 ā multi-stage RL; reasoning emerges.
* V3.1 Ā· Aug ā25 ā hybrid think / non-think in one model.
* V3.2 Ā· Dec ā25 ā 6 specialists via RL ā SFT distillation ā one mixed GRPO.
* V4 Ā· Apr ā26 ā 10+ domain experts ā MOPD.
2026 style recipes!
MiMo Flash v2 (Jan. 2026) ā where MOPD started Ā· paper
Stages: Stage 1 SFT ā Stage 2 train ~6 domain-specialist teachers (with older style post-training recipes) ā Stage 3 MOPD into a single student.
First clean articulation of multi-teacher on-policy distillation as the consolidation step ā replaces a single monolithic RL stage with distill-from-specialists.
Nemotron 3 Ultra (Jun. 2026) ā two rounds, many teachers Ā· paper
Stages: SFT ā multi-teacher on-policy distillation, run over two iterations, with >10 teachers spanning reasoning, code, math, and agentic domains.
Novel: multi-round MOPD across different domains ā distill, then re-distill from refreshed teachers.
MAI-Thinking-1 (Jun. 2026) ā closer to R1 than V4 Ā· announcement
Stages: mid-trained base ā 3 specialist RL āclimbsā (e.g. STEM) ā trace-distillation SFT to consolidate the climbs ā a final RL climb ā MAI-Thinking-1.
Closer to DeepSeek R1 than to V4 ā multi-stage RL with trace-distillation SFT to consolidate, not on-policy MOPD. Not the only lab without MOPD!
Kimi K2.5 (Jan. 2026) ā agentic, multimodal Ā· paper Ā· blog
Stages: text-only SFT ā joint textāvision RL across coding, vision, reasoning, agentic tasks. (No mention of MOPD.)
GLM-5 (Feb. 2026) ā staged RL by capability Ā· paper
Stages: Base ā SFT ā Reasoning RL ā Agentic RL ā General RL.
Transcript
00:00:00 Nathan Lambert: Hello, we are back on a Interconnects conversation. I donāt really say I do interviews. People criticize me ācause I interrupt the guests too much. āCause Iām not a good interviewer, but Iām here to entertain people. Um, this is also fun for me because Iām trying to make, like, a post-training course, and it kind of fits as, uh, in the advanced end of this.
So itās kind of a crossover between Interconnects content and other stuff that Iāve been spending my time on this summer. So Iām happy to welcome Finbarr back. I think... Are you the first return guest? I havenāt checked.
00:00:37 Finbarr Timbers: Oh, wow.
00:00:37 Nathan Lambert: Um, Finbarr and I worked on this sort of post-training recipe stuff for a while at AI2. Um, I left recently. This is one of Finbarrās last days at AI2. Itās already been announced. Itās not a spoiler here. So weāre gonna kind of reflect on some things on building post-training recipes for OLMO. Um, then we have a little, like, review slide deck and notes on the kind of state and evolution of frontier post-training recipes over time, which is pretty interesting because thereās, what is it, like two to four kind of canonical recipes that there has been.
So itās kind of interesting when you see the field converge on something new, which itās doing right now with multi-teacher on policy distillation. For some reason, thatās a bit of a mouthful. It is a long acronym. And then weāll just kind of end with various discussion points on post-training and what weāre up to. So, happy to give you the floor if you have any hot takes you wanna start with to get people to, draw people in. Otherwise, I think, uh, Iām excited to kind of reflect on this, ācause I know youāve been reading a ton of papers recently and kind of prep, laying some of this groundwork.
00:01:43 Finbarr Timbers: Well, yeah. I mean, today is my last day at AI2, so it- itās ki- it feels very appropriate to be, to be talking to you as youāre the one who recruited me to AI2. So, uh, yeah, thatās pretty special, and itās great to be, uh, yeah, the, the first repeat guest. I feel honored, uh, to be back on. So yeah, thanks, uh, for having me.
00:02:03 Nathan Lambert: Yeah. Do we wanna start with OLMO? I think that-
00:02:05 Finbarr Timbers: Sure
00:02:06 Nathan Lambert: ... people... I think I, uh, need to do this carefully, but Iāve talked about OLMO-3ās post-training many times to people. I havenāt done this in a very direct way on the podcast, but I would say that post-training OLMO-3 to make this reasoning model was a major accomplishment for many individuals to do this. But also, the complexity of what we were doing was pushing against the limits of AI2ās organizational capacity, and a lot of modern post-training is, like, your ability to wrangle compute data into a work stream.
And in order to do that in a complicated way, you really are wrangling an org chart. And thatās like part of why itās like OLMO-3 was, by its nature, pretty late as a reasoning model. It was, like, a pretty rigid reasoning model, and thatās, like, partially reflected in the recipe being pretty simple. But then when you, like, compare it to all these new recipes with tool use and multi-teacher distillation and all of this, itās just like a, a, a fork in the road where itās like you could do this very simple thing and make a strong recipe, but it is not representative of what all the frontier labs are doing.
And I think that that kind of fork in being able to say that things are similar happened kind of after Tulou-3, where Tulou-3, I think, was also much simpler with this three-stage SFT-DPO RL recipe. But that simpler recipe was probably closer in outcome to what the labs are doing, but now doing that sort of three-stage recipe for a reasoning model, and especially a tool use, like, agent model, just doesnāt really apply. And thatās the point. Thatās why I think the point of this podcast is to be like, what are the, what are the way, what are they doing to make these, like, true frontier models, and then shed some light on how it contrasts to the more a- like, open academic ones.
00:03:56 Finbarr Timbers: Well, actually, I think thatās interesting. What was the proce- so, you know, I, I only, um, came around for OLMO-3. I wasnāt around for the earlier, um, versions. What was the process like to go from Tulou-3 to OLMO-2? Because, like, y- just looking on, on Archive, um, I think Tulou-3 came out in November of ā24, and then OLMO-2 came out in December of, of ā24.
00:04:22 Nathan Lambert: We just applied the recipe.
00:04:24 Finbarr Timbers: Yeah. I, I mean, so, so I think that actually, like, yeah, and then, you know, um, DeepSeeker-1 came out in January, end of January ā25, and, you know, OLMO-3 was then released in October. Was it October or November of ā25? Like, I think-
00:04:39 Nathan Lambert: I think November.
00:04:41 Finbarr Timbers: Yeah, November. Yeah, right. It was November. So itās-
00:04:43 Nathan Lambert: It was like do or die with Thanksgiving.
00:04:45 Finbarr Timbers: I remember that. Uh, yeah, ācause Canadian Thanksgiving had, had already happened-
00:04:50 Nathan Lambert: Yeah
00:04:50 Finbarr Timbers: ... which, yeah, I was happy. Um, but, uh, like, like I think it was, sure, maybe it was late, but I think it was only late by a few months. Like, itās, itās actually, like, you know, if I think of my past experience with model turnaround times, like a nine-month model turnaround, you know, from R1 coming out, like thatās actually, thatās not bad. I think, you know, something like six months wouldāve been nicer, but-
00:05:12 Nathan Lambert: I, I think itās slow ācause we didnāt re- it would be fast if we had rebuilt the R1 recipe. But what we did was we, like, ported reasoning into our existing recipe-
00:05:21 Finbarr Timbers: Yeah. Okay
00:05:22 Nathan Lambert: ... which is a simpler task, but has, like, a lower ceiling, in my opinion. Where itās like the DeepSeek and the newer style recipes, which weāll talk about, I think they just have a much higher ceiling in how much you can keep hill climbing them. Or theyāre just, like, more prescri- more pedagogical of what the frontier is doing. Like, for the size models that OLMO was, which was like 7 to 30B, Iām not sure that doing this DeepSeek style RL first recipe is actually useful.
00:05:52 Finbarr Timbers: Uh, well, I, yeah, I think thatās a good point. And I mean, I think thatās really reflected in what we see in the research where you s- you know, you obviously you see the big, uh, the step change and you know how quickly things are improving When, you know, R1 comes out. So, like, I think that a great point, and it really does seem to saturate, or to, to not saturate, sorry, with, with compute. Um-
00:06:11 Nathan Lambert: Yeah. Um, shall we just do the slide deck? Weāre throwing around, like, recipe-
00:06:15 Finbarr Timbers: Sure. Yeah, letās do it
00:06:16 Nathan Lambert: ... names. Like, I feel like it might be useful to just do it because a lot of people probably want to follow but donāt exactly know. Iām, Iām gonna share, Iām gonna share a screen. So people listening, it might be useful to either, you can pull this slide deck up on your phone and click through it. Itās not super information dense, but you can also just watch it on YouTube. All of this will be linked.
Generally, this is just like a quick survey on how frontier recipes have evolved. Weāll go through the history quickly and then talk about what is currently happening and kind of probably interleave the old mode discussion we were having. Uh, okay. Thereās a bunch of canonical recipes weāll talk about. This is where I got the two to four number. I think the recipes are like InstructGPT, which is what coined the initial RLHF with this like three-stage idea, which took a while to get people to move on from, which was like SFT reward model and RL.
And I see as like Llama 3 and 2.3 as kind of practical implementations of that with, with other tricks of the trade. So those two could potentially be merged together. Itās just like kind of pre- and post-ChatGPT moment. And then the two most recent canonical recipes that weāll cover in this I would say are like DeepSeek-R1, which is the shift to doing like reasoning focused and bigger RL stages than this kind of SFT focus from before, and then NeMo Flash and some of the new models from 2026 which add this distillation element.
00:07:42 Finbarr Timbers: Well, and, and I think itās worth pointing out too that itās not just NeMo Flash, like it was kind of a consistent theme. Like you saw this with DeepSeek, th-they referenced it in, uh, the V3 paper and then itās, you know, itās Qemi K 2.5, itās GLM 5. Like itās all of these papers, you know, start talking about this specialist, um, RL stage.
00:08:03 Nathan Lambert: Yeah. I think thereās a debate on how we draw it and whether or not distillation is... If youāre, if you have distillation as a technique, as a key milestone, then they were, the Xiaomi was the first and, but itās kind of a march over time where you kind of see them change, and weāll, weāll go through this. I donāt, I donāt need to interrupt.
00:08:23 Finbarr Timbers: When you say distillation, I do think itās important to distinguish between the straight up like, you know, distillation of the leading closed models and, you know, distillation of these domain specific models where, you know, I, I, I suspect that the, you know, the, the Chinese labs are doing both.
00:08:41 Nathan Lambert: Yeah.
00:08:41 Finbarr Timbers: But, you know, a lot of what theyāre do, you know, but a, a lot of what theyāre doing is this, um, training these domain specific models like, you know, a math model, a coding model, uh, you know, logic model, whatever, and then distilling those models back in and not just distilling from... So when weāre talking about distillation, itās not just distilling from the leading closed models.
00:09:01 Nathan Lambert: Yeah. Itās a pain. I agree. The distillation term is horribly overloaded. Um, thereās a review slide. Do we need to review multi-teacher on policy distillation? It might be too complicated to need to do it. We could come back to it. I think I kind of want to just go through the actual models, and then we could use the supporting slides as needed. Um, this famous InstructGPT three-step thing, I think many people have heard of it, but this is what constituted post-training at the time of ChatGPT coming out, so itās kind of important grounding of this human supervised SFT data, mostly human supervised preference rankings to make a reward model and then do RL on that, and the model gets better.
And itās pretty interesting how all of these have been kind of phased out, at least in terms of what we know openly, where theyāre, we donāt use that much human demonstration data for SFT. Thereās likely some human preference data still in the loop, but I would guess that synthetic has a much bigger role, and there are reward models, but theyāre like not the cl- key RL target anymore. So in four years, most, almost all the canonical pieces have been moved on. And like this evolution is kind of within there. I think the early models after InstructGPT, like Llama 2, um, even Llama 3, these are pretty similar, which is like youāre starting to break down this recipe with different tools like projection sampling, DPO, some increased iterations. I think increased iterations is just that there was more incentive to squeeze more out of the models, and they just like broke things down more, where InstructGPT seemed like a bit more open-ended research where this kind of cleanness was fine. So-
00:10:48 Finbarr Timbers: Well, I think thatās interesting, uh, with respect to how much everything has scaled, uh, right? Because, you know, InstructGPT was before ChatGPT was, was released, and so, you know, itās something, like just the complexity of what was done is that which a small team or even a single team could do. But then when you start looking at, you know, Llama 3, like it just starts to be a more complicated process and, you know, where you start to have a lot more, you know, specialized data and thereās, you know, a lot more, you know, room for scale and for kind of money and complexity be poured in.
00:11:25 Nathan Lambert: Yeah. Itās like, uh, both for-profit and nonprofit efforts to do post-training want me to advise them, and Iām like, āI donāt really know how Iām gonna give you advice unless Iām spending twenty hours a week look, understanding the details of your recipe,ā ācause itās like, well, I canāt really give you a one sentence thing of do X without understanding all the complexities of the model and the post-training process that go into it. Which makes it, like makes it hard from kind of like a transparency point of view. Even if itās fully detailed, itās definitely still hard to modify and study.
00:12:00 Finbarr Timbers: Absolutely.
00:12:02 Nathan Lambert: So then like two through three in AI2, a lot of this was weāre trying to beat the results of this Llama 3 post-training, which is pretty complicated, but we donāt have the ability to scale the organization as far. So I, I, I think thatās a big reason why the actual workflow is a lot simpler, where we have three clear stages that are doing slightly different things, and they build on each other. And thatās like... Itās never stated very explicitly in these papers on like how the org chart impacts the recipe, but I would, I, I think itās a very strong signal within the, at least the delta between the fully open work and the kind of partially open work that you get from industry.
00:12:43 Finbarr Timbers: Yeah, absolutely. And, and I think especially as weāll see with the domain-specific models, like thatās like really clear, like something where you could really easily scale up your org chart to-
00:12:54 Nathan Lambert: Yeah
00:12:54 Finbarr Timbers: ... build that up.
00:12:56 Nathan Lambert: Yeah. And I threw Olmo 3 in after this, after the two through three slide, mostly just to show that the recipe was so similar to two through three, and the org chart hadnāt really changed. Like we didnāt have more ability to scale, and like there was a, a little bit of separation between the model types, between like the think and the instruct models. But like without a major reinvent- like a major org change, it was just kind of stuck in this and do the best you can with it.
00:13:22 Finbarr Timbers: Yeah. Absolutely.
00:13:23 Nathan Lambert: Be- because like the real big change was this with DeepSeeker-one. They, I had never seen this plot before, but they had this plot, maybe they added it for the nature version of the paper, where they kind of show their recipe, where they like take the base model, they do RL zero, and then they sample from the RL zero to like filter prompts, and then they use that as SFT. This is like going through this. They use that as SFT for the next version of the model to create like a development internal RL DeepSeek-R1, and then they do this like repeated sampling to train multiple RL versions and kind of distill, distill in the sense of, of clarify and refine the reasoning behavior of the model before going through the final pipeline, which again is a mix of, um, reasoning and non-reasoning SFT into a bigger RL run. And-
00:14:11 Finbarr Timbers: Well, and I think this is really interesting because it starts to show, I mean, first of all, the, the complexity here. Weāre starting to use, um, yeah, like synthetic data as this primary input here, but itās not just like, you know, itās trying to elicit, you know, specific behaviors, and itās this kind of like industrial process, um, instead of like this, you know, itās not as much of an elegant research recipe. Itās more like, you know, we train a model, and then we use it as best we can, and we keep iterating. Um, but I think the other thing thatās interesting is, is weāre starting to see here the SFT serving as the cold start. First of all, where, where thatās, you know, I think before SFT was more of like a generally useful stage, whereas here its, its primary purpose is this, this cold start for RL.
And then the other interesting bit is, you know, DPO, uh, starts to disappear at this point from the leading recipes. I mean, Olmo 3 still does it, but you know, basically everyone else does away with it and just, you know, has the preferences included, um, as in, as a reward model or, you know, at so- at some way, um, in the reward bit of the RL stage. And so thatās a really interesting change, where the, the supervised part of post-training is just, you know, massively deprioritized.
00:15:27 Nathan Lambert: Yeah. So my hypothesis for the dropping of DPO on these models is that, uh, as, as youāre doing like a cleaner recipe, essentially the need falls away. Versus if you look at Olmo, which is taking tons of potential gains by refining your model on outputs of strong open weight models, like largely Qwen and DeepSeek is the training data for the SFT of Olmo 3. Uh, and like the delta between that SFT data and the base model is still pretty big in the probability distributions. So DPO kind of helps further refine and clean up that distribution in a way that kind of has very rough edges. And but when you have a more refined, like industrial process on post-training, th-that will, that potential benefit will be harder to gain. Something interesting that I didnāt fully con-confirm before this is, for example, NVIDIA used to also be on this DPO train with their smaller Nemotron models.
And, and I would guess that potentially like D- Nemotron Ultra would not. But itās, and, and thatās because theyāre at much further down this development tree and using on pol- like these more on policy methods for creating the SFT data. And their model, I would guess, will become kind of more robust out of distribution and like have weird, less weird rough edges before because of it. So thatās kind of my hypothesis on DPO, and people that use DPO will be looked down upon. But itās like if youāre trying to bootstrap a recipe off the ground and just take gains where you can, I still think itāll work for a lot of people in a kind of compute efficiency standpoint.
00:17:05 Finbarr Timbers: Yeah. I mean, I think generally, uh, thereās something interesting with the, the preference tuning that, yeah, like maybe, um, it isnāt being given the proper, um, respect that it deserves. āCause o-one of the interesting bits about the Nemotron 3 super paper was that they saw pr- they, they do a, a traditional RLHF stage in their RL, which has also, you know, fallen with fashion and development, and they see pretty massive gains with it. So I think some of these changes are more, you know, driven by whatās in fashion rather than perhaps like a fully rigorous, you know, set of ablations.
00:17:41 Nathan Lambert: Itās very remarkable to me that the preferences loss function can do so much for these models. Like the models have so much potential there, and itās just, itās really a contrastive loss on pretty granular feedback. And they learn all sorts of things. Like theyāll, theyāll get better at math and code, or their reasoning strategies will be refined. And so I, I... Thatās remarkable to me. I think there will still be funny research on like using preference- Base losses with verifiable outputs. Like, I, I think all this would work. Like DPO on verifiable rewards and stuff like this, itās just kind of intellectually less appealing.
00:18:19 Finbarr Timbers: Yeah. Well, I think thatās, uh, you know, thatās where I thought that the, uh, delta learning, um, hypothesis style, uh, DPO, like what Olmo-3 did, where you, um, where the, the preference, you create these synthetic preferences by having like strong, by like bigger and smaller models of the same family, like is where you get your preferences from. I thought that was a really interesting signal because it, it seems really analogous to some of the work, some of the guidance stuff that we see in diffusion models, like how you have the classifier-free guidance, which has something similar, and there, there were very similar results there, which showed that you could have the--
But like one signal they used was further along in training versus earlier in training models as like, uh, a source of, of signal that you could guide along. And, and that worked quite well. And so I suspect that these signals, um, for, for preferences in that way, like that they could actually be more robust, but because, you know, some of the largest labs donāt have to do that, perhaps weāre not citing them as much.
00:19:18 Nathan Lambert: Yeah. Or they donāt tell us. Like, to continue this, itās kind of cool to look at-- So the DeepSeek models have kind of gone through this, what I would call like l- closer to Llama recipes to DeepSeek-R1, which is d- like most definitively the canonical recipe for reasoning models, and then continue to change closer to this multi-teacher format. So if you look at the VC-3.3 paper, um, before R1, they do something remarkably similar to two to three type thing, where they have a mix of SFT and then they use it ver-- like this RL on verifiable rewards. They didnāt call it that, or their paper wasnāt out at the time. And so they did this before R1 came out, which was just kind of a less reasoning-focused models and used the same tools but with a different ratio of implementation weight.
00:20:07 Finbarr Timbers: And, and whatās interesting is that this comes out basically at the same time as two to three, and itās a very similar two to three and Olmo-2. Itās a very similar recipe, just done with more complete.
00:20:16 Nathan Lambert: Yeah. Yeah. And then we have this R1, which weāve just talked about at length in January, which is a month later. They have a few more releases through this. They have some updates to their V3 and R1 models, which have dates, which are largely the same recipe. And then the next documented change in their recipe was V3.1, which is when they merged this thinking and non-thinking into one model, which everybody that does this says, has said that it has been hell to train in. But you kind of need it from a serving perspective, and itās obvious that long term, at least obvi- itās obvious to me that long term all the models will be reasoning models, and youāll just have reasoning models that are very efficient based on the gains that are there.
So this is kind of a needed change that they made. And then in December of 2025, they released V3.2, which is when thereās kind of meaningful changes to their recipe, and theyāre talking about this expert creation with separate mini recipes, and then using that within their kind of R1 data process to do SFT data and then like a big RL run at the end with GRPO. So it took about a year for this, uh, like kind of evolution of the R1 style recipe to land in their models. And I think this, this is like a very big complexity step that isnāt represented in something like Olmo-3, and itās kind of where you can see a fork in the recipes over time as like they, it, they become way more industrial and scaled at these frontier labs.
00:21:46 Finbarr Timbers: Yeah. And I think, you know, another one good thing here, just from a historical note, is that I think it was with the O3-24 release where they updated the original V3 paper. So, you know, V3 comes out before R1, then R1 comes out, and then after R1 comes out, they actually go back and update the V3 paper, maybe getting ready for the nature submission or, or, or something.
00:22:07 Nathan Lambert: Yeah.
00:22:07 Finbarr Timbers: Um, and they make a reference there to say like, āOh, you know, something you could do is you could train these domain specialist models and then combine them.ā Uh, and then, you know, that later becomes kind of what, you know, the more of a priority as they talk about in V3.2.
00:22:21 Nathan Lambert: Thatās a fun note. Yeah. And then most recently in April 26th is this V4 model, which has even more experts. They add this new loss function for multi-teacher on policy distillation, which I said follow Jiaoli. And this is kind of a microcosm of the arc that the whole industry went through, at least the people who share what their post-training details are, of realizing how core RL is, changing the recipe around scaled RL, and then figuring out how to kind of scale to more domains in the scaled RL format without just like grinding to a halt in operational complexity.
00:22:58 Finbarr Timbers: Yeah.
00:23:00 Nathan Lambert: So then kind of the next stage of this is these, what I call twenty twenty-six style recipes, which are all these models that are doing this multi-teacher, um, infusion of knowledge. And then some of them are using on-policy distillation and some are not. Itāll be one of the key things to see is like how crucial is this on-policy distillation to really keeping up at the frontier. So the paper that kind of, that named this term was the MimoFlash V2 paper. I think the model was released in December and the paper in January, which a lot of things will look similar to this, um, kind of RL, large RL style recipe. But with this large RL run is more, is where the on-policy distillation comes in. So for, I c- this is probably a better time to explain. I have this great, great little feature.
So this is like the summary of what multi-teacher on policy distillation is. Generally, it fits within an RL framework where you have the model you are training, the, like the general model, sample its own trajectories, and then you route the trajectories to various expert models you have trained. And each kind of sample is trained with this distillation KL loss to match the tokens of that expert. And People have, multiple models have shown that this type of supervision is really useful for the models. You could combine it with other RL losses, such as verifiable rewards, which for example, Sasha Rush gave a good mini spiel on that and how they use that with Composer, which is a, a video that I really recommend people watching as well. But the, the key of it is that it is a different loss function, but it plays very nicely in the RL frameworks that people are already using. So they use these teachers-
00:24:45 Finbarr Timbers: Just RL, like itās, itās, like if you-
00:24:47 Nathan Lambert: Yeah
00:24:47 Finbarr Timbers: ... actually implement it, you know, Iām talking with some of the people at AI2 about implementing it now. And itās like you take your RL setup, and then you just, you know, you, you have some very, your, uh, set of tweaks on the, the learner to actually implement this. So itās quite straightforward.
00:25:02 Nathan Lambert: Yeah, so this is a fancy diagram that makes it more complicated than it needs to be, but it also a very nice diagram, which shows the various, um, domain teachers that they have, search agent, code agent, math, reasoning, safety, and how they put these together. And the, the experts are used both for SFT data and then this final supervision. And the recipe for the experts would look something like this DeepSeek recipe, which is complicated on its own, which is like make a very good reasoning model that is good at one thing.
00:25:29 Finbarr Timbers: Well, and I think it is complicated, but itās also like if you, if you think about being the actual researcher like working on it, itās like, you know, you have a base model, and then you have an RL set up, and you know, youāre just constantly updating both and then rerunning RL. So, you know, the, the most complicated like, uh, part of it is just, you know, writing down the history and tracing everything. But itās kind of like a very natural, organic way, uh, for the r- the RL to evolve through, you know, iterative experimentation.
00:25:57 Nathan Lambert: Yeah. So like once you have a recipe, youāre progressively tinkering with each part, and itās, itās fairly stable, but itās hard to rebuild from scratch. So like weāll see how, see how long the recipe shape lasts, but itāll probably be order of years. Um, another big one in this like also shared a lot of details on this on policy distillation approach was Nemotron-3 Ultra, which is obviously exciting to me to have a, like a US-made model that is very strong performance, and NVIDIA released a lot of datasets with it.
But they, they also talked about a lot of their very n- n- like implementation details of what was hard with on policy distillation. I, like I have notes somewhere on this. They do this thing where they have two rounds of on policy distillation, as they found it to be better to integrate some teachers one after another. And the paper has a lot more details. Iāve, I, I donāt wanna go scroll through the paper, but we could also do this. Did you have any o- other impressions? Like I have the, we have this other doc we can pull up that-
00:27:01 Finbarr Timbers: Oh
00:27:01 Nathan Lambert: ... also you might have had other details on it.
00:27:03 Finbarr Timbers: Yeah. Well, I think something else, um, that, that is worth, um, you know, contrasting the, the paper to is the Nemotron-3 super paper. āCause in the Nemotron-3 super paper, they had a similar complicated recipe, but they did multiple rounds of RL. Like there they had three rounds of RLVR, followed by a round of, um, software engineering RL, and then followed by an RLHF stage. So it was, it, it was really interesting to see them go from doing that, like, you know, one of the most complicated, um, RL setups or in terms of, you know, successive stages, uh, that Iāve seen. To then, you know, you know this setup where itās still complicated, but itās a lot, um, you know, itās a lot con- conceptually a lot simpler.
00:27:54 Nathan Lambert: Yeah. I, I pocket the paper up. Itās gonna be hard for me to... I, like I had highlighted a few details. The, the interesting parts are kind of around the, um, various NVIDIA details on all the teachers. Thereās just so many details in their paper on training-
00:28:10 Finbarr Timbers: Yeah
00:28:10 Nathan Lambert: ... all the teachers. I think, okay, so I have some of it. I have some of this up. Itās like I have an interesting quote thatās like, āOne key finding from our trials of doing on policy, multi-teacher on policy distillation is that teacher models trained with substantially different training pipelines cannot be effectively combined through a straightforward on policy distillation merge, resulting in suboptimal performance.ā So itās like theyād have to do some cross teacher alignment, um, to make sure that theyāre actually similar, which I feel like could become a whole, uh, organizational nightmare. Itās like they say, āWe hypothesize that when the teacher and student are trained on different SFT data, they acquire different reasoning behaviors and induce different output distributions. This distribution mismatch can cause student-generated trajectories to be out of distribution for the teacher, result- reducing the quality and reliability of the supervision- supervision signals provided by the teacher.ā
00:29:00 Finbarr Timbers: Yeah, thatās interesting actually because there was a paper, uh, I, I canāt remember the name of it, but there was a paper that I read, um, recently which claimed that what you need to do is constantly... So, so you know, you know, one thing you could do, which was kind of the, the obvious thing to do, is you, you take your base model, right? You do, um, whatever general SFT that youāre doing, and then you take, you do, you know, a bunch of RL, you train domain-specific agents, you train them, you know, all the way until theyāve converged or until youāve run out of money.
Uh, and then you take these final experts, and then you do some sort of, you know, on policy distillation to combine them into your, your final model. Um, but with the paper, and Iāll, Iāll try to find it and then give it to you, um, see if we can share it. What they claimed was that you need to, um, instead of using the converged model, you need to do it in like successive stages with like the in-progress model. So if, you know, you train your RL for like a thousand steps, you need to, you canāt use the, you know, the thousand step checkpoint to, for the on policy distillation. You have to do it in stages, and first use the, you know, two hundred and fifty step checkpoint and the five hundred checkpoint and, you know, gradually bring that base model like up to speed or else thereās gonna be too much divergence, and the, the KL divergence will just be like too, um, too distinct-
00:30:17 Nathan Lambert: Yeah
00:30:18 Finbarr Timbers: ... to learn from.
00:30:19 Nathan Lambert: Yeah. So essentially the last state-- sentence in this paragraph I had read most of is literally like, āWe encountered this issue in practice because the teacher and student models were developed in parallel.ā
00:30:29 Finbarr Timbers: Yeah.
00:30:29 Nathan Lambert: Itās like theyāre like, āThis is a problem because of itās, like, hard to do everything at once.ā Which is w- this is the type of thing where having research in it would be so great, and I think NVIDIA could release some of the teachers so that people could just like-
00:30:45 Finbarr Timbers: Yeah. Thatād be great
00:30:45 Nathan Lambert: ... if you have the teachers and you have the intermediate model stage, you could do the problem of, like, just studying multi-teacher on policy distillation from the starting point and understanding the training dynamics.
00:30:57 Finbarr Timbers: Yeah.
00:30:57 Nathan Lambert: Which is the type of thing we would want to do at Oldo. We just havenāt scaled our recipe to this point yet.
00:31:03 Finbarr Timbers: Yeah, absolutely.
00:31:04 Nathan Lambert: So I will keep encouraging NVIDIA to do this.
00:31:07 Finbarr Timbers: Thatād be great. NVIDIA-
00:31:08 Nathan Lambert: I think, uh-
00:31:08 Finbarr Timbers: ... listen.
00:31:10 Nathan Lambert: They, they listen. The other side of things is a bunch of models released in 2026 that do not do this multi-teacher on policy distillation, and they also donāt do nearly as many teachers. So I would say that this, like, Microsoft model, which I donāt say this as a diss, itās, like, hard to get a new team off the ground, is they went for a simpler approach to try to get a solid model, and it has three more general experts combined w- via SFT and then, like, a longer RL run. So it looks a lot more like DeepSeeker one, but I suspect that what they will do next is make finer grain teachers and see if they need to switch to on policy distillation.
00:31:48 Finbarr Timbers: Yeah. And I think, you know, in one of our, um, group chats, you described the MAI thinking model as a conservative recipe. A-and I think thatās a really good description of it. Like they, you know, the, the team came up with this conservative recipe, and then I think that they did a really great job of actually executing on it. āCause I think, you know, if you try to make too many changes at once, itās really easy for the recipe to collapse under its own complexity, and Iāve seen that a bunch of times, you know, across my career.
Try to make too many changes and, you know, it all goes poorly. So I thought that was, um, a really good choice on their part. I, I also think that, uh, itās not super clear to me, may-maybe youāve seen some papers on this that I havenāt seen, but itās not super clear to me how well the trace distillation SFT does or, you know, h- how much better on pols- online policy distillation is versus the trace distillation SFT.
00:32:41 Nathan Lambert: Yeah. Itās like whatās, what is the relative magnitude in the final performance?
00:32:45 Finbarr Timbers: Yeah.
00:32:45 Nathan Lambert: So the Nemotron Ultra paper has a table on how far the on policy distillation goes relative to the teacher, and they also have the starting point. So I guess thatās a potential way to do this. Here, I could, I could just pull this up. Let me switch.
00:33:00 Finbarr Timbers: Oh, sure.
00:33:04 Nathan Lambert: So I, I had this open, but in a different tab. Okay. Hereās, hereās this paper. This is page twenty-seven is which the paragraph I just read, and then it also has this kind of-
00:33:17 Finbarr Timbers: Oh, fascinating
00:33:18 Nathan Lambert: ... is it a great table. I spent a while looking at this earlier. So essentially, itās like where they get after SFT-
00:33:24 Finbarr Timbers: Wow
00:33:24 Nathan Lambert: ... on each of the benchmarks on the general model. And then I think... Okay, so the sort of gains over the RLVR student recovery of the specialty student. So I need to make sure... Okay, so it denotes the initial student checkpoint, where RLVR denotes the s- initial student checkpoint, and then the multi-teacher on policy distillation. So Iām not sure what this SFT column can figure out, but you could see the kind of like where the teacher is relative to on policy distillation. I think this is like the closest information we have on the relative performance gains.
00:33:59 Finbarr Timbers: Yeah. Thatās fascinating because the DeepSeek, I forget which one, maybe it was V3.2 paper claims or, or maybe it was, um, R1 actually claims that you can domain-specific... That, that, you know, doing the general stage, uh, captures the performance, uh, of it. But, you know, that, that doesnāt really seem to be... A-a-and yeah, a-a-and then so, you know, doing the domain-specific distilling in, and then doing a general stage on top of that captures the original performance. But that doesnāt seem to be the case here. Like, you know, the, the gap maybe isnāt huge, but there is still, most of the time, thereās a pretty big... There, thereās like, you know, a significant gap, even if itās not huge. So thatās really interesting.
00:34:42 Nathan Lambert: Yeah. I wish this table and text was clearer. Itās like I literally canāt fully parse it. Itās like RLVR denotes the initial student checkpoint, and then OPD denotes the checkpoint after first and second iterations. Itās like, what is the checkpoint that was used at the start of on policy distillation?
00:35:01 Finbarr Timbers: I think it was the RLVR one, so that they do a general SFT stage, and then they do an RLVR stage that covers the non-teacher, the, the areas that where they donāt have specialized models. Then they do MOPD.
00:35:15 Nathan Lambert: Yeah. And then that makes sense with this recovery rate, which is like final model minus RLVR, which would be like the gains for the OPD relative to the teacher minus RLVR, which would be like what gains you needed to still cover.
00:35:31 Finbarr Timbers: Yeah.
00:35:32 Nathan Lambert: And like what, what gains the teacher could potentially give you. So more research like this. Happy to see some of it a- out there. Iām gonna switch back.
00:35:43 Finbarr Timbers: Yeah. Something I found interesting about the, um, the, uh, both the Nemotron papers and then the MAI thinking paper is that they donāt talk as much about some of the more detailed, um, post-training decisions that have shown some pretty strong gains in, um, some of the other papers. Like I, I believe it was GLM five where they talk about doing a difficulty curriculum and a difficulty filtering stage.
00:36:11 Nathan Lambert: Yeah.
00:36:12 Finbarr Timbers: And thatās just not something thatās really talked about in these other papers. Theyāre saying they, they donāt, you know, uh, I think it was QEM 2.5 used a temperature. Itās kind of funny. So QEM K 2.5 and GLM five both have temperature schedules, uh, and they both claim the exact opposite thing. So one of them says you have to start with a high temperature and go low. The other one says you have to have a low temperature and go high. And, uh, y- I donāt know. And then so, you know, you donāt see that discussion, uh, I, I donāt think in Some of the other papers, which is kind of interesting
00:36:40 Nathan Lambert: Yeah. I, I still think the Chinese labs are much more willing to share, like really, really nitty-gritty tech details. The NVIDIA paper is like mostly a list of like methods to create a teacher or like-
00:36:51 Finbarr Timbers: Yeah
00:36:51 Nathan Lambert: ... domain-specific teachers, which is useful, but I think like I was less... Itās like less of a fun read. Theyāre like, thereās 15 pages of different domains, so Iām like, āOkay, I donāt, like I donāt need this.ā Yeah, like KBK 2.5 and, uh, GLM 5 actually have like more similar recipes, which are also on the simpler side, which is like you create this SFT stage, and then you do RL. The RL might be staged. Um, thereās not this on-policy distillation. Thereās a bit less talk on how many experts they have and what their domains of expert-s are. I think it, itās obvious, like you have to take all this with a grain of salt, and itās like what, how they decided to present the information is like a big factor in this. And then like they might actually be closer in reality and then it just wasnāt described in a certain way.
00:37:44 Finbarr Timbers: I, I think another interesting bit is that you see the Chinese labs, uh, all seem to be converging towards sparse attention, whereas, uh, we donāt see the, you know, where was the American labs, at least NVIDIA and, you know, AI2 seem to be more converging towards hybrid attention. Uh, like N- uh, the NVIDIA Ne- Nemotron Ultra used the Mamba, um, attention, whereas, you know, we see, you know, DeepSeek sparse attention and then the Mimo, eh, MSA, whatever that stands for, Mimo Sparse Attention. So I, I think thatās, uh, an interesting divergence.
00:38:20 Nathan Lambert: Yeah. I am not the person to ask, but I agree.
00:38:23 Finbarr Timbers: [laughs]
00:38:23 Nathan Lambert: Itās like I... Like I, I often get asked of like, this is to, to... Donāt, weāll avoid the full rabbit hole, but I often get asked like, āAre the Chinese labs more efficient?ā And Iām like, āI donāt really know how Iām gonna give you advice unless Iām spending twenty hours a week look, understanding the details of your recipe,ā ācause itās like, well, I canāt really give you a one sentence thing of do X without understanding all the complexities of the model and the post-training process that go into it. Which makes it, like makes it hard from kind of like a transparency point of view. Even if itās fully detailed, itās definitely still hard to modify and study.
00:38:42 Finbarr Timbers: Yeah
00:38:42 Nathan Lambert: ... like if you make a GPT model 1% more efficient, youāre making like fat stacks of profit. Like, I think thatās like a more effective market mechanism, but-
00:38:53 Finbarr Timbers: And then-
00:38:53 Nathan Lambert: The Chinese lab-
00:38:54 Finbarr Timbers: You know-
00:38:54 Nathan Lambert: Yeah
00:38:55 Finbarr Timbers: ... if you make, you know, serving ChatGPT more efficient, Sam Altman can say, āHey, hereās a bunch of stock.ā Like, so yeah.
00:39:02 Nathan Lambert: Yeah. But, uh-
00:39:03 Finbarr Timbers: Um
00:39:03 Nathan Lambert: ... they do great, like the Chinese labs do great research.
00:39:05 Finbarr Timbers: Absolutely.
00:39:05 Nathan Lambert: I just think itās kind of a bit different. Okay, we can move into more open-ended stuff here.
00:39:12 Finbarr Timbers: Sure.
00:39:12 Nathan Lambert: I think that we have like a bunch of docu... We have th- a bunch of things in a document here. Iām sure more will come up. How do you think about open models and kind ācause i- it just doesnāt strike me that thereās this, like, you know, I think that thereās a large business to providing... Well, actually thatās not even super clear. Thereās, you know, weāve seen a number of companies providing, you know, RL fine-tuning services, you know, RL as a service. Weāve seen a lot of companies try to provide fine-tuning as a service, and, you know, none of them have really taken off. Like, I think OpenAI has started to shut down, I think they shut down their RL fine-tuning. I think they might be shutting down their fine-tuning. May be wrong about that.
00:45:51 Nathan Lambert: Well, itās like Cursor used Fireworks for their actual training run, and Iām like, I donāt really know all the details of this, but Cursor does something for fat- I think like fast weight tran- or Fireworks does-
00:46:01 Finbarr Timbers: Yeah
00:46:01 Nathan Lambert: ... a fast weight transfer and other things to make it so that they can scale their RL inference compute very nicely. So thatās one type of it. I donāt know how big of a long tail that business is, but also I think Tinker is a better business than most people expected. It makes some real amount of money. Itās like in the hierarchy, I think selling compute, not the best business.
00:46:23 Finbarr Timbers: Yeah.
00:46:23 Nathan Lambert: Selling inference, great business. And Tinker-like APIs, if you canāt transition it into selling tokens, is somewhere in between the two, where they could take some amount of margin thatāll be slightly higher than just selling the compute. And they obviously get a margin by having, like, they get compute at a cheaper rate than their customers-
00:46:43 Finbarr Timbers: Yeah
00:46:43 Nathan Lambert: ... and thatās like part of the margin theyāre taking. But I donāt see it being as nice as inference, so itās kind of existential for them to make it so that these fine-tuning APIs feed into a inference business pretty nicely.
00:46:56 Finbarr Timbers: Yeah.
00:46:56 Nathan Lambert: Because then you can be somewhat locked in on you train the model on our infrastructure. You actually can own the model weights, but the training dynamics to inference mismatch is perfect because you trained exactly on our inference engine, and are gonna get what you want out of it.
00:47:11 Finbarr Timbers: Yeah. And it also helps a lot with utilization because you can then, you know, utilize it. You, you can share that utilization across a lot of clients. So I think it makes a lot of sense. I think itās probably a better model for a lot of, um, users. Like, I think of academic users, like it probably makes way more sense to do this. Or, you know, for that matter, if youāre, you know, as, uh, uh, starting a new, um, ar- you know, post-training lab now, as you know, I, I know a few people, um, who are. Like, I think thatās where it, it probably makes a lot of sense to start with something like the Tinker API, and then, you know, at some point if you wanna try and capture that margin, maybe then you try to do something more custom. But if you, if you can use something like that, like thatās great, and the economics are just, you know, fundamentally more sustainable. I or, you know, theyāre better for you rather than trying to, you know, g- go to CoreWeave or whoever and say, or Serv scale and say, āHey, I need, you know, 10,000 networked, uh, DB200s,ā you know? Thatās just a very expensive, um, thing to do, especially if you canāt keep it running all the time.
00:48:14 Nathan Lambert: Yeah. Do you have a, do you have any more hot takes on post-training before I ask you some more general things?
00:48:22 Finbarr Timbers: Uh, well, something Iām, Iām generally interested in and, you know, I, Iām the wrong person to, to speak to about it. Iād love to talk to someone whoās maybe a, a, a capital allocator, like whoās, you know, deciding or a compute allocator whoās deciding where to put, uh, compute or, you know, where to hire team members. Um, because Iām kind of curious how Uh, the high level decisions are made allocating resources between pre-training and post-training. Uh, ācause, you know, what I kind of have seen as, as a general trend is, is that you see a lot of papers where thereās, you know, more focus put on one or the other. Uh, like I think... So, so yeah, so thatās something kind of interesting to me is how people who are, you know, making this decision, how, how theyāre making that decision and how theyāre thinking about it.
00:49:10 Nathan Lambert: Yeah. Itās like the hardest decision to get out of labs. Iāve like, I used to spend time trying to get them to share more, but I, I think itās like such a sensitive decision to where they see progress coming. Like theyāre making that decision ba- allocating compute based on where they think the most progress is and what the like return on investment is. So if you go to Anthropic and theyāre like, āHereās where our percent, hereās our distributions,ā itās like, okay, thatās where labs see their bets and/or where they see they are weak.
And itās like you invest more compute in the pro- to make progress in the area that you are interested in, which I always think makes a lot of the open research kind of boring right now, is like the people that get compute are just way more likely to succeed as academics and researchers, which is a horrible equilibrium for the world, but kind of realistically true. I, I, I donāt know how to make a lot of that. I wanted to ask you how you feel about the craze that people have to cash in on making money and join a lab before the ladder gets pulled up, and what people should be optimizing for in their careers in face of meaningful opportunity costs.
00:50:18 Finbarr Timbers: Yeah. I think itās, well, thatās actually very, very timely. Uh, but yeah, no, I, I think that thatās, um, really important to, to talk about. I mean, I think itās always worth focusing on whether what youāre doing and spending time on is gonna be generally valuable or if it, if itās like a really short-term exploitation type thing in, in the, you know, RL like explore versus exploit setup. I, I mean, something that Iāve seen throughout my career has been often the places that pay the most, um, are also the places where youāre doing the most interesting work, right? Like, you know, if, if youāre gonna go work at OpenAI, OpenAI or, you know, Anthropic or the Frontier Lab, like they pay a lot of money. They also have a lot of resources, so youāre gonna make a lot of money and learn a lot.
Um, uh, so I think itās worth trying to decide i- is that the, is the opportunity that youāre doing that or is the, is the opportunity like, you know, in 2021 or 2022 or whatever, where you might say, you know, I was at DeepMind at the time and itās like, okay, do I work at DeepMind, which paid a lot less than like crypto? Should I go just, you know, work in crypto and try to, you know, mint NFTs or whatever? I think that wouldāve been a mistake, but, you know, trying to figure out, um, if youāre gonna be able to do interesting work is really important and also, you know, try to figure out if youāre going to be able to, you know, push forward science. You know, if, if what youāre doing is more just saying, going to, you know, data vendors and saying, you know, āOkay, you know, we, I need a bunch of data to do whatever.ā And then, you know, they, they give you a bunch of data, you train a model, you say itās good or bad or whatever.
You know, I donāt think thatās as interesting and, and I donāt think youāre gonna learn a lot even though thatās, you know, work that would probably drive model progress for it. I think if youāre able to, you know, make, focus more on the science and make more scientific conclusions, I think that can be, you know, a lot better for your long-term career. And I think thatās where places like AI2 and the other, um, academic research labs, you know, Marin is doing a really great job of this. Um, I think thatās where you can have a lot of impact in that they donāt have the budget to go and buy a lot of data, and so that leverage just really isnāt, um, open to them to pull. And so they have to focus on science and driving innovation, and thatās where you can see things like the Almix, uh, paper, which I thought was a really excellent, uh, sc- you know, scientific paper, but also, you know, meaningfully, I think, advanced, uh, the state of the art.
00:52:32 Nathan Lambert: Yeah. No, mostly this is grounded in visiting the Bay Area, and every time I go Iām like, āHoly s**t, what is going on here?ā Like all these very junior people are like have way too much dread about their, uh, opportunity cost and both of us arenāt based in the Bay Area, so I feel-
00:52:46 Finbarr Timbers: No
00:52:46 Nathan Lambert: ... somewhat removed from it, which gives me a little bit more time to pause and be like, what exactly is the right thing to optimize for? I per- I-- itās easy for me to say as somebody thatās established, but I think thereās opportunity for a lot of people to just, if they have conviction on something, to try to go and do it and not just follow everybody that goes down the funnel of joining one of the established labs or the Neo labs where I donāt hear from many people that join as a junior person at these places and end up with very high responsibility. Like theyāre contributing to something that matters or theyāre around a cool group of people, but I donāt hear from that many people that are like, āWow, I am doing the highest leverage stuff and the most interesting things.ā
00:53:30 Finbarr Timbers: Well, I think that, you know, itās kind of funny for, for me to say this as I, my career has been more on, on the opportunistic, uh, side of things. Um, but you know, twice now, uh, Iāve been at organizations where, um, I, Iāve been working... So, you know, at, at DeepMind, uh, I, I was part of the Alberta office where DeepMind had, you know, aqua hired the, uh, computer poker research group from the University of Alberta. And so, you know, this was a group of people who were really invested in, uh, computational game theory and g- you know, poker playing, um, algorithms. And they were all in on that and, you know, they, they were all in on that to the point that, you know, they were one of the two leading, uh, labs in the field and, um, were, you know, b-because they were so strong at this, they were then, you know,
DeepMind came and, you know, acquihired them and, and they all joined and they, you know, did quite well from that, um, acquisition there. And then, you know, you know, I joined later because I was, uh, you know, interested in, in working with them and doing game theory and stuff. But you know, it was this group of people who had this conviction that what they were doing was really important and, you know, it worked out quite well for them. And then, you know, the same thing at AI2, where at AI2, you know, there was all of these people who were really interested in, uh, NLP research, you know, even before language models. Like we see people like, you know, like Kyle a-and Dirk I think were both at AI2 for like almost a, a decade.
Like they had these really long tenures, um, and then they did really well and then, you know, theyāve, theyāve since had some, you know, strong, um, opportunities, uh, coming out of that with, with, um, yeah, some of the opportunities that have been available to them. And I, and I think that the consistent theme there has been that, you know, if you have high conviction that what youāre doing is important and interesting, then like it, itās not a mistake to follow that and to, you know, try to become really strong, um, in that area.
00:55:15 Nathan Lambert: Yeah. I mostly think itās good for the world to have a di- more diverse set of approaches.
00:55:19 Finbarr Timbers: Yeah.
00:55:19 Nathan Lambert: Itāll be interesting to see what the deal labs actually produce if, if they can manage to do things that are diverse. My personal idea is that theyāre so big now that most of them need to end up doing something that is somewhat similar, which is-
00:55:33 Finbarr Timbers: Yeah
00:55:34 Nathan Lambert: ... hard, but like they need to keep risking the comp- they effectively need to risk their $20 billion valuations to do something interesting thatās not just gonna be like squashed by an OpenAI or Anthropic side project.
00:55:48 Finbarr Timbers: Yeah, absolutely. And I think itās tough because when youāre raising, when youāre, you know, you have these huge seed rounds and youāre raising, you know, 200 million or, you know, a billion dollars or whatever, then itās like you have to pretty quickly show results to be able to-
00:56:01 Nathan Lambert: Yeah
00:56:01 Finbarr Timbers: ... you know, grow off of that.
00:56:04 Nathan Lambert: Yeah. So a to-be continued conversation.
00:56:11 Nathan Lambert: Any last words? I donāt, I donāt need to stretch it on if we donāt have anything to add to our conversation.
00:56:16 Finbarr Timbers: No, I, I think this was pretty good. I think it was really great, uh, getting a chance to catch up and talk about some of this stuff. You know, I, Iāve been reading all of these papers and thinking about all the different recipes, so itās great to get to, um, to chat about it and put it out into the ether. So yeah, thanks for having me on.
00:56:31 Nathan Lambert: Yeah, thanks for coming back. Weāll talk soon.
00:56:33 Finbarr Timbers: Sounds good.
This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe -
Zijn er afleveringen die ontbreken?
-
Edit Jun. 11: Anthropic changed their silent model manipulation of AI research queries to also use a classifier like the other safety domains. This addresses a key concern I had in the mistreatment of āsafetyā in the release, and props to Anthropic for a quick change, but it does not fully address the trust that has been broken. I shared more reflections here.
Today, Anthropic released their Claude Fable 5 model to consumer and enterprise audiences. This is the general-access variant of their Mythos-class models. With it, Anthropic rolled out a series of safety measures ā some explicitly called out to users and some modifying the model without telling the user. It should be less surprising than it is that the next major step in AI capabilities came with heavier-handed safety measures indicating Anthropicās intention to protect, or entrench, their current lead.
The unevenly applied safety policies that Anthropic have rolled out are on track to become a classic cautionary fable in how narrow and self-fulfilling notions of safety and control rarely work out.
The smartest model in the world
Before digging into the nuance of the safety facts, it is important to establish the quality of this model. The quality of the model paints the stakes of today ā as these safety features are meaningfully changing the shape of access to frontier AI, something which has never happened with the modern LLMs we know. Second, the capabilities point to this story only accelerating. Recursive self-improvement isnāt quite the right mental model of progress from here, but Claude Fable 5 should make it very clear that there are no immediate walls in training LLMs.
To start ā Claude Fable 5 is definitely the smartest model available to the general public ā a remarkable leap on pretty much every relevant benchmark of the day ā at only 2X the price of current Opus models (which is still less than GPT 5.5 Proās variant). This alone is a seminal moment for the field. To have a model iteration take such a substantial step in capabilities, a few years into the post-ChatGPT LLM race, is astounding. Thereās no clear breakthrough associated with this model, such as inference-time scaling or RL, and public wisdom is that this is achieved by advances across the whole stack (of course, we canāt know for sure ā itās not documented). This is a major technical achievement and the employees who built the model should be very proud of their work.
This model was delayed 2+ months after it was done training before it was publicly available. Given the competitive dynamics of the AI economy, the smarter version of this model is already well underway.
To continue, the benchmarks for the model are below.
An asterisk on these scores is that these arenāt necessarily the scores that the public will get, as some of the prompts will be downgraded to Opus 4.8 with the current safety filters on the model.
This is the type of jump in benchmark scores where I donāt even need to substantially test the model to know itās an incredible tool. Remember that Anthropic is also the AI lab with the track record of caring the least about benchmarks (in particular, when compared to OpenAI and Gemini). Recall a comment I made in June of 2025:
This is a different path for the industry and will take a different form of messaging than weāre used to. More releases are going to look like Anthropicās Claude 4, where the benchmark gains are minor and the real world gains are a big step. There are plenty of more implications for policy, evaluation, and transparency that come with this. It is going to take much more nuance to understand if the pace of progress is continuing, especially as critics of AI are going to seize the opportunity of evaluations flatlining to say that AI is no longer working.
Clearly, a few pieces of the progress dynamics have changed, but thatās a post for another day. Iāve written multiple posts about new models this year specifically in how itās hard to trust benchmarks (and partially because the benchmarks donāt move that much). Altogether, this is a major validation for AI-savvy workers who realized theyāre likely never going to write meaningful code again and need to develop new workflows around agents.
Interconnects AI is a reader-supported publication. Consider becoming a subscriber.
Smarter models spawn new safety games
There are multiple pieces of safety tooling associated with this release, including but not limited to required data-retention policies and added prompt filters. Through this analysis it is particularly important to be precise and clear as to which pieces of these are causing harm, and why single elements being out of place in an otherwise comprehensive policy are so damning for the overall safety process.
For their focus areas of cybersecurity, targeted model distillation, and research biology, Anthropic details new safety classifiers in their blog post:
Fable 5 comes with a new set of classifiers: separate AI systems that detect potential misuse, including jailbreak attempts, and prevent the main model (in this case Fable 5) from responding. Weāve been running classifiers on our models for some time, and Fable 5ās classifiers are an extension of this previous work with extra coverage.
When Fableās classifiers detect a request related to cybersecurity, biology and chemistry, or distillation, the response is automatically handled by Claude Opus 4.8 instead. Users will be informed whenever this occurs. Opus 4.8 is a highly capable model in its own right: a response that falls back to Opus is a far better experience than an outright refusal from Fable. Our early data shows that more than 95% of Fable sessions involve no fallback at allāfor those sessions, Fable 5ās performance is effectively the same as that of Mythos 5.
Examples of the primary cybersecurity and biology safety filters ā which tell the users explicitly when theyāre triggered ā are already proliferating online and appear quite sensitive. These can be a frustrating experience for users, but Anthropic is definitely within its power to do this and intellectually consistent for doing so.
The damaging part of the safety story falls under the fold in the Claude Fable 5 & Claude Mythos 5 System Card:
We have also added safeguards related to frontier LLM development. As discussed in Section 6.1 of our February 2026 Risk Report, we are concerned about the risks of accelerating the overall pace of AI development, though we remain uncertain about the severity of these risks. In particular, our concern is withāas we wrote thenāāaccelerating other AI developers in building powerful AI systems that pose similar risks to the ones ours pose - without necessarily having commensurate safeguards.ā
In light of the ability of recent models to accelerate their own development, weāve implemented new interventions that limit Claudeās effectiveness for requests targeting frontier LLM development (for example, on building pretraining pipelines, distributed training infrastructure, or ML accelerator design). Using Claude to develop competing models already violates our Terms of Service, but enforcing this restriction through our safeguards avoids accelerating the actors most willing to violate these terms.
Unlike our interventions for cybersecurity, biology and chemistry, and distillation attempts, these safeguards will not be visible to the user. Fable 5 will not fall back to a different model. Instead, the safeguards will limit effectiveness through methods such as prompt modification, steering vectors, or parameter-efficient fine-tuning (PEFT).
Anthropic documents on how this will impact a small percentage of users, which is true. I focus on the small amount of users supporting AIās diffusion and understanding outside of the few frontier labs, as a crucial mechanism for the continued safety of the technology.
Anthropic is documenting how the proliferation of AI capabilities is a concern to them, but they are solving it by misleading their users. An AI model that gets less intelligent automatically without notifying me is categorically misaligned AI. The next step on this line ā not that Anthropic did it, but they could ā is to have a model silently manipulate a workplace when it thinks it is an unsafe use for AI. Second, the implementation here is more complicated than was documented for cybersecurity or biology ā modifying the model itself or the data presented to it, all without notifying the user.
The duality of these policies is extremely confusing and paints a strong inconsistency that casts doubt over their safety policies. This āsafetyā measure is presented as being far more about maintaining their competitive position. Again, if all of the safety policies took one form, this would be far more cogent and easier to support intellectually.
Anthropic has been very vocal about their concern over distillation attacks from particularly Chinese actors. Their claims are not transparent enough with the facts ā or context as to why they canāt prevent the behavior ā to be fully believable. Despite the limited information, in the broader AI and DC communities, there have been serious discussions about taking action against the Chinese model builders on the grounds of said distillation.
On the point of distillation, my hypothesis is that API builders donāt have an easy time preventing hacks or jailbreaking because itās a deeply grounded property of reasoning models to want to output the reasoning traces, and it would make the model far less intelligent to fully patch the behavior. This is based on a few assumptions:
* Chinese labs are not just showing up as customers to Anthropicās API and paying for tokens in the intended input-output form. If the Chinese labs are paying for intended use behaviors, despite being banned by the terms and conditions, I donāt have a lot of sympathy for the frontier labs manifesting policy actions against this.
* Reasoning traces are disproportionately effective at seeding behavior in downstream models.
* Leading labs work very hard to patch the pipeline of these jailbreaks.
So, my logical conclusion is that the model companies would have to weaken their economic position to fully protect their IP. If this is the case, Anthropic would get a lot more sympathy from the AI research community by being transparent. It would also be far easier to have informed policy discussions, and not rely on me proposing Occamās razor explanations for what the API jailbreaking looks like.
Building these safeguards is not something that Anthropic should do alone. Safety research should be built on common understanding and information sharing across both labs and public research efforts.
If the exact safety procedures were actually the top line item to the company ā a true non-negotiable for the leadership ā they wouldnāt permit the model to be released with an unclearly implemented safety filter in one of their areas of focus (frontier AI training). I am asking ā why isnāt there a classifier to downgrade AI research requests? This is a mix of transparent and reasonable safety policies with quietly rolled-out market entrenchment tactics.
I personally cannot trust the best AI model in the world to work in my professional domains building models, which Iāve constructed entirely out of a passion for making sure the transition to very powerful AI systems goes well for society. This inevitably will feel like a declaration of superiority by the Anthropic leadership.
The control problem and open-source as the only answer
All of the actions Anthropic is taking, including calling out smaller Chinese companies for distillation, is well within their right. In fact, many people already expected the leading frontier models to be obviated from users so that labs can protect their IP. Todayās actions miss the big picture that AI will always be an ecosystem, and cultivating an us against them dynamic between the leading company and the other players is structurally unstable.
Remember, this is at a time when the AI ecosystem is seeing the first stirrings of violence against AI leaders ā and Iāve heard from many people that they donāt expect it to abate. I wish I knew how to engage more to prevent this, and I see myself in the non-profit sector as someone who can hopefully independently represent AI to broader stakeholders.
I believe there was something misread, or at least misunderstood here, by the Anthropic leadership having a narrowly cultivated worldview around AI. An overwhelming sentiment I had today was one of obligation and confusion. I shared how I donāt really want to have to go to bat against Anthropic, but theyāve just been unnecessarily antagonistic to China, then not so subtly to open weight models, and now more broadly to open AI research.
I understand that Anthropic has a specific view of AI, but such a powerful technology will never have its final equilibrium be one of singular control by a private company. Anthropic showcased this earlier this year in the spat between the Department of Defense and themselves ā which points to a long-term equilibrium where the government will either want AI to be controlled by them or to be open. This made me believe that an open ecosystem is a far safer outcome.
Many of these events make me feel that Anthropicās leadership has a culture by which they canāt help but speedrun through these issues ā going head to head with existing power structures. This adds substantial uncertainty into an AI ecosystem at a time when it is very much not needed.
Collectively, the last week could be seen as a major rallying point for a new open-source ecosystem in the U.S. Nvidia released their first flagship model last week ā Nemotron 3 Ultra ā and these actions from Anthropic have galvanized a unanimous motivation and concern among my peers building open models. We need intelligence that we can trust, that we can modify, and that we can control.
The American open-source ecosystem has its feet underneath it and keeps being given more reasons to fight for its leadership, right from the hands of the companies it directly undercuts. Thatās the moral of this fable.
This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe -
Iām departing the Allen Institute for AI (Ai2), where I got the great privilege to work on the Olmo models, to grow, to learn, and to have broad lasting impacts. This post is an attempt to reflect on why what we did was influential, despite obviously being far from the frontier in performance (even when within size buckets), and how this reflects on various paths to impact in AI today.
To start, I shared the following note with the company yesterday:
Dear Ai2.
As many of you know, today is my last day working at Ai2.
I joined Ai2 largely as an accident. I met Luca at ICML 2023 in Hawaii and realized I could level up my open post-training work dramatically if I got the chance to join. When I got an offer it was an absolute no-brainer, it was such a welcoming and exciting environment.
It has been a wonderful ride that has transformed my life, and I couldnāt be prouder of the work we did together. Ai2 has a wonderful scientific culture at its core and Iām excited to see this continue. I feel very lucky to have been here and that I personally have benefited massively from everyone who has worked so hard to cultivate that culture and environment. It is and has been a team effort. This includes all the people whose longest interactions with me were brief chats at the coffee machine. I drew so much energy and excitement from all the different ways people at Ai2 showed up for the mission.
Iāve already thanked much of the OE team directly, but I wanted to thank everyone else that went into this. Legal, IT, Comms, and the Office team all do a great job enabling and leveling up our research work. Itās often work that is forgotten, outside of the lime light, or remembered at the last minute, but it all has been crucial to achieving our goals. Iām excited to keep visiting the wonderful Northlake space in the coming years.
Even though Iām leaving, Iām more excited than ever about Ai2ās mission. Ai2 operates in such a rare niche between academia and industry, where we can explore and influence the most important technology of our lifetime. Doing this openly is the best way to ensure the technology diffuses safely to everyone who may benefit. Ai2 needs to stay as ambitious as possible, trying to influence the cutting edge of AI and the biggest issues of the field. Do not shy away from these challenges ā AI needs independent voices as it only becomes more geopolitical, socially disruptive, and central to the economy.
I will still be working in this space, working to make the open ecosystem better coordinated and more useful.
So as I go off to try something new, donāt be strangers. Iāll always be reachable at [email protected] and will still live in Seattle for most of the year.
Nathan
I have loved and will still love Ai2. Ai2 has a deep culture of caring about the research process, the outputs that get shared, and most importantly the people who do the work. This is why the institution creates countless wonderful people that go and spread the gospel throughout the research community. This core culture will remain through the rebuild, and there are plenty of resources to do impactful research across the spectrum of AI.
In the last two years of my time at Ai2 Iāve done so much meaningful work. Of course Olmo is at the top and has been my priority, but making time for consistent practice here on Interconnects, weekend cram sessions for ATOM, and also the fun RLHF book make for a list that makes me wonder how I did it all. I was obviously obsessed with work, but not in a way that made me lose sleep or lose my overall wellness. It was the right long-term approach.
This impressive list is one where I was ruthless in saying no to things that didnāt matter and got all my work out to see the light of day. I had no medium-sized projects that didnāt succeed in the last few years. It makes me wonder if I wasnāt taking enough risk. It shows you can truly do so much with your time, and itās actually harder to find the right problems and environment to do it. Many people are in environments where their work never becomes public or theyāre forced to change topics consistently.
From zero to hero
To start, Iād like to do a short recap on my path to Ai2 to show what Ai2 was just as much a growth story for me as an execution story.
I studied electrical engineering in undergrad, focusing on linear systems math and microelectronics.
I was admitted to the UC Berkeley EECS Ph.D. program to study microelectromechanical systems (MEMS).
I showed up at Berkeley in August of 2017 and realized AI was obviously the thing I should be doing. I asked the likes of Sergey Levine or Pieter Abbeel if they could advise me ā they said no.
I threw all my energy into learning what I could about AI. I got a break to get advised by one of Sergeyās post-docs in 2018 or 2019. I went all in on that, I fought for funding, I fought to have an AI paper.
This process worked out by the end of my Ph.D. in 2022: I had access to the Berkeley AI Research (BAIR) building and collaborations in the department. It was a bumpy road.
I wanted to go to industry research, to get a nice paying job with intellectual freedom, something like FAIR or Google Brain at the time. HuggingFace was the only job that fit that bill, it was easy to say yes to.
I joined HuggingFace in May of 2022 and wasted my time at the company until ChatGPT was released. I used my RL background to write a blog post on RLHF which went viral. HuggingFace decided it would be good for me to form a team around this success.
In 2023 I learned NLP and about language models. I had a lot of fun and built an initial community. I got burned out by working remote with a huge time difference. I met Luca Soldaini at ICML in Hawaii, where I was giving a tutorial on RLHF, and they told me Ai2 was hiring.
I got the job at Ai2 largely because of my excitement and how I was saying I wanted to do a lot of stuff that sounded cool to them but no one was likely to do (RL related things). My interviews were far from a sure thing ā this is a great job to land!
I started at Ai2 in October of 2023. I worked remotely for a while. I was doing normal research, I made the first reward model evaluation, RewardBench. It was a solid success, but nothing like how the pretraining team was getting ready to release the first Olmo.
I helped coach Ai2 on how to release models well, helping the Tülu 2 project land (the first model to do DPO well, publicly at the 70B scale).
The first Olmo was released in early 2024, I squeaked onto the papers just by trying to be helpful and doing some basic post-training. I was already good at paying attention to which projects are actually important.
That summer I started rounding everyone up to do a ābig frontier post-training project.ā This became Tülu 3, one of my favorite projects ever released, in fall of 2024. The goal was to beat Llama 3ās post-training with their own base model. The team morale was incredibly high and the execution was so timely, allowing us to coin the term Reinforcement Learning with Verifiable Rewards (RLVR) in the paper.
The crazy lengths I went to get the Tülu 3 and Olmo 2 post-training done had me sending 40% more slack messages than anyone at the company and got me the award āThe Cat Herder.ā
2025 was a much simpler year. We were too slow to react to reasoning models, given we had been doing similar stuff with Tülu 3, but sometimes that happens.
Originally we wanted to release Olmo 3 by June or July of 2025. That obviously didnāt happen, but we got the slim chance to train a bigger model, and it really landed. We threaded the needle.
Since Olmo 3 was released, it was clear that some changes were coming and I personally never got a big post-training project off the ground after that. Many other people managed great work in the spring of 2026.
This all leaves me here today showing you that only about half of my story at Ai2 is what I was known widely for, and the rest was building momentum. It often takes a year of building relationships and direction before really big successes can happen in a career.
I was just about a nobody when I joined Ai2 and I got to join a team that was willing to learn from the skills I had brought from HuggingFace. With how media works, I often think I get more recognition than I deserve for Ai2ās success.
The likes of Tülu 3, Olmo 2, and Olmo 3 felt like generational team efforts. The amount of personal successes and breakthroughs that happened for those projects is immense ā and to sustain them over such a long time period is incredibly hard to replicate. The sum far exceeded the individual parts.
Iāve heard many times in the last few months how people wouldnāt know about Ai2 if it wasnāt for my writing. Statements like this are overblown, but they are partially true and reiterate how crucial building relationships and getting the word out is today.
When you write a plan that is feasible, the world bends towards that plan. When you convince people itās going to happen it only becomes more likely. Vision and compelling explanations are one of the items in shortest supply in the tech industry. Often building the thing is easy and explaining it is hard. If no one knows about your work, the value is often close to 0. So much of building reputation is about building relationships with people who will receive your work.
Reflecting on all of this, Iāve had a shockingly linear path through my career to incremental success. I would expect the first 10 years of most careers to be in search of finding one opportunity as good as Ai2, and you will not always be able to seize it. There are some ways to create more opportunities.
Iāve discussed before how a large part of my rise is down to many more senior and more established scientists being drawn into the closed ecosystems at the same time as an immense swell in interest for AI. This created a power vacuum that I, and a few other prominent scientists that I think form my āgenerationā, got to grow rapidly into.
Interconnects AI is a reader-supported publication. Consider becoming a subscriber.
The role of public scientists
With my work at Ai2 and Interconnects, I summarize my role and mission as trying to accomplish three things:
* Provide clarity in the evolution of frontier models. This is easiest when the science has caught up, but even applying a scientific lens to how the models are changing is very useful to building trust in the broader AI ecosystem.
* Create a vibrant and diverse open (model) ecosystem. This is crucial to mitigating some risks of AI, particularly with concentration of power and myopia in studying frontier safety, that has motivated me now for 3-4 years. The risks havenāt abated.
* To build institutions that create people and ideas that further the above missions, and generally mission-driven individuals that are willing to advocate and build a future they believe in. AI is a grand problem, and not one that I can do alone, so I need to build brands to rise through the noise and attract likeminded people.
At my best, I have many avenues for impact. I help open researchers work on impactful problems ā not wasting the precious compute and time they have during the AI boom. I help policymakers know what is true. I build models that people use. I tell stories that make people smile. I keep the list wide so that I can stay motivated.
I see all of this continuing, and have been thinking about the broader impacts of this repeatedly over the last few months. Hearing that Andrej Karpathy was joining Anthropic prompted me to finally share more of my opinions:
For a long time, academic researchers being at the cutting edge of new technologies has been a great social equilibrium. Neutral, unbiased technologists have been the people to spread new ideas to the world.
As AI research takes off in velocity, it is also going behind closed doors. The tech industry has sowed distrust, and now they are the ones trying to tell the world about incredible changes coming. Itās a big loss to a form of social contract in America.
Thereās been a history of scientists helping society understand new technologies. There is a public service in the culture of science that I want to see continue.
Itās being exacerbated by feelings of FOMO, especially financially driven, where Iām seeing many people who previously wanted to be professors -- and likely still do deep down -- feel a need to conform and chase money, in a pocket of industry. I get it, I grapple with this.
For those with a safety net, there will be great returns to some who choose to zag, and try to build something good, for people who need something different. For me, this is building interesting, fully-open models, to show what you can do with a variety of open weight sizes.
Yes, AIās immediate future is dictated by the frontier, but itās long-term trajectory still deeply includes academic institutions and open science. Knowledge will always diffuse, but to whom?
As of today, I think China is positioned to be the global home of AI research in a few years. The home of research is where ideas are accessible, spread rapidly, and are nurtured. The U.S. seems to be unwinding many institutions and relationships.
The largest returns go to people who build something differentiated, at least in reputation, and a lot of people are not being shown that this path exists.
To elaborate on this, I donāt fault any of the individuals who are going to industry today. Iāve been very close to doing this myself in the past weeks of job searching, or rather job exploring. Itās a systematic problem where scientists cannot easily get the support to take bold stances, especially stances that are designed around the public good.
To go a step further and say that only the research within closed, frontier labs matters is very myopic. Yes, thereās a sort of research you can only do with vast compute resources, and they will directly impact the most revolutionary tools of the day. But, I see the relative opportunity to do good elsewhere as higher for plenty of people.
Open research will always be the standard that sets the language people use to understand AI. Itāll always be how the next generation is trained ā even if itās behind what industry has built. Itāll be the ecosystem where new long-shot ideas are built. Without investing in this open ecosystem, all of these cycles will be kneecapped.
At the end of the day, so much of my role now is just showing the path to impact in this domain. To show how clever, mid-sized open models can impact real problems in the world. To show how policy-makers and educators need open research to structure the rest of society around AI. This is a fun role too! It would be very sad for me to see this light diminish ever further, into the lightest embers of a fire that looks almost entirely out.
Even if the pace of research were to slow further, if the folks remaining like myself got financial offers they canāt refuse for their familiesā sake, the torch of open research will never fully go out. Itās core to how science is taught and done. There is a next generation coming, they just look for guidance and role-models.
Whatās next
I see the best Ai2 work as research infrastructure. Building recipes in public gives countless researchers the ability to ask very specific questions of training processes. We need these researchers in the broader community, as Ai2 could never answer all the interesting questions themselves. One of my great joys in recent months has been visiting a top ML university and hearing so many graduate students say theyāre building on Olmo. This is how the world should work!
Going forward, I still plan to operate in similar spaces, fighting for open-science, imagining what the future of the open model ecosystem can be, and doing my best to make the social transition to an AI-native era smooth. Iām most excited by how you can train medium sized open models on specific tasks that become useful tools in complement to the frontier models ā massively winning on price. I want to invest in the ecological diversity of open models and coordination across builders.
For something that isnāt surprising given my past focus areas, Iām watching the pace of releases from all labs open & closed, and how theyāre hillclimbing on super ripe new post-training veins (on-policy distillation, agentic workflows, etc.), itās clear that fully-open post training recipes are about as far behind as they ever have been & falling further behind. Iād like to fix this. Itās not 100% clear yet if I will this year, but Iāll try.
To do this best and to execute, mostly personally, I needed a new start and fresh perspectives. Iāll be carefully building what Iām doing next over the next few months and am eager to share more about it when I can. One of my close teammates at Ai2 shared this quote with me in a farewell card, and I found it very apt in where Iām going next.
The object of life is not to be on the side of the majority, but to escape finding oneself in the ranks of the insane. ā Marcus Aurelius
Thank you all for your continued support.
This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe -
The largest debate thatāll define the future balance of power between the open and closed AI model ecosystems is primarily economic ā itās if users of AI will continue to pay dramatically more, i.e. large margins, for the top closed models. Early 2026 is a seminal time for the AI industry, as the coding agents have shown the first area where a huge AI market will continue to pay a substantial premium for better intelligence.
The other side of this dichotomy is the inevitable decay of API businesses at these same labs. These labs will realize they need to protect their best models, rolling them out later in APIs to both protect token supply, avoid distillation, and stick to use-cases with higher margins. All of these effects will be clearly visible in 5-10 year timelines, as in the near term markets, prices, margins, and demand will be dictated by a rapid buildout of compute (supply-limited in the near term) and mass subsidization of tokens (through continued investment in new AI companies).
The core of this argument rests in the obvious habit changes that are setting in with coding agents past the Opus 4.5 and Codex 5.2 thresholds. People are not making this switch because they are lazy, but because their net output is obviously higher when using an agent as an implementation aid for complex knowledge work. For people who rely on coding agents to work, they will always pay more for the best rather than settle for good enough. There are so many ways to make the product better, speed, intelligence, specialized models, etc.
I would pay $2000/month for the tools today, especially knowing theyāll get much better. At the same time, it is likely that many companies are forcing agents and usage onto people that actually will get very little out of them in their current form, which helps the AI buildout (or bubble) continue.
The best closed labs ā right now this list is just Anthropic and OpenAI, but itās reasonable to expect Google to catch up ā will always make the most efficient models for intelligence at a given cost. Building models is a mass capital investment of talent, data, and compute. These systems, a combination of model weights, harnesses, tools, and serving infrastructure have massive returns on integration (where open models are designed to work across many, diverse serving situations). These integration benefits ā the integration of hardware and new forms of software ā can be expressed in any possible way of making models better.
The models in the near future may saturate on benchmark scores, but if that intelligence ceiling really is a cap on utility then the labs will optimize utility per second or per watt, serving users in another way. Improving the models is possible in every direction ā there have been no walls in progress. Weāre early in the mass buildout of intelligence, which involves harnessing the physical world to build numerous datacenters, organizing many AI researchers so that a large team can contribute to one model, and of course solving many small, low-level puzzles that unlock performance. Every indication is that there is still meaningful performance to be unlocked and the closed labs are the best set up to extract it.
The collective wisdom of the labs is that making the models smarter, in terms of the frontier of absolute intelligence, has the most value. This is the right call to me because it unlocks large new markets. Optimizing models at a fixed intelligence level locks in markets, expands accessibility over time, and increases return on investment for users (while potentially lowering margins for selling intelligence).
Many people are making this bet that models will keep getting better and are learning to work well in these harnesses, even though some workflows are still a bit clunky. This is the right bet. These people all will continue to use the absolutely best models available. Itās like buying an iPhone as a consumer. You could get an Android and suffer from a bunch of paper cuts to save money, but why would you? The returns to performance are even higher in the workplace, which drives pricing power.
In this mental model, the frontier labs as businesses, will look like new, reimagined forms of a mix of Apple and Microsoft. The Apple side is that theyāre selling an integrated, extremely hard to replicate technology. The Microsoft side is selling high-leverage subscriptions across the economy. In 5-10 years I expect both OpenAI and Anthropic to be valued in the $2-10T range. The true frontier labs will be an oligopoly that looks like the cloud market today.
Interconnects AI is a reader-supported publication. Consider becoming a subscriber.
On the other side of this equation is the open model economy. This isnāt to say that the frontier labs will dominate all aspects of AI use. Yes, I expect OpenAI and Anthropic to be the most representative companies of the AI boom (new companies, alongside Nvidia of course), but the collective value capture around open models will be far bigger overall, itās just that the revenue and margins will be shared across a wide stack of companies.
Many businesses want to switch to open models but the models today are not good enough in out-of-distribution tasks. Eventually open model builders will stop chasing Claude and GPT on the Artificial Analysis index and fill this niche. This fork could be driven by economic factors, where they no longer have the revenue to support the growing R&D costs for continuing to scale models. It can also be driven by pure demand, where certain AI solutions only can exist at low price points present in open models. Where closed labs are an oligopoly, open model builders and users will be far more diverse and numerous. The total market value will dramatically exceed the cumulative value of OpenAI and Anthropic.
Open models are by their nature not integrated, so they will rely on multiple companies coordinating to serve them. Each of these layers will have alternatives, driving prices down to commodity pricing. These low, predictable prices will be where many enterprises enter to build in-house agents and tools for niche tasks. The predominant mode of deployment here is that enterprises find a model that hits a sufficient performance threshold on a task of interest and does not replace the model later (setup costs are high). As customizing models becomes easier, again in the open model finetuning stack we are seeing emerge (Tinker, Fireworks, Prime Intellect, etc.), this market becomes even bigger.
What this will look like in the coming years is a steady rise in open model inference proportion across the entrenched hyper-scale clouds of Google, Amazon, Microsoft and new AI infrastructure companies of Together, Fireworks, OpenRouter, etc when compared to OpenAI and Anthropic.
The key is that the open and closed model economies are operating on different exponentials. I still believe that progress will continue at a fast pace across the entire ecosystem, but claims of recursive self improvement (RSI) giving the closed labs an unassailable advantage are overblown. New forms of products like background agents can support both these open and closed models.
The closed models hit incredible product-market fit with the current agents, starting their integrated exponential by monetizing the top end of the knowledge work. The open model economy will take far longer, but it will also be far more satisfying to follow, as it tracks the broader diffusion of AI into the entire economy and world.
This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe -
As the years of AI progress go by, itās been accompanied by a slowly rising tide of consequence. Models are getting more capable, how we work is changing quickly, economics of AI are becoming real, just as real-world risks come to the forefront. 2026 is the first year where I donāt think thereāll be any breaks from this. The hard part to prepare for is that thereās a good chance things just continue to ratchet up from here ā more disruption, more surprises, more stakes.
On my end, thereās been a growing list of topics that are very fateful to how I see the current state of AI, but I havenāt even gotten to write about them (at least not from all the angles I want to)! All of these are closely related to the implications of different models reaching new capability levels and how I use that to infer what may come next.
1. Open models havenāt had their true agent moment like Opus 4.5
The time gap between open and closed models is very often discussed, but the reality is that we have a nice time-gating thatās independent of debatable benchmarks ā if open-weight models do or do not become super useful in agentic harnesses. The Opus 4.5 in Claude Code moment of December 2025 was so loud and obvious, that if open models hit this performance level for price points as low as $5/month, there will be an explosion in usage.
Right now we are about 5-6 months in with no equivalent open model. I suspect the robustness of the best closed frontier models that I write about could make this moment take a good amount longer, say closer to 12+ months. In this time, Claude Code and Codex may seem like different categories of products. In the standard flurry of new, state-of-the-art open models from a variety of labs, benchmarks will definitely keep climbing, but the open-closed gap should become more interpretable as real-world use becomes the real litmus test.
2. Gemini still doesnāt have a meaningful competitor for Claude Code and Codex
The best exclamation point I can offer to reinforce my prediction that open models are further behind than the benchmarks claim is that even the mighty Google doesnāt have a clear competitor for Claude Code and Codex. Iām sure the Gemini team is pushing very hard on this.
I still need to do a lot more testing on Gemini 3.5 Flash, but reading reviews makes it clear that itās not a substitute for how Iām working today. Itās maybe not the Gemini team explicitly specializing for Googleās existing products (search, YouTube, etc.), but the model seems to suit them. If Google doesnāt have a powerful tool here soon, I donāt expect the open model labs to either. The open models are going to be used more for automated, enterprise agents and low-cost domains, rather than being the driving tool of modern knowledge work. This will feed directly into the economic engine of funding future models, where the agents like Claude Code and Codex are the current best path to massive AI revenue growth.
I discussed how the current environment is quietly driving labs in China to specialize on AI Proem with Grace Shao and this is central to my expectations of open models specializing over the next few years instead of competing with OpenAI, Anthropic, and Google.
Interconnects AI is a reader-supported publication. Consider becoming a subscriber.
3. I donāt expect an open-weights Mythos this year
While I donāt think Mythos is a general āgod modelā that will crush the competition in every domain, I do think itās a remarkable technical achievement in software engineering and cybersecurity. Mythos is obviously a watershed moment for those fields. Having spoken to most of the Chinese labs ā particularly those with the most prominent, large, open MoE models like Kimi, Z.ai, DeepSeek, and Qwen ā I think theyāre heavily resource limited and donāt have an immediate path to scaling up training processes like the big labs in the U.S. For the labs which are more corporate, which comes with more resources, such as Alibaba and Bytedance, they also have more conservative stances on safety and security.Mythos is a bellwether of the massive acceleration in training and research compute available to the largest American companies.
Epoch AI recently had a nice piece on the compute available to various labs (~Google 25%, Meta 11%, OpenAI 11%, Anthropic 6%). All of these numbers are vastly higher than any Chinese lab.
4. American open models are slowly gaining steam
Nvidia with Nemotron, Google with Gemma, Arcee AI and others are slowly stabilizing the open model ecosystem in the U.S. Thereās a lot thatās hard to measure here, especially in the rise of local agents like OpenClaw and Hermes, but there are adoption numbers of American models that we havenāt seen since Llama 3.Gemma 4ās models are all tying or outperforming the equivalently sized Qwen 3.5/3.6 models ā where Qwen has for years now been the default open model at these sizes. These Qwen 3.5/3.6 models have been tricky to get working in a lot of post-training research, partially due to architecture/tooling and partially likely due to modeling (i.e. the model is not easy to finetune for some training decision). Iāve heard few complaints about Gemma, but it also could be because Gemma is not yet the researcher default.
There's a simple reality that we've seen recently with models like GPT-OSS, Nemotron 3, and now Gemma 4, that if a model is in the right range of benchmarks and released by an American lab with a truly permissive license, it'll get a large amount of adoption (in this cycle, recall that Gemma 4 adopted the Apache 2.0 License, changing from one with use-case restrictions on earlier Gemmas). This early phase of American growth in open models is establishing key brands directly with developers. The consensus is that more neolabs like Reflection and Thinking Machines are likely to participate in this space, but being too patient will lose the time when new agentic workflows and enterprise relationships are built.
5. Anthropic and OpenAI are just getting up to speed in model iterations
I expect the rest of this year to be a ruthless competition between these two flagship companies. Iām at an interesting balance where I think GPT 5.5 is a bit smarter of a model and I love the Codex App, so Iām structuring much of my work to be possible there. At the same time, for a lot of writing-related and broader surface area tasks I really still love Claude. These models are rapidly changing how we work, I run Codex from my phone while doing other things, am setting up automated open model analysis jobs on the back of agents, and expect to be able to scale the research side of Interconnects widely.
AI is beginning to drive companies to the two extremes in the scaling era. The biggest companies will be way bigger than ever, using resources and mass talent to have sustained progress at the frontier of raw AI capabilities. On the other side, tiny businesses like Interconnects thrive by using agents to refine, present, and sell niche expertise. The mass social job displacement thatāll come is going to reduce employability for various knowledge workers that donāt fit into either of these extremes for the raw technical side (big or small companies), while sustaining and maybe even amplifying careers that interface directly with humans (e.g. doctors) or other power structures with means to sustain themselves (law/government).
6. More existing power structures will assert themselves on AI
Just in the last few days while writing this, we had the Pope release an over 40,000 word document on where AI is going and China expand personnel movement restrictions on top AI researchers across industry. At the same time, the U.S. has designated Anthropic a supply chain risk and continues to use its models for national security. The list of news like this is only going to grow. Existing power structures are realizing thereās a finite time window for them to exert themselves in the AI dynamic ā an intuition that could be mapped to influence going down as AI models get more powerful. This intuition is potentially dangerous, as it sets up meaningful conflict in who controls the technology (as I discussed with Dean Ball after the Anthropic-DoW spat).
Next: Where technical becomes social
These largely technical and power trends accelerating are going to put more pressure on the social and political anti-AI sentiments within the U.S. This is currently the most obvious barrier to continued AI development and beneficial diffusion. Reflecting on this, many people in the tech discourse get too focused on the details, where yes a lot of data-center-detractors are making genuinely wrong factual claims in defense of their position.
The real position that a large swath of Americans has is that they have a voice in saying no to the current trend ā by not granting permission to build data centers. This is a voice that they havenāt been granted by the tech industry that changed the face of the global economy and power structures in the last few decades.
This is setting us up for a challenging year ahead for the industry. The labs are aggregating and concentrating talent to peak levels. There are few neutral messengers to communicate the reality of AI to the public. The frontier labs leadership is largely gearing up to IPO and stay ahead in the capabilities race. With the status quo, there are few actions to unwind this path toward social conflict.
It takes individuals in the AI ecosystem to zag and go against the groupthink of needing to make your wealth today, of needing to be at a lab to do impactful work, and so on. Iām personally continuing to bet on this, by trying to make a vibrant and diverse open model ecosystem supported by clear, unbiased information. If you agree with this and have been watching from the sidelines, itās a good time to get involved, before the situation spirals into something uncontrollable.
This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe -
Staring out the window on a new, high-speed train from Hangzhou to Shanghai Iām gifted with views of dramatic ridgelines speckled with wind turbines that are silhouetted against the setting sun. The mountains cast a backdrop to a mix of spanning fields and clustered skyscrapers. Iām returning from China with great humility. Itās a very warming, human experience to go somewhere so foreign and be so welcomed. I had the honor of meeting so many people in the AI ecosystem who I knew from afar, and they greeted me with big smiles and cheer, reminding me how global my work and the AI ecosystem is.
Interconnects AI is a reader-supported publication. Consider becoming a subscriber.
The mentality of Chinese researchers
The Chinese companies building language models are set up as the perfect fast-followers for the technology, building on long-standing cultural traditions in education and work, along with subtly different approaches to building technology companies. When you look at the outputs, the latest, biggest models enabling agentic workflows, and the ingredients, excellent scientists, large-scale data, and accelerated computing, the Chinese and American labs look largely similar. The lasting differences emerge in how these are organized and conditioned.
Iāve long thought that a reason that the Chinese labs are so good at catching up and keeping up with the frontier is that theyāre culturally aligned for this task, but without talking to people directly I felt like it wasnāt my place to attribute substantial influence to this hunch. Speaking with many wonderful, humble, and open scientists at the leading Chinese labs has crystallized a lot of my beliefs.
So much of building the best LLMs today comes down to meticulous work across the entire stack, from data to architecture details and RL algorithm implementations. All points of the model can give some improvements, and fitting them in together is a complex process where the work of some brilliant individuals needs to get shelved in favor of the overall model maximizing a multi-objective optimization.
Where American researchers are obviously also brilliant at solving the individual components, thereās more of a culture of speaking up for yourself in the U.S. As a scientist, youāre more successful when you speak up for your work and modern culture is pushing the new path to fame of āleading AI scientistsā. This results in direct conflict. The Llama organization is heavily rumored to have collapsed under the political weight of these interests embedding themselves in a hierarchical organization. Iāve heard of other labs saying that it can be needed to pay off a top researcher to get them to stop complaining about their idea not making it in the final model. Whether or not thatās exactly true, the idea is clear. Ego and desires for career advancement do get in the way of making the best models. A small, directional shift in this sort of culture between the U.S. and China can have a meaningful impact on the final outputs.
Some of this has to do with who is building the models in China. Thereās an immediate reality at all of the labs that a large proportion of the core contributors are active students. The labs are quite young, and it reminds me of our setup at Ai2, where students are seen as peers and directly integrated in the LLM team. This is incredibly different from the top labs in the US, where the likes of OpenAI, Anthropic, Cursor, etc. simply donāt offer internships. Other companies like Google nominally have internships related to Gemini, but thereās a lot of concern about whether your internship will be siloed and away from anything real.
To summarize how the slight change in culture can improve the ability to build models:
* More willingness to do non-flashy work in order to improve the final model,
* People new to building AI can be free of prior phases of AI hype cycles, allowing them to adapt to the new modern techniques faster (in fact, one of the Chinese scientists I talked to really actively attached to this strength),
* Less ego enabling org charts to scale slightly, as thereās less gamifying the system, and
* Abundant talent well-suited to solving problems with a proof of concept elsewhere, etc.
This slight inclination towards skills that complement building todayās language models stands in contrast to a known stereotype that Chinese researchers tend to produce less creative, field-spawning, 0-to-1 academic style research. Among the more academic lab visits on our trip, many leaders talk about cultivating this more ambitious research culture. At the same time, some technical leaders we talked to were skeptical about whether such a rewiring in the approach to science is likely in the near term, because itāll take a redesign of the education and incentive systems that is too big to happen within the current economic equilibrium. This culture seems to be training students and engineers that are excellent at the LLM building game. They also, of course, have an extremely abundant quantity.
These students told me about a similar brain drain happening in China as in the U.S., where many who previously considered academic paths now intend to stay in industry. The funniest quote was from a researcher who was interested in being a professor to be close to the education system, but remarked that education is solved with LLMs ā āwhy would a student talk to me!ā
The students have a benefit of coming at LLMs with fresh eyes. Over the last few years weāve seen the key paradigm of LLMs shift from scaling MoEās, to scaling RL, to enabling agents. Doing any of these well involves absorbing an insane amount of context quickly, both from the broader literature and the technical stack at your company. Students are used to doing this and excited to humbly drop all presumptions about what should work. They dive in head first and dedicate their life to getting the chance to improve the models.
These students are also so magically direct and free of some of the philosophical chatter that can distract scientists. When asking questions on how they feel about the economics or long-term social risks of models, far fewer Chinese researchers have sophisticated opinions and a drive to influence this. Their role is to build the best model.
This difference is subtle, and easy to deny, but it is best felt when having long conversations with an elegant, brilliant researcher who can clearly communicate well in English, basic questions on more philosophical aspects of AI hang in the air with a simple confusion. Itās a category error to them. One researcher even quoted the famous Dan Wang premise of China being run by engineers, relative to the lawyers of the U.S. when probing in these areas, to emphasize their desire to build. Thereās no track in China that systematically enables the growth of star power for Chinese scientists, akin to mega mainstream podcasts like Dwarkesh or Lex.
Trying to get Chinese scientists to comment on the coming economic uncertainty fueled by AI, questions beyond the capabilities of simple AGI, or moral debates on how models should behave all served to capture the upbringing and education of these scientists (edited). They are extremely dedicated to their work, but have grown up in a system where debates and opinions on how society should be structured and changed are not encouraged.
Zooming out ā Beijing especially felt much like the Bay Area, where a competitive lab is a short walk or Uber away. I got off a flight and stopped by Alibabaās Beijing campus on the way to the hotel. Then, in 36 hours we went to all of Z.ai, Moonshot AI, Tsinghua University, Meituan, Xiaomi, and 01.ai. Travel by Didi is easy, and if you select an XL in China youāre often paired with electric mini vans that have massage chairs. We asked the researchers about the talent wars, and they said itās very similar to what weāre experiencing in the U.S. Itās normal for researchers to bounce around, and much of where people choose to go is based on the best current vibes.
In China, the LLM community feels far more like an ecosystem than battling tribes. Across many off the record conversations, itās nothing but respect for peers. All of the Chinese labs fear Bytedance with their popular Doubao model, which is the only frontier closed lab in China. At the same time, all of the labs have massive respect for DeepSeek as the lab with the best research taste in execution. When you meet with lab members off the record in the States, sparks fly quickly.
The most striking part of the humility of Chinese researchers is how they also often shrug on the business side, saying itās not their problem, where everyone in the U.S. seems to be obsessed with various ecosystem-level industrial trends, from data sellers to compute or fundraising.
Where Chinaās AI industry differs (and matches) the Western labs
The thing that makes building an AI model today so interesting is that itās not just about getting a group of great researchers in one building together to produce an engineering marvel. It used to be this, but to sustain AI businesses, the LLMs are becoming a mix of building, deploying, funding, and getting adoption for this creation. The leading AI companies exist in complex ecosystems that supply money, compute, data and more in order to keep pushing the frontier.
The integration of these various inputs to creating and sustaining LLMs is fairly well conceptualized and mapped for the Western ecosystem, as typified by Anthropic and OpenAI, so finding big differences in how the Chinese labs think about it points at where the different companies can be making meaningfully different bets on the future. Of course, these futures can be heavily dictated by the constraints on funding and/or compute.
Iāve documented the biggest āAI Industryā level take-aways from talking to these labs:
* Early signs of domestic AI demand. Thereās a much-touted hypothesis that the Chinese AI market will be smaller because Chinese companies donāt tend to pay for software ā thus, never unlocking a giant inference market supporting labs. This is only true for software spend that maps to the SaaS ecosystem, which is historically tiny in China, where on the other hand there is obviously still a large cloud market in China. A crucial unanswered question ā one which the Chinese labs themselves debate ā on if spending for AI in the enterprise tracks the SaaS market (small) or the cloud market (fundamental). On net, it feels like AI is trending closer to the cloud, and no one was actively worried about a market growing around the new tools.
* Most developers are Claude-pilled. Most of the AI developers in China are obsessed with Claude and how itās changed how they build software, despite Claude nominally being banned in China. Just because China has historically been hesitant to buy software does not give me the impression that there wonāt be a massive surge in inference demand. Chinese technical staff are so practical, humble, and motivated ā a fact that seems stronger than any commitment to previous habits in not spending.Some Chinese researchers mention building with their own tools, such as the Kimi or GLM CLIs, but all of them mention building with Claude. There were also surprisingly few mentions of Codex, which is definitely surging in popularity in the Bay Area.
* Chinese companies have a technology ownership mentality. The Chinese culture is combining with a roaring economic engine to create unpredictable outcomes. Iām left with a lasting feeling that the numerous AI models reflect a practical, current equilibrium of the many technology businesses here. Thereās no master plan. The industry is defined by a respect for ByteDance and Alibaba, the incumbents expected to win large portions of all markets with their substantial resources. DeepSeek is the respected technical leader, but far from a market leader. They set the direction, but arenāt set up to win economically.This leaves companies like Meituan or Ant Group, where people in the West can be surprised theyāre building these models. In reality, they see LLMs obviously as being central to future technology products, so they need a strong base. When they fine-tune the strong, general purpose model it hardens their stack from getting the open community to provide feedback on it, and they can keep internal, fine-tuned versions of the model for their products. The āopen-firstā mentality in the industry is largely defined by practicality ā it helps make their models get strong feedback, it gives back to the open-source community, and empowers their mission.
* Government aid is real, but unclear how big. Itās often asserted that the Chinese government is actively helping with the open LLM race. This is a government thatās decentralized across many levels, each of which doesnāt have a clear playbook for what exactly they do. Neighborhoods in Beijing compete for tech companies to house their offices there. The āhelpā offered to these companies almost certainly involved removing bureaucratic red tape like permits, but how far does it go? Can levels of the government help attract talent? Can they help smuggle chips? Across the visit, there were many mentions of government interest or help, but far too little to report the details as assertive or have a confident worldview of how government can bend the trajectory of AI in China. There were certainly no hints of the top levels of the Chinese government influencing any technical decisions in the models.
* The data industry is far less developed. Having heard so much about the likes of Anthropic or OpenAI spending $10M+ for single environments, with cumulative spend on the order of hundreds of millions per year to push the frontier of RL, we were eager to know if Chinese labs are either buying the same environments from companies in the U.S. or supported by a mirrored domestic ecosystem. The answer was not quite complete that thereās no data industry, but rather that their experience was that the data industry was relatively poor quality and it is often better to build the environments or data in-house. Researchers themselves spend meaningful time making the RL training environments, and some of the bigger companies like ByteDance and Alibaba can have in-house data labelling teams to support this. This all mirrors the build-not-buy mentality from the previous bullet.
* Desperation for more Nvidia chips. Nvidia compute is the gold-standard for training and everyone is limited in progress by not having more of it. If supply was there, it is obvious that they would buy it. Other accelerators, including but not limited to Huawei, were spoken positively of for inference. Countless labs have access to Huawei chips.
These points paint a very different picture of an AI ecosystem, where quickly mapping how Western labs operate to their Chinese counterparts will often result in a category error. The crucial question is if these different ecosystems will produce meaningfully different types of models, or if the Chinese models will always be explained by being similar to the U.S. frontier models of 3-9 months ago.
Conclusion: The global equilibrium
I knew so little about China going into the trip and came out with the feeling of just starting to learn. China isnāt a place that can be expressed by rules or recipes, but one with very different dynamics and chemistry. The culture is so old, so deep, and still completely intertwined with how domestic technology is built. I have much more learning ahead.
So much of the current power structures in the US use their current worldviews of China as crucial mental devices for decision making. Having talked, in person, either formally or informally to pretty much every leading AI lab in China, there are a lot of qualities and instincts in China thatāll be very hard to model with Western decision making. Even after asking directly about why these labs release their top models openly, the intersection between ownership mentality and genuine ecosystem support is hard for me to connect the dots on.
The labs here are practical and not necessarily absolutists around open-source, where every model they build would be released openly, but thereās a deep intentionality in supporting developers, the ecosystem, and using it as a way to learn more about their models.
Almost every major Chinese technology company is building their own general purpose LLMs, as we see with the likes of Meituan (delivery service) and Xiaomi (broad consumer technology company) releasing open weight models. The equivalent companies in the U.S. would just buy services. These companies arenāt building LLMs out of a race to be relevant with the hot new thing, but a deep fundamental yearning to control their own stack and develop the most important technologies of the day. When I look up from my laptop and always see bunches of cranes on the horizon, it obviously fits in the with the broader culture and energy around building in China.
The humanity, charm, and genuine warmth of Chinese researchers is extremely humanizing. At a personal level, the cut-throat geopolitical conversation weāre used to in the U.S. hasnāt permeated them at all. The world can use more of this simple positivity. As a citizen of the AI community, I currently worry more about the fissures appearing within members and groups around labels of nationality.
Iād be lying if I said I didnāt want US labs to be clear leaders in every part of the AI stack ā especially with open models where I spend my time ā Iām American, and thatās an honest preference. With this, I want the open ecosystem itself to thrive globally, as this can create safer, more accessible, and more useful AI for the world, and right now the question is whether American labs will take the steps to own that leadership position.
As of finishing this piece, more rumors are swirling of executive orders influencing open models, which can further complicate this synergy between American leadership and the global ecosystem ā it doesnāt fill me with confidence.
Thank you to all the wonderful people I got to talk to at Moonshot, Zhipu, Meituan, Xiaomi, Qwen, Ant Ling, 01.ai, and others. Everyone has been so welcoming and gracious with their time. Iāll keep sharing my thoughts on China as they crystallize, across culture generally and AI specifically. It is obvious that this knowledge will be directly relevant to the story unfolding at the frontier of AI development.
This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe -
āDistillation attacksā is a horrible term for what is happening right now. Yes, some Chinese labs are hacking or jailbreaking APIs to attempt to extract more signal from model APIs ā stopping this is important to maintain the U.S.ās lead in AI capabilities. Referring to this as distillation attack is going to irrevocably associate all distillation with this behavior, and distillation generally is a core technique needed to diffuse AI capabilities broadly through academic and economic activities.
We went through this sort of language transition with the open source vs open weight debate. All the terms just reduced to open models ā very few people in the large AI community know exactly how open-source differs from open-weights. And terminology matters, as the less informed people who still care about ā and influence ā the technology are bound by different terms they use. If weāre not careful with the discourse around distillation, many people could associate this broad technique used for research and development of new models as an act at the boundary of corporate manipulation and crime.
Iāve recently written a more technical piece on estimating how impactful state-of-the-art distillation methods are on leading Chinese models, and this piece follows to push for caution in any hasty actions to target the methods with policy. To set the stage, recall Anthropicās recent blog post where they detailed ādistillation attacksā made by 3 Chinese labs.
These labs used a technique called ādistillation,ā which involves training a less capable model on the outputs of a stronger one. Distillation is a widely used and legitimate training method. For example, frontier AI labs routinely distill their own models to create smaller, cheaper versions for their customers. But distillation can also be used for illicit purposes: competitors can use it to acquire powerful capabilities from other labs in a fraction of the time, and at a fraction of the cost, that it would take to develop them independently.
This is a clever paragraph, where they normalize distillation generally and explain how a few people can use it illicitly, without detailing how illicit use often involves other more explicit behavior like jailbreaking, hacking, or identity spoofing of the API.
Distillation itself is an industry standard. Itās used extensively, primarily in post-training, by smaller players to create specialized or smaller models. In my book coming this summer, I describe it as follows:
The term distillation has been the most powerful form of discussion around the role of synthetic data in language models. Distillation as a term comes from a technical definition of teacher-student knowledge distillation from the deep learning literature.
Distillation colloquially refers to using the outputs from a stronger model to train a smaller model.
In post-training, this general notion of distillation takes two common forms:
* As a data engine to use across wide swaths of the post-training process: Completions for instructions, preference data (or Constitutional AI), or verification for RL.
* To transfer specific skills from a stronger model to a weaker model, which is often done for specific skills such as mathematical reasoning or coding.
With this definition, itās easy to see how distillation takes many forms. Of course, if you just take the outputs from GPT-5.5 and train a recent open-weight base model with them to host a competitive product, thatās one thing. But, a lot of the things that fall under the bucket of distillation are complex, multi-stage processes that muddle the exact impact of the model you distilled from.
Modern LLM processes could look like using a GPT API to build an initial batch of synthetic data to build a specialized small data-processing model. A good example is a model like olmOCR (or many other models in this category) that are trained to convert PDFs to clean text. This specialized model would be used to create large amounts of data. Finally, you train another model (often from scratch) with the new data you created. Is this final model distilled from GPT?
When done via a closed, API-based model, distillation sits in the grey area of the terms of service that you agree to when signing up to the Claude or GPT platform. They generally forbid the use of the API to create competing language model products, but this term has largely gone unenforced. The open-source community used to worry deeply at being cut off from these cutting-edge APIs for doing research or creating public datasets, but to date only one prominent case of corporate accounts being restricted exists (at least until the recent Chinese companies).
This is all to say that distillation is an industry standard technique, and the use of closed APIs to perform distillation has always been a grey area. Nvidiaās latest Nemotron models, as one of the only models with open post-training datasets, are technically in large part distilled from Chinese, open-weight models. The Olmo models weāve built at Ai2 are distilled from a mix of open and closed models. This grey area was brought to the forefront again when it turned out that xAI has been distilling from OpenAI. Quoting from the recent trial proceedings between Elon and OpenAI:
OpenAIās counsel asked Musk whether xAI has ever ādistilledā technology from OpenAI.
Musk: āGenerally AI companies distill other AI companies.ā
āIs that a yes?ā Savitt asked.
Musk: āPartly.ā
xAI is likely the largest, and most successful AI company willing to thread the grey area that is distillation from their competitors. On the other side, the majority of startups and research groups with fewer resources than them have very likely engaged in distillation of some capacity from Claude, GPT, or Gemini models.
Interconnects AI is a reader-supported publication. Consider becoming a subscriber.
In the above Anthropic blog post, the problem with the distillation attacks by a few Chinese labs is less the distillation and more the means of attack. It is documented that Chinese labs are actively working to get around the intended use of the API, e.g. to provide additional reasoning data that is very useful for training.
Of course no one should be able to access information from a model that a developer didnāt intend to reveal in their APIs (e.g., reasoning traces which would be helpful for training). Associating all of distillation with these attacks, which is to date an industry standard for post-training, from open and closed models alike will be a massive own goal.
What these few labs are doing should be referred to as jailbreaking or abuse, rather than distillation.
The discourse around these actions is creating a troubling discussion thatās marching towards a mix of regulatory capture or regulatory exuberance thatās most likely to harm the U.S.ās ecosystem more than Chinaās. Even if we ban, most likely through potential legal action and other penalties, this type of API abuse, the Chinese companies will likely still do it. Weāve seen this playbook with Chinese multimedia models taking a flexible view of copyrighted content that no U.S. player is willing to take the risk on.
This distillation discussion has quickly snowballed, with a bill moving out of a committee in Congress, an executive order pushing for action, and congressional oversight targeting U.S. companies building on Chinese models (which are downstream of distillation). This multi-pronged regulatory environment could yield truly horrible outcomes ā such as figuring out a way to effectively ban open-weight models in the U.S. that are built in China by groups abusing closed LLM APIs.
It is obvious that no bill will literally ban open models, but they can create grey area that exposes entities to unwanted risk or require certain provisions that are bureaucratically very challenging to fulfill, squashing small open source contributors.
In that scenario, the groups who lose are Western academics and smaller companies building models for the long-tail of AI uses. The ecosystem here could be made permanently irrelevant with the removal of nearly all Chinese open-weight models. There is no immediate substitute and building new models with meaningful community adoption has a lead time measured in 6+ months. In the time it takes to build a new domestic open-source ecosystem, countless researchers wouldāve moved onto closed training platforms or into new areas.
Altogether, Iām hoping this flurry of discussion around distillation becomes a nothing-burger and not a hasty, multi-pronged policy push. We need to avoid two things:
* A wholesale negative connotation of the word distillation, which is used extensively across the AI ecosystem.
* A domestic ban of the open-weight models built by organizations engaged in some portion of distillation.
In addition to this, I want the leading U.S. AI companies to be able to provide their APIs without having their IP leak. They should share more information on why it is hard for them to secure their APIs, but thatās an issue out of scope for my expertise.
Iāll conclude with a proposal from my friend Kevin Xu at Interconnected Capital (and great Substack) on why this current distillation dynamic may actually be good for the leading labs.
If all the Chinese companies are addicted to distillation as a way of getting close to the frontier, then theyāll never actually learn the techniques needed to take an outright lead. If we cut off the Chineseās obvious crutch in model building, weāll gain a short-term lead in AI, but in the long-term that may be what they needed to get on a more competitive long-term trajectory.
This is the same debate weāre having with other technologies where the U.S. currently has a lead, e.g. with advanced semiconductor technologies. So I understand the trade-offs, but we not should crack down on all of distillation.
This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe -
Weāre living through the period of time when weāll learn if open models can keep up with closed labs. The obvious answer is that no, they wonāt. This answer is a form of saying they wonāt keep up in every area. This framing closes off a popular prediction where the open models completely catch up, as in all models saturate and open and closed models only become increasingly similar. In living through this, itās evidently very unclear when the longer-term stable balance of capabilities will solidify.
This is a very complex dynamic, where the core point we monitor is a capability gap between models. At the same time, this gap is intertwined with evolving dynamics in the funding of open models, who builds open models, how techniques like distillation that enable fast-following translate through new application domains, potential regulation hampering the open-source AI ecosystem, and of course who actually uses open models.
The capabilities gap is one signal in a complex sea of forces, pushing supply and demand into different shapes. In many cases the demand ā where obviously tons of individuals, organizations, and sovereigns want, or need, open models ā is largely separated from supply. Supply is fully dictated by economics. The question of āwhich business strategies support releasing open modelsā is still at stake.
Interconnects AI is a reader-supported publication. To receive new posts and support my work, consider becoming a subscriber.
With this complexity, I wanted to distill my key beliefs down into a clear list. These are downstream of 10+ pieces Iāve written or recorded on open models this spring (which are linked throughout).
* Itās surprising that the top closed models did not show a growing capability margin over open models, based on compute differences for training and research, especially in the second half of 2025 and through today.
* Open model labs are technically very strong at keeping pace on well-established benchmarks. This will continue and reflects a balance of abundant talent and sufficient computing power.
* Chinese open-weight labs focus slightly more on benchmark scores than comparable closed labs in the U.S. Distillation helps the Chinese LLM companies do so, but itās not a panacea. Changes in the distillation dynamic (e.g. regulation) will not be a determining factor on the balance of capabilities. This increase in focus is a natural evolution of their incentives in keeping the narrative on keeping up with the frontier alive, which is crucial to fundraising and adoption.
* To date, closed models tend to be more robust and generally useful than similarly scoring open models. Closed models have certain hard-to-measure qualities that are not well captured in current or past benchmarks. This will be key to enabling closed models to dominate in markets where an individual user constantly presents new challenges, i.e. supporting knowledge workers as a direct assistant.
* The open vs. closed model race, as monitored through benchmarks, will largely be a game of economic staying power and fast-following, until the market structure constricts. I expect Chinese open-weight labs to face funding difficulties first, as soon as later this year. Funding difficulties will be seen in different capability trajectories 3-9 months later.
* The RL dominated training era has increased the relevance of distribution to real-world use-cases as a key factor in continued capabilities improvements. These are tasks where users directly use tools like Claude Code or Codex to solve problems in their job with agents. This is the first clear technical area that closed labs can dominate open-weight models on capabilities, potentially leveraging online RL directly based on user feedback.
* Open models will be increasingly adopted in repetitive automation tasks, as measured in the relative share of the API market, for repetitive tasks across the ecosystem. This takes the form of many new AI-native applications, business backend automation, etc. The success of this will drive more investment in domain-specific, efficient open models.
This is a complex picture, where the long-term trajectory is more of an economics question rather than an ability one. Many other outlets can paint a far more simplistic narrative that āChina will assuredly catch us in AIā and get more distribution because it is a simple story. The reality is complex. Only real AI revenue begets more investment, eventually thatāll be linked to the ability to keep improving models at a rapid rate. Economic realities have not yet impacted scaling open models, as a general category.
This economic-focused angle relates to my positions on the open model ecosystem more broadly.
* Recurring calls to ban certain types of open models will continue to come but are in practice impossible to implement. Training strong AI models (i.e. near but not at the frontier) is a relatively small cost compared to large-scale deployments. E.g. if the U.S. bans open models over a certain compute threshold, another sovereign entity will eventually train them and release them publicly, with the models entering the U.S. market with less oversight.
* The second derivative of influence on open models has shifted, and the U.S. will slowly regain ground in adoption metrics of open models starting in early 2027 (it takes a long time for Chinaās velocity to slow, then flip). Examples include Googleās Gemma 4 (a wild success), Nvidiaās Nemotron, and Arcee AI.
* As ever-stronger closed models are built, previewed, and released, there will be more safety-shocks saying that open-weight versions of the strongest AI models never can be allowed to exist, similar to reactions to Claude Mythos. These can spur burdensome regulation on open models.
* With the above, there will also be increased long-term interest in open models, as sovereign entities and existing power structures realize the coming, super powerful AI tools cannot land in the hands of only one or a few companies. These entities will see open models as a different governance paradigm.
* New funding structures for open models will emerge, as many stakeholders realize dependencies on single, for-profit companies for access to intelligence are unreliable.
* Local agents, OpenClaw, and other personal agents represent a large, to date, mostly ignored market for open model usage. It is a sort of dark matter, with pervasive, massive potential for influence on the balance of open-to-closed models.
A single word governs this post and is intentionally repeated ā complex.
This complex reality has been driving me to think more deeply about how to clearly describe the open model gap, and why I can hold it in my head that I expect American closed labs to clearly draw ahead, despite the fairly unequivocal evidence in support of the capabilities of recent open-weight models. More on the nuance in the open-closed gap in another piece coming soon, so please subscribe!
Let me know any positions that I missed.
This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe -
Recently, I was talking with Percy Liang, Stanford professor and lead of the Marin project (another fully-open model lab), and it set in on me that there will eventually be a consortium of companies funding a foundational set of open models used across industry. Itās not clear when thisāll emerge, and Nemotron (Coalition) is Nvidiaās attempt to bankroll and bootstrap this approach within a single wealthy company, but a consortium is the only long-term stable path to well-funded, near-frontier open models.
In recent months, weāve seen a lot of turnover in open model labs, with high-profile departures at Qwen and Ai2 (my comment). This shouldnāt be super surprising to followers of the ecosystem ā itās happened before with Meta shifting its focus away from Llama, and itāll only happen more as the cost of trying to keep pace at the frontier of AI only increases. The other leading labs with models available today include Chinese startups such as Moonshot AI, MiniMax, and Z.ai ā all of which look precarious on their ability to fund continued growth in the cost of training or R&D. Releasing oneās strongest models openly today is in active tension with the option of spending focus and resources on AI products that can currently generate meaningful revenue (and profits).
Weāre going to see business models emerge around releasing some, or even many, models openly, but these will largely be smaller models that enable a long-tail of functionality, rather than models at the absolute frontier. This class of companies thatāll release many, strong fine-tunable models will include the likes of Arcee AI, Thinking Machines, OpenAI, Google with Gemma, and more in that class. The cost and relative advantage of keeping the best models closed in a business environment with many opportunities for revenue are too high. To summarize ā there will be an ever increasing number of companies releasing models that are good for creating a lively niche of smaller, custom models, but an ever decreasing number of companies willing to release fully open, near-frontier models.
This is the core thesis of why Iām pushing hard for more people to do more research on how these smaller models can complement the best closed agents, the science of finetunability, etc. See my post below ā itās about creating a sustainable open model ecosystem, whether or not the frontier of open keeps paced with closed:
Itāll take years for this equilibrium to become more obvious, seen through the lens of more open model families coming and going. This year, it seems likely weāll see Nvidiaās Nemotron reach new heights, Reflection AI challenge some of the Chinese models with a strong, large MoE, maybe Meta releases a new open-weight model, and so on. True pressure to change strategy will only come when the capital environment punishes the less efficient spend on resources (e.g. giving away your competitive advantage, in having an in-house model). This pressure will likely hit Chinese startups training these models first.
All of Moonshot AI, MiniMax, and Zhipu AI will show signs of financial challenge in the coming years if they retain their strategy, on top of their models falling further behind the best open models in terms of generality. This is inevitable pressure to evolve open models to areas that are profitable and complementary of the frontier of AI.
Nvidia, which is best positioned to support the open ecosystem in the near term to support its core GPU business, could face many pressures to pull back its open model efforts. It could:
* Realize itās too competitive to their biggest customers as they succeed too much with Nemotron,
* Fall to competition on their core business and lose the free cash flow buffer needed to fund this (e.g. itās 2031 and OpenAI, Anthropic, Google, and the other frontier labs are worth so much they build their own chips).
* Start succeeding beyond their initial goals and keep the chips for them to build ASI themselves, as a closed-weight model.
The pressures for new funding mechanisms for open models are based on the assumptions of continued, substantive progress on the capabilities of frontier models. Mechanisms such as self-improvement and scaling all stages of the training pipeline are underway. This progress of capabilities will only increase the potential profit in selling models as and in products, not giving them away. The scale of investment required has already begun to push away non-profits from the game of making truly frontier-scale models. Capitalism is designed to make companies ruthless and chase down leads on profitability, not donate technology as charity.
As the economic environment shifts companies away from releasing the strongest models openly, more companies that rely on these models will look for an outlet of securing model access into the future. This is going to be compounded by a growing group of companies who come to rely on open-weight models for their workflows.
These points loop back into how model training is getting more expensive, so where desire to have the models will go up, ability to procure them will go down for many players. There are x-factors that could multiply the demand for institutions to ensure the existence of open models, such as the best frontier models not even being available via API (such as if Claude Mythos never goes general access).
As training relevant models is shifting to cost billions of dollars, rather than millions, few companies well be able to afford it. many companies will bite at the cost of paying 1/10th of the cost to train a frontier model, or if the consortium works, 1/50th. The upside for companies will be some mechanism to steer development (e.g. model sizes) or getting early access to develop internal and open-source tooling for the model.
It is in my nature to, by default, say this idea will fail, as training models is inherently a complex and high-focus endeavor, one that requires integration of every part of the stack and focusing specifically on your own vision and needs, rather than trying to serve every possible user. Eventually the need for open intelligence ā and economic pressure to build it ā will make a model consortium inevitable.
This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe -
With the announcement of the Claude Mythos model this week and the admittedly very strong stated abilities, especially in cybersecurity, a new wave of anti open-weight AI model narratives surged. The TL;DR of the argument is that our digital infrastructure will not be ready in time for an open-weight version of this model, which will allow attacks to be conducted by numerous parties.
The backlash against open models in the wake of the Mythos news conflates too many general unknowns into a simple, broad policy recommendation that could actually further weaken cybersecurity readiness.
Weāve been here before ā open-weight models were discussed as being extremely dangerous when OpenAI withheld GPT-2 weights in 2019, and when OpenAI released GPT-4 in 2023. Both of these waves came and went. The core mistake that is being made is the composition of two issues: 1) the acceptance of the open-closed model gap being static in time and 2) linking open-weight viability generally to specific issues.
Iāve written at length recently on how I think that the best, frontier-level open weight models are going to fall behind the best closed models in overall capabilities in the near future. Iāve also written about how the open-weight ecosystem needs to adapt to accept this reality. This is one of the times for the AI industry where I will repeat that itās a total blessing to have the 6-18 month delay from when a certain capability is available within a closed lab to it being reproduced in the open. Itās a good balance of safety and monitoring the frontier of AI systems while allowing a useful open-source ecosystem to exist and thrive.
The core argument Iāve focused on in the open-closed model time gap has been in general capabilities ā i.e. for general purpose, frontier models such as Claude Opus 4.X or GPT Thinking 5.X. The abilities of these closed models to robustly solve and work in diverse situations as agents remains out of scope of the best open-weight models. What the open-weight models have tended to be better at is quickly keeping pace on key benchmarks (which admittedly is helped to some extent, but not necessarily substantially by distillation). This discussion is entirely different, it has to do with if open weight models can keep pace on the specific skills related to cybersecurity, and when we could expect an open version of this model to be available to the world.
The case of a Claude Mythos level open weight model is admittedly more nuanced to me than the previous few anti-open weight narratives the community has experienced. Where GPT-4 was about a more hypothetical risk, especially in areas like bio-risk, the clear and present reality of cyber infrastructure being prone to attack is far more tangible. Still, much of this nuance in the moment comes down to not knowing the full details of what the system can actually do (i.e. Mythos), and the state of the environment it would act in (i.e. our digital infrastructure).
To properly assess this risk, we need to know what it takes to build and deploy a Claude Mythos scale model. This entails three pieces: 1) training and releasing the weights, 2) the harness that gives the model effective tools it knows how to use, and 3) the inference compute and software.
(Below I make some model size & price estimates to show my thinking, these should not be taken as ground truth.)
Current estimates put the size ranges of leading models like Claude Opus 4.6 or GPT 5.4 as being around 3-5T parameters. Currently, the largest open-source models, which have been coming from Chinese labs, are around 1T parameters. Claude Mythosās preview pricing is 5X Opus, which could come from a simple multiplicative increase in active parameters (with the same serving system design), far higher inference-time scaling, more complex harnesses that make inference less efficient, lower utilization expectations, and so on. The simplest guess is that itās a mix of all of the above, something like 2X bigger in parameters and much less efficient to serve. Thatās a huge model, likely something similar to GPT 4.5, but actually post-trained well (GPT 4.5 was ahead of its time, infra-wise).
With size comes the challenge actually training the model, as bigger models always come with new technical problems that must be solved to unlock the capabilities. For the case of cybersecurity, my guess is that most of the capabilities can be learned by training a model to be superhuman on coding. Unlike some capabilities such as knowledge work, medicine, law, etc., coding can be studied and improved substantially with public data like GitHub. Iām far more optimistic in open-weight models staying fairly close to the frontier in narrow domains of code execution and processing, but I donāt understand the full scope of skills needed to be superhuman in cybersecurity understanding. How much expert knowledge and special sauce went into training Claude Mythos? Thatās a substantial source of my error bars on the impact.
Second, we know nothing about how the model works under the hood. Today, models are complex systems that entail far more than just weights. They require complex tools and infrastructure to run them, of which Claude Code is the one we are most used to. Mythos very likely has its own innovations here.
My estimate for how many GPUs youād need to serve an 8T parameter, modern MoE is something like O(100) H100 GPUs, which costs something like $10K a day (and this may be very slow in terms of tok/s). Heck, the official marketing copy of the Nvidia GB200 VL72 system is āUnlocking Real-Time Trillion-Parameter Modelsā on the rack. Does Mythos fit on one rack? The point isnāt to rely on my specific estimate as a policy reference, but to repeat that running leading AI systems is very expensive and not something you can just do on a laptop or self-service cloud portals.
There are far fewer actors who can get their hands on these resources, relative to those who can download the model. Of course, there are still many, but itās important to flesh out all the details of what it would take to proliferate the capabilities of a Mythos-like model. In summary, tools like Mythos will make the best attackers have more powerful tools of the trade, but it wonāt be handing a nuke to every teenager connected to the internet.
Interconnects AI is a reader-supported publication. Consider becoming a subscriber.
Personally, I do acknowledge thereās a chance that cybersecurity abuse is a red line that makes releasing open-weight text models above a certain capability threshold morally grey. Many people thought this red line would come far earlier, somewhere in between GPT-2 and GPT-4, through the harm axis of mis/disinformation, but that had different bottlenecks. For image generation models, weāre well past the first red line which is enabling non-consensual AI deepfakes with readily available open-weight models. Weāre balancing the reality of these fears having come and gone before with a technology thatās becoming increasingly capable.
So, my second large source of error bars is āhow bad is it actuallyā with respect to the state of cybersecurity. How much can humans clean up in the most important software with months of private access to a model like Claude Mythos? What will never get fixed?
For example, if we get open-weight models that are close to the capabilities of Claude Mythos, could those be fine-tuned by organizations to harden the security of their tools?
Currently, itās too soon to call it as a general reason to stop progress in open models. When Claude Mythos is closed to so few partners, in some ways having strong open models close to the threshold makes assessing the danger easier. Having to rely fully on a single private company to determine the security of essential, international infrastructure is not a tenable equilibrium.
So, in conclusion, I urge people to further study three things:
* How do we measure cybersecurity related capabilities across open and closed models. With this, are open models truly keeping up at a 6-9month lag, or are they only maintaining performance relevance in other areas of coding?
* How do we independently measure the true impact of Claude Mythos and Project Glasswing on existing cybersecurity concerns?
* If it is the case that the models are keeping up and the defensive capabilities of Claude Mythos are weak, how do we better monitor (and if needed, try to regulate) the targeted capabilities of open-weight models in narrow domains?
The goal is to encourage fears about open models remaining very specific. Any general ban on open models in a nation will immediately and likely irrevocably remove that entityās ability to influence a crucial, and amorphous technology. If we stop building the best open models in the U.S., then another country will do this and become the center of the technology. Thereās no way to fully kill open models, only influencing, understanding, and steering.
This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe -
Having written a lot of model release blog posts, thereās something much harder about reviewing open models when they drop relative to closed models, especially in 2026. In recent years, there were so few open models, so when Llama 3 was released most people were still doing research on Llama 2 and super happy to get an update. When Qwen 3 was released, the Llama 4 fiasco had just gone down, and a whole research community was emerging to study RL on Qwen 2.5 ā it was a no brainer to upgrade.
Today, when an open model releases, itās competing with Qwen 3.5, Kimi K2.5, GLM 5, MiniMax M2.5, GPT-OSS, Arcee Large, Nemotron 3, Olmo 3, and others. The space is populated, but still feels full of hidden opportunity. The potential of open models feels like a dark matter, a potential we know is huge, but few clear recipes and examples for how to unlock it are out there. Agentic AI, OpenClaw, and everything brewing in that space is going to spur mass experimentation in open models to complement the likes of Claude and Codex, not replace them.
Especially with open models, the benchmarks at release are an extremely incomplete story. In some ways this is exciting, as new open models have a much higher variance and ability to surprise, but it also points at some structural reasons that make building businesses and great AI experiences around open models harder than the closed alternatives. When a new Claude Opus or GPT drops, spending a few hours with them in my agentic workflows is genuinely a good vibe test. For open models, putting them through this test is a category error.
Something else to be said about open models in the era of agents is that they get out of the debate of integration, harnesses, and tools and let us see close to the ground on what exactly is the ability of just a model. Of course, we canāt test some things like search abilities without some tool, but being able to measure exactly the pace of progress of the model alone is a welcome simplification to a systematically opaque AI space.
The list of factors Iād use to assess a new open-weight model Iām considering investing in includes:
* Model performance (and size) ā how this model performs on benchmarks I care about and how it compares to other models of a similar size.
* Country of origin ā some businesses care deeply about provenance, and if a model was built in China or not.
* Model license ā if a model needs legal approval for use, uptake will be slower at mid-sized and large companies.
* Tooling at release ā many models release with half-broken, or at least substantially slower, implementations in popular software like vLLM, Transformers, SGLANG, etc due to pushing the envelope of architectures or tools.
* Model fine-tunability ā how easy or hard it is to modify the given model to your use-case when you actually try and use it.
The core problem is that some of these are immediately available at release, e.g. general performance, license, origin, etc. but others such as tooling take day(s) to week(s) to stabilize, and others are open research questions ā with no group systematically monitoring fine-tunability.
In the early era of open models, the days of Llama 2 or 3 and Qwen pre v3.5, the architectures were fairly simple and the models tended to work out of the box. Some of this was due to the extremely hard work of the Llama, Qwen, Mistral, etc. developer teams. Some is due to the new models being genuinely harder to work with. When it comes to something like Qwen 3.5 or Nemotron 3, with hybrid models (either gated delta net or mamba layers), the tooling is very rough at release. Things you would expect to ājust workā often donāt.
Iāve been following this area closely since we released Olmo Hybrid with a similar architecture, and Qwen 3.5 is just starting to work well in the various open-source tools that need to all play nice together for RL research. Thatās 1.5 months after the release date! This is just to start really investing more into understanding the behavior of the models. Of course, others started working on these models sooner by investing more engineering resources or relying on partially closed software. The fully open and distributed ecosystem takes a long time to get going on some new models.
All of this is lead-in for the most important question for open models ā how easy is it to adapt to specific use-cases? This is a different problem for different model sizes. Large MoE open-weight models may be used by entities like Cursor who need complex capabilities in their domain, e.g. Composer 2 trained on Kimi K2.5. Other applications can be built on much smaller models, such as Chromaās Context-1 model for agentic search, built on GPT-OSS 20B.
The question of āwhich models are fine-tunableā is largely background knowledge known by engineers across the industry. There should be a thriving research area here to support the open ecosystem model. The first step is to understand characteristics of different base and post-trained models to understand what they look like. The second step is to tune pretraining recipes for open models so theyāre more flexible.
Interconnects AI is a reader-supported publication. Consider becoming a subscriber.
For The ATOM Project and other Interconnects endeavors, weāve put in substantial effort to measuring adoption trends in the open ecosystem. Everything takes a long time to unfold after a model is first publicly available ā and adaptability is why. What we know for sure now, when Qwen has been going from strength to strength with its releases, is that technical staff across the industry has gotten comfortable working with Qwen models. Countless research methods and datasets were made to work with Qwen. Itāll take patience for any other model family to get to this point ā a patience Iām not sure many open model builders have.
This takes us to Gemma 4, Googleās latest open models. Gemma 3 was released more than a year ago, in March of 2025, and is a bit underrated. Gemma 4 comes in 4 sizes for now, with a bigger, MoE model of over 100B total parameters rumored but not released yet. The models we have today come in sizes of ~5B dense, 8B dense, 26B total 4B active MoE, and 31B dense.
Iām most excited that theyāre finally adopting a standard Apache 2.0 open source license. Thisāll massively boost adoption. The standard of better licenses for strong open-weight LLMs was set by mostly Chinese open model labs in the last 1-2 years, and now U.S. companies are following suit. I will personally be so happy if the horrible Llama licenses and Gemma terms of service were an ~18-month transient dynamic of the industry being nervous about releasing strong open models.
The Gemma 4 scores look very solid, the small models have incredible benchmark scores (especially in general domains like LMArena) and the 31B model rivals the recent Qwen 3.5 27B, which is the leading member of that class. The ~30B size range is an important one, as itās accessible both to researchers and to enterprises looking to deploy the model in real use-cases. Where the 7B model scale is the default for tinkering and research, a 30B model is the default for seeing if an open model can unlock substantial value in your specific workflow ā a good mix of intelligence, low price, tractability for downstream training, etc.
This takes us back to the above adoption criteria I mentioned for open models and the bigger question ā do I think Gemma 4 will be an overwhelming success? Previous Gemma models have been plagued by tooling issues and poorer performance when being finetuned.
Gemma 4ās success is going to be entirely determined by ease of use, to a point where a 5-10% swing on benchmarks wouldnāt matter at all. Itās strong enough, small enough, with the right license, and from the U.S., so many companies are going to slot it in.
Iām cautiously optimistic that Gemma 4 is going to work better here. Winds are shifting for open models built in America. We saw GPT-OSS go through a bumpy launch to become an overwhelming success. Thereās a collective energy around the likes of Reflection, Arcee, Nemotron, Gemma, Olmo, and peers that show substantial demand for building new stacks around open models. Thereās capital to be spent on AI stacks across the economy by those who want more ownership of everything, including the model.
After launching The ATOM Project 240 days ago, the conversation is shifting into the next stage. Summer of 2025 was a crisis moment where the U.S. AI scene realized it canāt wait and figure out open models after building AGI. The two markets will capture different areas and proceed in parallel. Now that more companies in the U.S. are releasing strong models, we need to improve the ecosystem so that these models are easy to use, understand, and build value around. Itās the hard work to build another inflection point in these adoption plots Iāve been updating consistently, but thatās the work to be done. Join me in it.
More data coming soon! Hereās a sneak peek:
This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe -
Fast takeoff, the singularity, and recursive self-improvement (RSI) are all top of mind in AI circles these days. There are elements of truth to them in whatās happening in the AI industry. Two, maybe three, labs are consolidating as an oligopoly with access to the best AI models (and the resources to build the next ones). The AI tools of today are abruptly transforming engineering and research jobs.
AI research is becoming much easier in many ways. The technical problems that need to be solved to scale training large language models even further are formidable. Super-human coding assistants making these approachable is breaking a lot of former claims of what building these things entailed. Together this is setting us up for a year (or more) of rapid progress at the cutting edge of AI.
Weāre also at a time where language models are already extremely good. Theyāre in fact good enough for plenty of extremely valuable knowledge-work tasks. Language models taking another big step is hard to imagine ā itās unclear which tasks theyāre going to master this year outside of code and CLI-based computer-use. There will be some new ones! These capabilities unlock new styles of working thatāll send more ripples through the economy.
These dramatic changes almost make it seem like a foregone conclusion that language models can then just keep accelerating progress on their own. The popular language for this is a recursive self-improvement loop. Early writing on the topic dates back to the 2000s, such as the blog post entirely on the topic from 2008:
Recursion is the sort of thing that happens when you hand the AI the object-level problem of āredesign your own cognitive algorithmsā.
And slightly earlier, in 2007, Yudkowsky also defined the related idea of a Seed AI in Levels of Organization in General Intelligence:
A seed AI is an AI designed for self-understanding, self-modification, and recursive self-improvement. This has implications both for the functional architectures needed to achieve primitive intelligence, and for the later development of the AI if and when its holonic self-understanding begins to improve. Seed AI is not a workaround that avoids the challenge of general intelligence by bootstrapping from an unintelligent core; seed AI only begins to yield benefits once there is some degree of available intelligence to be utilized. The later consequences of seed AI (such as true recursive self-improvement) only show up after the AI has achieved significant holonic understanding and general intelligence.
Itās reasonable to think weāre at the start here, with how general and useful todayās models are.
Generally, RSI can be summarized as when AI can improve itself, the improved version can improve even more efficiently, creating a closed amplification loop that leads to an intelligence explosion, often referred to as the singularity. There are a few assumptions in this. For RSI to occur, it needs to be that:
* The loop is closed. Models can keep improving on themselves and beget more models.
* The loop is self-amplifying. The next models will yield even bigger improvements than the current ones.
* The loop continues to run without losing efficiency. There are not added pieces of friction that make the exponential knee-capped as an early sigmoid.
While I agree that momentous, socially destabilizing changes are coming in the next few years from sustained AI improvements, I expect the trend line of progress to be more linear than exponential when we reflect back. Instead of recursive self-improvement, it will be lossy self-improvement (LSI) ā the models become core to the development loop but friction breaks down all the core assumptions of RSI. The more compute and agents you throw at a problem, the more loss and repetition shows up.
Interconnects AI is a reader-supported publication. Consider becoming a subscriber.
Iām still a believer that the complexity brake on advanced systems will be a strong counterbalance to the reality that AI models are getting substantially better at every narrow task we need to compose together in making a leading AI model. I quoted this previously in April of 2025 in response to AI 2027.
Microsoft co-founder Paul Allen argued the opposite of accelerating returns, the complexity brake: the more progress science makes towards understanding intelligence, the more difficult it becomes to make additional progress. A study of the number of patents shows that human creativity does not show accelerating returns, but in fact, as suggested by Joseph Tainter in his The Collapse of Complex Societies, a law of diminishing returns. The number of patents per thousand peaked in the period from 1850 to 1900, and has been declining since. The growth of complexity eventually becomes self-limiting, and leads to a widespread āgeneral systems collapseā.
There are plenty of examples in how models are already trained, the deep intuitions we need to get them right, and the organizations that build them that show where the losses will come from. Building leading language models is incredibly complex, and only becoming more-so. There are a few core frictions in my mind.
1. Automatable research is too narrow
First, it is clear that language models this year will already be useful tools at optimizing localized tasks like lowering the test loss of a model. Andrey Karpathy recently launched his autoresearch that popularized doing just this. This allows AI agents to play directly on GPUs to target tasks like lowering the loss on the test set. This approach works in narrow domains, i.e. one general test loss or one overall reward. The problem is that thereās a long-standing gap between an on-paper more accurate model and models that users find more productive. The most provocative case is for pretraining, which was discussed more at length around scaling laws. Scaling laws show us that the loss will continue going down, but we donāt know if thatāll be economically more valuable.
In post-training, reinforcement learning algorithms are at least more directly tied to specific performance gains as most RL training environments can be used directly as an evaluation. Still, I worry about generalization and tying back to models that are better at the specific task of improving themselves. Itās a big leap from models get better at some things to that necessarily translating to models that are better at building themselves and designing experiments. Weāve seen many AI capabilities sort of saturate at certain levels of human taste, such as writing quality. AI research is a bit different here, as there is a very high ceiling to climb up to. Where models mostly saturate on writing because thereās inherent tension in preferences, models will saturate on research because the search space and optimization target is too wide.
The early benchmarks for measuring this sort of ability all fall prey to the same problem ā narrow scope. Agents will do well at optimizing single metrics, but the leap required to navigate many metrics at once is a very different skill set. That is actually what the best researchers do ā they make many scalable ideas work together.
The most related benchmark we have to measure this is PostTrainBench, which is quite fun, but progress will very rapidly get distorted on this. Over 90% of the challenge in doing post-training well is getting the last 1-3% of performance, especially without cooking the model in out-of-domain tasks. Post-training a general, leading model is extremely complex, and only getting more complex.
I could go on and on about this. Another example is from during my Ph.D. (2017-2022), when there was immense hype around a field called āAutoMLā which aimed to use techniques like Bayesian Optimization to find new architectures and parameters for models. The hype never translated into changing my job. Language models will do more than this, but not enough to take jobs away from top AI researchers any time soon. The core currency of researchers is still intuition and managing complexity, rather than specific optimization and implementation.
2. Diminishing returns of more AI agents in parallel
The biggest problem for rapid improvement in AI is that even though weāll have 10,000 remote workers in a datacenter, itāll be nearly impossible to channel all of them at one problem. Inherently, especially when the models are still so similar, theyāre sampling from the same distribution of solutions and capabilities while being bottlenecked by human supervision. Adding more agents will have a strict saturation in the amount of marginal performance that can be added ā the intuition of the best few researchers (and time to run experiments) will be the final bottleneck.
A common idea to illustrate this is Amdahlās law, which is taken from computer architecture and shows that a given task can only generate a fixed speedup proportional to how much can be parallelized and how many parallel workers exist. An illustration is below:
In AI this should be relatively easier to convey, as the low-level operating details of computers are fairly mysterious. Consider an AI researcher on the transition from writing code by hand to using AI autocomplete assistance to now using autonomous coding agents. These are all massive gains. Let us continue. Now this researcher uses 3-4 agents working on different sub-tasks or approaches to the problem at hand. This is still a large gain. Now consider this single researcher trying to organize 30-40 agents with tasks to do every day. Some people can get more value out of this scale, but not many.
How many people do you think could come up with 300-400 tasks for AI agents every day? Not many. This problem will hit the AI models soon enough as well.
3. Resource bottlenecks and politics
Fundamentally, all the AI companies are walking a fine line of acquiring substantial capital, converting new compute resources to revenue via sufficient demand, and repeating the process all-the-while spending an extreme amount on research. With the scale of resources here, there will always be political bottlenecks on who gets resources and what gets bet on. In this layer, research leadership sits above the AIs and the researchers. Even as models continue to improve, this source of friction will never get removed. It isnāt a substantial friction, but the AI models are fundamentally operating in organizations where humans are the bottleneck on resources.
The early scale of improvements with language models is local optimizations, where the resources used cost
-
Iām a little late to this model review, but that has given me more time to think about the axes that matter for agents. Traditional benchmarks reduce model performance to a single score of correctness ā they always have because that was simple, easy to quickly use to gauge performance, and so on. This is also advice that I give to people trying to build great benchmarks ā it needs to reduce to one number that is interpretable. This is likely still going to be true in a year or two, and benchmarks for agents will be better, but for the time being it doesnāt really map to what we feel because agentic tasks are all about a mix of correctness, ease of use, speed, and cost. Eventually benchmarks will individually address these.
Where GPT 5.4 feels like another incremental model on some on-paper benchmarks, in practice it feels like a meaningful step in all four of those traits. GPT 5.4 in Codex, always on fast mode and high or extra-high effort, is the first OpenAI agent that feels like it can do a lot of random things you can throw at it.
Interconnects AI is a reader-supported publication. Consider becoming a subscriber.
I havenāt been particularly deep in software engineering over the last few months, so most of my working with agents has been smaller projects (not totally one-off, but small enough where Iāve built the entire thing and manage the design over weeks), data analysis, and research tasks. When you embrace being agent-native, this style of work entails a lot of regular APIs, background packages (like installing and managing LateX binaries, ffmpeg, multimedia conversion tools, etc), git operations, file management, search etc. Prior to GPT 5.4, I always churned off of OpenAIās agents due to a death by a thousand cuts. It felt like rage quits. Iād feel like I was getting into GPT 5.2 Codex, but it would fail on a git operation and have me (or Claude) need to reset it. Those hard edges are no longer there.
The other subtle change in GPT 5.4ās approachability ā the biggest reason I think OpenAI is much more back in the agent wars ā is that it just feels a bit more āright.ā I classify this differently to the routine tasks I discussed above, and it has to do with how the product (i.e. the model harness) presents the model outputs, requests, and all that to you the user. It has to do with how easy it is to dive in. This has always been Claudeās biggest strength in its astronomical growth. Not only has Claude been immensely useful, but it has a charm and entertainment value to it thatāll make new people stick around. GPT 5.4 has a bit of that, but the underlying model strengths of Claude still leave it feeling warmer.
Where Claude is a super smart model, with character, a turn of phrase in a debate, and sometimes forgetting something, OpenAIās models in Codex feel meticulous, slightly cold, but deeply mechanical. Iād use Claude for things I need more of an opinion on and GPT 5.4 to churn through an overwhelmingly specific TODO list. The instruction following of GPT 5.4 is so precise that I need to learn to interact with the models differently after spending so much time with Claude. Claude, in some domains, you come to see has an excellent model for your intent. GPT 5.4 just does what you say to do. These are very different philosophies of āwhat will make the best model for an agentā, Claude will likely appeal to the newcomers, but GPT 5.4 will likely appeal to the master agent coordinator that wants to unleash their AI army on distributed tasks.
Outside of charm, and dare I say taste, a lot of the usability factors are actually better on OpenAIās half of the world. The Codex app is compelling ā I donāt always use it, but sometimes I totally love it. I suspect substantial innovation is coming in what these apps look like. Personally, I expect them to eventually look like Slack (when multiple agents need to talk to eachother, under my watch).
OpenAI also natively offers fast mode for their models with a subscription and very large rate limits. Iāve been on the $100/month Claude plan and $200/month ChatGPT plan for quite some time. Iāve never been remotely close to my Codex limits with fast mode and xhigh reasoning effort, where I hit my Claude limits from time to time. Thereās definitely a modeling reason to this ā most of OpenAIās release blogs showcase each iterative model being substantially more concise in the number of tokens it takes to get peak benchmark performance. This is a measure of reasoning efficiency. This 2D (or more) benchmark picture is exactly where the world is going.
Hereās a plot from Cursor, which sadly doesnāt have all the GPT 5.4 reasoning efforts, but it confirms this point in a third party evaluation. What is missing across model families is the speed and price (a proxy for total compute used) to get there.
The final benefit of GPT 5.4, and OpenAIās agentic models in general for that matter, is much better context management. In using them regularly now I feel like Iāve never hit the context wall or context anxiety point. The reasoning efficiency I suspect is the case above just lets the model do way more with its initially empty context window. Then, when GPT 5.4 does compact, itās been less noticeable.
The one problem Iāve been having with both Claude Opus 4.6 and GPT 5.4 is a light forgetfulness. If you give the models multiple TODOs in a single message outside of planning mode, I find them often dropping them. Sometimes it feels like the models glitch and try to solve a previous problem rather than the recent ones. Iām not sure what in the model or the harness is the exact cause, but sometimes I like to queue up a few messages as I see the model working on something, to refine the task, but currently this tends to be a pretty risky outcome except in the simplest use-cases.
These days Iāve been using both GPT and Claude extensively, mostly based on my mood, and have been getting more done than ever. Having a GPT 5.4 Pro integration directly with Codex, e.g. like \ultrathink, would be a big differentiator for OpenAI. Those models have been incredible.
All in, I see GPT 5.4 as an agentic model that brings a ton more simple usability and āagentnessā to the very strong software foundation of GPT 5.3 Codex. Itās a big step, and Iām unbelievably excited for which of these two companies releases an update next. On paper, listing the strengths of GPT 5.4 across better top end coding performance, better speed, better context management, better rate limits, itās a testament to how nuanced choosing a model is. I genuinely still enjoy Claude a bit more for ways thatāll never show up on benchmarks. This makes me type claude into my terminal at the start of my day, rather than codex.
This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe -
2025 was the year where a lot of companies started to take open models seriously as a path to influence in the extremely valuable AI ecosystem ā the adoption of a strategy that was massively accelerated downstream of DeepSeek R1ās breakout success. Most of this is being done as a mission of hope, principle, or generosity.
Very few businesses have a real monetary reason to build open models. Well-cited reasons, such as commoditizing oneās complements for Metaās Llama, are hard to follow up on when the cost of participating well is billions of dollars. Still, AI is in such an early phase of technological development, mostly defined by large-scale industrialization and massive scale-out of infrastructure, that having any sort of influence at the cutting edge of AI is seen as a path to immense potential value.
Open models are a very fast way to achieve this, you can obtain substantial usage and mindshare with no enterprise agreements or marketing campaigns ā just releasing one good model. Many companies in AI have raised a ton of money built on less.
The hype of open models is simultaneously amplified by the mix of cope, disruptive anticipation, and science fiction that hopes for the world where open models do truly surpass the closed labs. This goal could be an economically catastrophic success for the AI ecosystem, where profits and revenue plummet but the broader balance of power and control of AI models is long-term more stable.
Thereās a small chance open models win in absolute performance, but it would only be on the back of either a true scientific breakthrough that is somehow kept hidden from the leading labs or the models truly hitting a wall in performance. Both of them are definitely possible, but very unlikely.
It is important to remind yourself that there have been no walls in progress to date and all the top AI researchers we discuss this with constantly explain the low-hanging fruit they see on progress. It may not be recursive self-improvement to the singularity (more on that in a separate post), but large technology companies are on a direct path to building definitionally transformative tools. They are coming.
The balance of power in open vs. closed models
The fair assessment of the open-closed gap is that open models have always been 6-18 months behind the best closed models. It is a remarkable testament to the open labs, operating on far smaller budgets, that this has stayed so stable. Many top analysts like myself are bewildered by the way the gap isnāt bigger. Distillation helps a bit in quality, benchmaxing more than closed labs helps perceptions, but the progress of the leading open models is flat out remarkable.
The reality is that the open-closed model gap is more likely to grow than shrink. The top few labs are improving as fast as ever, releasing many great new models, with more on the docket. Many of the most impressive frontier model improvements relative to their open counterparts feel totally unmeasured on public benchmarks.
In a new era of coding agents, the popular method to ācopyā performance from closed models, distillation, requires more creativity to extract performance ā previously, you could use the entire completion from the model to train your student, but now the most important part is the complex RL environments and the prompts to place your agents in them. These are much easier to hide and all the while the Chinese labs leading in open models are always complaining about computational restrictions.
As the leading AI models move into longer-horizon and more specialized tasks, mediated by complex and expensive gate-keepers in the U.S. economy (e.g. legal or healthcare systems), I expect large gaps in performance to appear. Coding can largely be mostly āsolvedā with careful data processes, scraping GitHub, and clever environments. The economies of scale and foci of training are moving into domains that are not on the public web, so they are far harder to replicate than early language models.
Developing frontier AI models today is more defined by stacking medium to small wins, unlocked by infrastructure, across time. This rewards organizations that can expand scope while maintaining quality, which is extremely expensive.
All of these dynamics together create a business landscape for open models that is hard to parse. Through 2026, closed models are going to take leaps and bounds in performance in directions that it is unlikely for open models to follow. This sets us up for a world where we need to consider, fund, use, and discuss open models differently. This piece lays out how open models are changing. It is a future thatāll be clearly defined by three classes of models.
* True (closed) frontier models. These will drive the strongest knowledge work and coding agents. They will be truly remarkable tools that force us to reconsider our relationship to work.
* Open frontier models. These will be the best open-weight, large models that are attempting to compete on the same directions as above. There will be plenty of use-cases that they donāt work for relative to the best models, but countless use-cases where they work remarkably well. For many use-cases, even ones as valuable as some subsets of coding, these will work great. The AI ecosystem will still take years to understand what it means to have intelligence of this magnitude served in private, at the marginal cost of electricity for individuals, as assistants, coaches, companions, and more. OpenClaw provided a glimpse behind the mirror that will expand and grow. The class of models around GPT-OSS 120B, Nvidia Nemotron 3 Super, or MiniMax M2.5 are the balance of performance to price that can work as local models.
* Open, small models as distributed intelligence. The most successful open models will be complementary tools to closed agents. This is a path for open models to complement and accelerate the frontier of progress.AI is slotting in to automate many repetitive, niche tasks across the technology economy. Thereās a huge pressure to shift these tasks off of the best closed models ā which frankly are still better at most of the things, across my conversations with businesses trying to build with open models ā to small, open models that can be 10X faster and 100X cheaper. There arenāt really people building data and fine-tuning engines for economically viable tasks on the smallest models possible. These models need to be almost brain-numbingly boring and specific. In a world dominated by coding agents, I want to build open models that Claude Code is desperate to use as a tool, letting its sub agents unlock entirely new areas of work. This is possible, but remarkably under-explored. Small models from the likes of Qwen and co. are still marketed on general-task benchmarks. The hype of āopen models catching the frontierā distracts the world from this very large area of demand.This is the sort of model that moves open models from just a few, crucial static weights to more of an ecosystem. It requires creativity and a new approach. The goal of this piece is to illustrate why and how to build these, with added context on where open models stand today.
All three of these model classes hint at different ways to use agents. It is absolutely definitional to how AI is going to be built going forward that theyāre not just model weights, but rather systems that think, search, and act. The weights only define one portion of those abilities.
Interconnects AI is a reader-supported publication. Consider becoming a subscriber.
Open weights as part of an AI system
To start, consider what are the most impactful and impressive things that language models can do without a suite of tools at their side. When was the last time that you were blown away by something that was just autoregressive token outputs? Unless youāre doing a substantial amount of work on mathematical proofs or competition code, it seems like that situation has changed little since GPT-4ās release in 2023. The AI systems we use today are about far, far more than weights.
In this world, closed models have a clear advantage. Closed models get to vertically integrate everything from the chips they run on, the inference software, the weights, the tools, and the user interface. Open models on the other hand need to work on every inference setup, with many tools, and in many use-cases. This vertical integration is best expressed today in the joy of using Claude Code with Opus 4.6 or OpenAIās Codex with GPT 5.4. Open models havenāt passed this point. Some are starting to focus on specific interfaces, e.g. OpenCode, but thereās an inherent tension in making an open model work only in your blessed product roadmap.
At the same time, this change could point to more about the latest AI systems being open! If you can do less with the weights alone, maybe more labs will release them.
The way to think about AI systems today is as a mix of weights, tools, and harnesses. The weights portion is familiar. The tools are the deeply integrated environments the models act in at deployment time ā best typified by search and code sandboxes ā and the harness is how these two fit together with a product that the user sees.
In this world, there are two things to consider: 1) Is there an equivalent, open system to the closed products that people are using today ā I mean truly equivalent, where every level of the stack can be modified and controlled (more on this later), and 2) How does this systemās view impact different future decisions in the open ecosystem?
Still looking for open model business strategies
To understand how the business and practicality of open models will evolve, let me take a tour back in time to foundational writing on the role of open-source in modern technology companies. The first is a Google blog post, The Meaning of Open, which originally was an internal memo by Jonathan Rosenberg, which sparked an intense internal debate that later resulted in it becoming public. To start, hereās a basic assessment of how open systems can work:
Open systems have the potential to spawn industries. They harness the intellect of the general population and spur businesses to compete, innovate, and win based on the merits of their products and not just the brilliance of their business tactics.
Iāve long believed that the company who will benefit most from the ecosystem of open models is the one who understands it best. This entails being deeply involved with open research and experimentation in how to use the models. So far, most of the open model company business models are not this. Rosenberg expands on this in his 2009 post, comparing the dynamics of open systems to closed products:
[Open systems] are competitive and far more dynamic. In an open system, a competitive advantage doesnāt derive from locking in customers, but rather from understanding the fast-moving system better than anyone else and using that knowledge to generate better, more innovative products. The successful company in an open system is both a fast innovator and a thought leader; the brand value of thought leadership attracts customers and then fast innovation keeps them. This isnāt easy ā far from it ā but fast companies have nothing to fear, and when they are successful they can generate great shareholder value.
Weāve known for some time that open weight models are not actually enough to constitute a product ā models are a product in the sense that they have tools and harnesses, so we donāt actually have fully open systems, we have systems that are partially open partially closed, making moats messy. VLLM and a model like GLM 5 are pieces of a system, but it still takes more to deploy them ā expensive private GPUs and some tools with local business data.
It may turn out to be that AI is too complex and expensive to have any analogous open system to previous generations of technology. If there was a fully open system, it would win by default, as many historical generations of technology have shown us. This fully open analog does not yet exist, so we have constant debates on the role of open-source AI.
Bill Gurley recounts how Googleās free products have exemplified the open or free strategies across technology. Gurley wrote on the open-source operating system, Android, and the free browser, Chrome, in 2011:
So here is the kicker. Android, as well as Chrome and Chrome OS for that matter, are not āproductsā in the classic business sense. They have no plan to become their own āeconomic castles.ā Rather they are very expensive and very aggressive āmoats,ā funded by the height and magnitude of Googleās castle. Googleās aim is defensive not offensive. They are not trying to make a profit on Android or Chrome. They want to take any layer that lives between themselves and the consumer and make it free (or even less than free).
Because these layers are basically software products with no variable costs, this is a very viable defensive strategy. In essence, they are not just building a moat; Google is also scorching the earth for 250 miles around the outside of the castle to ensure no one can approach it.
In the same post, Gurley reflects on the limits of Googleās openness:
In this open manifesto, Jonathan opines over and over again that open systems unquestionably result in the very best solutions for end customers. That is with one exception. āIn many cases, most notably our search and ads products, opening up the code would not contribute to these goals and would actually hurt users.ā As Rodney Dangerfield said in Caddyshack, āIt looks good on you, though.ā
Essentially, Google open-sourced so much, in fact paid people to use its products (e.g. paying phone makers to use android) to keep the funnel leading to the search profit center. This is the virtuous loop that the search business still funds to this day.
AI is still nothing like this, but signs of change are emerging. The default belief on the value of models to these companies is that the model is the product. This is obvious with products like hosted APIs, where releasing the model weights would be business suicide, but this is softening as interfaces like Claude Code, Codex, Cursor, etc. get vastly popular. It could be a path to more openness, at least in parts of the stack. We can see this with the coding plans offered by Moonshot and Z.ai ā where the demand is very high for the businesses, even though the model is open. Most people will just use the cheap interface with inference, instead of figuring out how to use the model themselves (as long as the business is mostly consumer or per-head services).
All of this doesnāt leave me optimistic on the direction of companies becoming more open in the coming years. Iād expect the opposite still. Nvidia has the one great reason to be open ā to sell more GPUs to people building on open models and understand what they need to build next, but thereās no one else obvious on this list. Until there are more specific economic reasons to build open models, the companies building these at the frontier will have fewer resources to spend on the models and face a consolidation to the best few.
In the face of consolidation at the open frontier, the investment in the models should shift to areas where the models can have more differentiated upside relative to the best closed frontier models.
Open models that are specific, cheap, fast, and ubiquitous
Thereās too much obsession with the best companies building open models to try and compete at the frontier. Thereās a vastly underserved market of enterprises that want cheap, reliable models for repetitive use-cases in their systems. Picture this, one small model with a series of LoRA adapters that specialize the model to internal skills. This can be deployed very cheaply as tools and a complement to the frontier closed models that are orchestrating agents.
Every task that a frontier agentic model does tens to hundreds of times can potentially be outsourced to a small model. There are ancillary benefits to this, e.g. privacy of a local model reading your files and summarizing to Claude, but almost no one is pushing hard in this direction. The leading model family of capable, customizable small models to date is Qwen, but thatās now shrouded in uncertainty with the departures of key personnel. Gemma, Phi, Olmo, etc. are all major steps down in quality, and therefore potential for modification.
There are a few obvious examples why this can be scaled up. There was a recent thread and discussion on how the new Qwen 3.5 4B model arguably bests the original ChatGPT model. On the research side, there are already recipes for finetuning open models on specific code-bases to match performance of much bigger models. Moondream.ai is a startup made by a friend of mine Vik, who builds some of the best, small multimodal models on a tiny budget ā they compete with Qwen and Llama on real world tasks. This is the tip of an iceberg.
Intelligence compression hasnāt been explored with nearly as much depth (or resources) because it is less exciting than keeping track of the progress of the best few models. Investigating these areas is the standard technological diffusion process that is slow and why weāre still early in understanding how people will build with AI. My contention is that too many people building open models are slightly deluded in their perception of their competitiveness. The best few models will win on general capabilities and there are still plenty of underserved niches elsewhere.
Taking this to the next level involves releasing open models that are scoped to be truly excellent at 1-3 tasks, as I hinted at the beginning of this piece. Too many people try to compete with Qwen and show that their small model does great on frontier AI benchmarks. The right benchmark here is savings in compute and time.
Itāll take years for this transition to slowly become reality. Part of why I am so excited about it is that it is driving innovation on open models being more about diversity, specialization, and curiosity, rather than the standard āone model to rule them allā that the frontier models presume.
Models vs. ecosystems.Consolidation vs. creativity.
So long as the open source ecosystem for AI is defined by a bunch of model providers trying to chase after the closed labs, it will largely lose. It will face pain on funding and substantive adoption. The same consolidation that will come for closed AI companies will come for open model builders ā likely even sooner.
Open systems at their best allow many people to participate and many approaches to flourish.
The world of open models needs to be more of an ecosystem. Iāve discussed in the past how China is closer to this type of environment by having a variety of companies, but the variety in approaches is still too low.
Ecosystems are self-reinforcing, whereas individual models are static artifacts in time. Ecosystems showcase clear, constant opportunities for whatās next that have growing value propositions.
The path forward for open models is to solve different problems than the frontier labs, to find places where open models are effectively free alternatives, to show ways of using specialized models that the closed labs cannot offer. The world of open models needs to embrace creativity, before building powerful AI systems grows too expensive and prices out many of the prized open labs of today.
This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe -
Watching history unfold between Anthropic and the Department of War (DoW) it has been obvious to me that this could be a major turning point in perspectives on open models, but one thatāll take years to be obvious. As AI becomes more powerful, existing power structures will grapple with their roles relative to existing companies. Some in open models frame this as ānot your weights, not your brain,ā but it points to a much bigger problem when governments realize this.
If AI is the most powerful technology, why would any global entity let a single U.S. company (or government) control their relationship to it?
I got Dean W. Ball of the great Hyperdimensional newsletter onto the SAIL Media weekly Substack live to discuss this. In the end, we agree that the recent actions by the DoW ā especially the designation of Anthropic as a supply chain risk (which Dean and I both vehemently disagree with) ā points to open models being the 5-10 year stable equilibrium for power centers.
The point of this discussion is:
* Why do open models avoid some of the power struggles weāve seen play out last week?
* How do we bridge short term headwinds for open models towards long-term strength?
* The general balance of capabilities between open and closed models.
Personally, I feel the need to build open models more than ever and am happy to see more constituencies wake up to it. What I donāt know is how to fund and organize that. Commoditizing oneās compliments is a valid strategy, but it starts to break down when AI models cost closer to a trillion dollars than a hundred million. With open models being very hard to monetize, thereās a bumpy road ahead for figuring out who builds these models in face of real business growth elsewhere in the AI stack.
Enjoy and please share any feedback you have on this tricky topic!
Listen on Apple Podcasts, Spotify, and where ever you get your podcasts. For other Interconnects interviews, go here.
Chapters
* 00:00 Intro: is the Anthropic supply chain risk good or bad for open models?
* 04:03 Funding open models and the widening frontier gap
* 12:33 Sovereign AI and global demand for alternatives
* 20:55 Open model ecosystem: Qwen, usability, and short-term outlook
* 28:20 Government power, nationalization risk, and financializing compute
Transcript
00:00:00 Nathan Lambert: Okay. We are live and people will start joining. Iām very happy to catch up with Dean. I think as we were setting this up, the news has been breaking that the official supply chain risk designation was filed. This is not a live reaction to that. If we get any really, really interesting news, weāll talk about it. I think one of the undercurrents that Iāve felt that this week where everything happened is gonna touch on is open models, but thereās not an obvious angle. I think I will frame this to Dean to start, which is how does-- Like, thereās two sides of open models. One is that thereās the kind of cliche like, not my weights, not your weights, not your mind, where like somebody could take it away if not an open model, which people are boosting like, āOh, like Anthropicās gonna take away their intelligence.ā But the other side is people worried about open models existing that the Department of War can just take and use for any purpose that it wants. And I feel like both of these are a little cliche. And the core question is like, is this type of event where more control is coming towards AI and more multi-party interest, like is that gonna be good or bad for the open weight model ecosystem?
00:01:12 Dean Ball: My guess is that in the long run, this is probably profoundly good for open weight AI. And like the whole reason I got in, like, so I became interested in frontier AI governance. I did something totally different with my time before. I wrote about different kinds of policy and studied different kinds of policy. And the reason I got into this was because it immediately occurred to me that the government was gonna... I was like, okay, letās assume weāre building super intelligence soon or whatever, like very advanced AI that seems like really important and powerful. Thatās gonna be something that I depend on, like for my day-to-day life. Iām gonna need it for all kinds of things. Itās gonna profoundly implicate my freedom of expression as an American and my exercise of my liberty and all that. And yet itās also gonna profoundly implicate national security. And so the governmentās gonna have its hands all over it, and they also might not like me using it because I might use it, and others might use it to challenge the status quo in various ways, to challenge the existing power structures which the government is a part of. So we have a political problem on our hands here, in my view.
00:02:36 Dean Ball: It immediately occurred to me that weāre gonna have this huge problem of like, this is gonna be a conflict because this is something thatās gonna enormously implicate American speech and liberty, and also itās gonna have legitimate national security issues, and also the governmentās gonna want it because of bad power-seeking reasons. And so thatās always a part of the picture. And my view was this is just a fight thatās gonna play out over the coming decades, and I wanna be a part of this fight. But number two, in that fight, you have to have an insurance policy, and open weight is the insurance policy. Open weight is the way we can always say yes, but we can build the open ecosystem. We can do that. And so I think in the fullness of time, this is gonna be beneficial, but the problem is thereās a lot of coordination and economic problems that have to be solved here. Itās not just a matter of hoping that Google and Meta or whomever else, or the Chinese companies, by virtue, out of the goodness of their hearts continue to open-source things. Thatās not scalable. There has to be a reason to do it. So what are the institutional dynamics open weight gonna look like in the long term? I donāt really know, but it feels deeply under theorized.
00:04:03 Nathan Lambert: I think itās hard to fund is the thing. I mean, we saw Qwen had their turmoil this week, which is timely, and Iām not that surprised because the stakes for these companies is so high, and they all are trying to make sure their companies win in it. And people will say like, āOh, Meta should commoditize their complements and release open models.ā But no oneās ever commoditized their complements with something that costs a trillion dollars to make. Like, thatās a line item. Like, is Apple gonna commoditize... Apple commoditizing their complement would be them doing the... They could spend just as much as all the other tech companies are on CapEx and spend hundreds of billions of dollars, but theyāre choosing not to. And I just like, I agree that long term it should be better, but if we never bridge that gap, does it actually materialize? Like, the crank is being turned of these models getting better and better. GPT 5.4 released today, excited to try it.
00:05:02 Nathan Lambert: But like, where does it go? Like, what Iām working on is totally falling behind the frontier. Weāre the foundation of research, but itās like I see it already slipping.
00:05:13 Dean Ball: So I kinda think, yeah, I mean, look, I think itās gonna get bad in the short term, itās gonna be bleak, right? Thereās just no doubt about that in my view. Because weāre in this period, like I think the pace of frontier progress is gonna continue. My own view is that, like, just ācause I peer in and use the open weight Chinese models on a fairly regular basis, and I kinda just feel as though the gap has widened between the US frontier and the open frontier. Unfortunately, itās so sad that US frontier and open frontier are increasingly distinct things. But I do feel as though that probably is true. And thatās probably gonna continue because in the next, like, in the early stages of a new technology, you would expect for the vertically integrated players to be the ones who do the best. And over time, the modular players can win, and part of that is ācause eventually you do get to good enough, right? Like, eventually, I think most people think the iPhone is good enough now. There was a time when every year the iPhone upgrade was like, āOh my God, this is so much better.ā Intelligence is maybe different, but maybe not for a lot of things.
00:06:37 Nathan Lambert: Well, like, thereās no iPhone that you can buy from anyone. Nothing you can buy from anyone but Apple is nearly as good. Thatās the concern. Itās like, is it gonna be Anthropic that like, yeah, it stopped getting better, but you canāt rebuild it. Like, you canāt make the open source version.
00:06:51 Nathan Lambert: I also think I had a later question, which is like, the weights are so much less of a concern for me. So like, somebody dropping a two-trillion-parameter model thatās open weights and way better than anything else that somebody has built and released in the open, it almost doesnāt matter if you donāt understand the harness and the tools and the setup you need to make it into a Claude-like system. Like, you need what, eighty nodes of H100s that cost a hundred thousand dollars a day to run and expertise to make it a system. Itās like the shifting away from weights is also happening. I donāt think itās happening in this open versus closed ecosystem at the surface level of the discussion. So thatās why Iām just like, I donāt know if itās gonna exist. The thing that I could see happening is that open weights models are niche, and they help these Claude-like models, but thereās not an alternative in that universe. So itās like, is the government capable of actually making this alternative exist? I donāt know. Like, I donāt know if you can Manhattan Project this, and I wouldnāt advocate for it.
00:07:53 Dean Ball: I actually think about it from the opposite perspective, because I think that what happens if the government follows through on what theyāve threatened with Anthropic, which is to make it so that basically any military contractor cannot have any commercial relations with Anthropic, which means NVIDIA canāt sell GPUs to them for anything. Amazon canāt sell cloud services to them. Amazon and NVIDIA also canāt be invested in them, by the way, if you take any commercial relations at its face value. Now, thatās not a power the government actually has, but nonetheless, if this harassment campaign continues, I think what it probably does... You know, I spend a lot of time in international policy, dealing, talking to foreign governments and civil society in foreign countries, and they already have major trust issues with respect to the US closed source models because they think the US government is gonna come in and disable the models. Like, the American president will get mad at Brazil, say, and in addition to putting tariffs or sanctions, the US president will say, āYeah, weāre also gonna turn off all your public services that are dependent upon American closed source models.ā Right? So people view that as this profound threat, and people are legitimately scared of that in other countries.
00:10:00 Dean Ball: I think this turns that fear up another meaningful degree, and probably not incorrectly, by the way, probably rightfully so. And so I kinda look at this and I think, well, now a lot of American companies might also have that concern, and so you certainly have a demand side of people who are gonna be like, āI get this. It is a risk to use anything where I have a commercial relationship. āCause once I have a commercial relationship, the government can regulate that. Can I find some way of getting out of it?ā I think thereās gonna be demand for that. Whether or not that demand produces supply, I think will depend on... It might just not be possible, thatās true. But I think youāve never had a more favorable demand picture, and I suspect that on the margin, this probably will favor open in the longer run.
00:10:44 Nathan Lambert: Yeah. So thereās a few ways that I think about this. I have this thing, like ATOM Project and all this other stuff I do, and itās like, how do I meaningfully advocate for this? I think thereās something, like I work at AI2, and AI2 has budgets of order of a hundred million dollars and can train decent models. But if I wanted to redo an AI2, like my method for getting that type of money, itās mostly gonna be like befriending a billionaire. And it seems like philanthropy dice roll in the near term is a way to get it. But then, like, maybe it really is some long slog of a multi-industrial consortium that takes a couple years off the ground and slowly, like, Googleās, or all these Netflix and all these five hundred billion dollar smaller companies are gonna give millions of dollars to have somebody else do it because they canāt get the billion dollars themselves, but they know they need to have it existed.
00:11:31 Dean Ball: And sovereign wealth funds. Right. Sovereign wealth funds everywhere can do that, right? Thereās trillions of dollars in sovereign wealth. Thereās pension funds, public employee pension funds. A lot of people can chip into this and itās possible. This is like, Yann LeCun thinks this is the inevitable outcome. He thinks that the future is gonna be that some sort of global consortium gets together and builds this, because no one country is gonna be able to own it, because itās gonna be too important. Iāve always kinda doubted that, and Iāve always thought that that outcome is probably a bad outcome for the world, honestly.
00:12:06 Nathan Lambert: Thatās a bad outcome for how good the AI is.
00:12:09 Dean Ball: Thatās correct. Itās a socialist outcome, you know? Itās not communism, but it is democratic socialism, and Iām not a democratic socialist, so Iām not a super big fan of that. But at the same time, I have to be honest that I kinda think that this probably does increase the odds of that precise outcome coming to bear.
00:12:33 Nathan Lambert: I think something that comes sooner is that a lot of these super wealthy countries are gonna realize they can have real... Like, they can do some sort of sovereign AI and make some sort of noise, particularly starting with open models. I think thereās the Institute for Foundation Models, which is based on the UAE university system. Like, thatās--
00:12:53 Dean Ball: Thatās very UAE-coded, yeah.
00:12:55 Nathan Lambert: Theyāve been playing that for years, and they can keep doing this. Their models are gonna be pretty good, and I think thereās gonna be more people that do this. Thereās the SWISS initiative in EU, which is on one hand doing a good job, on the other hand plagued by the most obvious European limitations of talent cycling and consortium life. I think these things are gonna become more of a thing in the next year, but I donāt know exactly how they impact the... They donāt impact the frontier of AI, but maybe theyāre just like how the geopolitics and power of AI evolves. And I for some reason feel like open models need to be the thing that theyāre gonna do because if they have a closed model thatās not as good, it doesnāt really give them any sort of power. But I donāt have a good enough world view for what that actually does, and if thereās more EU models, if India actually has their act together and trains a solid model. I donāt know what that does, but I feel like itās probably gonna happen.
00:13:54 Dean Ball: Yeah. I mean, itās really super interesting ācause I think the other thing-- that will be inherently... I mean, it will be a Linux compared to a macOS, you know? It will not be as good of an experience for people. But then it becomes strange. Like, I donāt think macOS is as appealing of a thing if itās viewed to be owned by the US government, right? And in fact, part of the reason I think that Apple is able to make its case quite credibly to consumers and businesses is they have resisted US government pressure to turn things over before. People might remember about a decade ago, there was this shooter in San Bernardino, California, and the FBI tried to force Apple to release iPhone data, and Apple said, āNo, weāre not gonna expose this information.ā Now, I think the FBI eventually just hacked it anyway, but thatās a separate issue. Itās a matter of principle here.
00:15:01 Dean Ball: So yeah, I think itās an interesting question: do we expect for the gap between the open frontier and the American closed frontier to widen in the near future, especially just because of how much compute theyāre gonna have?
00:15:30 Nathan Lambert: A hundred percent. And data and talent. Like, a hundred percent. Itās happening.
00:15:34 Dean Ball: Data, talent. And itās compounding, right? I mean, this has always been my view. And how much, Iām not sure, but I think it could be quite significant because these things are compounding benefits. And so if you expect them to just continue compounding, then all of a sudden it gets pretty bleak pretty quickly, would be my fear.
00:16:00 Nathan Lambert: One of the... I mean, whatās your take on this? Why has it not compounded so much faster? Like, I feel like these three companies are spending, I donāt know, 10X what the Chinese labs are spending, and you only get like a little bit better model. Like, I believed so full-heartedly that Claude and ChatGPT and all these models are much better, and I expect them to become better by increasing margin, but itās still confusing why theyāre not already more ahead.
00:16:29 Dean Ball: I go back and forth on this. Sometimes I think they are that ahead, and itās just difficult to show up in benchmarks for the obvious reasons that benchmarks get chased. And like, I do feel that with the coding agents and with certain use cases, I do just feel like, wow, the American frontier is just way ahead, profoundly ahead of the Chinese frontier there. But thereās a lot of other things where you do kinda saturate how good you can be. I suspect that a very large fraction of AI usage is essentially glorified Google search. Even though I donāt think AI is glorified Google search, I suspect that a lot of what people use it for is that, at the consumer level. And it isnāt obvious to me how much better you can get at things like that. But my guess would be that over the next five years, I would guess the American labs really take off, in part because of compute, data, internal deployments for recursive self-improvement style stuff. And also, itās amazing how we talk about that as just a normal thing now.
00:18:05 Nathan Lambert: I think there will be a ceiling on it. Like, theyāre gonna get a ton of improvement-- The gains are insane. Itās like, personally, at my job, Iāve been a lot of a research manager and just chasing s**t down to get a model out the door. But now I can take on hard engineering tasks because Iām like, āOkay, might as well do this at the same time.ā Like, going from zero to a hundred software engineers at anyoneās fingertips is worth a lot in terms of exploration. But the next, like, from a hundred to ten thousand is like, people can mess that up type thing. But thatās a huge gain.
00:18:37 Dean Ball: I kind of agree. I think thereāll be a sigmoid there too. But then the other thing that will happen is, like, what I sort of wonder is will the AI companies, will the current model vendors, will they eventually become more like true infrastructure companies where what they actually do is they have models that design their own chips and models that design their own data centers and models that design their own successors. And so itās this hugely vertically integrated thing, and what youāre really getting access to is not just the model itself, but youāre getting access to this highly optimized hardware, physical world infrastructure. And again, thatās kind of already the case, but does that become even more the case? And then thatās truly insurmountable for any open player. Thatās definitionally insurmountable for an open player, and that becomes scary too. But again, this is why Iāve always felt so good about the position of the US closed source labs. This is why Iāve always been pretty bullish on them and have my concerns about open.
00:20:07 Dean Ball: But to the extent the US government makes it impossible to trust closed source models, you do provide an advantage to open there. Youāre giving a shot in the arm. If you like open source, you should hope that the supply chain risk designation against Anthropic is quite broad.
00:20:09 Nathan Lambert: Itās a rough thing to hope for.
00:20:09 Dean Ball: I mean, you shouldnāt actually hope for it, but I just mean, like, if thatās the only thing you care about in the world is open source, then--
00:20:17 Nathan Lambert: I would say that anyone that only cares about open source probably is not thinking through any of these principles. It just gets really bad if you only have-- Like, AI is not gonna be meaningful lift to the economy and nor sustainable if everything is open. Like, if models are truly commoditized, things look kind of rough out there.
00:20:36 Dean Ball: I think a world where models get commoditized is a really bleak world too, actually. And yeah, this is why Iām very worried about what the US government is doing. But I think that it helps on the margin, though. It probably helps on the margin in terms of waking people up. That still is my view.
00:20:55 Nathan Lambert: I am a little surprised by the Qwen stuff, but I think thereās-- Itās like, at some point, I knew there was gonna be a year where a lot of the open model efforts just died because theyāre just too expensive and too similar. But at the same time, having a lot of efforts that are somewhat similar but exploring a lot of the minor permutations in modeling space to figure out what works for people who use open models is actually quite good. Iām very bearish on the reflection style approach, which is build a lab, build an incredible model, drop it, make a bank selling it on-prem. Because on-prem is not that distinct from a business model as having a closed model. You could sell a closed model on-prem with the right IP controls. But then the person who actually wins open is by trying a whole bunch of tiny different things, understanding what is actually a meaningful differentiator in private data, in certain deployments and whatever, and then really iterating on that with a community. And thatās why I was like, Qwen is the closest to doing this by being so close to the community, and itās so distinct from what a lot of the other labs are betting on.
00:22:05 Nathan Lambert: But I see the pressure going away and kind of reducing diversity onto standards, because standards also make inference more efficient. Using open models is really rough. I think some of the best open models have really had rough launches. I think GPT-OSS had a horrible launch in terms of usability and is now one of the most popular models of all time. Qwen 3.5, itās like researchers I work with are like, āOh, letās see if we can do some basic RL baselines on it,ā and all the software stack is kinda broken. It takes a few weeks to get it going. And this is ācause all the models change differently, and closed labs just have such an advantage there ācause they should conceivably ship things on day one that work. I mean, donāt talk about Claudeās runtime, but thatās fine.
00:22:42 Dean Ball: And donāt talk about the GPT-5 auto router either. But yeah, no, totally. I think thatās right.
00:22:53 Dean Ball: I think fullness of time, Iām bullish on open source in the long run, fairly bearish in the next five years. The next five years are gonna matter quite a bit. And there is a lot of cope in both open source world and also... I donāt really hear it so much in open source world. I think open source world is actually more honest about this. But where the cope is so bad is in global civil society discourse. Like, I was in India for the AI Impact Summit recently, and they are just smoking the copium, being like, āWe are gonna do everything on subfrontier open source models, and weāre just gonna diffuse those, and thatās all weāre gonna need in our economy.ā And I just think thatās, if youāre India, thatās really not the bet you wanna make. I understand these are resource-constrained countries. They have a lot of acute constraints that they face, but nonetheless, I think thatās probably not a good bet.
00:24:05 Nathan Lambert: Well, itās even if those long tail models will work like manufacturing has worked, where itās like Apple has put hundreds of billions of dollars into the manufacturing ecosystem in China to get absolute fine margins and scale. Like, if you really-- these things are gonna be used so much that that fine margin is actually gonna matter a lot, and it is not cheap to get that fine margin. You canāt just YOLO a DeepSeek V3 and spend five million dollars in compute and be done. Itās still gonna be expensive for a long time.
00:24:34 Dean Ball: Yeah, it requires-- I think the Chinese approach, in the long run, if Chinaās gonna continue its strategy and they want to be competitive with the American frontier, theyāre gonna have to fully socialize that, I think. I donāt think DeepSeek alone is gonna be able to do this, and I donāt think even Alibaba alone is gonna be able to do this. I think theyāre going to need some sort of collective effort. Especially because of the export controls, the American export controls. Theyāre gonna have to centralize compute. Theyāre gonna have to centralize all these things, and talent and data and all that.
00:25:17 Nathan Lambert: I donāt see it happening. Like, maybe someone gets officially AGI pilled, and I donāt know that much about China. But the things I know about China, it seems like that would be a big lift, and it would take a lot of time to actually do it. Like, all the companies would have to give up their biggest... All the cloud companies are like tech companies making a lot of money. They would be like, āWe have to give up what?ā
00:25:42 Dean Ball: No, it would be a tough sell. Obviously, if the Chinese government decides they want to do it, they absolutely will. But in total, it will be a tough sell. My experience having had diplomatic engagements of many sorts with Chinese government-- and a lot of Chinese tech policy is actually not directly set by the government. Itās actually more kind of civil society, academia and civil society adjacent to government. Had a lot of conversations with folks like that, and theyāre definitely... Itās largely not a very AGI-pilled crew. I think AGI-pilled-ness probably has a rough correlation with GDP per capita, and I think China is about where you would expect based on their GDP per capita, maybe a little bit ahead, but not very so. But if they ever do get AGI pilled, thatās the kind of thing that they could consider, but then thatās still a pretty extraordinary outcome because the Chinese government would have to be willing to make these things and then give it away. And I kinda just donāt think they will.
00:27:11 Nathan Lambert: Yeah. I mean, all the politics of control with how everybody thinks AI is so powerful are pointing to very value-destructive actions economically in order to achieve the end state that people determine to be right. Itās like supporting open source to the extent that you can to avoid situations like Anthropic being labeled a supply chain risk and having interactions like that totally decimating runway of AI productivity. Like, if the companies are really gonna commit to open source for other things, then theyāre gonna lose money. And I see this in-- Chinaās economy would be taking a gigantic hit doing this. And thatās kind of a common theme of what weāre talking about is that the interface of AI in an economic fashion is gonna make the next few years really weird.
00:28:06 Dean Ball: I hope so.
00:28:09 Nathan Lambert: I think things are gonna be weird, but I havenāt spent a ton of time thinking about how that interacts with political institutions. I thought about socially weird a lot, but I havenāt thought about power weird a lot.
00:28:20 Dean Ball: Oh, power weird is what I worry about all the time. What I worry about the most is I think itās plausible that what weāre seeing... Iāve always had this concern. I have this dual problem of-- maybe Iām talking out of both sides of my mouth. Maybe thatās just the critique, and itās a fair critique. But I routinely complain about how people in government arenāt really... They pretend to take AI seriously, but they donāt take it that seriously. And they donāt really own the implications of advanced, of near term advanced AI and all that. I think we basically have transformative AI right now, but they donāt own that, because itās annoying, itās difficult, itās conceptually challenging.
00:29:08 Dean Ball: But the flip side of that is that if people do start to take it very seriously, thereās the risk that they sort of lash out, that they get scared, and they lash out and do things that are rash, in a rush. And that actually creates very, very bad, much worse outcomes than you otherwise might have gotten. I think thatās a very fair risk, and I think itās possible that you might see things like that happen within the U.S. I donāt think this particular incident with Anthropic is quite an example of that. But itās possible that you do see that in the coming years, and that is in and of itself a pretty scary outcome because if the U.S. government decides that they want to nationalize the frontier labs, I think it could be one of the most tyrannical things we ever see happen in this country.
00:30:16 Nathan Lambert: Yeah. Itās like, I donāt know how to reply to this. I think things are... Itās serious times and I see so many... It feels like such a Sisyphean task to make more open models exist, but all the broader trends seem to point to that being a more stable equilibrium in a lot of ways. Like, good enough open models and keeping up with what we all feel happening in the closed model land.
00:30:50 Nathan Lambert: So I donāt know. I stay motivated, but I feel increasingly lost in terms of achieving it.
00:30:56 Dean Ball: I donāt think you should be. I think, look, I suspect the US government will not actually do it, and the best thing about America is that our general sort of-- I donāt wanna say incompetence, but the general sort of chaos of American institutions and decentralized confusingness of it all, it can often be quite frustrating, and it can sometimes be a detriment, but it can also be really great because we tend to not execute and follow through on our very worst ideas. And so I donāt think weāre going to do that. It doesnāt feel very American to do it. I worry about it because I worry about these rash reactions, and thatās why I fight as heavily as I do on things like this, despite not insignificant cost to me to do it, politically speaking. But thatās totally worth it because I care about this. I think everything, I think that will probably be fine. But yeah, I do agree. Itās a major risk. Itās a major risk, and itās a weird world to think about, Iāll tell you that much.
00:32:16 Nathan Lambert: Yeah. I donāt have a lot more to add. Iām sure weāll continue this discussion. I think it warrants the space of it ācause thatās the... Itās one of the longer term things, but itās not in the news cycle whatsoever, at least the open model angle. Thereās just so many layers. People have to talk. Like, send feedback, people listening. Iāll even send this out as a podcast as well and just like, what do people think? How do we get to the places we want to get to?
00:32:46 Dean Ball: Well, one thing Iām particularly interested in is-- one of the items in the Trump administration action plan, which I worked on for those who donāt have that context, is this idea of financializing compute, creating a financial market, like basically a commodities market for compute so that you can buy, you know, like really robust. In the same way that you can buy electricity spot, electricity futures and electricity on the spot market and things like this, the wholesale. Could you do something like that for compute? That could really profoundly change the dynamics and the economics of AI production. Itās not gonna turn them over. It doesnāt flip them on their head, but it changes it quite meaningfully. And Iām very excited by that prospect.
00:33:48 Dean Ball: And thatās the kind of thing that I would be increasingly doing if this sort of interference of government into the frontier continues. What I suspect Iāll do is start developing some of those ideas which I developed earlier. Iām only one person. If those things start to seem relevant again, I totally will. Because anything to make it easier to produce AI for people that donāt have trillions of dollars will be extremely important.
00:34:38 Nathan Lambert: Yeah. I think that... I donāt know. Iām happy to leave it there.
00:34:43 Dean Ball: Cool.
00:34:45 Nathan Lambert: I can let you get on your trip. Itās good to catch up. Iām early in the process of potentially coming to DC in a few months, so I will let you know if I do.
00:34:52 Dean Ball: Oh, please do. Itād be great to see you. We can record an episode of my podcast live.
00:34:58 Nathan Lambert: Sounds good. Okay. Thanks everybody for listening.
00:35:03 Dean Ball: Talk to yāall later. Bye.
This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe -
So-called hybrid architectures are far from new in open-weight models these days. We now have the recent Qwen 3.5 (previewed by Qwen3-Next), Kimi Linear last fall (a smaller release than their flagship Kimi K2 models), Nvidiaās Nemotron 3 Nano (with the bigger models expecting to drop soon), IBM Granite 4, and other less notable models. This is one of those times when a research trend looks like itās getting adopted everywhere at once (maybe the Muon optimizer too, soon?).
To tell this story, we need to go back a few years to December 2023, when Mamba and Striped Hyena were taking the world by storm ā asking the question: Do we need full attention in our models? These early models fizzled out, partially for the same reasons theyāre hard today ā tricky implementations, open-source tool problems, more headaches in training ā but also because the models fell over a bit when scaled up. The hybrid models of the day werenāt quite good enough yet.
These models are called hybrid because they mix these new recurrent neural network (RNN) modules with the traditional attention that made the transformer famous. They all work best with this mix of modules. The RNN layers keep part of the computation compressed in a hidden state to be used for the next token in the prediction ā a summary of all information that came before ā an idea that has an extremely long historical lineage in deep learning, e.g. back to the LSTM. This setup avoids the quadratic compute cost of attention (i.e. avoiding the incrementally expanding the KV cache per token of the attention operator), and can even assist in solving new problems.
The models listed to start this article use a mix of RNN approaches, some models (Qwen and Kimi) use a newer idea called Gated DeltaNet (GDN) and some still use Mamba layers (Granite and Nemotron). The Olmo Hybrid model weāre releasing today also falls on the GDN side, based on careful experimentation, and theory that GDN is capable of learning features that attention or Mamba layers cannot.
Introducing Olmo Hybrid and its pretraining efficiency
Olmo Hybrid is a 7B base model, with 3 experiment post-trained checkpoints released ā starting with an Instruct model, with a reasoning model coming soon. It is the best open artifact for studying hybrid models, as it is almost identical to our Olmo 3 7B model from last fall, just with a change in architecture. With the model, we are releasing a paper with substantial theory on why hybrid models can be better than standard transformers. This is a long paper that Iām still personally working through, but itās excellent.
You can read the paper here and poke around with the checkpoints here. This is an incredible, long-term research project led by Will Merrill. He did a great job.
To understand the context of why hybrid models can be a strict upgrade on transformers, let me begin with a longer excerpt from the paperās introduction, emphasis mine:
Past theoretical work has shown that attention and recurrence have complementary strengths (Merrill et al., 2024; Grazzi et al., 2025), so mixing them is a natural way to construct an architecture with the benefits of both primitives. We further derive novel theoretical results showing that hybrid models are even more powerful than the sum of their parts: there are formal problems related to code evaluation that neither transformers nor GDN can express on their own, but which hybrid models can represent theoretically and learn empirically. But this greater expressivity does not immediately imply that hybrid models should be better LMs: thus, we run fully controlled scaling studies comparing hybrid models vs. transformers, showing rigorously that hybrid modelsā expressivity translates to better token efficiency, in agreement with our observations from the Olmo Hybrid pretraining run. Finally, we provide a theoretical explanation for why increasing an architectureās expressive power should improve language model scaling rooted in the multi-task nature of the language modeling objective.
Taken together, our results suggest that hybrid models dominate transformers, both theoretically, in their balance of expressivity and parallelism, and empirically, in terms of benchmark performance and long-context abilities. We believe these findings position hybrid models for wider adoption and call on the research community to pursue further architecture research.
Essentially, we show and argue a few things:
* Hybrid models are more expressive. They can form their outputs to learn more types of functions. An intuition for why this would be good could follow: More expressive models are good with deep learning because we want to make the model class as flexible as possible and let the optimizer do the work rather than constraints on the learner. Sounds a lot like the Bitter Lesson.
* Why does expressive power help with efficiency? This is where things are more nuanced. We argue that more expressive models will have better scaling laws, following the quantization model of neural scaling.
All of this theory work is a great way to go deeper, and frankly I have a lot more to learn on it, but the crucial part is that we transition from theory to clear experiments that back it up. Particularly the scaling laws for designing this model were studied carefully to decide on the final hybrid architecture. The final performance is very sensitive to exactly which RNN block is used and in what quantity.
In scaling experiments, the results showed that for Olmo, the hybrid GDN (3:1 ratio of layers) > pure GDN (all RNN layers) > standard transformer (all attention) > hybrid Mamba2 > pure Mamba2. The crucial point was that these gaps maintained when scaling to more parameters and compute. A visual summary of the different types of architectures studied is below.
In terms of this specific model, the pretraining gains were giant! Relative to Olmo 3 dense, it represents an about 2X gain on training efficiency. When you look at evaluation performance for pretraining, there was also substantial improvement in performance, particularly after long context extension (the final 2 rows of Table 2 in the paper, highlighted below).
The journey to post-training Olmo Hybrid
Most of the experience in post-training Olmo models has been climbing up a steep curve in base model capabilities with minor tweaks to architecture. Our recipes from Tulu 2, Tulu 3, and the Olmo 3 reasoning work (building substantially on OpenThoughts 3) all worked in a fairly straightforward, off the shelf manner. Olmo Hybrid is our first experience in post-training a substantially different architecture, and the results were mixed.
1. Benchmark performance
Following the Olmo 3 recipe, we got some substantial wins (knowledge) and some substantial losses (extended reasoning) relative to the dense model. All together these still represent a very strong fully open model ā just that the pretraining gains didnāt translate as obviously. The results are below.
The exact reason why this happens is a research question. Our best guess is that the Olmo Hybrid base model is just a sufficiently different student model, where most of our post training data at early stages is learning from stronger āteacherā models (a recap of this method, called distillation, appeared recently in Interconnects).
There is a lot of other research ongoing in the community around what makes a strong teacher model ā generally, the best overall model is not the best teacher. In other words, training on data outputted from the model with best evaluation scores today is unlikely to unlock the ceiling in performance for your new base model. A second factor, which is even less explored, is how different base models likely need different teachers to learn from. This is why Olmo Hybrid could perform very differently, where itās behavior is downstream of an architecture-based learning change, where the pretraining data is almost identical.
Thereās A LOT more work to dig into here, some empirical work in generating better data and other work in understanding how different training stages fit together. I am confident this Olmo Hybrid base model is solid and more performance can be extracted, but it takes more careful work adapting existing datasets.
2. Open-source tooling
The frank reality of new architectures for open models is that the open-source software tooling support is horrific. Thereās the paper-cuts that people are familiar with, e.g. random errors in popular libraries (as people experienced with GPT-OSS) that slow adoption, but there are also deeper problems.
A large part of the potential benefit of hybrid models is the reduction in memory usage for long-context generation, which is crucial for reinforcement learning and agentic tasks. It should be a huge win for post-training! This, unfortunately, is far from the case, and will likely take another 3-6months to get right for this batch of GDN models.
The core problem is that the open-source inference tools, e.g. VLLM, are relying on far less developed kernels (and other internals) when compared to standard transformers. This comes with two challenges ā throughput slowdowns and numerical issues. Numerical issues can be combatted with a variety of inference flags. Quoting the paper again:
The two key flags in VLLM we needed to get maximum performance with the post-training model were --disable-cascade-attn, which disables cascade attention (an optimization for shared prompt prefixes), and --enforce-eager, which turns off CUDA graphs. These two flags have been used in our RL setup dating back to Olmo 3, but are new additions to evaluations. Scores for the released models drop precipitously without them. We also evaluated our final models with the hybrid model cache in the richer FP32 datatype, to improve stability via --mamba_ssm_cache_dtype following NVIDIA.
Essentially, we used these to make sure the model was numerically stable. The downside is that the inference throughput plummets, so the potential gains in compute efficiency are erased. A comparison of numbers is below.
Effectively, the 7B hybrid model today takes more compute to train with RL than our 7B dense model (that doesnāt even have a common memory saving technique, GQA). The total compute estimate from the table at different context lengths is below (more visuals in the slides from my recent CMU talk).
The good news is that these are solvable problems ā and improving the tooling could even improve benchmark numbers ā but itās going to take a good bit of time and hard work in the OSS community.
This leads to my final question. If Iām optimistic about the open ecosystem evolving to support these models with ease, motivated by the better fundamental scaling of the architectures and a large cluster of leading open model builders already using it, are closed models like GPT and Claude built like this?
To be clear, this answer is a total guess (which I donāt normally do), but with the evidence I have Iād put the chance of one of the 3 frontier models being an RNN being around a coin flip. Iāll let you know if I learn for sure either way. If the scaling advantages hold at frontier scale, the economic case becomes hard to ignore, but they could already have architectures that are efficient like RNNs, but with even more benefits.
Iām going to follow up this post with more architecture discussions, particularly on why Mixture of Expert (MoE) models are a major headache to post-train, so make sure to subscribe if that sounds interesting to you!
Thanks to Will Merrill and Finbarr Timbers for some discussions that helped inform this post.
This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe -
Distillation has been one of the most frequent topics of discussion in the broader US-China and technological diffusion story for AI. Distillation is a term with many definitions ā the colloquial one today is using a stronger AI modelās outputs to teach a weaker model. The word itself is derived from a more technical and specific definition of knowledge distillation (Hinton, Vinyals, & Dean 2015), which involves a specific way of learning to match the probability distribution of a teacher model.
The distillation of today is better described generally as synthetic data. You take outputs from a stronger model, usually via an API, and you train your model to predict those. The technical form of knowledge distillation is not actually possible from API models because they donāt expose the right information to the user.
Synthetic data is arguably the single most useful method that an AI researcher today uses to improve the models on a day to day basis. Yes, architecture is crucial, some data still needs exclusively human inputs, and new ideas like reinforcement learning with verifiable rewards at scale can transform the industry, but so much of the day to day life in improving models today is figuring out how to properly capture and scale up synthetic data.
To flesh out the point from the start of this piece, the argument has repeatedly been that the leading Chinese labs are using distillation for their models to steal capabilities from the best American API-based counterparts. The most prominent case to date was surrounding the release of DeepSeek R1 ā where OpenAI accused DeepSeek of stealing their reasoning traces by jailbreaking the API (theyāre not exposed by default ā for context, a reasoning trace is a colloquial word of art referring to the internal reasoning process, such as what open weight reasoning models expose to the user). Fear of distillation is also likely why Gemini quickly flipped from exposing the reasoning traces to users to hiding them. There was even very prominent, early reasoning research that built on Gemini!
This all leads us to todayās news, where Anthropic named and directly accused a series of Chinese labs for elaborate distillation campaigns on their Claude models. This is a complex issue. In this post we unpack a series of questions, beginning with the impact, and ending with politics. The core question is ā how much of a performance benefit do Chinese labs get from distilling from American models.
Interconnects AI is a reader-supported publication. Consider becoming a subscriber.
To start, letās review what Anthropic shared. From the blog post, emphasis mine:
We have identified industrial-scale campaigns by three AI laboratoriesāDeepSeek, Moonshot, and MiniMaxāto illicitly extract Claudeās capabilities to improve their own models. These labs generated over 16 million exchanges with Claude through approximately 24,000 fraudulent accounts, in violation of our terms of service and regional access restrictions.
These labs used a technique called ādistillation,ā which involves training a less capable model on the outputs of a stronger one. Distillation is a widely used and legitimate training method. For example, frontier AI labs routinely distill their own models to create smaller, cheaper versions for their customers. But distillation can also be used for illicit purposes: competitors can use it to acquire powerful capabilities from other labs in a fraction of the time, and at a fraction of the cost, that it would take to develop them independently.
Much like the models themselves, the benefits of distillation are very jagged. For some capabilities, particularly if you donāt have a full training pipeline setup for it, quickly distilling some data from the leading frontier model in that area can yield massive performance boosts. This can definitely help the lab distilling from the API catch up much more quickly than they otherwise would. Most distillation is rather benign, using many tokens of an LLM to help process and refine existing data ā putting a lot of compute into getting a few, high quality training tokens out. This sort of raw data processing work can be done on many different APIs, but one tends to be best.
When we go into what Anthropic says the three Chinese LLM builders actually used the Claude API for ā as an aside, Anthropic didnāt confirm that the attack was done through the API, the chat app, or Claude Code ā the actual impact of the operations is very mixed. Itās hard to know how much untracked usage these labs deployed for other projects (or other American models).
To start, Anthropic puts DeepSeek first in their blog post because theyāre the household name in the US for Chinese AI. The extent of their use is actually quite small, showing how this post is more about the big picture than the details:
DeepSeek
Scale: Over 150,000 exchanges
The operation targeted:
* Reasoning capabilities across diverse tasks
* Rubric-based grading tasks that made Claude function as a reward model for reinforcement learning
* Creating censorship-safe alternatives to policy sensitive queries
In the scale of training a language model, 150K samples is only scratching the surface as a substantive experiment. It looks like they were experimenting with some rubrics, which couldāve been for an online RL run, but thatās extremely unlikely with how distributed the access was, and then some minor stuff on completions for sensitive queries. This usage of Anthropicās API will have a negligible impact on DeepSeekās long-rumored V4 model (or whichever model the data here contributed to). This was also very likely a small team at DeepSeek and unknown to much of the broader training organization.
The other two labs, Moonshot AI (makers of the Kimi models) and MiniMax reflected much broader usage.
Moonshot AI
Scale: Over 3.4 million exchanges
The operation targeted:
* Agentic reasoning and tool use
* Coding and data analysis
* Computer-use agent development
* Computer vision
MiniMax
Scale: Over 13 million exchanges
The operation targeted:
* Agentic coding
* Tool use and orchestration
The role of distillation is constantly changing. Distilling from Claude today for its agentic behavior is much more valuable than versions of Claude have been as a teacher in the past. Claude Opus 4.6 has a well-rounded agentic navigation that none of the other models quite match. Why not try training on some of the model outputs to see if your model absorbs it? Over the next few months, thatāll be less differentiated. Itās sort of like how all the models are way better at math today than most people need ā there are plenty of places to distill from.
Estimates will vary, but if each response had 10-25K tokens per exchange, the total tokens across these two labs, mostly with MiniMax, would be 150-400 billion tokens. This is a substantial amount, which could meaningfully improve a modelsā post-training. For example, in Olmo 3 we had an SFT dataset of 20 billion tokens that could be built like this, and increasing it by 10X would be very reasonable.
These numbers are just scratching the surface of total synthetic data generation across APIs hosted by US companies. At the same time, quantity is a pretty crude way to measure impact. Just taking the outputs from Claude and figuring out how to add them to your model pipeline isnāt easy. The research community has seen many cases where taking outputs from a certain teacher model unexpectedly makes the student worse ā subtle interactions between the data make it variable and tricky to do this type of distillation. Itās fundamentally a research problem.
This is what Iām sure the Chinese labs are innovating at. Thereās an argument that Chinese frontier labs are substantially more efficient than their Western counterparts ā this is misleading.
The labs operate under different constraints. The Chinese labs are likely slightly more efficient out of necessity in being lower on resources, but overall the picture of talent access is very similar. The Chinese labs also approach benchmarks differently, making it appear that theyāre a bit closer than they really are (and appearing as if theyāre potentially surpassing). This is needed to get momentum and brand recognition in the AI market.
The Chinese labs likely innovate greatly on distilling from leading API models, due to their restricted access to GPUs. GPUs could be used to construct synthetic data, but for organizations with more funding than they can spend on research compute (being supply limited), using API-based models is one of the few other options for effectively getting more compute. Itās way easier to figure out getting access to ābannedā API models than it is to smuggle tens of thousands of physical GPUs and get them set up.
Itās not only the Chinese labs that operate like this. Synthetic data from a model you donāt own is all arguably distillation. Distillation is a shortcut to more compute for anyone. Itās also a far less risky cost, as having a big cluster for research requires a very large financial commitment, where APIs are pay-as-you-go. For example, in Olmo 3 we used millions of GPU hours on the Frontier supercomputer and Azure credits through NAIRR for synthetic data. We didnāt have the equivalent in GPUs (or really the cash, thank you research credits!).
All together, itās very fair for Anthropic to be concerned about this. I still wouldnāt say it is a crucial factor in these Chinese labs post-training capabilities, especially not one thatāll be easy to measure in a time gap to matching the model theyāre distilling from a la the US-China performance lag.
If we take a step back, there was even a time when Claude Sonnet was the flagship model ahead of Opus (I think this was with Sonnet 3.5), much of this comes from it being well distilled internally from Opus checkpoints. Fast iteration and high-quality data can go very far, letting student models surpass the teacher. Frontier labs use this to their advantage, by having internal-only models for generating synthetic data, but saying that Chinese models could never pass the US frontier due to data distillation is like saying that Claude Sonnet could never beat Opus. It's unlikely, and it depends a lot on release times, but with AI models making dramatic progress, weirder things like this have already literally happened.
The biggest factor unaddressed here is how distillation from stronger teacher models is harder in an era when reinforcement learning at scale is needed to train the best models. You can spend compute carefully crafting and filtering prompts, but you still need to train the model yourself with substantial, on-policy inference ā generation is the majority of the compute cost for RL and it canāt be generations from another model. For this reason, I expected this story to die down a bit. Itās clear from their open research that Chinese labs have excellent RL infrastructure, despite the compute shortages.
The reason I expected it to fade is that not being allowed to distill models for ācompetitive purposesā has violated the terms of service for API models for quite some time. Academics and open model builders in the US used to greatly worry about and debate this (and Iāve written about it multiple times in 2022 and 2023). Only later in 2024 did that worry die down in the community (and no action has been taken against any smaller model builders).
This action from Anthropic represents another continued step ratcheting up the AI geopolitical tension. Kneecapping model distillation will be far harder than restricting the shipments of physical goods like GPUs. In many ways it seems like fully restricting distillation through distributed access methods seems almost impossible, and restricting GPU sales would be far more impactful.
Anthropic and the AI industry should choose their battles. When API endpoints are available for the best models, other entities will use that to train variants of said model. This is a natural evolution of AI models. If AI models are so precious that distillation is an extreme risk, then the models will be restricted to first-party products. Anthropic has a choice to do this with their latest models. The market for API-based model alternatives may be so competitive that some companies go this path ā likely in part due to Chinese models undercutting on price ā but an API is a fundamental offering that no leading lab will risk walking back from anytime soon.
This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe -
Last Thursday, February 5th, both OpenAI and Anthropic unveiled the next iterations of their models designed as coding assistants, GPT-5.3-Codex and Claude Opus 4.6, respectively. Ahead of this, Anthropic had a firm grasp of the mindshare as everyone collectively grappled with the new world of agents, primarily driven by a Claude Code with Opus 4.5-induced step change in performance. This post doesnāt unpack how software is changing forever, Moltbook is showcasing the future, ML research is accelerating, and the many broader implications, but rather how to assess, live with, and prepare for new models. The fine margins between Opus 4.6 and Codex 5.3 will be felt in many model versions this year, with Opus ahead in this matchup on usability.
Going into these releases Iād been using Claude Code extensively as a general computer agent, with some software engineering and a lot of data analysis, automation, etc. I had dabbled with Codex 5.2 (usually on xhigh, maximum thinking effort), but found it not to quite work for me among my broad, horizontal set of tasks.
For the last few days, Iāve been using both of the models much more evenly. I mean this as a great compliment, but Codex 5.3 feels much more Claude-like, where itās much faster in its feedback and much more capable in a broad suite of tasks from git to data analysis (previous versions of Codex, including up to 5.2, regularly failed basic git operations like creating a fresh branch). Codex 5.3 takes a very important step towards Claudeās territory by having better product-market fit. This is a very important move for OpenAI and between the two models, Codex 5.3 feels far more different than its predecessors.
OpenAIās latest GPT, with this context, keeps an edge as a better coding model. Itās hard to describe this general statement precisely, and a lot of it is based on reading othersā work, but it seems to be a bit better at finding bugs and fixing things in codebases, such as the minimal algorithmic examples for my RLHF Book. In my experience, this is a minor edge, and the community thinks that this is most apparent in complex situations (i.e. not most vibe-coded apps).
As users become better at supervising these new agents, having the best top-end ability in software understanding and creation could become a meaningful edge for Codex 5.3, but it is not an obvious advantage today. Many of my most trusted friends in the AI space swear by Codex because it can be just this tiny bit better. I havenāt been able to unlock it.
Switching from Opus 4.6 to Codex 5.3 feels like I need to babysit the model in terms of more detailed descriptions when doing somewhat mundane tasks like āclean up this branch and push the PR.ā I can trust Claude to understand the context of the fix and generally get it right, where Codex can skip files, put stuff in weird places, etc.
Both of these releases feel like the companies pushing for capabilities and speed of execution in the models, but at the cost of some ease of use. Iāve found both Opus 4.6 and Codex 5.3 ignoring an instruction if I queue up multiple things to do ā theyāre really best when given well-scoped, clear problems (especially Codex). Claude Codeās harness has a terrible bug that makes subagents brick the terminal, where new messages say you must compact or clear, but compaction fails.
Despite the massive step by Codex, they still have a large gap to close to Claude on the product side. Opus 4.6 is another step in the right direction, where Claude Code feels like a great experience. Itās approachable, it tends to work in the wide range of tasks I throw at it, and thisāll help them gain much broader adoption than Codex. If Iām going to recommend a coding agent to an audience who has limited-to-no software experience, itās certainly going to be Claude. At a time when agents are just emerging into general use, this is a massive advantage, both in mindshare and feedback in terms of usage data.
In the meantime, thereās no cut-and-dried guideline on which agent you need to use for any use-case, you need to use multiple models all the time and keep up with the skill that is managing agents.
Interconnects AI is a reader-supported publication. Consider becoming a subscriber.
Assessing models in 2026
There have been many hints through 2025 that we were heading toward an AI world where benchmarks associated with model releases no longer convey meaningful signal to users. Back in the time of the GPT-4 or Gemini 2.5 Pro releases, the benchmark deltas could be easily felt within the chatbot form factor of the day ā models were more reliable, could do more tasks, etc. This continued through models like OpenAIās o3. During this phase of AIās buildout, roughly from 2023 to 2025, we were assembling the core functionality of modern language models: tool-use, extended reasoning, basic scaling, etc. The gains were obvious.
It should be clear with the releases of both Opus 4.6 and Codex 5.3 that benchmark-based release reactions barely matter. For this release, I barely looked at the evaluation scores. I saw that Opus 4.6 had a bit better search scores and Codex 5.3 used far fewer tokens per answer, but neither of these were going to make me sure they were much better models.
Each of the AI laboratories, and the media ecosystems covering them, have been on this transition away from standard evaluations at their own pace. The most telling example is the Gemini 3 Pro release in November of 2025. The collective vibe was Google is back in the lead. Kevin Roose, self-proclaimed āAGI-pilledā NYTimes reporter in SF said:
There's sort of this feeling that Google, which kind of struggled in AI for a couple of years there ā they had the launch of Bard and the first versions of Gemini, which had some issues ā and I think they were seen as sort of catching up to the state of the art. And now the question is: is this them taking their crown back?
We donāt need to dwell on the depths of Geminiās current crisis, but they have effectively no impact at the frontier of coding agents, which as an area feels the most likely for dramatic strides in performance ā dare I say, even many commonly accepted definitions of AGI that center around the notion of a āremote worker?ā The timeline has left them behind 2 months after their coronation, showing Gemini 3 was hailed as a false king.
On the other end of the spectrum is Anthropic. With Anthropicās release of Claude 4 in May of 2025, I was skeptical of their bet on code ā I was distracted by the glitz of OpenAI and Gemini trading blows with announcements like models achieving IMO Gold medals in mathematics or other evaluation breakthroughs.
Anthropic deserves serious credit for the focus of its vision. They were likely not the only AI lab to note the coming role of agents, but they were by far the first to shift their messaging and prioritization towards this. In my post in June of 2025, a month after Claude 4 was released, I was coming around to them being right to deprioritize standard benchmarks:
This is a different path for the industry and will take a different form of messaging than weāre used to. More releases are going to look like Anthropicās Claude 4, where the benchmark gains are minor and the real world gains are a big step. There are plenty of more implications for policy, evaluation, and transparency that come with this. It is going to take much more nuance to understand if the pace of progress is continuing, especially as critics of AI are going to seize the opportunity of evaluations flatlining to say that AI is no longer working.
This leaves me reflecting on the role of Interconnectsā model reviews in 2026. 2025 was characterized by many dramatic, day-of model release blog posts, with the entry of many new Chinese open model builders, OpenAIās first open language model since GPT-2, and of course the infinitely hyped GPT-5. These timely release posts still have great value ā they center the conversation around the current snapshot of a company vis-a-vis the broader industry, but if models remain similar, theyāll do little to disentangle the complexity in mapping the current frontier of AI.
In order to serve my role as an independent voice tracking the frontier models, I need to keep providing regular updates on how Iām using models, why, and why not. Over time, the industry is going to develop better ways of articulating the differences in agentic models. For the next few months, maybe even years, I expect the pace of progress to be so fast and uneven in agentic capabilities, that consistent testing and clear articulation will be the only way to monitor it.
This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe -
One of the big stories of 2025 for me was how Nvidia massively stepped up their open model program ā more releases, higher quality models, joining a small handful of companies releasing datasets, etc. In this interview, I sat down with one of the 3 VPās leading the effort of 500+ technical staff, Bryan Catanzaro, to discuss:
* Their very impressive Nemotron 3 Nano model released in Dec. 2025, and the bigger Super and Ultra variants coming soon,
* Why Nvidiaās business clearly benefits from them building open models,
* How the Nemotron team culture was crafted in pursuit of better models,
* Megatron-LM and the current state of open-source training software,
* Career reflections and paths into AI research,
* And other topics.
The biggest takeaway I had from this interview is how Nvidia understands their unique roll as a company that and both build and directly capture the value they get from building open language models, giving them a uniquely sustainable advantage.
Bryan has a beautiful analogy for open models this early in AIās development, and how they are a process of creating āpotential energyā for AIās future applications.
I hope you enjoy it!
Guest: Bryan Catanzaro, VP Applied Deep Learning Research (ADLR), NVIDIA. X: @ctnzr, LinkedIn, Google Scholar.
Listen on Apple Podcasts, Spotify, YouTube, and where ever you get your podcasts. For other Interconnects interviews, go here.
Nemotron Model Timeline
2019ā2022 ā Foundational Work
* Megatron-LM (model parallelism framework that has become very popular again recently; alternatives: DeepSpeed, PyTorch FSDP).
* NeMo Framework (NVIDIAās end-to-end LLM stack: training recipes, data pipelines, evaluation, deployment).
Nov 2023 ā Nemotron-3 8B: Enterprise-ready NeMo models. Models: base, chat-sft, chat-rlhf, collection. Blog.
Feb 2024 ā Nemotron-4 15B: Multilingual LLM trained to 8T tokens. Paper.
Jun 2024 ā Nemotron-4 340B: Major open release detailing their synthetic data pipeline. Paper, blog. Models: Instruct, Reward.
JulāSep 2024 ā Minitron / Nemotron-Mini: First of their pruned models, pruned from 15B. Minitron-4B (base model), Nemotron-Mini-4B-Instruct. Paper, code.
Oct 2024 ā Llama-3.1-Nemotron-70B: Strong post-training on Llama 3.1 70B. Model, collection. Key dataset ā HelpSteer2, paper.
MarāJun 2025 ā Nemotron-H: First hybrid Mamba-Transformer models for inference efficiency. Paper, research page, blog. Models: 8B, 47B, 4B-128K.
May 2025 ā Llama-Nemotron: Efficient reasoning models built ontop of Llama (still!). Paper.
Sep 2025 ā Nemotron Nano 2: 9B hybrid for reasoning, continuing to improve in performance. 12B base on 20T tokens (FP8 training) pruned to 9B for post-training. Report, V2 collection.
Nov 2025 ā Nemotron Nano V2 VL: 12B VLM. Report.
Dec 2025 ā Nemotron 3: Nano/Super/Ultra family, hybrid MoE, up to 1M context. Super/Ultra H1 2026. Nano: 25T tokens, 31.6B total / ~3.2B active, releases recipes + code + datasets. Papers: White Paper, Technical Report. Models: Nano-30B-BF16, Base, FP8.
Nemotronās Recent Datasets
NVIDIA began releasing substantially more data in 2025, including pretraining datasets ā making them one of few organizations releasing high-quality pretraining data at scale (which comes with non-negligible legal risk).
Pretraining Data
Collection ā CC-v2, CC-v2.1, CC-Code-v1, Code-v2, Specialized-v1, CC-Math-v1. Math paper: arXiv:2508.15096.
Post-Training Data
Core post-training dumps (SFT/RL blends):
* Llama Nemotron Post-Training v1.1 (Apr 2025)
* Nemotron Post-Training v1 (Jul 2025)
* Nemotron Post-Training v2 (Aug 2025)
2025 reasoning/code SFT corpora:
* OpenMathReasoning (Apr 2025)
* OpenCodeReasoning (Apr 2025), OpenCodeReasoning-2 (May 2025)
* AceReason-1.1-SFT (Jun 2025)
* Nemotron-Math-HumanReasoning (Jun 2025), Nemotron-PrismMath (Apr 2025)
NeMo Gym RLVR datasets: Collection
Nemotron v3 post-training (Dec 2025): Collection
HelpSteer (human feedback/preference):
* HelpSteer (Nov 2023)
* HelpSteer2 (Jun 2024)
* HelpSteer3 (Mar 2025)
And others, not linked here.
Chapters
* 00:00:00 Intro & Why NVIDIA Releases Open Models
* 00:05:17 Nemotronās two jobs: systems R&D + ecosystem support
* 00:15:23 Releasing datasets, not just models
* 00:22:25 Organizing 500+ people with āinvitation, not controlā
* 0:37:29 Scaling Nemotron & The Evolution of Megatron
* 00:48:26 Career Reflections: From SVMs to DLSS
* 00:54:12 Lessons from the Baidu Silicon Valley AI Lab
* 00:57:25 Building an Applied Research Lab with Jensen Huang
* 01:00:44 Advice for Researchers & Predictions for 2026
Transcript
00:00:06 Nathan Lambert: Okay. Hey, Bryan. Iām very excited to talk about Nemotron. I think low-key, one of the biggest evolving stories in twenty-five of open models, outside the obvious things in China that everybody talks about, that gets a ton of attention. So th- thanks for coming on the pod.
00:00:22 Bryan Catanzaro: Oh, yeah, itās my honor.
00:00:23 Nathan Lambert: So I wanted to start, and some of these questions are honestly fulfilling my curiosity as a fan. As like, why does NVIDIA, at a basic level, release Nemotron as open models?
00:00:39 Bryan Catanzaro: Well, we know that itās an opportunity for NVIDIA to grow our market whenever AI grows, and we know that having access to open AI models is really important for a lot of developers and researchers that are trying to push AI forward. you know, we were really excited by efforts from some other companies around the industry to push openly developed AI forward. You know, Meta did some amazing work, obviously, with Llama and you know OpenAI released GPT OSS, which was exciting. And the Allen Institute, of course, has been, you know, really leading the charge for research, open research and, you know, also things like the Marin Project and OpenAthena. You know, like thereās, thereās a bunch of things that weāre always excited to see develop.
And, you know, as we think about where AI is gonna go, you know, NVIDIA believes that AI is a form of infrastructure. itās.. AI is a very useful technology when itās applied, but on its own you know, itās kind of a foundation and infrastructure. We think that technology generally works better when thereās openness to the infrastructure so that people can build things in different ways. You know, you think about the way that the internet transformed every aspect of the world economy is pretty profound, and weāre not done yet.
But the way that, for example, retail uses the internet is different from the way that healthcare uses the internet. And the fact that you know, different sectors of the economy were able to figure out how to incorporate the internet into the beating heart of their businesses in different ways was possible because the internet was built on open technologies that, you know, allowed people to try different things. And we think AI is gonna evolve in a similar way, that organizations across every sector of the world economy are gonna find new and surprising and fun, and important things to do with AI, and theyāll be able to do that better if they have the ability to customize AI and incorporate it directly into the work that they do. and so -- and by the way, this is not to detract from any of the you know, more closed approaches to AI, you know, the APIs that we see from a number of leading labs that, you know, are just extraordinary and have amazing capabilities. Weāre excited about those, too.
You know, NVIDIA loves to support AI in all of its manifestations, but we feel like right now the sort of closed approaches to deploying AI are doing pretty well but we, you know, could use some more energy in the openly developed AI ecosystem, and so thatās why weāve been putting more effort into it this past year.
00:03:42 Nathan Lambert: Yeah. So Iām definitely gonna dig into this a lot ācause I have seen this. Weāre sitting here recording in January twenty-six, which is in the midst of the rollout of these Nemotron three models. Thereās the-- I think the Nano has released in the fall, which was probably one of the biggest splashes the org has made, and everybodyās eagerly awaiting these super and ultra-larger variants.
And itās like how far are you, how far are you willing to push this Nemotron platform? Like, is it just depending on the users and the uptake and the ecosystem? Like, like, what is the-- is there a North Star in this? Or you hear a lot of.. if you listen to a lot of other open labs, theyāre like: āWe want to build open AGI,ā which is like, I donāt necessarily think grounded, but thereās like a very unifying vision.
Is there something that you try to set the tone for it that goes through the organization? I mean, AI too, itās like-
00:04:31 Bryan Catanzaro: You know, my North-
00:04:32 Nathan Lambert: .. academics is so-
00:04:34 Bryan Catanzaro: For Nemotron.
00:04:36 Nathan Lambert: Okay, go ahead.
00:04:37 Bryan Catanzaro: Oh, sorry. Go ahead.
00:04:39 Nathan Lambert: I was just, like, gonna compare to, like, AI too, where we can have such a-- like, we have a very specific vision, being so open that itās like, I think, like, research is so needed, and thereās so little recipes to build on, like, with really credible research. So thereās, like, a research infrastructure, and then when you have something like Llama, it was, like, built on Zuckerbergās vision, and he changed his mind, which I actually thought his vision was ex- was excellent, the way he articulated the need for open models, and it kind of faded. So itās like, is there a way to set a vision for an org that, like, permeates every- everyone and is really compelling and exciting?
00:05:17 Bryan Catanzaro: Right. Well, we built Nemotron for two main reasons. The first is because we need to for our main product line. So what I mean by that?
Well, accelerated computing, what NVIDIA does, we build fast computers, right? But the point of building fast computers is to help people do new things. and actually every fast computer is also a slow computer. you know, the observation that it would be nice if computers were faster and could do more things isnāt new. thatās been around since the beginning of computing. So what makes accelerated computing different from standard computing is that weāre prioritizing, you know, weāre focusing, weāre deciding weāre gonna accelerate this workload. This other workload, which is like ninety-nine percent of all of the workloads, weāre gonna let somebody else do that, right?
So, like, you do not buy NVIDIA systems to do any general purpose computation. You buy them for a purpose, right? Which is these days, all about AI. But when you think about the workload, the compute workloads involved in AI thereās a, thereās a lot of diversity and thereās a lot of really important -.. parameters, hyperparameters, or algorithmic approaches that all have enormous imp- impacts on the systems that we need to build for AI.
So things like numeric precision MoE architecture, which of course, influence net-- it influences network design. you know, weāre dreaming about sparsity. We, you know, weāve had, weāve had sparse neural network acceleration in the GPU since Ampere. I donāt think that itās being used enough. you know, so how do we, how do we figure out how to use that? These, these sorts of things have an enormous impact on the future of NVIDIAās main product line, and we have to understand the answers to those questions deeply ourselves in order to know what weāre going to build.
We canāt just go to our customers and do a survey and say, āHey ā you know, Meta, for example, since we were just talking about them, āwhat would you like to see in a future product line from NVIDIA?ā Of course, Metaās always trying to help us as much as they can, but thereās limits to what they can tell us because, you know a lot of the information that influences the design of these systems, itās very expensive to derive, and so therefore, itās, itās very closely held. And so we need to be able to understand these questions very deeply in order to understand what kind of systems to build, in order to understand what weāre accelerating in AI and what weāre not gonna worry about. and so thatās kind of the first job for Nemotron models, is to make it possible for NVIDIA to continue to exist as a company. And I think itās important that the community knows that because thatās the reason why NVIDIA is making the investments in Nemotron, is because we believe itās essential for the future of our company. and so this isnāt-- and although as much, as much as it feels good to say, you know, NVIDIA believes in open openly developed AI because you know, weāre so charitable, but actually, thatās not the case. This is actually a business decision-
00:08:34 Nathan Lambert: Itās smart
00:08:34 Bryan Catanzaro: .. like, for NVIDIA, our business needs us to know about AI very deeply. And and so, you know, the amount of investment that is justified to carry on NVIDIAās ongoing business, I think, is large. and so thatās thatās job number one for Nemotron. Now job number two for Nemotron is to support the ecosystem more broadly outside of NVIDIA. and, you know, NVIDIA has a special position in the AI landscape. of all of the big AI companies I think weāre the one that works with the most other companies. We support every company small and large, AI native company to old established enterprise.
We work with hyperscalers, we work with tiny little startups, we work with countries around the world. so we have this unique position and I think also a uni- unique responsibility and al- maybe also a unique opportunity, that whenever AI is able to grow in any sort of direction, in any capability, then you know, thatās an opportunity for us to grow our business. Obviously, itās not automatic, right? you know, the AI market is diverse, and itās getting more diverse, and it should be, ācause itās the most important market in the history of humanity. So so we acknowledge that, and at the same time, we know that itās in our interest to develop the AI ecosystem. The more people that are building, inventing, and deploying AI, the more opportunity that we have as a company.
So thatās job number two for Nemotron.
00:10:17 Nathan Lambert: Yeah. I really appreciate you saying it so directly ācause itās like weāve worked.. We- I launched this thing, the Adam Project, last summer, which is trying to get more investment in the US open models, and itās like the only company that has an obvious business model for open models is something like NVIDIA, where you need to make sure that the open models and the research ecosystem plays nicely on CUDA, because then youāre gonna be able to be one-- Youāre so many steps closer to research thatās happening. If not, like, if it like- Thereās such an advantage to have research happen mostly on GPUs relative to AMD or anything like this, so.
00:10:49 Bryan Catanzaro: Well, you know, we are-- weāre, weāre not thinking about how to prevent competition. You know, we welcome competition. Thereās lots of competition. There should be more competition in this space, but we are very self-interested in staying engaged with the community.
You know, itās very important. You know, CUDA not many people remember this because it happened so long ago, but you know, CUDA started out with a lot of outreach from NVIDIA to the academic and industrial community saying, āHey, we have this new way of doing computing. weād love to see what you can do with it.ā In fact, you know, I started using CUDA in 2006 when I was a grad student at Berkeley because David Kirk, who was the chief scientist of NVIDIA at the time, came over to Berkeley and said, āHey we just released this new GPU, and it has this new programming model called CUDA. You should give it a try.ā And I was-- at the time, I was working on machine learning on FPGAs, and I had been working on this one particular piece of support vector machine training on the FPGA, and I decided to take that little piece and write it in CUDA, and it took me like fifteen minutes, and then I ran it, and it was like two hundred times faster than my single-threaded CPU code, and I was like: āWhoa, that was way easier than what I was doing before. Iām just gonna go do that,ā right?
So, like, my own personal involvement with CUDA and NVIDIA came about because of this outreach that NVIDIA conducted right from the beginning of CUDA. you know, of course, that led to a lot of great things for NVIDIA, including AlexNet, which was another academic project, you know, where Alex Krizhevsky and Ilya Sutskever were thinking about: āHow do we train larger neural networks on more data? weāre gonna go write a bunch of GPU code that uses the GPU in a, in a kinda new and clever way, so that we can train a better image classification model.ā And, you know, that had such astonishing results, it kicked off the deep learning era for the whole community. and again, not something that-.. could have been done top-down. That was a, that was a very much a result of NVIDIA supporting open development and re- research in parallel computing and artificial intelligence. And so we remember that, and weāre thinking about in twenty-six, what does it look like to help, you know, the Alex Krizhevsky of the future, whoās, whoās a grad student in a lab somewhere, invent the next technology that changes the world? It seems really difficult to do that without something like Nemotron or, or the other openly developed AI projects out there. yeah, I also wanna say in regards to this Nemotron is not trying to be the only project out there.
Weāre part of the community. We love other people doing great work in openly developed AI. We learn from things that other people do and you know, so weāre, weāre trying to support the community because itās in our interest, but we you know, weāre very happy to see other people contributing as well.
00:13:57 Nathan Lambert: Yeah, I mean, I can transition into something I wanted to ask about is like, I see multiple ways, twenty-five Nemotron mat-- in, I donāt wanna use the word maturing ācause I wanna ask you about how it feels in the org, but just like the output reached levels that were more noticed by the community and people building with models. And thereās a lot of ways that can happen, but one of them is like, in my niche community, Iāve been using Nemotron datasets a lot. Like we-- when we redo our post-training recipe, one of the only people we look at is like, okay, NVIDIA, Nemotron has released a lot of high-quality, openly licensed post-training data. this year, you also started releasing some pre-training data, which among AI2 got a lot of notice. Like, what is that? is that like a distinct shift within Nemotron?
Is that something that youāve wanted to do for a while and finally just did? But itās ācause itās like-- it is just like a zero to one moment where releasing pre-training data comes with legal risk for any company, but so few people do it, where on my side of the world, itās like pretty easy to normally say what the best pre-training dataset is, and it had, for a long time, oscillated between like Hugging Face, AI2, DCLM, and there was like literally only two or three options. So in terms of fundamental research, like I think thatās a big step from an org to support the community and take on some risk. So if you have any story you can tell and or just say like, I appreciate it, thatās, thatās all.. thatās all I got.
00:15:23 Bryan Catanzaro: Well, yeah. I mean, so I think itād be great if more people could understand that Nemotron is not just a model, right? Like, what weāre trying to do with Nemotron is to support openly developed AI, because, again, thatās our big opportunity, right? Now, thereās a lot of organizations that are incentivized to build a model, and the model is maybe the thing that runs their business, right?
But at NVIDIA, the model is not the thing that runs our business, itās the systems. So when weāre thinking about how do we support the ecosystem, itās clear to us that the ecosystem needs more than just a model. Thereās a lot of models out there already, you know? And of course, we want Nemotron to be awesome, but you know, if Nemotron can convince other people to work on AI because of a dataset or a technique, you know, weāre, weāre trying to be very open with all of the things we learn, you know, including..
I mean, we do a lot of expensive experiments in order to figure out how to do blending for our datasets or to figure out, you know, optimize our settings and, you know, these sorts of things. weāre very happy for other people to pick that up and run with it if itās useful to them, you know. And so that makes Nemotron a different kind of AI effort. Of course, there is a model component, and thatās a tangible thing, and itās, itās easy to focus on that, but we see Nemotron as you know, an effort that includes models, but also includes datasets, techniques, all of all of the research that goes into Nemotron. And again weāre a unique kind of AI organization because of the way that we work with AI companies around the industry and because of the way that our business works, we can afford to be more open with some of these things than maybe some other organizations could be.
Now to your question about, like, does it take some courage in order to be open? Yeah, absolutely it does. and you know, I think thereās been-- one of the things thatās happened in twenty-five is that thereās been an evolving understanding within NVIDIA about the benefits of openness, and that has really enabled the company to make some investments that perhaps it was a little gun-shy to make in the past. And so thatās really encouraging for me. itās something that Iāve you know, advocated for a while, and so itās, itās great to see the company kind of lining up behind it. I also, you know, to your point about like twenty-five being a, a year where Nemotron really made some strides, I want to say thank you for noticing that, and then maybe tell you a little bit about how that happened, because I think itās instructive for me about how I think the work is gonna go forward in the future.
So you know, NVIDIA is a very decentralized company with a lot of volunteers. You know, everybody that works at NVIDIA is a volunteer. And what do I mean by that? Well, I mean, look, the industry is moving quick.
You know, people can always move from one job to the next. So the way that we think about the work that we do is like, itās very decentralized, itās very much let smart people figure out what they should be doing and then kind of self-organize. Now one of the challenges of self-organization in a field thatās moving quickly is that sometimes a whole bunch of people decide to-.. do similar kind of overlapping things but arenāt really coordinated. and thatās okay at the beginning because, you know in a place like NVIDIA, itās just great to have some energy. It, it took us a while, I think, as a company to figure out that Nemotron was better together.
That rather than having, like, this group has a, has a model and that group has a dataset, and like, you know, then we end up publishing papers that kind of you know donāt really acknowledge each other and arenāt really coordinated. And then, of course along with that, we need to have k times the GPUs, where k is the number of independent efforts. we realized that, you know building AI, you really do need to figure out how to collaborate. the AI efforts that are built from teams of people focused on the overall effort succeeding rather than their own particular piece of the project succeeding, those are the ones that, you know, really change the world. And, you know, of course, NVIDIA works that way for the systems that we build, right? So, like, the people working on the memory controller on the GPU know that they also have to work with the people working on the SM that does the math, right?
Like, you canāt, you canāt make a GPU where itās just like, āWell, weāve got an awesome memory controller,ā if the math doesnāt work, right? It all has to, has to kinda work together. And so that coordination, I think in the field of AI, it took us a little bit longer to do maybe than you could imagine that it could have. and I think that slowed the progress for Nemotron. so I give a lot of credit to the Nemotron team for realizing over the past, I donāt know, year and a half or so, that it was really time to join up and build one thing and make it awesome, and deeply understand that the success of the Nemotron project was more important than the success of any individual piece of that project. And the reason why Iām telling you all of this is because I think thatās actually true more broadly than just inside NVIDIA, and I think itās, itās difficult. you know, researchers like those of us with PhDs, for example, we are taught how to be independent, you know, and how to, how to build up our Google Scholar profile, and thereās, like, an incentive to go ahead and focus on that.
And a lot of successful academics and people researchers you know, they manage to push that pretty far and get some pretty amazing results. But, you know, I do believe that in 2020- in the 2020s you know, that the best research is done as part of a larger team. so how do we figure out how to work together? You know, how do we figure out how to put the success of the team first? That is a thing that is challenging to do but if we can achieve it, I think yield significant results.
And, you know, to the extent that we made progress in that part of the organization, I think we also saw progress in the technology. and thatās.. That gives me great hope for 2026 for Nemotron because the way the team is working together, I think is you know, pretty extraordinary. Thereās just an enormous number of brilliant people that have decided that theyāre gonna volunteer to make Nemotron awesome, and weāre, weāre starting to see some pretty great things come together.
00:22:25 Nathan Lambert: I agree with everything you said. Do you have any advice for making the orgs come together? I think weāve seen big-- Wait, Iāve seen two class-- thereās two classes of AI companies right now. One is startup, does everything, and you have a model in six months, but youāre building from zero, and you have-- you p-- everybody agrees when they start that they do this. And then you have Googleās famous long-winded reorgs, which they actually eventually got right. Like, they got it very right with whatās going on with Gemini and Google DeepMind-.. right now. And itās like, do you have any advice on doing this? I think, like, Iām, AI too, also advocating for this, but itās very hard. I think personally-
00:22:58 Bryan Catanzaro: Itās-
00:22:58 Nathan Lambert: .. itās like, I mean, Iām, Iām a special case ācause Iām also visible, where itās e-- very easy for me to turn internet activity into, like, reputation points because of algorithms and size. But itās very hard to do bottom-up technical work and get all of this and get all the culture alignment. So do you have any advice on actually, like, what works in this domain?
00:23:20 Bryan Catanzaro: You know whatās worked for us is invitation and not control. so you know, one way that, like, for a while I kinda wanted to try to implement was, like, nobody gets to publish any papers in AI unless theyāre clearly part of Nemotron. So this is kind of a top-down, like, weāre gonna make you do it, right? I came to the realization that which we never implemented this, by the way, but I came to realization that this was a bad idea because it would just breed resentment, and, you know, NVIDIA is a company of volunteers. Everybody here is a volunteer.
So what we need to do is create the conditions by which it makes sense for people to volunteer to be part of Nemotron. And so the way that we went about doing that first of all it involved like, some top-level agreements between me and some of the other leaders of Nemotron, for example, John Cohen and Kerry Briski. I work very closely with the two of them. And you know, that hadnāt always been the case.
Like, we kind of had all come to this place independently. but we realized, like, Nemotron, better together, all three of us, and then we started telling our teams that: āYou know, we really think Nemotron is gonna be better together.ā so that top-down alignment, I think was really helpful. We-- again, we werenāt telling people exactly what to do, but we were just sending a con constant message like, you know, āNemotronās better together.ā And then we built some structures that facilitated collaboration. So in the past decisions in the Nemotron project tended to be made in kind of a an opaque way. and the reason for that is just, you know-.. itās hard to tell everybody about the middle of the sausage-making process. You know, itās, like, messy and dif- difficult, and so, like, you know, itās natural.
Like, researchers, weāre used to doing this, right? Itās a fait accompli. Like, āHereās my ICML paper,ā and like, you know, the fact that you spent, like, two years failing at that task before you finally succeeded, and then you tied a bow around it and gave it to the ICML committee, you donāt really talk about that, right? And so itās difficult for researchers to, to be open about the middle of the process of research.
Thereās a lot of failure, and itās hard for people to feel like theyāre, theyāre not looking amazing. But what we, what we decided to do is we structured the project with.. Thereās about twenty different areas for the project. Each of them has a clear leader, what we call a pilot in command.
Their job is to-- the job of the pilot in command is to land the airplane. You know, you just want the airplane to land, okay? So somebody, if youāre landing an airplane, there might be multiple pilots on board, but only one of them is gonna land the airplane at any time, right? Because it would be chaos if two of them tried to land at the same time, people would die.
So so this is not a committee structure; it is a delineated responsibility structure. And then the purpose of that pilot in command for each of these sections is to gather together all the best ideas, help the group of people that are interested in working on that space to come up with data-driven answers to what we should do, what technical decisions we should make, and then document that, you know, in a, in a way that other people can review. and you know, the thing thatās been really great about that is that it is inviting to people because when they see, like, okay, hereās the group of volunteers that are working on this area of Nemotron and then they want to contribute, itās much clearer about how they could go about doing that, and itās also clearer what the group needs because you know, these meetings are being held in the open. and we have-- we actually have a website where all of the ideas are submitted. they each get, like, a unique identifier, and then they get engaged with, you know, the PIC is trying to understand what the implications are, what kinds of experiments need to be run in order to prove or disprove the idea? how do we do what I call integration studies? You know, I, integration studies are so key for bringing researchers together, and theyāre so opposite of what we are taught when weāre learning how to do ablations as a graduate student. You know, rather than, like, isolating the particular contribution of one idea, integration studies are about putting a hundred ideas together and seeing if theyāre better than what we had before. so this kind of thing, doing that in a structured way and in a, in an open way internally has then made it possible for more people to volunteer, and that has then generally raised the rigor of the experiments and also the I think the outcome of the work.
00:28:15 Nathan Lambert: Yeah, this is great. I think that over the last few years, thereās been more consensus on things that work for research. And I think the- we also do integration tests very regularly of like, is this feature gonna land for the model? And thatās kind of a..
Itās a good- itās a nice mirror to ablations, where we know research is changing so much. Thereās a lot of turmoil in the academic research community, and itās nice to have things that are tangible as ways that are a little bit different when youāre doing these large-scale projects. So people that underst- like, you still need to do ablations. But then it needs to survive, like, an additional test in order to land into the model.
So itās like an additional type of work that needs to be done, and I just like to have words to describe what is actually happening. I think on the Nemotron-3 Nano front, I do a lot of analysis on just looking at basic adoption metrics and Nemotron we created this, what we called like a relative adoption metric, which is essentially looking at downloads over time for models, because itās easy to know which models have a ton of downloads that are released a while ago. But to, like, look at the trajectory of downloads changing over time, this is a lot-- this is a mouthful. Itās kind of an aside, but, like, Nemotron Nano 3 was in the thirty B size range, like, on track to be one of the top ten models downloaded of all time.
The point that I bring this up, other than to just flatter you, is like, do you think last mile adoption takes a substantial amount of work other than making, like, a very functional model? Or does adoption-- like, do you need to, like, change the recipe that youāre making and put a lot of focus and evaluation and, like, change this over time so that you actually get people to really use the model, rather than, like, āOh, the benchmarks are good,ā look at NVIDIA flying high?
00:30:03 Bryan Catanzaro: Right. Yeah, I mean, wow, it has taken the whole company coming together in order to make Nano V3 have more of an impact than the models that we released before. and thereās so many different aspects to that. obviously, thereās a lot of technical aspects which frankly, I think we have more work to do. So, like you know, making sure that on day zero, when we release something, that the quantizations, all the quantizations, the best quantizations are out there, that the speed on all of the important inference frameworks is out there, that it runs on all of the edge devices that we care about fla- flawlessly, that the install experience is great. You know, this kind of work is extraordinarily important because you know, itās a crowded world.
Thereās so many different things that people could choose to work with, and any amount of friction that gets in the way of people even evaluating something that you do is gonna blunt the results, no matter how good that technology is.. I donāt think that weāre amazing at this yet, so this is something that I anticipate weāre gonna see a lot more investment in as the, you know more people at NVIDIA from all over the company, from marketing, from developer relations, from software engineering, you know as they-- as we all come together in support of this effort. so yeah, so it does, it does take an enormous amount of work. and then, you know, something that Iām particularly interested in is you know, how do we work engage-- i-in a new way, sort of engage with the community to make future Nemotron models even stronger? You know if the only things that we were to optimize for with a Nemotron model would be kind of academic benchmarks that are, you know, highly cited itās likely the case that the model wouldnāt be general enough to really be useful. And so what weāre trying to build is a technology that other people can extend and deploy, and that means we need to have, like, other ways of understanding the strength of a model besides you know, a handful of academic benchmarks.
I think we have a lot of room to grow here. Iām hoping over time that we develop the muscle of being able to engage with the community and learn from them. Like, you know, okay, this particular thing that I tried to do with Nemotron, it didnāt work. It did this other thing that, you know, I wasnāt expecting, it was wrong. well, that can become feedback that then is used to make the next version better.
I think weāve got a lot of work to do in that regard.
00:33:10 Nathan Lambert: Do you think thereās any magic to it? Iāve-- Iām blown away by how successful OpenAIās two open-source models are. Like, yes, theyāre obviously the number one name brand in AI, but on the same metric that I see you guys, like, overperforming, like, what I would expect. Iām like, āWow, great job, NVIDIA.ā Theyāre, like, totally off the charts, like, on track to like, beat Llamaās, like, most downloaded numbers ever with these two GPT OSS models.
And I feel like what they-- like, even on release, they had hiccups where people were pretty negative on it. But for whatever reason, it has just like.. People figured it out, and it just clicked, and then just, like, for a company to say so little about it. Like, we-- Meta put so much effort into Llama being adopted, and you obviously are putting a lot of effort into this.
Like, Iām just like, did OpenAI just crack the code, or is there sometimes a bit of luck?
00:33:59 Bryan Catanzaro: Well, I donāt think I, I donāt think about OpenAI as a, as a lucky company. I think of them as a visionary company that works incredibly hard and you know, I think their success is well deserved. I love the GPT OSS models. You know definitely theyāre an inspiration for us here at Nemotron. and yeah, so I think OpenAI also has, like, some other ways of engaging with the community just because of the large number of people that use their services, and that helps them learn things about what are people trying to do with AI, that then they can address when theyāre building models, and you know, obviously, you know, people talk about that as a flywheel. you know, I think thatās really interesting and really important.
NVIDIA is never going to have the same kind of flywheel as OpenAI does. Weāre not trying to build a service like ChatGPT. What weāre trying to do is help the ecosystem, you know, be strong and enduring. we think that itās important for there to be this openly developed AI ecosystem, and also weāre, weāre trying to build our next generation of systems, and so we have our own reasons for doing this. But weāre not ever going to have the same exact user base or flywheel that OpenAI does.
On the other hand, you know, we are able to work with institutions around the world in our own way, that I think offers us different opportunities and hopefully, that helps us make things that are, that are useful, too.
00:35:38 Nathan Lambert: Yeah, this makes me realize, Iām having a lot of conversations on.. There are many open model efforts, especially even among people that are fully open, and itās like, how do we better coordinate? So especially at the smaller scale, itās like AI2 and Hugging Face. So theyāre not big teams.
Like, how do we make sure weāre not doing the same data project at the same-- the same exact thing at the same time? And itās like, I wonder if thereās opportunities for open companies, like LM Arena has historically released a lot of user data to, like, better help us close this kind of what are people using models for flywheel. And but itās just-- itās very hard to build cross-organizational model improvement pipelines, is something that I think. I think models become pretty vertical in terms of somebody at NVIDIA getting the feedback and the model making better.
So thatās what would be something I would like to see this year, but I donāt have ideas for doing it well.
00:36:28 Bryan Catanzaro: Yeah. You know at NVIDIA, we have a tradition of working really closely with, you know, organizations that use our technology. and, you know, we really-- we have, we have teams of engineers that their job is to enable success for our customers. in fact, thereās more people at NVIDIA that care about the success of people outside of NVIDIA than I feel like sometimes there are people that care about the success of things inside NVIDIA. So, like, sometimes Iām like, Iām like: āHey, could we use a little bit of that e-energy to support Nemotron?ā And, and the answer is yes, and NVIDIA is doing that. But I think as Nemotron matures, weāre gonna find that you know, the organizations that work with NVIDIA to make Nemotron awesome for their business, for their use case are gonna have a say in how Nemotron evolves and hopefully, that helps Nemotron address their needs.
00:37:29 Nathan Lambert: .. Yeah, a basic question: how many people, like, how many employees does it take to build all the different versions of Nemotron? I havenāt brought this up because you also have other great types of models. I think our, like, open model analyst, Florian, is obsessed with the Parakeet model, ācause- Much faster at typing and is much faster at speaking than typing.
So thereās a lot of other-- I donāt know-- I donāt have the full list of other NVIDIA models off the top of my head, but you are releasing a lot of varieties of models. So I think itās a bit of a thereās more context to my original question, which is I think about language models ācause Iām a n-- like, I just think of AIās progress is gonna continue to go very fast, so I focus as that as the engine. So but itās like, how many people is putting this kind of movement into place?
00:38:16 Bryan Catanzaro: Yeah. Well, itās, itās, itās hard to know exactly, and as I said, NVIDIA is a company of volunteers. But and also these days, things are changing, right? Like, so the Parakeet team, which is an excellent team, by the way they I would say a year ago wouldnāt have really considered themselves so much part of the core Nemotron effort, but these days they absolutely are. for the obvious reason that, you know, LLMs these days need to be able to consume all sorts of data, right?
Including audio data. And so you know, as the pro-- as the characteristics, the capabilities of Nemotron models expand obviously, the number of people contributing is gonna expand. Iād say right now thereās about five hundred people that are working pretty much full-time on Nemotron technologies in different ways. This is everything from numerics quantization recipes to speech recognition or image understanding or, you know, pre-training, post-training, RL systems inference software. you know, thereās, thereās a, thereās a whole bunch of different dimensions, right?
So Iād say itās about five hundred people. but also weāre having our Nemotron all-hands meeting this week, and so I took a look to see how many people were invited to that all-hands meeting, and it was about two thousand. so those are people around the company that are interested in working with Nemotron and either expanding its capabilities or helping its adoption. and so I think you know, the number is somewhere in between and itās hopefully gonna keep growing as, as Nemotron matures.
00:40:07 Nathan Lambert: Yeah, I mean, thatās one of the greatest attestations to what youāre saying is like, if the interest outside the company-- inside the company is four times as big as the people doing it, youāre gonna, youāre gonna keep scaling up, it seems. People are gonna-.. find ways to help. - One of the other things Iām interested in, I donāt know, like, on the point of five hundred, itās like, it sounds like a lot of people, but with how many things you have going on, it seems also very few. āCause Iām transitioning to thinking about the long-standing, like, open-source software that youāve had for NeMo, and I think Megatron, and itās like theyāve been around for a long time. I think Megatron has gone through many eras. I have a note here.
Itās like these softwares have been going around since, like, twenty nineteen in some form. And itās, it-
00:40:51 Bryan Catanzaro: Publicly. We had our first public release in twenty nineteen, but we started earlier.
00:40:56 Nathan Lambert: And itās something that Iāve found is that when I started doing lang- language models, so I was a late bloomer, and weāll transition to some career talk in a few minutes at Hugging Face. Like Megatron had, like, a bad rap of being very hard to use. But now, like three years later, I hear from anyone thatās founding a new language modeling startup, theyāre like, āJust use Megatron.ā like, do you pick up on things like this? Is it just, like, random-
00:41:22 Bryan Catanzaro: Well, we-
00:41:22 Nathan Lambert: .. but itās like-
00:41:22 Bryan Catanzaro: We hard on it. You know, weāre trying really hard to make Megatron easier to use. Itās difficult. Megatron is a complicated piece of technology, and, you know, when we originally started Megatron, the point was to show the community that you could make state-of-the-art large transformer language models with NVIDIA.
I donāt know if you recall, but it-- there was some assertions by some other companies back in twenty seventeen when the transformer was invented, that they could only be made without NVIDIA. in fact, there were statements to that effect on bl-- on official blog posts, which I think got redacted later on. But it was important for NVIDIA to show up and say, āWe love language models. We love transformers. Letās see what we could do, you know, if we partitioned the work properly on lots of GPUs with an amazing interconnect, what kinds of models could we train?ā And so thatās where the Megatron project started.
You know, I actually came up with the name Megatron. one of my proudest moments, I suppose. I was thinking about it, I was like: This is a really big transformer. Whatās the biggest and baddest transformer? Oh, itās Megatron.
So thatās, you know, where the name came from. but youāll think about that had nothing to do with usability, right? Like, I wasnāt, I wasnāt thinking about, like, how do we make a platform thatās really easy for other people to use? I was just trying to show the world that, like, NVIDIA systems could be awesome for transformers. You know, that was, that was my goal.
Over the years, you know, it has evolved. We have a lot more people trying to use Megatron. We got a lot of complaints about how hard it was to use, and then we did a lot of work to try to improve the software engineering around Megatron. You know, these days Megatron software engineering is actually shared between about four different teams at NVIDIA. and we have to coordinate that work very closely.
That has also not been easy. There has been times when you know, people wanted to fork Megatron, and then there were times when we, like, had to bring it back together, and itās like: Look, I know forking things is always tempting, but look, better together. Itās better for all of us to keep working together.. and so I feel like Megatron the-- and especially Megatron Core, which is like a subset of Megatron thatās, like, especially protected, and we try to put more software engineering into that that has gotten dramatically better since we started paying more attention to it as a company. are we done yet? No, thereās a lot, a lot, a lot more work.
00:43:52 Nathan Lambert: a ba-- a basic question: Is is Megatron or Megatron Core, like, this is what Nemotron is trained on? And also-- And itās also something that many of the hottest, like, AI startups are training their models on. I would guess that thereās nothing else that does that. So, like, could you summarize why itās so hard?
00:44:11 Bryan Catanzaro: Well, you know, thereās a, thereās a lot of other great frameworks out there. Megatronās not the only one. and you know, weāre happy about that. NVIDIA doesnāt need to control the space. What we, what we do wanna do is make sure that weāre putting our products forward in the best light, you know, and itās a challenging problem.
Weāve got so many things going on with precision and you know, the networking. Like, those questions, like, the software is so complicated. these days, you know, weāre pre-training our Nemotron-3 Super and Ultra models using FP4 which is a thing that, you know, hasnāt been done publicly anyway and something that, you know, weāre pretty excited about because our GPUs have really awesome FP4 throughput. But obviously, the numerical challenges of, like, trying to train a state-of-the-art language model using four bits is non-trivial. So, like, you know, all of that work has to go into Megatron, into Transformer Engine which is a, another open-source project that Megatron relies on and, you know coordinating all of that making sure that, you know, we can actually deliver the benefits of NVIDIA systems to people that are trying to make state-of-the-art models, thatās really important to us.
And, you know, of the five hundred or so people working on Megatron, like, a pretty good fraction.. or on Nemotron, a pretty good fraction of them are working on these kinds of systems issues, right? Because NVIDIA at its core, is a systems company. and Megatron, you know, Nemotronās first job really is about systems, you know, and so we, we care, we care deeply about that.
00:45:51 Nathan Lambert: Yeah. I mean, from my perspective, I was at Hugging Face before AI2, and Hugging Face is, like, the best company at doing public work. But also, and switching to AI2 and focusing on, like, weāre focused on the output artifact the most. Seeing the different type-- Like, itās such a different type of work, going from youāre trying to build a tool thatās good for training models, to build a tool thatās good for everybody else and whatever heck use case they are.
00:46:13 Bryan Catanzaro: Itās different.
00:46:13 Nathan Lambert: So I think-
00:46:13 Bryan Catanzaro: Yeah. Different work.
00:46:14 Nathan Lambert: To do both is like.. Iām, Iām happy that AI2ās repos arenāt that popular in terms-
00:46:21 Bryan Catanzaro: Oh,
00:46:21 Nathan Lambert: .. of open-source adoption because, like, we canāt handle it. We just canāt. Itās, like, so hard because itās people-- itās, like, it ends up being researchers that are supporting it, and we donāt have the ability to scale the organization structure. So I just think, like, thatās a, thatās a very fun turnaround for me to think of all these things happening at once.
00:46:39 Bryan Catanzaro: Yeah. Well, thanks for noticing weāre putting effort in. I would say Megatron is still not nearly as user-friendly as Hugging Face libraries. Like-.. Hugging Face libraries are legendary, and I admire the work theyāve done to make the community so productive. people, you know, are able to get so much research done thanks to the work that, you know, Hugging Face has put into to their library. So you know, my hatās off to them as well.
00:47:06 Nathan Lambert: Yeah. One of my hot takes, you donāt have to reply, is that Hugging Face and NVIDIA have been very good partners.
00:47:10 Bryan Catanzaro: Oh, absolutely.
00:47:10 Nathan Lambert: And itās like bringing that Hugging Face culture to the NVIDIA stuff would be so good. Itās just so hard, so I donāt know how that would work, but-
00:47:17 Bryan Catanzaro: Weāre trying, you know, and you know, it is, it is challenging. NVIDIA is always a company that is gonna prioritize speed like hardware speed, above really anything else, ācause thatās, like, who we are. I am always trying to make the case that developer speed is important, too, right? Itās like thereās different ways of thinking about speed. and it is definitely the case that a lot of NVIDIAās software is so cumbersome to use that you know people canāt get the actual hardware speed as fast as it should be because they just give up.
You know, they just donāt, donāt even figure out how to use that. So I think NVIDIAās making strides there. I think the, the company is understanding more deeply how important developer experience is, and I hope we continue to push that, so that the benefits of all of the systems technology that NVIDIA works so hard on can be more widely used. but at the same time, you know, there is gonna be a tension between those things. Itās, itās not gonna go away, and you know, to a certain extent, I think thatās just life on planet Earth.
00:48:26 Nathan Lambert: It is. I think youāre do- youāre doing a good job, and Iām gonna kind of shift gears in this interview. So Iāve.. In becoming more back in language- in becoming a person that works in language models, Iāve seen your name more and more times.
I was like, āBryan Catanzaro, like, where have I seen this?ā And then I went and did the research of the Berkeley PhD in, like.. It says April of 2021, you gave a Berkeley EECS Colloquium titled āApplications of Deep Learning and Graphics, Conversational AI, and Systems Design.ā Iām not even gonna posit that I actually went, but thatās definitely where I remembered the name from in grad school. And we both have backgrounds that arenāt traditionally in AI and end up working in language models. I just wanted to, like-- what have you learned from your path th- through NVIDIA into what, like, people should be thinking about with AI or open models today?
This could be career reflections, like technical reflections. I just think that thereās-- there are actually a lot of people that come from all over the, like, STEM field to work in AI, so giving it-
00:49:29 Bryan Catanzaro: Sure
00:49:29 Nathan Lambert: .. space to think about is-
00:49:31 Bryan Catanzaro: .. useful, even if itās just like, it was the big problem, and I wanted to go solve it. Well, I think, you know Iāve, Iāve had a lot of opportunity and a lot of luck in my career. I think in hindsight, it seems like an extraordinarily lucky thing that, you know, I did my first internship at NVIDIA in 2008, and I was, like, building machine learning models on the GPU, and I went to NVIDIA, and nobody else was really doing that. And I was like, āHey, like, we should have more people doing machine learning on the GPU.
I think this could be an opportunity.ā And you know, it took a few years for me to make any headway. NVIDIA didnāt really wanna listen to me. I was a brand-new PhD. I was in the research organization, which is very independent, but, you know, sometimes struggles to change the way that the, you know, the bigger company thinks about things.
And and yet, I just had this conviction, you know, I just was following my heart about what I think is gonna be important, what do I think could really change the world? And that has been, I think, the thread that has taken me through my whole career, is that Iām constantly trying to refine my beliefs about what matters and then hold to them. And that.. I donāt know how helpful it is to say that, but I feel like sometimes people you know, tend to follow the, whatever the thing is that people are talking about on Twitter.
And like Iāve- Iāve done a lot of unpopular things during my career because I believed in them, you know? I remember I published my first paper in 2008 on, at ICML, on training support vector machines on the GPU, and I actually had somebody at the conference, it was in Helsinki at dinner, you know, we were all telling each other what weāre doing, and, and I was like: Yeah, I wanna help people train bigger models on bigger data sets with GPUs. And, and I had you know, a couple of people just say, āWell, why are you here at ICML? That just doesnāt really feel like a good thing for us.ā And in 2008, ICML was momly- mainly about new mathematical frameworks for thinking about data, and you know, maybe if you trained a model at all, you would train one on your laptop.
You know, that was the state of machine learning in 2008. So for somebody to come in and say, āI think I want to focus on, like, parallel computing, new kinds of hardware for machine learning, programming frameworks for machine learning, so that, you know, we- more people can try inventing new models on complicated machines with a lot more compute throughput on bigger data sets,ā that was like a, an unpopular thing. At least it felt very unpopular. I felt very marginalized at the time by the community.
But I believed in it, you know? I just felt like, look, technology.. Like I have this sense of, like, where do I think technology is going? I knew that traditional computing was running out of steam.
You know, I had, I had done a few internships at Intel, and I was trying to help Intel make processors that ran at, like, ten gigahertz back in 2001, and, you know, it was, like, clear that th- they were running into a wall. And I was thinking: Okay, so if the compute hardware is gonna have to be different, itās gonna be more restricted. Itās not gonna be able to be so general-purpose in order to get speed. What kinds of applications are gonna have, like, an infinite need for more computing?
And I thought, well, machine learning and AI, that could really change the world if it ever actually worked. But, you know, but, you know, back then it, back then, it kinda worked inside of Google. outside of Google, it kind of didnāt work. and so I had kinda these signals, like it was possible, but it was hard. It was a little weird. It was a little niche.
I was a little bit caught in between different fields, like the systems people didnāt think I was systems enough, and the machine learning people didnāt think I was machine learning enough. But, but I believed in what I was doing, and I found a way to keep following that belief. And, you know, ultimately it was very rewarding when all of a sudden NVIDIA decided, āHey deep learning is changing the world. What do we know about deep learning?ā And then it was like: Oh, well, Bryanās been doing that for several years, and heās written some libraries that we could turn into a product.
Letās go do that. And, you know, so that all happened really quickly after many years of nothing happening, you know? And that was really obviously an amazing opportunity for me. you know, an- another thing that was important to me, I left NVIDIA in 2014 to go work at the Silicon Valley AI Lab at Baidu with a group of really talented people, including Andrew Ng and Dario Amodei and Awni Hannun and Adam Coates, and you know, this was a, a really once-in-a-lifetime opportunity, I think for me, to learn some things that would have been hard for me to learn on my own. you know, I felt at the time at NVIDIA that although I had this great opportunity to help NVIDIA become an AI company, and I was doing that, and I was succeeding at that back in 2013 2014, I also felt like I really wanted to learn from a broader community of people applying machine learning and AI to solve really important business problems. And so going to work at Baidu really gave me that chance. and I was there for a couple of years, learned a ton. very grateful to the team there especially to Andrew Ng, who, who encouraged me to, to join with him on that. and then, you know, I ran into limits of what I could do in California, working for a Chinese company.
I was thinking about, you know, what should I do next? And Jensen asked me to come back and build an applied research lab at NVIDIA in 2016. and -.. I wasnāt sure, like, if that was a good idea. I thought NVIDIAās already grown so much, you know.
The, the years from twenty fourteen to twenty sixteen, NVIDIA actually grew a lot. these days you look back at it, and youāre like: It was still really tiny. But, but back then, I was like: I donāt know, maybe NVIDIAās already tapped out. I donāt know if you recall, in twenty sixteen, there was already, like, ten different companies making GPU competitors, right? The TPU had already been out for a while and you know, it, it wasnāt clear that NVIDIA was gonna become as large as it, as it has.
But I believed in the opportunity. I believed in the people. you know, one of the things I loved about NVIDIA was that itās a very stable organization. So Jensen, heās been running it since he founded it in nineteen ninety-three. my boss, Jonah Alben, whoās an absolutely extraordinary person has been here for you know quite a, quite a long time, almost since the very beginning of NVIDIA. And these people a lot of the leadership at NVIDIA they love the work.
Their heart is in the work. Jensen and Jonah and many other leaders at NVIDIA, they donāt need to be doing this, right? They, they have earned the right to go sit on a beach and drink mai tais all day, but their heart is in the work, and they work incredibly hard. you know, the.. I feel like if there was an Olympics for email, you know Jensen would get the gold medal.
You know, like itās, itās unfathomable to me, like, how much information heās able to process. and itās a skill that heās built up over a long time running this company, but itās also a reflection of his commitment to the work. And I felt like working at a place where weāve got this very stable organization that loves the work, that really wants to change the world. You know, why does, why does Jensen get up in the morning? Well, itās-- this is his chance to do something meaningful.
I thought, associating with these people, you know, I could do worse. I could-- I think I could learn from this as well. And so I came to NVIDIA, and back then it was really hard to explain to people why I was trying to build an AI lab inside of NVIDIA. At, at the time, NVIDIA wasnāt doing very much AI, and so I had to kind of develop a vision for that and then explain it to people. thatās ended up being a really good idea for me as well.
You know, the lab, I think, has really helped NVIDIA. you know, Megatron, I think, has really shown the industry, like, how valuable NVIDIA systems can be for language modeling, which is, which is awesome. DLSS, you know Iām continuing to, to push DLSS forward. Very excited about making graphics, you know more efficient with AI. These days, you know, fifteen out of every sixteen pixels a gamer sees are rendered by AI models that, you know, my team developed, and that then makes the GPU ten times more power efficient.
This is a really exciting you know, thing for me to be involved with, something that Iāve, you know, dreamed about for years. So, so thatās the kind of thing that continues to push me forward, is that I have strong beliefs about what I think is possible, where I think technologyās going, and Iām willing to do things that are we- weird and unpopular but, you know, basically following my convictions. Iām very much always thinking about the people Iām working with, the tribe. You know, I think tribes matter enormously. like you know if I..
So, so back when I was a grad student, I was working on programming models for machine learning. I joined the Python tribe. There are other people that were in the Scala tribe, and the people that did their work in the Scala tribe, trying to make programming models for machine learning in, like, two thousand and ten you know, that work, although a lot of it was technically excellent, didnāt matter to the community as much as the people who were in the Python tribe. It ended up.. and, you know, it kind of sucks sometimes that the world is tribal like this, but itās just the case.
You know, that like the people that you work with, the community that you work with has a big impact on the problems you think about and then the impact that your work has. So I think a lot about the people and the tribes that Iām collaborating with or that Iām part of. and you know, thatās, thatās kind of been the thread that has carried me through my career.
00:59:56 Nathan Lambert: Yeah. Than- thanks for sharing this full arc. I think youāve said things that I tell people but in different languages, and the first one, the early days, it seems like there can be space in between fields, where people-- two fields will have their way of describing things, but both of them are probably incomplete, and there can be space there, which is a lot of what I was doing transitioning from novel robots to model-based RL, where I, like, didnāt sit and bear in the actual AI lab, but I started doing AI with my, like, total electrical engineering friends. And then the second thing is, like, Iād wholeheartedly recommend this to people, is, like, choose your work based on the people and people that sincerely are in it for-.. the, what they want to do, and a lot of-
01:00:41 Bryan Catanzaro: And follow your beliefs. You know, think about it. What do you believe in? And itās okay to change your mind, you know, but, like, figure out what is it that you believe in.
Ask yourself every day: Do I still believe in that? If I do, what next? You know. If I donāt, well, what do I believe in?
You know, thatās been really important to me. I think too many people end up kind of just following trends. Thatās not usually helpful because the trends are too late. So if you wanna, if you wanna change the world, you need to be ahead of the trends, and you need to know, you know, it-- trends-- I donāt think trends in computing are just fashion.
I think thereās truth that drives those trends. Not always, but often. You know, itās just-- this is, itās thereās kind of an inevitable force of gravity. It just can be really hard to par- parse out the noise and figure out what is the truth that is gonna push the industry forward, and how can you push that with it.
You know, if you can join with that, you can accomplish great things.
01:01:36 Nathan Lambert: Yeah, I agree. I think in building language models, itās like you want to build a model that the community wants in six months. I think if youāre building a model to compete-.. with the models that are already out, youāre not gonna keep up. And I think that itās like, what is the right thing is building open language models in six months, and like, where do you need to try to steer things is one of the hardest problems that I think about. So I donāt-- if you want to close with any predictions where you see, like, open models, like, if weāre-- if youāre gonna be here at the end of twenty-six, if thereās anything you think will be far more obvious than it is today, or any bets that you want to make, I think itās kind of a good place to wrap.
01:02:18 Bryan Catanzaro: Well predictions are always hard, and I donāt feel like Iām very good at making predictions. But I am-- I feel like I am good at identifying what I believe in, and what I believe in right now is that compute remains one of the fundamental challenges behind AI. It has been that way for a very long time and I think it continues to be. I think as we find new ways to apply compute to AI, we discover new forms of scaling laws that help AI become more useful and therefore, it becomes more widespread.
So Iām gonna keep thinking about compute. I continue to believe that the fastest-- that, you know, the way to think about AI is not just in terms of absolute intelligence, but rather intelligence per second. You know, thereās some sort of normalization in there that relates to how fast a model can think, how fast a model can be trained or post-trained. You know, that models that kind of incorporate this compute acceleration characteristic, where theyāre thinking about intelligence per unit time, those are gonna end up winning because they end up getting trained on more data, they end up getting post-trained with more cycles, they end up with more iterations during thinking when theyāre deployed. and you know, of course, if they happen to fit the hardware really well whatever hardware that is then, you know, that can have a pretty non-trivial effect on the intelligence as well.
So thatās something that I really believe in. I really believe in AI as an infrastructure. You know, thereās, thereās different ways of thinking about AI. I think some people believe AI is more like the singularity, like once AGI has been declared, then the whole world is different forever, and all humans have lost their jobs and, you know, thereās a lot of like-- thereās a lot of things about AI that people believe that I personally donāt believe.
You know, I believe, first of all, that intelligence is very multifaceted that it is not easy to pin down, that as soon as we try to pin down intelligence, we find that thereās very many more forms of intelligence that arenāt covered by that. So, for example, a model that achieves gold medal status on the International Math Olympiad, thatās an extraordinary achievement, but it doesnāt make me have no job, right? Like, Iām actually not solving math problems all day, even though, like, having the ability to solve math problems is clearly very useful. And you know, itās also the case that intelligence is, you know, is kind of like a potential energy itās not a kinetic energy, right?
In order to transform intelligence into kinetic energy, it needs to have a platform. It needs to be applied in the proper way. and you know, that is why I believe in open models and open- openly developed and deployed intelligence. I believe every company, every organization, has secrets that only they know. They have special data, they have special ways of thinking about their problems, their customers, their solutions, and theyāre gonna know how to apply AI better than anyone else.
And so AI as infrastructure that transforms companies, turbocharges them, allows them to take the things they know and multiply their impact, thatās something that I believe in more than AI as an event, that one day, when it happens, makes everyone obsolete. I donāt.. I just donāt believe in that. you know, I often joke that, like if, for example, the CEO were to retire at some point, and we needed to find a replacement you know, handing out an IQ test or asking, you know, who has the highest SAT score that would not be a very good way of finding a replacement, you know? intelligence is just far too complex for that. And so you know, so this, these beliefs, you know, you can disagree with me about anything that I just said, and Iām not offended by that.
I have a lot of friends that do. but you know, Iām asking myself, well, if I believe that intelligence has these characteristics and that AI is gonna change the world by turbocharging institutions that exist a-and also creating new applications that we havenāt even dreamed of yet rather than replacing all humans, then, you know, how do I go about building that, you know? And so thatās, thatās kind of the direction that Iām on right now.
01:07:00 Nathan Lambert: Yeah, I love it. I agree, I agree that weāre entering an interesting area where the open models are taking so many different shapes and sizes and have so many different strengths and trade-offs, that there can start to be interesting interplay as an ecosystem, where thereās just so many different things going on. And I think I like your idea of potential energy, and you have to build things that are kind of unclear of what-- Itās like you have to build the energy in a way, and you donāt really know what the goal is, but you have to do.. try to build these good models. So I appreciate it, and-
01:07:30 Bryan Catanzaro: Yeah, and then let people apply it. Let it-- let them make the kinetic energy happen.
01:07:35 Nathan Lambert: I agree. Thanks for coming on.
01:07:37 Bryan Catanzaro: Thanks so much for inviting me. Itās been a great conversation.
This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe - Laat meer zien