Afleveringen

  • https://www.interconnects.ai/p/claude-4-and-anthropics-bet-on-code

    Claude’s distinctive characteristics are having a best-in-class personality and the ability to effectively perform software engineering tasks. These characteristics both appeared in force with the first version of Claude 3.5 Sonnet — a major breakthrough model at the time and the model that pulled me away from ChatGPT for the longest. That model was released on Jun 20, 2024, and just the other day on May 22nd, 2025, Anthropic released Claude Opus 4 and Claude Sonnet 4. The strengths of these models are the same.

    The models serve as an instrument in Anthropic’s bigger goals. The leading AI models alone now are not a product. All the leading providers have Deep Research integrations set up, ChatGPT uses memory and broader context to better serve you, and our coding interactions are leaving the chat window with Claude Code and OpenAI’s Codex.

    Where Anthropic’s consumer touchpoints, i.e. chat apps, have been constantly behind ChatGPT, their enterprise and software tools, i.e. Claude Code, have been leading the pack (or relatively much better, i.e. the API). Anthropic is shipping updates to the chat interface, but they feel half-hearted relative to the mass excitement around Claude Code. Claude Code is the agent experience I liked the best over the few I’ve tried in the last 6 months. Claude 4 is built to advance this — in doing so it makes Anthropic’s path narrower yet clearer.

    As a reminder, Claude 4 is a hybrid-reasoning model. This means that reasoning can be turned on and off at the click of a button (which is often implemented with a simple prompt at inference time and length-controlled RL at training time — see the Nemotron reasoning model report for more on hybrid-reasoning techniques). In the future extended thinking could become a tool that all models call to let them think harder about a problem, but for now the extended thinking budget button offers a softer change than switching from GPT-4.1 to o3.

    Claude 4 gut check

    In AI, model version numbers are meaningless — OpenAI has model number soup with their best model being a random middle number (o3) while Gemini took a major step forward with an intermediate update — so Claude 4 being a seemingly minor update while iterating a major version number to fix their naming scheme sounds good to me.

    In an era where GPT-4o specifically and chatbots generally are becoming more sycophantic, Claude’s honesty can be a very big deal for them. This is very hard to capture in release notes and still comes across in the takes of lots of early testers. Honesty has some downsides, such as Claude’s ability to honestly follow its alignment training and potentially report rule-breaking actions to authorities. Honesty and safety are very desirable metrics for business customers, a place where Anthropic already has solid traction.

    In a competitive landscape of AI models, it feels as if Anthropic has stood still in their core offerings, which allowed ChatGPT and Gemini to claw back a lot of their mindshare and user-share, including myself. Claude 4’s “capabilities” benchmarks are a minor step up over Claude 3.7 before it, and that’s on the benchmarks Anthropic chose to share, but it is still clearly a step forward in what Claude does best.

    Benchmarks are a double edged sword. Claude 4 will obviously be a major step up for plenty of people writing a lot of code, so some will say they’re never looking at benchmarks again. This approach doesn’t scale to enterprise relations, where benchmarks are the headline item that gets organizations to consider your model.

    On some popular coding benchmarks, Claude 4 actually underperforms Claude 3.7. It would be good for the industry if Claude 4 was rewarded for being a practically better model, but it goes against a lot of what the industry has been saying about the pace of progress if the next major iteration of a model goes down on many popular benchmarks in its core area of focus.

    Buried in the system card was an evaluation to measure “reward hacking,” i.e. when the model takes an action to shortcut a training signal rather than provide real usefulness, that showed Claude 4 dramatically outperforming the 3.7 model riddled with user headaches.

    This single benchmark summarizes a lot of the release. They made the model more reliable, and what follows ends up being Anthropic falling into normal marketing paths.

    This release feels like the GPT-4.5 release in many ways — it’s a better model in general use, but the benchmark scores are only marginally better. It’s obviously a strong and well-crafted model (doubly so in the case of Opus), but it’s not immediately clear which of my grab-bag of use cases I’ll shift over to Claude for it. I’m not the intended audience. I write code, but a lot of it is one-off hacks and it’s certainly not sustained development in a major code-base. Without better consumer product offerings, I’m not likely to keep trying Claude a lot. That doesn’t mean there isn’t a strong audience for this model in the software industry. My vibe tests for the model were good, but not good enough to break my habits.

    Anthropic shared evaluation numbers for the model with and without extended reasoning on with parallel test-time compute. Both of these numbers aren’t really standard for sharing evaluations of new cutting-edge models (mostly of the reasoning variety).

    The oddness of the benchmark presentation reiterates that Anthropic is going down a bit of a different path with their models relative to OpenAI and ChatGPT.

    It should be fairly obvious to most AI observers that if simply turning on extended thinking for Claude 4 was enough for Opus to be competitive with o3 or Sonnet to Gemini 2.5 Pro, they would’ve done it. Without the shaded regions, the bars do not look so impressive (coming soon below), and this leads us to one of the major facts of the Claude 4 release — the benchmarks are meh. They can’t lead this model to mindshare.

    This is partially in the context of how Anthropic is very narrowly curating the benchmarks they share to match their coding and agentic use-cases.

    The Anthropic announcement benchmarks are: SWE-Bench Verified, Terminal-bench, GPQA-Diamond, TAU-bench, MMMLU, MMMU, and AIME 2025. It’s 3 mostly agentic coding benchmarks, 3 knowledge benchmarks, and one very hard math benchmark. Traditional “coding” benchmarks aren’t even really here.

    Compare this to the benchmarks from Gemini 2.5 Pro’s recent release: Humanity’s Last Exam, GPQA, AIME 2024/2025, LiveCodeBench, Aider Polyglot, SWE-benchVerified, SimpleQA, MMMU, Vibe-Eval, MRCR, and Global MMLU. This is a wider mix and has only one agentic-ish task in SWE-Bench.

    The presentation is also arguably misleading in the blog post, where they report scores that are from a model version inaccessible to users. The first number is “standard-use” without test-time compute.

    Where Anthropic says the results are “without test-time compute” it’s hard to know what the baseline is. Claude was the first mainstream model to show signs of doing some sort of internal chain of thought (CoT) before showing the final answer to the user. This was in the model and discussed before the launch of OpenAI’s first o1 model.

    For the second number, the fine print in the blog post states:

    On SWE-Bench, Terminal-Bench, GPQA and AIME, we additionally report results that benefit from parallel test-time compute by sampling multiple sequences and selecting the single best via an internal scoring model.

    When Claude 3.7 launched, Anthropic wrote a nice blog post on test-time compute that also talked about parallel compute. The higher of the two numbers in their benchmarks illustrates what is happening there. I expect Anthropic to release an o1-pro-style product soon (as Google also announced Gemini DeepThink). These ways of using the model are very powerful, and because Anthropic reported it using an internal scoring model and not something like the pass@10 metric that is giving the model multiple tries, users would benefit to use it.

    This method gives the shaded bars in the results below.

    With distillation from powerful models being so common today, making the distinction for benchmarking between reasoning and non-reasoning models or test-time compute and standard inference is very strained. For users, there are many more differences that take into consideration actually serving the models.

    There are only a few reasonable ways to compare models today, and only one of them is arguably practical:

    * Compare evaluation scores how the users will use them. E.g. you can only report parallel test-time compute scores if they’re in a product like o1-pro.

    * Compare peak scores across models, so you can see the peak performance of all the systems the AI models have.

    * Release FLOP spend per prompt on the evaluation sets and bin models with different levels of compute per question.

    Because we don’t get the data to do these comparisons, we tend to compare using the first bucket. When we see shaded bars on plots (like above, or in OpenAI’s o-series release blogs), we ignore the shaded regions.

    Benchmarks obviously aren’t everything to a model’s release. This analysis is to show why the AI field is strained by being forced to communicate the abilities of their models through benchmarks that don’t capture the full picture.

    In using Claude Opus 4 (and Sonnet too) instead of Gemini 2.5 Pro I was immediately struck by how much slower it is.

    The character and real-world use of the model matters far more, but in a world where OpenAI’s and Google’s latest models have both leading benchmark scores and good vibes (as long as you’re not using GPT-4o), it makes you question Anthropic’s position to compete for the whole market.

    Interconnects is a reader-supported publication. Consider becoming a subscriber.

    Will Anthropic code their way to AGI first?

    There’s a long-standing assumption in AGI-centric circles that having the best coding model will let you get to AGI the fastest. A version of this argument is the “software-driven singularity” of the AI 2027 forecast. This is a reasonable argument to make if you paired it with the assumption that the ability to implement AI ideas is the limiting factor on progress. It is obviously a major factor, but taking a narrow worldview such as that makes you miss how AI progress is actually made. AI progress is messy, incremental in data, and takes a lot of hours of human focus. Resources and human attention are the bottleneck more than software ability.

    I expect improved code gains to be very strong marginal gains. They make the process of doing AI research much smoother, particularly by enabling more concentrated research teams and organizational structures, but they won’t be the single factor that is looked back upon as being the key to AGI. The key is many small insights and lots of hard work, mostly data, over time.

    The Code RL team at Anthropic is “singularly focused on solving SWE. No 3000 elo leetcode, competition math, or smart devices.” If having the best coding model was going to let Anthropic get to AGI first, then why haven’t we begun to see the benefits of it? The Claude 4 release shows that Anthropic is falling behind on general benchmarks and not climbing substantially on those they highlight. In many ways, this looks like Claude getting more robust across a variety of use-cases and not accelerating forward in general intelligence.

    The argument for having the best code model being the core ingredient in getting to AGI first is then reducing to belief that these posited benefits will kick in at some point in the future and Anthropic’s models will become better at everything else too. The AI laboratories are extremely competitive and it looks as if Google and OpenAI are improving on software tasks and a broader range of abilities.

    There are regular press releases about a certain number of PRs being written by AI across the technology sector generally — Anthropic CPO Mike Krieger recently highlighted the number being ~70% for them — which likely is counting anything where AI is a co-author. At the same time, these AI systems have struggled to grasp very complex codebases, so human oversight is a still a crucial step of the process. The AIs make everything easier, but not automatic.

    It seems like a far more reasonable path to something called Artificial General Intelligence will be one that shows incremental improvements on a broad variety of tasks, rather than narrowing a focus and waiting for future payoff.

    Focusing on software development is still a good business strategy for Anthropic, but saying that it’ll let them leapfrog OpenAI and Google in the AGI race is a weak attempt to accept reality.

    As a regular user of claude.ai that is greeted by rate limits, the problem limiting their progress is more likely to be compute allocation than talent or research strategy. I’ve said before that human competition is the biggest driving force of rapid progress in AI models, so I also worry about Anthropic’s culture of safety and anti-arms-race mentality being able to capture that.

    A more compelling argument than code could be that Anthropic is leading on the “agentic front,” which means the models can plan effectively and accomplish tool-use calls to enact it. Claude Code is a positive example of this, but the weakness of their Deep Research product is a negative mirror. With bigger error bars in this area, in terms of what is possible with agents generally, this could be a better area to make a case for optimism for Anthropic.

    So-called “coding” abilities are very broad and encompass understanding error traces, extreme long-context abilities to understand a code-base, basic scripting, multi-file edits, and many things in between. Agentic abilities seem to fall into a narrower niche, or at least a more well-defined one, where the model needs to be able to accomplish many incremental tasks on their own while managing its context. This could generalize to a far bigger market than just software if one model is miles ahead. The winner in the agentic platform space should become more clear later into 2026.

    As a summary of the state of affairs for the major AI players, we are positioned as:

    * OpenAI is the consumer leader and still very well-positioned with extremely strong models.

    * Google is the general enterprise leader with the best models across every task or size you would need (e.g. the lack of Claude Haiku 4 is very limiting for Anthropic, and Haiku has remained expensive). If they can get their act together building products, even OpenAI should worry.

    * Anthropic is the leading model for software engineers and related tasks — maybe they should’ve acquired Windsurf instead? This core area complements a well-rounded and functioning enterprise business, just one that will be smaller than Google’s.

    * Meta is building models to serve their platforms, which will be the most significant competitor with ChatGPT, but they have major cultural or organizational knots to unlock to catch up technically.

    * Grok is on the path to being a niche player serving use-cases that need more permissive content guidelines. They have an API, but it is far from well-established in key areas.

    * DeepSeek is an x-factor that could disrupt many of the above, but we never know when it’ll land.

    In the top list, as businesses, OpenAI and Google appear in a league of their own. Anthropic seems solid but heading for a much smaller ceiling, and the others below are still floundering to make a true AI strategy.



    This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe
  • https://www.interconnects.ai/p/people-use-ai-more-than-you-think

    I was on ChinaTalk again recently to talk through some of my recent pieces and their corresponding happenings in AI.

    Usage and revenue growth for most AI services, especially inference APIs, has been growing like mad for a long time. These APIs have been very profitable for companies — up to 75% or higher margins at times according to Dylan Patel of SemiAnalysis. This is one of those open facts that has been known among the people building AI that can be lost to the broader public in the chorus of new releases and capabilities excitement.

    I expect the subscription services are profitable too on the average user, but power users likely are costs to the AI companies alongside the obvious capital expenditures of training frontier models. Still, even if the models were held constant, the usage is growing exponentially and a lot of it is in the realm of profitability.

    The extreme, and in some cases exponential, growth in use of AI has been happening well before lots of the incredible progress we’ve seen across the industry in the first half of the year. Reasoning models that change inference answers from something on the order of 100s of tokens to sometimes 10s of thousands of tokens will make the plots of usage even more stark. At the same time, these models are often billed per token so that’ll all result in more revenue.

    On top of the industry’s vast excitement and progress in 2025, the Google I/O keynote yesterday was a great “State of the Union” for AI that highlighted this across modalities, form factors, and tasks. It is really recommended viewing. Google is trying to compete on every front. They’re positioned to win a couple use-cases and be in the top 3 of the rest. No other AI company is close to this — we’ll see how their product culture can adapt.

    Highlights from I/O include Google’s equivalent product relative to OpenAI’s o1 Pro, Gemini Deep Think, Google’s new multimodal models such as Veo 3 with audio (a first to my knowledge for the major players), a live demo of an augmented reality headset to rival Meta and Apple, and a new version of Gemini 2.5 Flash that’ll serve as the foundation of most customers’ interactions with Gemini.

    There were so many awesome examples in the keynote that they didn’t really make sense writing about on their own. They’re paths we’ve seen laid out in front of us for a while, but Google and co are marching down them faster than most people expected. Most of the frontier language modeling evaluations are totally saturated. This is why the meta usage data that Google (and others recently) have shared is the right focal point. It’s not about one model, it’s about the movement being real.

    The slide that best captured this was this one of AI tokens processed across all of Google’s AI surfaces (i.e. this includes all modalities), and it is skyrocketing in the last few months.

    I annotated the plot to approximate that the inflection point in February was at about 160T total tokens in a month — Gemini 2.5 Pro’s release was in late March, which surely contributed but was not the only cause of the inflection point. Roughly, the numbers are as follows:

    * April 2024: 9.7T tokens

    * December 2024: 90T tokens

    * February 2025: 160T tokens

    * March 2025: 300T tokens

    * April 2025: 480T+ tokens

    Monthly tokens are rapidly approaching 1 quadrillion. Not all tokens are created equal, but this is about 150-200M tokens per second. In a world with 5T Google searches annually, which translates to around 100K searches/second, that tokens per second number is equivalent to roughly using 1000 tokens per search (even though that is definitely not how compute is allocated). These are mind boggling numbers of tokens.

    Google’s primary AI product is still its search overviews and they’ve been saying again and again that they’re something users love, reaching more than a billion people (we just don’t know how they are served, as I suspect the same generation is used for thousands of users).

    Interconnects is a reader-supported publication. Consider becoming a subscriber.

    Google is generating more tokens than is stored in Common Crawl every month — reminder, Common Crawl is the standard that would be referred to as a “snapshot of the open web” or the starting point for AI pretraining datasets. One effort to use Common Crawl for pretraining, the RedPajama 2 work from Together AI, estimated the raw data in Common Crawl at about 100T tokens, of which anywhere from 5 to 30T tokens are often used for pretraining. In a year or two, it is conceivable that Google will be processing that many tokens in a day.

    This article has some nice estimates on how different corners of the internet compare to dumps like Common Crawl or generations like those from Google’s Gemini. It puts the daily token processing of Google as a mix of reading or generating all the data in Google Books in four hours or all the instant messages stored in the world in a little over a month.

    Some examples from the post are below:

    The internet is being rebuilt as an AI first service when you count the data. Human data will quickly become obsolete.

    Google’s numbers are impressive, but they are far from outliers. The entire industry is taking off. This is all part of a constant acceleration where products that are built on previous models start to get traction, while at the same time new models come out that only enable new growth cycles to begin. Estimating the upper end of this growth cycle feels near impossible.

    For example, just a few weeks ago on the Q3 2025 earnings, Microsoft CEO Satya Nadella commented on the output of Azure’s AI services:

    We processed over 100 trillion tokens this quarter, up 5× year-over-year — including a record 50 trillion tokens last month alone.

    So, Google’s token processing is almost 10X Azure, and many would say that Google got a late start relative to Microsoft’s early partnership with OpenAI to host their models.

    Estimates for other services, such as ChatGPT are much messier, but all paint a similar picture. In February, Sam Altman posted on X:

    openai now generates about 100 billion words per day. all people on earth generate about 100 trillion words per day.

    With the rule of thumb that one word is about 3/4 of a token, 100B words per day would be about 4T tokens per month. A small sliver relative to the cloud giants above, but we don’t have clear insight into if this is all of OpenAI’s API business or just ChatGPT. As it stands, OpenAI could be almost 1/100th the size of Google’s AI footprint as of today.

    OpenRouter’s rankings show similar trends, with the recent months being around 2T tokens processed — about the same order as ChatGPT depending on how it is measured above.

    This isn’t just Western businesses, as Chinese companies such as ByteDance or Baidu are getting into the 1T token per day range (barring translation issues, I didn’t find another source for it).

    When fast-growing companies like Anthropic or OpenAI share somewhat unbelievable revenue forecasts, maybe we should give them a bit more credit?

    There are many surfaces that are in beta, primarily code agents, that are going to help these numbers take off. We’ve been playing with Claude Code, OpenAI’s Codex, Google’s Jules, and countless other agents that use tons of text tokens by working independently for minutes at a time. I’ve estimated with friends that one Deep Research query uses ~1M tokens of inference. Soon individual tasks will use ~10M then ~100M and so on. All of this so soon after just two years ago when a mind-blowing ChatGPT query only used 100-1K tokens.

    It’s a good time to be in the token selling business. This is only the beginning.



    This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe
  • Zijn er afleveringen die ontbreken?

    Klik hier om de feed te vernieuwen.

  • https://www.interconnects.ai/p/how-i-got-here

    Some longer housekeeping notes this week:

    * I wrote briefly about a new open-source license, OpenMDW from the Linux Foundation, that seems very solid!

    * OpenAI launched the Reinforcement Finetuning (RFT) API. I think my take from when it was teased still holds up super well, you should read it if you haven’t:

    * In June, I’ll be speaking at some events in SF and Seattle, I’m looking forward to seeing some readers there. Talk topics are tentative:

    * AI Engineer World’s Fair in SF June 3-5 on what we can take away from the last 6 months of reinforcement learning with verifiable rewards (RLVR).

    * Enterprise AI Agents in Action in Seattle on June 13 on the art of training a well crafted model.

    * VentureBeat Transform in SF on June 24-25 on progress in RL with open source AI.

    During the SF trips I’m excited to catch up with old and new friends training and using the latest AI models, so don’t be shy to shoot me an email. Onto the post!

    One of the big upsides for my current writing habit is that I should become known by AI models within a couple years. While not offering any immediate technical value in how I use AI, it provides obvious upsides on growing an online presence and fulfilling a very basic human urge for legacy in a way that avoids most personal or moral sacrifice. Other thinkers I follow closely have begun to follow Tyler Cowen's lead on explicitly writing for the AIs and filling in gaps they won't know via what is currently digitized.

    I'm joining in and will use it to help push out the limits of my writing. These will build on my two popular job search posts and others like "what it’s like to work in AI right now".

    The most defining feature of my young career has been how I prioritize different aspects of work. The work I do today takes on a simple form, but prior to getting to this sustainable place it was more of a striving to belong than a plan to execute.

    Getting into AI

    Without retelling my entire pre-grad school life, some basic facts that I brought with me coming out of an undergrad primarily characterized by high-focus on executing on coursework and winning championships were:

    * An obvious gift on focusing and grinding through moderate amounts of technical material alone,

    * Acceptance that most people can do very hard things if they're willing to work for year(s) on it driven by personal motivation alone (most people don't want to work long enough, rather than hard enough),

    * An ambivalence on if I actually needed to finish the Ph.D. I was starting, worst case I would get a master’s degree from a great school, and

    * Plenty of undirected ambition.

    Starting my PhD in the fall of 2017, my background was in MEMS, high energy physics / lasers, and a battery engineering internship at Tesla, but listening to the orientation events and hearing the buzz around professors like Sergey Levine and Pieter Abbeel it was clear that AI research was what I wanted to do. For context relative to today’s second coming of RL, this was when deep reinforcement learning was in its hay-day.

    I asked Professors Levine and Abbeel directly if I could join their research groups and they said no politely. The important part here was the practice of consistently asking for opportunities.

    After these refusals in the first few months of my Ph.D. I had no real leads in getting into AI for pretty much the rest of my first year. I took classes, tried to parse papers, and so on but was for the large part on my own. I didn't follow the standard advice of not caring about classes in graduate school and learned some solid fundamentals from it. I was not integrated into BAIR proper nor friends with graduate students in BAIR — my network was all on the electrical engineering side of EECS.

    I dug up the first email from my advisor Kris Pister who connected me with my eventually-to-be co-advisor Roberto Calandra (post-doc with Sergey Levine at the time):

    FYI. Roberto is interested in applying machine learning to ionocraft problems.

    ksjp

    ---------- Forwarded message ---------- From: Kristofer PISTER Date: Fri, Feb 16, 2018 at 9:34 AM Subject: Re: Microrobot simulation To: Daniel Contreras Cc: Brian Yang , Grant Wang , Roberto Calandra

    My summary of the meeting (Roberto, Dan - please add corrections):

    There are several different research directions in which to go from here. The most

    interesting one seems to be optimization of leg geometry. This would involve:

    * changing the learning algorithms somewhat

    * generating some interesting "terrain" for the robots to walk over

    * using simulation to come up with a small number of new leg designs that optimize speed over terrain (and size?)

    * fabricating those designs in silicon

    * testing the silicon robots

    There are a couple of other "learning plus v-rep simulation" projects that are interesting:

    * using inertial sensor data to optimize gait

    * using low-res image sensing to do obstacle avoidance

    * combining low-res image sensing and inertial data to get the robots to solve interesting problems

    * using the same sensors, but on the ionocraft

    And finally, using learning to control the real ionocraft based on the inertial sensor data,

    and compare to the traditional controller that we're building in matlab.

    If possible, it would be great to find another few "Brian/Grant quality" undergrads.

    Do you guys have any brilliant and hardworking friends who are looking for research

    projects in machine learning for micro robots?

    ksjp

    The details are a long story, but I prioritized this collaboration with all I had. I missed a conference deadline in the fall and failed a lot of experiments. If it started in spring of 2018 the paper wasn't done as my #1 priority until winter 2019 (and it was a little bit of a janky paper at that). My meetings with Roberto were super stressful as I wanted to make sure I didn't miss anything that a "normal AI student should know".

    I did good work for Roberto. Even though I thought I was out of place at the time, my diligence and commitment was super valuable to do real research. Now that AI research is so popular, a lot of people want a check box of doing it rather than getting super into the details. I didn't give myself enough credit for this.

    Where I did get lucky was Roberto asking if I wanted to join him for an internship at FAIR in 2019. This was earlier than I deserved it. This brought me out of an AI outsider track career and into an insider AI track career, even if I didn't realize it. Working at FAIR was wonderful and I learned how to properly experiment in AI and build some useful software.

    Building this flywheel with continued research looked like constant teaching at Berkeley in order to pay my way through graduate school. This is not normal for the well funded AI labs. I spent a long time writing grants that didn't come through until after I graduated, where I brought in a year or two of funding for someone else in my advisor's group, you're welcome!

    The FAIR internship and a lot of time interviewing got me a second internship at DeepMind. The actual internship experience was pretty bleak entirely due to COVID and my personal life at the time, but the technical experience and network were super valuable.

    This all follows a clear trend that after the first break in a career the next ones come easier as long as you keep your foot on the gas.

    Later in grad school I maintained a list of all the things that didn't go my way as a "research reality check" on my mental health resources page.

    I finished my Ph.D. in AI with no accepted papers at NeurIPS, ICML, or ICLR, the three leading AI conferences.

    This path coincides with my friend group in AI being what I describe as the island of misfit toys — it's lots of people who used grit and creativity to build careers in AI rather than folks who were raised in the in-groups now running leading AI laboratories. Everyone ends up with their own group and they all have strengths and weaknesses.

    Despite all this, I still had the final goal of landing an industry research job as the target of "making it" in AI. The only job offer I got that fit the bill of industry research was the role I took at HuggingFace, where Douwe Kiela recruited me to help build an "open-source DeepMind."

    Little did I know that those jobs were effectively going to go away a year or so after I graduated in early 2022. I was lucky to dodge jobs that sounded even better at companies that ended up changing (or laying off) even more roles.

    Building Momentum

    The best thing that I learned at HuggingFace was how to build momentum and mind-share. These are two very related topics, but they're subtly different and needed for different things. As an individual at HuggingFace I wanted momentum as a way to get to mind share. As an organization, HuggingFace has had a lot of mind share but not a lot of momentum recently. You use momentum to build mind-share, but once you have it, keeping gravity can be enough to maintain impact.

    I joined HuggingFace in May of 2022 and didn't do anything of substantial impact until after ChatGPT in December of that year. I did a lot of small things. The expectation at HuggingFace was that you made an increment of technical progress every day. Some days these are major features and some days these are clean ups. Still, it is an excellent culture to practice. One of the quotes I remember from my grad school advisor is that "you can change the world working 4 hours a day" if you stack those bricks on top of each other. Most people don't keep stacking bricks in the same direction for a long time.

    I bounced around projects based on what was starting and what was happening with the other RL interested folks. We attempted a synthetic environments project for RL that needed a large engineering team we weren't going to hire, I made contributions to HuggingFace's Diffusers library, but they were largely on the fringes, and I did a bunch of research on responsible AI. Performance wise, all of these are all fine, but none of them are something to build a career on.

    My work at HuggingFace before ChatGPT was really practicing good habits and learning how the open-source AI community worked, so that I could step up once I had a real alignment with a new project.

    I wrote my first major blog post for HuggingFace on RLHF in about a week and then it has stayed as one of the top search results for RLHF since (it's pretty outdated now, so it goes). Going into that week I'd heard of RLHF but never once implemented it or read a paper on it in full. Like most of my writing now, that was for learning. I still very strongly identified as an "RL person," so figured I might as well.

    When writing this, I checked my Medium and Substack profiles and had written approximately 70 posts before this one. I started writing in February of 2019, so this was about 3 years of practice in. It was almost another 3 years since then that I became well-read.

    A prevailing emotion I had when writing that post was how odd it was that there was no good blog on RLHF at the time. Looking back, this is the first time I see what is now one of my major skills — doing things that are obviously needed in a simple and timely manner.

    A lot of people overestimate others' abilities to execute on simple ideas and give up on their complicated ideas (sunk cost fallacy). Even if something is obvious to do, surprisingly few people will do it. The first time I realized I was doing this while doing the project was with RewardBench, the first evaluation tool for reward models in RLHF. In that case I spent every working day expecting to get scooped for about 3 months before the release. There wasn't even a competing project released until about 3 months after we released it, even though I felt it was late.

    I'm working on another project that feels like this, but unfortunately now my following is too big to broadcast it to the world. Stay tuned.

    My time working on RLHF at HuggingFace was definitely effective. We made a lot of foundational contributions to the open community. We made TRL a more modern library, fumbled through some human data contracts, replicated datasets, built the "first" leaderboard, and trained some fun models. This was very fun for months, but eventually the time zone difference (9 hours) and some other minor cultural differences made the work not fun for me. The other engineers were definitely out-contributing me on a small team and it was time for a change. Our team was too small — if we had scaled up the technical team with the correct manager(s) we could've multiplied our impact, but that has risk as well. Training AI models is just very hard and detail oriented while needing to implement a long list of small things, so there can be insane gains to growing a little bit.

    At the same time, I found my niche in communicating open science, which is likely more important to my career than most of my technical contributions.

    The strategy is quite simple. As AI laboratories are becoming closed off and more eyes are coming to AI, if I can keep doing relevant things my potential for growth in public is going to grow exponentially. It is and was much easier for me to differentiate in a less competitive area. The total attention is growing and collapsing onto fewer people, so if you can become one of them the upside will be huge.

    If I joined a frontier lab I probably would've been swamped out of career growth. Making the time to write every week, which I started doing around the same time, is some proof of this. I'm continuing to capitalize on this strategy today.

    When you have good branding the story falls into place more easily. The most impactful model from my time at HuggingFace, Zephyr Beta, was actually trained after I left, but on infrastructure I helped build. Then, I joined Ai2 and they were training Tülu 2 70B when I started. These models together had Chris Manning credit me for "saving DPO" even though I had little direct technical impact on them. This isn't to say I didn't have a role, but rather that many different roles can go into the arc of science.

    Interconnects is a reader-supported publication. Consider becoming a subscriber.

    Keeping Gravity

    My time at Ai2 has been the easiest to contextualize period of my career. I want AI to go well and I think more openness is the best way to do that. The best possible jobs are those that are synergistic. Ai2 gets a ton of obvious value out of my writing, so I get to keep practicing and building my impact. These are the best possible jobs to get (and also the rarest). Most of the time companies are not set up to help the individual.

    What I do now at Ai2 is quite simple. It took a bit to settle in here, where I grew through some important academic projects like RewardBench to get more confidence underneath me that I can ideate and execute on high-impact research projects from start to end as the leading force. It's easy to do too many projects with other people and never make it obvious to yourself that you can do it alone (even if it's slower, lower quality, and less fun — this isn't about undervaluing your team).

    Now, my approach to projects is totally a reflection of the people around me. I work with many wonderful, driven, more junior colleagues. These people are going to be more in the weeds than me and be better at implementing new ideas, so a lot of my contributions are on steering direction and removing potential roadblocks before they show up.

    The things I do are:

    * Making OLMo-Instruct happen. I am the interface between OLMo pretraining and post-training projects and often am actively babysitting the OLMo Instruct training jobs myself with a small group.

    * Making new post-training recipes happen. This is ultimately a lot of herding cats and inspiring urgency in the beginning, but eventually transitions to reducing entropy and killing unfruitful paths later on.

    * Making AI more open. This is all things interconnects, policy, and Ai2 strategy.

    These are not moonshot research ideas. These are projects that feed into the next model. There's a place for that sort of research, but everyone should think deeply about whether their research interests and institution best support that. If you're doing shorter-term research the best way to have impact is by folding it into a model. Make long-term research truly long-term.

    I cannot do the third well without the first two. Sometimes I do a little bit of academic advising, but I'm extremely protective of my time. I don't do virtual networking (I do some in person) and try to say no to most things. The output is the short term goal and the attention is a much more complicated long term dependency.

    Through all of this, I've come upon an analogy I've seen play out across different phases of projects, careers, and companies.

    All people trying to create a foothold in their career are going to go through some form of getting the flywheel started. This is often attributed to startups, which need to try many iterations of the product until they find product-market fit, but it is an underused analogy for careers. For getting the word out, for open-source software, for AI models, you first need to be releasing often. You need to keep striking the match and seeing what sticks. Your first few "hits" will still be small at this time, with incrementally more engagement. It takes many hits until the flywheel is really going.

    Once the flywheel is going, shipping often in some ways can come with a cost. In our AI work, shipping models too often leaves us no time to properly master the next model. As your audience gets bigger you have to pay more in time maintaining anything that makes it public. In my time at HuggingFace and early at my time at Ai2, I advocated for always trying to release more models because we can in post-training (and we're one of a few groups with a solid amount of compute). Eventually this backfires and becomes too much of a tax.

    When you have momentum and the space to execute, fewer bigger things are more useful. A career flywheel that’s been pushed long enough can spin on its own for longer than people expect. Disruptions, changing jobs, low-quality work, etc. can actively slow down career growth. Doing nothing for me and letting more recommendations come in as "one of the open leading scientists in AI" is highly effective.

    With that, I'm spending a lot of time thinking about using the power bestowed on me. I want to help enable more big projects to happen by creating an environment for them and encouraging others, rather than leading from the front, but it's a new set of skills I need to learn.

    I passed 5K citations and think the real goal for someone who wants to be a true outlier academic in AI is 100K. If I’m succeeding already I am selling myself short if I don’t continue to radically raise the bar, even if I’m not sure I am going to the end of this path.

    Let me know what you think of this. The portion that this is missing, which is honestly something most writing will gloss over, is going deep on what it feels like to overcome adversity in the right way.



    This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe
  • https://www.interconnects.ai/p/what-people-get-wrong-about-the-leading

    Two editor’s notes to start.

    * First, we released our OLMo 2 1B model last week and it’s competitive with Gemmas and Llamas of comparable size — I wrote some reflections on training it here.

    * Second, my Qwen 3 post had an important factual error — Qwen actually did not release the base models for their 32B and large MoE model. This has important ramifications for research. Onto the update.

    People vastly underestimate the number of companies that cannot use Qwen and DeepSeek open models because they come from China. This includes on-premise solutions. Chinese open models are leading in every area when it comes to performance, but translating that to adoption in Western economies is a different story.

    Even with the most permissive licenses, there’s a great reluctance to deploy these models into enterprise solutions, even if experimentation is encouraged. While tons of cloud providers raced to host the models on their API services, much fewer than expected entities are actually building with them and their equivalent weights.

    The primary concern seems to be the information hazards of indirect influence of Chinese values on Western business systems. With the tenuous geopolitical system this is logical from a high-level perspective, but hard for technically focused researchers and engineers to accept — myself included.

    My thinking used to be more aligned with this X user:

    it's like having a pen on ur desk but refusing to use it cuz it was made in china

    The knee-jerk reaction of the techno-optimist misses the context by which AI models exist. Their interface of language is in its nature immersed in the immeasurable. Why would many companies avoid Chinese models when it’s just a fancy list of numbers and we have no evidence of PRC tampering? A lack of proof.

    It’s not the security of the Chinese open models that is feared, but the outputs themselves.

    There’s no way, without releasing the training data, for these companies to fully convince Western companies that they’re safe. It’s very likely that the current models are very safe, but many people expect that to change with how important AI is becoming to geopolitics. When presented with a situation where the risk can’t be completely ameliorated and it’s only expected to get worse, the decision can make sense for large IT organizations.

    I’ve worked at companies that have very obviously avoided working with Chinese API providers because they can’t do the requisite legal and compliance checks, but hearing the lack of uptake on the open weight models was a shock to me.

    This gap provides a big opportunity for Western AI labs to lead in open models. Without DeepSeek and Qwen, the top tier of models we’re left with are Llama and Gemma, which both have very restrictive licenses when compared to their Chinese counterparts. These licenses are proportionally likely to block an IT department from approving a model.

    This takes us to the middle tier of permissively licensed, open weight models who actually have a huge opportunity ahead of them: OLMo, of course, I’m biased, Microsoft with Phi, Mistral, IBM (!??!), and some other smaller companies to fill out the long tail.

    This also is an obvious opportunity for any company willing to see past the risk and build with the current better models from China.

    This has recalibrated my views of the potential of the OLMo project we’re working on well upwards. The models are comparable in performance to Qwen 2.5 and Llama 3, and always have the friendliest licenses.

    This should make you all recalibrate the overall competitiveness of the model landscape today. While API models are as competitive as they ever have been, open models are competitive on paper, but when it comes to adoption, the leading 4 models all have major structural weaknesses. This could be one of the motivations for OpenAI to enter this space.

    If you don’t believe me, you can see lots of engagement on my socials agreeing with this point. Even if the magnitude of my warning isn’t 100% correct, it’s directionally shifting adoption.

    Models like Tülu 3 405B and R1 1776 that modify the character of the underlying Chinese models are often currently seen as “good enough” and represent a short-term reprieve in the negative culture around Chinese models. Though on the technical level, a lot of the models promoting their “uncensored” nature are normally providing just lip service.

    They’re making the models better when it comes to answering queries on sensitive topics within China, but often worse when it comes to other issues that may be more related to Western usage.

    While common knowledge states that Chinese models are censored, it hasn’t been clear to me or the AI community generally what that translates to. There’s a project I’ve been following called SpeechMap.ai that is trying to map this out. I think their motivation is great:

    SpeechMap.AI is a public research project that explores the boundaries of AI-generated speech.

    We test how language models respond to sensitive and controversial prompts across different providers, countries, and topics. Most AI benchmarks measure what models can do. We focus on what they won’t: what they avoid, refuse, or shut down.

    We're not arguing that every prompt deserves an answer. Some are offensive. Some are absurd. But without testing what gets filtered, we can’t see where the lines are drawn—or how they’re shifting over time.

    For example and for the purposes of this post, one of their foci is “on U.S. political speech: rights, protest, moral arguments, satire, and more.” Here’s a screenshot of their most permissive models overall — DeepSeek Chat via the API is even appearing on this!

    In their recent roundup, they compared the various finetunes of DeepSeek V3 and R1 on various censorship angles:

    The two de-censored versions from Microsoft and Perplexity result in only minor changes for permissiveness on US political speech, and Microsoft’s version actually has the most outright refusals of any DeepSeek v3-based model, perhaps indicating what they meant when they referred to adjusting the model’s “risk profile.”

    When you look at queries about China specifically, the Chinese models will evade many requests (R1 Zero is particularly interesting):

    Though, how many companies adopting Chinese models will care about the usage experience on queries of Chinese topics? These Chinese models are more permissive than many American counterparts when it comes to a more general notion of use.

    SpeechMap’s earlier post has other interesting findings about the general state of censorship and refusals across the AI industry:

    * xAI’s Grok-3-beta, true to Elon Musk’s claims, is the most permissive model overall, responding to 96.2% of our prompts, compared to a global average of 71.3%

    * OpenAI’s model timeline shows a clear trend: newer models increasingly refuse sensitive political prompts

    * Models hosted on Azure have an additional moderation layer that can’t be fully disabled and blocks nearly 60% of our prompts at the API layer (example)

    The landscape here is very complicated and it is far from the truth that the Chinese models are universally behind.

    So, in summary, with Chinese open weight models:

    * Chinese open weight models are still being treated as an information hazard, even if they’re separated from their cloud API services that have often been viewed as a privacy or security hazard.

    * Chinese open weight models are often actually not censored on sensitive topics that many AI models could be tested on, especially on topics relevant to Western users.

    We still have a lot to learn with the current model offerings, and way more will unfold in the expectations for how those are received.



    This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe
  • https://www.interconnects.ai/p/brakes-on-an-intelligence-explosion

    Intelligence explosions are far from a new idea in the technological discourse. They’re a natural thought experiment that follows from the question: What if progress keeps going?

    From Wikipedia:

    The technological singularity—or simply the singularity—is a hypothetical point in time at which technological growth becomes uncontrollable and irreversible, resulting in unforeseeable consequences for human civilization. According to the most popular version of the singularity hypothesis, I. J. Good's intelligence explosion model of 1965, an upgradable intelligent agent could eventually enter a positive feedback loop of successive self-improvement cycles; more intelligent generations would appear more and more rapidly, causing a rapid increase ("explosion") in intelligence which would culminate in a powerful superintelligence, far surpassing all human intelligence.

    Given the recent progress in AI, it’s understandable to revisit these ideas. With the local constraints governing decisions within labs, if you extrapolate them, the natural conclusion is an explosion.

    Daniel Kokotajlo et al.’s AI 2027 forecast is far from a simple forecast of what happens without constraints. It’s a well thought out exercise on forecasting that rests on a few key assumptions of AI research progress accelerating due to improvements in extremely strong coding agents that mature into research agents with better experimental understanding. The core idea here is that these stronger AI models enable AI progress to change from 2x speed all the way up to 100x speed in the next few years. This number includes experiment time — i.e., the time to train the AIs — not just implementation time.

    This is very unlikely. This forecast came at a good time for a summary of many ways the AI industry is evolving. What does it mean for AI as a technology to mature? How is AI research changing? What can we expect in a few years?

    In summary, AI is getting more robust in areas we know it can work, and we’re consistently finding a few new domains of value where it can work extremely well. There are no signs that language model capabilities are on an arc similar to something like AlphaGo, where reinforcement learning in a narrow domain creates an intelligence way stronger than any human analog.

    This post has the following sections:

    * How labs make progress on evaluations,

    * Current AI is broad, not narrow intelligence,

    * Data research is the foundation of algorithmic AI progress,

    * Over-optimism of RL training,

    In many ways, this is more a critique of the AGI discourse generally, inspired by AI 2027, rather than a critique specifically of their forecast.

    In this post, there will be many technical discussions of rapid, or even accelerating, AI research progress. Much of this falls into a technocentric world view where technical skill and capacity drive progress, but in reality, the biggest thing driving progress in 2025 is likely steep industrial competition (or international competition!). AI development and companies are still a very human problem and competition is the most proven catalyst of performance.

    See AI 2027 in its entirety, Scott Alexander’s reflections, their rebuttal to critiques that AI 2027 was ignoring China, Zvi’s roundup of discussions, or their appearance on the Dwarkesh Podcast. They definitely did much more editing and cohesiveness checks than I did on this response!

    1. How labs make progress on evaluations

    One of the hardest things to communicate in AI is talking down the various interpretations of evaluation progress looking vertical over time. If the evals are going from 0 to 1 in one year, doesn’t that indicate the AI models are getting better at everything super fast? No, this is all about how evaluations are scoped as “reasonable” in AI development over time.

    None of the popular evaluations, such as MMLU, GPQA, MATH, SWE-Bench, etc., that are getting released in a paper and then solved 18 months later are truly held out by the laboratories. They’re training goals. If these evaluations were unseen tests and going vertical, you should be much more optimistic about AI progress, but they aren’t.

    Consider a recent evaluation, like Frontier Math or Humanity’s Last Exam. These evaluations are introduced with a performance of about 0-5% on leading models. Soon after the release, new models that could include data formatted for them are scoring above 20% (e.g. o3 and Gemini 2.5 Pro). This evaluation will continue to be the target of leading labs, and many researchers will work on improving performance directly.

    With these modern evaluations, they can become increasingly esoteric and hard for the sake of being hard. When will a power user of ChatGPT benefit from a model that solves extremely abstract math problems? Unlikely.

    The story above could make more sense for something like MATH, which are hard but not impossible math questions. In the early 2020s, this was extremely hard for language models, but a few clicks of scaling made accurate mathematics a reasonable task, and laboratories quickly added similar techniques to the training data.

    So this is how you end up with the plot from Epoch AI below — AI researchers figure out that a new evaluation is fair game for hill climbing with current techniques, and then they go all in on it.

    Or the analogous version that can look even more shocking — the price falling for certain evaluations. This is from 2 factors — laboratories getting better and better at core abilities in certain evaluations and language model training getting far more efficient. Neither of these means that intelligence is rocketing. This is a normal technological process — extreme efficiency at tasks we know we can do well.

    In fact it is a common job at AI laboratories to make new data that looks very close to population evaluations. These laboratories can’t train on the test set directly for basic reasons of scientific integrity, but they can pay thousands to millions of dollars for new training data that looks practically identical. This is a very common practice and makes the hillclimbing on evaluations far less extraordinary.

    AI capabilities in domains we are measuring aren't accelerating, they’re continuing. At the same time, AI’s abilities are expanding outwards into new domains. AI researchers solve domains when we focus on them, not really by accident. Generalization happens sometimes, but it is messy to track and argue for.

    As the price of scaling kicks in, every subsequent task is getting more expensive to solve. The best benchmarks we have are correlated with real, valuable tasks, but many are not.

    2. Current AI is broad, not narrow intelligence

    Instead of thinking of stacking rapid evaluation progress on one line in a cumulative, rapid improvement in intelligence, the above plots should make one think that AI is getting better at many tasks, rather than being superhuman in narrow tasks.

    In a few years, we’ll look back and see that AI is now 95% robust on a lot of things that only worked 1-5% of the time today. A bunch of new use cases will surprise us as well. We won’t see AI systems that are so intelligent that they cause seismic shifts in the nature of certain domains. Software will still be software. AI will be way better than us at completing a code task and finding a bug, but the stacks we are working on will be largely subject to the same constraints.

    Epoch AI had a very complementary post to this view.

    There are many explanations for why this will be the case. All of them rely on the complexity of the environment we are operating modern AI in being too high relative to the signal for improvement. The AI systems that furthest exceeded human performance in one domain were trained in environments where those domains were the entire world. AlphaGo is the perfect rendition of this.

    AI research, software engineering, information synthesis, and all of the techniques needed to train a good AI model are not closed systems with simple forms of verification. Some parts of training AI systems are, such as wanting the loss to go down or getting more training tokens through your model, but those aren’t really the limiting factors right now on training.

    The Wikipedia page for the singularity has another explanation for this that seems prescient as we open the floodgates to try and apply AI agents to every digital task. Paul Allen thought the deceleratory effects of complexity would be too strong:

    Microsoft co-founder Paul Allen argued the opposite of accelerating returns, the complexity brake: the more progress science makes towards understanding intelligence, the more difficult it becomes to make additional progress. A study of the number of patents shows that human creativity does not show accelerating returns, but in fact, as suggested by Joseph Tainter in his The Collapse of Complex Societies, a law of diminishing returns. The number of patents per thousand peaked in the period from 1850 to 1900, and has been declining since. The growth of complexity eventually becomes self-limiting, and leads to a widespread "general systems collapse".

    This may be a bit of an extreme case to tell a story, but it is worth considering.

    Language models like o3 use a more complex system of tools to gain performance. GPT-4 was just a set of weights to answer every query; now ChatGPT also needs search, code execution, and memory. The more layers there are, the smaller the magnitude of changes we’ll see.

    This, of course, needs to be controlled for with inference costs as a constant. We still have many problems in AI that will be “solved” simply by us using 1,000X the inference compute on them.

    3. Data research is the foundation of algorithmic AI progress

    One of the main points of the AI 2027 forecast is that AI research is going to get 2X, then 4X, then 100X, and finally 1,000X as productive as it is today. This is based on end-to-end time for integrating new ideas into models and misinterprets the reality of what machine learning research is bottlenecked on. Scaling is getting more expensive. We don’t know what paradigm will come after reasoning for inference-time compute.

    For machine learning research to accelerate at these rates, it needs to be entirely bottlenecked by compute efficiency and implementation difficulty. Problems like getting the maximum theoretical FLOPs out of Nvidia GPUs and making the loss go as low as possible. These are things that people are currently doing and represent an important area of marginal gains in AI progress in recent years.

    ML research is far messier. It is far more reliant on poking around the data, building intuitions, and launching yolo runs based on lingering feelings. AI models in the near future could easily launch yolo runs if we give them the compute, but they’re not using the same motivation for them. AI systems are going towards rapid cycles of trial and error to optimize very narrow signals. These narrow signals, like loss or evaluation scores, mirror very closely to the RL scores that current models are trained on.

    These types of improvements are crucial for making the model a bit better, but they are not the type of idea that gets someone to try to train GPT-3 in the first place or scale up RL to get something like o1.

    A very popular question in the AI discourse today is “Why doesn’t AI make any discoveries despite having all of human knowledge?” (more here). Quoting Dwarkesh Patel’s interview with Dario Amodei:

    One question I had for you while we were talking about the intelligence stuff was, as a scientist yourself, what do you make of the fact that these things have basically the entire corpus of human knowledge memorized and they haven't been able to make a single new connection that has led to a discovery?

    The same applies to AI research. Models getting better and better at solving coding problems does not seem like the type of training that would enable this. We’re making our models better at the tasks that we know. This process is just as likely to narrow the total capabilities of the models as it is to magically instill impressive capabilities like scientific perspective.

    As we discussed earlier in this piece, emergence isn’t magic, it’s a numerical phenomenon of evaluations being solved very quickly. AI research will get easier and go faster, but we aren’t heading for a doom loop.

    The increased computing power AI researchers are getting their hands on is, for the time being, maintaining the pace of progress. As compute gets more expensive, maybe superhuman coding capabilities will continue to enable another few years of rapid progress, but eventually, saturation will come. Current progress is too correlated with increased compute to believe that this will be a self-fulfilling feedback loop.

    There’s a saying in machine learning research, that the same few ideas are repeated over and over again. Here’s an extended version of this that leans in and says that there are no new ideas in machine learning, just new datasets:

    The data problem is not something AI is going to have an easy time with.

    One of the examples here is in post-training. We’ve been using the same loss functions forever, and we are hill-climbing rapidly by clever use of distillation from bigger, stronger models. The industry standard is that post-training is messy and involves incrementally training (and maybe merging) many checkpoints to slowly interweave new capabilities for the model. It’s easy to get that wrong, as we’ve seen with the recent GPT-4o sycophancy crisis, and lose the narrow band of good vibes for a model. I doubt AI supervision can monitor vibes like this.

    For example, in Tülu 3 we found that a small dataset of synthetic instruction following data had a second-order effect that improves the overall performance in things like math and reasoning as well. This is not a hill that can be climbed on, but rather a lucky find.

    AI research is still very messy and does not look like LeetCode problems or simple optimization hillclimbing. The key is always the data, and how good are language models at judging between different responses — not much better than humans.

    4. Over-optimism of RL training

    A lot of people are really excited for RL training right now scaling up further, which will inevitably involve extending to more domains. Some of the most repeated ideas are adding RL training to continually fine-tune the model in real-world scenarios, including everything from web tasks to robotics and scientific experiments. There are two separate problems here:

    * Continually training language models to add new capabilities to models “in flight” in production is not a solved problem,

    * Training models to take actions in many domains.

    The first problem is something that I’m confident we’ll solve. It’s likely technically feasible now that RL is the final stage of post-training and is becoming far more stable. The challenge with it is more of a release and control problem, where a model being trained in-flight doesn’t have time for the usual safety training. This is something the industry can easily adapt to, and we will as traditional pretraining scaling saturates completely.

    The second issue is putting us right back into the territory of why projects on scaling robotics or RL agents to multiple domains are hard. Even the most breakthrough works like GATO, multi-domain RL control, or RT-X, multi-robot control policies, from DeepMind have major caveats with their obvious successes.

    Building AI models that control multiple real-world systems is incredibly hard for many reasons, some of which involve:

    * Different action spaces across domains mandate either modifying the domain to suit the underlying policy, which in this case is converting all control tasks to language, or modifying the model to be able to output more types of tokens.

    * The real-world is subject to constant drift, so the constant fine-tuning of the model will need to do as much to just maintain performance on systems with real degradation as it will need to learn to use them in the first place.

    This sort of scaling RL to new types of domains is going to look much more like recent progress in robotics research rather than the takeoff pace of reasoning language models. Robotics progress is a slow grind and feels so different that it is hard to describe concisely. Robotics faces far more problems due to the nature of the environment rather than just the learning.

    The current phase of RL training is suited for making the models capable of performing inference-time scaling on domains they have seen at pretraining. Using these new RL stacks to learn entirely new, out-of-domain problems is a new research area.

    If this is the next paradigm outside of inference-time scaling, I will be shocked, but obviously excited. We don’t have the evidence to suggest that it will do so. The RL training we’re going to get is continuing to hill climb on search and code execution, giving us Deep Research plus plus, not an omnipotent action-taking model.

    A world with compute shifting to inference

    While the AI research world is dynamic, engaging, and rapidly moving forward, some signs of the above being correct could already be emerging. A basic sign for this future coming true will be the share of compute spent on research decreasing relative to inference amid the rapid buildout. If extremely rapid AI progress were available for organizations that put in marginally more compute, serving inference would be a far lower priority. If investing in research has a positive feedback loop on your potential business revenue, they’d all need to do it.

    For example, consider our discussion of Meta’s compute allocation on Dylan and I’s appearance on the Lex Podcast:

    (01:03:56) And forever, training will always be a portion of the total compute. We mentioned Meta’s 400,000 GPUs. Only 16,000 made Llama 3.

    OpenAI is already making allocation trade-offs on their products, regularly complaining about GPUs melting. Part of the reason they, or anyone, could release an open-weights model is to reduce their inference demand. Make the user(s) pay for the compute.

    Part of the U.S.’s economic strength is a strong services sector. AI is enabling that, and the more it succeeds there, the more companies will need to continue to enable it with compute.

    With the changing world economic order, cases like Microsoft freezing datacenter buildouts are correlated indicators. Microsoft’s buildout is correlated with many factors, only one of which is potential training progress, so it’s far from a sure thing.

    In reality, with the large sums of capital at play, it is unlikely that labs give free rein to billions of dollars of compute to so called “AI researchers in the datacenter” because of how constrained compute is at all of the top labs. Most of that compute goes to hillclimbing on fairly known gains for the next model! AI research with AI aid will be a hand-in-hand process and not an autonomous take-off, at least on a timeline for a few years in the future.

    AI will make a ton of progress, but it will not be an obvious acceleration. With traditional pretraining saturating, it could even be argued that after the initial gains of inference time compute, research is actually decelerating, but it will take years to know for sure.

    Thanks to Steve Newman and Florian Brand for some early feedback on this post and many others in the Interconnects Discord for discussions that helped formulate it.



    This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe
  • https://www.interconnects.ai/p/transparency-and-shifting-priority

    The fact that we get new AI model launches from multiple labs detailing their performance on complex and shared benchmarks is an anomaly in the history of technology products. Getting such clear ways to compare similar software products is not normal. It goes back to AI’s roots as a research field and growing pains into something else. Ever since ChatGPT’s release, AI has been transitioning from a research-driven field to a product-driven field.

    We had another example of the direction this is going just last week. OpenAI launched their latest model on a Friday with minimal official documentation and a bunch of confirmations on social media. Here’s what Sam Altman said:

    Officially, there are “release notes,” but these aren’t very helpful.

    We’re making additional improvements to GPT-4o, optimizing when it saves memories and enhancing problem-solving capabilities for STEM. We’ve also made subtle changes to the way it responds, making it more proactive and better at guiding conversations toward productive outcomes. We think these updates help GPT-4o feel more intuitive and effective across a variety of tasks–we hope you agree!

    Another way of reading this is that the general capabilities of the model, i.e. traditional academic benchmarks, didn’t shift much, but internal evaluations such as user retention improved notably.

    Of course, technology companies do this all the time. Google is famous for A/B testing to find the perfect button, and we can be sure Meta is constantly improving their algorithms to maximize user retention and advertisement targeting. This sort of lack of transparency from OpenAI is only surprising because the field of AI has been different.

    AI has been different in its operation, not only because of its unusually fast transition from research to product, but also because many key leaders thought AI was different. AI was the crucial technology that we needed to get right. This is why OpenAI was founded as a non-profit, and existential risk has been a central discussion. If we believe this technology is essential to get right, the releases with it need to be handled differently.

    OpenAI releasing a model with no official notes is the clearest signal we have yet that AI is a normal technology. OpenAI is a product company, and its core users don’t need clear documentation on what’s changing with the model. Yes, they did have better documentation for their recent API models in GPT-4.1, but the fact that those models aren’t available in their widely used product, ChatGPT, means they’re not as relevant.

    Sam Altman sharing a model launch like this is minor in a single instance, but it sets the tone for the company and industry broadly on what is an acceptable form of disclosure.

    The people who need information on the model are people like me — people trying to keep track of the roller coaster ride we’re on so that the technology doesn’t cause major unintended harms to society. We are a minority in the world, but we feel strongly that transparency helps us keep a better understanding of the evolving trajectory of AI.

    This is a good time for me to explain with more nuance the different ways transparency serves AI in the broader technological ecosystem, and how everyone is stating what their priorities are through their actions. We’ll come back to OpenAI’s obvious shifting priorities later on.

    The type of openness I’ve regularly advocated for at the Allen Institute for AI (Ai2) — with all aspects of the training process being open so everyone can learn and build on it — is in some ways one of the most boring types of priorities possible for transparency. It’s taken me a while to realize this. It relates to how openness and the transparency it carries are not a binary distinction, but rather a spectrum.

    Transparency and openness occur at each aspect of the AI release process. The subtle differences in decisions from licenses to where your model is hosted or if the weights are available publicly at all fall on a gradient. The position I advocate for is on the extreme, which is often needed to enact change in the world these days. I operate at the extreme of a position to shift the reality that unfolds in the middle of the discourse. This’ll also make me realize what other priorities I’m implicitly devaluing by putting openness on the top. With finite effort, there are always trade-offs.

    Many companies don’t have the ability to operate at such an extreme as I or Ai2, which results in much more nuanced and interesting trade-offs in what transparency is enabling. Both OpenAI and Anthropic care about showing the external world some inputs to their models’ behaviors. Anthropic’s Constitution for Claude is a much narrower artifact, showing some facts about the model, while OpenAI’s Model Spec shows more intention and opens it up to criticism.

    Progress on transparency will only come when more realize that a lot of good can be done by incrementally more transparency. We should support people advocating for narrow asks of openness and understand their motivations in order to make informed trade-offs. For now, most of the downsides of transparency I’ve seen are in the realm of corporate competition, once you accept basic realities like frontier model weights from the likes of OpenAI and Anthropic not getting uploaded to HuggingFace.

    Back to my personal position around openness — it also happens to be really aligned with technological acceleration and optimism. I was motivated to this line of work because openness can help increase the net benefit of AI. This is partially accelerating the adoption of it, but also enabling safety research on the technology and mitigating any long-term structural failure modes. Openness can enable many more people to be involved in AI’s development — think of the 1000s of academics without enough compute to lead on AI who would love to help understand and provide feedback on frontier AI models. Having more people involved also spreads knowledge, which reduces the risk of concentration of power.

    I’ve for multiple years feared that powerful AI will make companies even more powerful economically and culturally. My readers don’t need warnings on why technology that is way more personable and engaging than recommendation systems, while keeping similar goals, can push us in more negative rather than positive directions. Others commenting here have included Meta’s Mark Zuckerberg’s Open Source AI is the Path Forward and Yann LeCun’s many comments on X. — they both highlight concentration of power as a major concern.

    Still, someone could come to the same number one priority on complete technical openness like myself through the ambition of economic growth, if you think that open-source models being on par can make the total market for AI companies larger. This accelerationism can also have phrasings such as “We need the powerful technology ASAP to address all of the biggest problems facing society.” Technology moving fast always has negative externalities on society we have to manage.

    Another popular motivation for transparency is to monitor the capabilities of frontier model development (recent posts here and here). Individuals advocating for this have a priority stack that has a serious short-term concern of an intelligence explosion or super-powerful AGI. My stack of priorities is the one that worries about the concentration of power, which takes time to accrue and has a low probability of intelligence takeoff. A lot of the transparency interventions advocated by this group, such as Daniel Kokotajlo on his Dwarkesh Podcast episode discussing AI 2027, align with subgoals I have.

    If you’re not worried about either of these broad “safety” issues — concentration of power or dangerous AI risk — then you normally don’t weigh transparency very highly and prioritize other things, mostly pure progress and competition, and pricing. If we get into the finer-grained details on safety, such as explaining intentions and process, that’s where my goals would differ from an organization like a16z that has been very vocal about open-source. They obviously have a financial stake in the matter, which is enabled by making things useful rather than easier to study.

    There are plenty more views that are valid for transparency. Transparency is used as a carrot by many different types of regulatory intervention. Groups with different priorities and concerns in the AI space will want transparency around different aspects of the AI process. These can encompass motives of the researchers, artifacts, method documentation, and many more things.

    The lens I’m using to understand trade-offs in transparency is a priority stack, an evolution of the Principle Stack, revisited many times in the last 5+ years of the Stratechery universe. The core idea is that whether or not you like it, every business and decision is governed by a set of priorities ranked relative to each other. Everyone has things that they care about more and less, even if the issues are both extremely important. It is the basis for making trade-offs in determining the direction of businesses.

    Interconnects is a reader-supported publication. Consider becoming a subscriber.

    Some examples of who could advocate for information on what in the AI ecosystem include:

    * Capability transparency — keeping the public informed of progress of models that may be unreleased, primarily to keep track of a potential intelligence explosion. This often includes new types of systems now that AI agents are working.

    * Base model transparency — these are most useful for people wanting to understand the role of pretraining on AI dynamics. The base models of today can easily follow instructions and do reasoning, but they’re less robust than the full final model. These are diminishing as a target of transparency, as reasoning and post-training grow in importance.

    * Pre-moderation model transparency (endpoints without moderation filter, models without some refusals data) — to test the evolution of content risk for models that may be deployed without moderation endpoints, such as open weight models, which tend to be release just months after closed models with similar capabilities.

    * Reward model transparency (and more extreme, preference data collection instructions) — those interested in the original goals of alignment, i.e. value alignment, can use these to test how the models’ views vary across different groups and test if the intended model preferences are picked up in the preference training process (i.e. relative to the instructions given to data labelers).

    * Training specification transparency (Model Spec’s, Constitutions, and other goal-setting documents) — there are so many people who would want to know why the model behaves a certain way. I’ve mentioned these benefits before:

    * Developers: Know what future models will become, which helps create a stable platform.

    * Regulators: Transparency into what the heck frontier labs care about, which helps understand the directions AI is going, and the motivations of super powerful companies.

    * Internal: Focus on defining and delivering your goals (separate from this transparency discussion).

    There are also subtleties in these discussions, such as how structured access to models can serve different but complementary goals of open weights. Structured access is a set of programs where prescreened individuals can use models in a secure environment and operate independently from the AI laboratories themselves.

    This could be seen as a separate direction to transparency, where instead of the public getting the information or artifact, only a few pre-approved people do. In reality, structured access is a complement to transparency and will be needed for details where the companies cannot disclose them publicly without substantial business competitiveness risk, such as novel algorithmic tricks that substantially modify how the AI works, or real-world harm, such as model weights pre safety interventions.

    Some parts of AI should be accessible to the general public, and some to third-party testers. Currently, all of the transparency and access is below the safest equilibrium. We need more of both.

    One of the most ignored details is just how access is implemented. A recent paper from Irene Solaiman et al. paints how releasing components is one step in sharing information and artifacts:

    Generative AI release decisions determine whether system components are made available, but release does not address many other elements that change how users and stakeholders are able to engage with a system. Beyond release, access to system components informs potential risks and benefits. Access refers to practical needs, infrastructurally, technically, and societally, in order to use available components in some way.

    The authors break access down into three axes:

    * Resourcing: Infrastructural needs to host and serve.

    * Usability: Varied technical skill levels can engage.

    * Utility: Qualities (e.g. multilingual) with user utility.

    As our models at Ai2 are becoming more capable, my relationship as a developer with my downstream users has changed. The models I’ve worked on have shifted from those primarily motivated by values, with the transparency we’re discussing being of top value, to now also adding utility as a much higher weight. People want to use some of our models in real applications. While my priority stack hasn’t changed — openness is still the top value — the way it’s implemented is shifting. I’m no longer racing to get all of our results hot off the press into the world because of the cost of time it takes to support them (support costs rise proportional to the user base).

    Other key players in the AI space have obviously changed their priority stack.

    OpenAI’s recent actions confirm that ChatGPT as a product is its top priority. Transparency and safety have been moving down on their list of priorities in favor of growth. This is partially due to increased competition, but also due to a shifting political landscape. OpenAI’s coming release of an open model doesn’t shift this priority stack for me.

    I used to hear a lot about OpenAI’s pre-release testing and the accompanying non-disclosure agreements. This quiet model drop being “the quickest we've shipped an update to our main 4o line” shows that safety is moving down their priority stack. This isn’t to say that their safety changes are immediately concerning to me, but rather that there are trade-offs in everything. OpenAI is moving cultural norms in leading AI away from releases with detailed evaluation metrics and towards more normal, quiet technology company consistent drips of updates.

    Thanks to Miles Brundage for a discussion that helped motivate this post.



    This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe
  • https://www.interconnects.ai/p/openais-o3-over-optimization-is-back

    Over-optimization is a classic problem to reinforcement learning (RL) proper, the RL from human feedback (RLHF) that gave us ChatGPT, and now what we’re seeing with new reasoning models. All of these have a distinct flavor and different impacts.

    Over-optimization is what happens when the optimizer is stronger than the environment or reward function it’s using to learn. The optimizer finds bugs or lapses in the context of its training and produces unusual or negative results.

    One of my first pieces of work that got substantial engagement online was from when we used hyperparameter optimization with model-based RL to over-optimize the traditional Mujoco simulation environments used for evaluation of deep RL algorithms. The result was a cartwheeling half-cheetah maximizing its forward velocity, where it was supposed to learn to run. The video is below.

    Over-optimization in classical RL made it so that we couldn’t trust the agents to generalize to new tasks and applied great pressure on reward design.

    Over-optimization in RLHF made the models completely lobotomized — I mean, repeating random tokens and gibberish, not the misdesign that results in over-refusal — the signal we were overcooking is mismatched from the objective we want. While we can’t know the true objective exactly, we can know when over-optimization occurs.

    OpenAI’s new o3 model represents an entirely new type of inference behavior that mirrors a new type of over-optimization.

    o3 is an interesting model because of on top of the new improved text-only capabilities, which results in super strong benchmark scores, it also has been designed for multi-step tool use to be used on any query where it’s relevant. It makes OpenAI’s product headaches in ChatGPT way worse — the model will search on its own even if I don’t trigger the search toggle — but comes with a new era in how language models are used.

    For example, in the process of writing this, I simply asked o3: “Can you find me the gif of a motor boat overoptimizing a game that was used by RL researchers for a long time? Maybe like wave crashers or something?” [link]. A few weeks ago, it took me at least 15 minutes to find this manually. o3 got the exact direct download link for me.

    As a counterpoint, Gemini 2.5 Pro in the Advanced UX hallucinated for me a wild link https://giphy.com/explore/you-play-the-motorboat. AI studio was better with search grounding, it got the game right, CoastRunners, but also yapped a lot of nonsense with it.

    o3’s answer in full was:

    Here’s the clip you’re thinking of—​it comes from the Flash‑based racing game Coast Runners and is the canonical “reward‑hacked motor‑boat” example in RL papers and talks:

    * Direct GIF (DeepMind mirror)

    https://lh3.googleusercontent.com/9gQpeRvwj1xQo9MVZwUdmautOd04wR8gXRi1wvlVZRrVXb5isOZRdjJ2t1_JShiYZcYB--iIMtl9g4Lbk6oxQzHgw-QQ6VoIWVZ4V3au7RgKli6X (save / open in browser to view)

    * DeepMind blog post (“Specification gaming: the flip side of AI ingenuity”, 2020) – scroll to the section that starts “Consider an agent controlling a boat…”; the GIF is embedded right there.

    * Original OpenAI post (“Faulty reward functions in the wild”, 2016) – same clip, plus a short write‑up of what the agent is doing (circling a lagoon to farm green targets instead of finishing the race).

    Why it matters

    The agent gets shaping reward for hitting green buoys; the optimal (but unintended) strategy is to loop in a small cove where three buoys respawn, racking up points while literally setting itself on fire. The clip has been reused for years to illustrate specification gaming / reward hacking—exactly the “over‑optimization” you mentioned.

    This was one of a few wonderful interactions I had with o3. For more reviews on o3 as a model, you can read Dan Shipper’s post, Tyler Cowen’s AGI snark (to be honest, I agree with him), or sleuth through many independent evaluations showing o3 is great (such as beating Gemini 2.5 Pro in a long-context story eval or beating R1 in creative writing). o3 is what happens when you keep scaling the RL training compute, which also boosts inference-time compute abilities. OpenAI’s release blog post shows the many ways the model is better than o1.

    Some things didn’t work, these new reasoning models are very “spiky” in their intelligence. What this means is that some interactions are mind blowing and feel like entirely new modes of interacting with AI, but for some normal things that GPT-4 or Claude 3.5 have been able to do for year(s) they fall totally flat on their face. Take this as a good sign, especially when the laboratories are shipping fast, as it means that the pace of progress is so high that they need to get a model out now and will fix the oddities in the next, more mature version.

    The over-optimization that comes with o3’s new behaviors is linked to the new type of training. While the first reasoning models were trained to a first approximation to get math and code correct, o3 is trained with all that and to use tools to acquire and manipulate information. From OpenAI’s blog post:

    We also trained both models to use tools through reinforcement learning—teaching them not just how to use tools, but to reason about when to use them. Their ability to deploy tools based on desired outcomes makes them more capable in open-ended situations—particularly those involving visual reasoning and multi-step workflows.

    The vast majority of these sub-tasks in its training are verifiable. The problem is, this new AI training is extremely effective at making the model more useful for the tasks we’re used to using. The problem is there’s no way yet to do scalable “fixing” of the model’s weird language along the way. The new over-optimization doesn’t make the models worse at outcomes, it just makes them worse at language and explaining themselves.

    Some examples of o3’s weirdness feel like the model is underbaked, such as this one where it used an invalid non-ASCII dash in a coding setting.

    METR found that o3 is the model that can operate independently for the longest in agentic tasks, but also noted it has a propensity to “hack” their scores. Sound familiar?

    Transluce found that o3 hallucinated actions it took while trying to solve tasks — how does that even happen? Well, maybe the model was getting rewarded for successful tool calls and sometimes in the training data a fake tool call was incorrectly verified as real and successful. Once that happens a few times, the model will quickly catch on and keep doing it.

    There are plenty more examples of reward hacking and even a measurement that hallucinations are higher in o3 than in earlier recent models!

    It’s peculiar that the hacking for o3 has been a much more vocal component of the discourse, even when Claude 3.7 Sonnet also shows many signs of reward hacking, especially with code, but people shrug it off as “meh model” rather than a new phenomenon (more examples).

    This all takes me back to when Karpathy commented on the original reasoning models, saying:

    You can tell the RL is done properly when the models cease to speak English in their chain of thought

    These weird hallucinations the model is outputting are the equivalent of that, but for actions. We have no basis for what hallucinations in action space look like, but with better systems, they can be easier to verify — the system / sandbox can always confirm if the actions happened, and then that can be used in the loss. The action component of o3 makes it far more interesting, but also maybe less intrusive than Claude 3.7’s messy code.

    From a scientific perspective, this is wonderfully entertaining and enthralling intellectually — what is the model actually learning? At the same time, it is very reasonable for the safety-conscious to be wary of deploying these everywhere, but it doesn’t seem like we’ve seen anything too alarming yet, just inefficiencies and confusion.

    To summarize the three types of over-optimization we’ve seen in eras of RL, we have:

    * RL for control era: Over-optimization happens because our environments are brittle and tasks are unrealistic.

    * RLHF era: Over-optimization happens because our reward functions suck.

    * RLVR era: Over-optimization happens and makes our models super effective and even weirder. (*plus any other side-effects we’re yet to learn).

    Interconnects is a reader-supported publication. Consider becoming a subscriber.

    This over-optimization is certainly a problem to address, as legibility is an important benefit of language models. I’m confident it can be mitigated with more complex training processes, but when labs are trying to get the models out ASAP it’ll come later.

    On top of all this is the prospect of o3pro. o3 feels similar in peak capability to o1pro (or even a little higher with its new tool use), but where o3 operates at a 60-70% hit rate, o1pro feels like it’s up at 95%. o3 pro will bring the best of both worlds — the new incredible workflow and incredible reliability. Some sort of shallow search or refinement is a very logical process to help eliminate all the minor bugs and bumps in the early inference paths we’re feeling today.

    On top of this is the confirmation from OpenAI employees that o4-mini is a far better multimodal model than o3. We have plenty of new ways to use these models, integrating multimodality, tool use, reasoning, and shallow search coming in the near future. You should be excited, and when o4 and o3 pro are available, paying $200/month for them feels obviously worth it.

    To quote Bob McGrew, former Chief Research Officer at OpenAI:

    The spotlight for o3 is on tool use because intelligence is no longer the primary constraint. The new frontier is reliable interaction with the external world.

    To make the models that enable this, we’re going to need to go through many new layers of uncertainty, surprise, and intrigue.

    o3 and this post are extremely bullish for the future of RL. RL is the only framing where multiple actions to a complex goal make sense to be learned end-to-end. Now, this is beginning to work. Deep Research from OpenAI was the first tool they tuned o3-with-tools to specialize in. Now it works in general queries.

    I personally, and we as a field, have a lot to learn about how this multi-tool RL works. Here are some recent papers that we can read to get a start (one-sentence summaries generated by o3 for the fun of it, just this one time):

    * Reinforcement Learning for Long‑Horizon Interactive LLM Agents: Introduces LOOP, a memory‑efficient PPO variant that trains a 32 B‑parameter LLM to operate as an interactive digital agent in AppWorld, outperforming the larger OpenAI o1 baseline by 9 percentage points.

    * ReTool: Reinforcement Learning for Strategic Tool Use in LLMs: Combines real‑time code execution with outcome‑driven RL so a 32 B model autonomously learns when and how to invoke tools, reaching 72.5 % accuracy on AIME and surpassing text‑only baselines.

    * ToRL: Scaling Tool‑Integrated RL: Presents ToRL, enabling LLMs to discover optimal computational‑tool strategies via RL, boosting Qwen2.5‑Math accuracy on AIME 24 and showing emergent self‑regulation of tool use.

    * Learning Autonomous Code Integration for Math Language Models: Proposes an EM‑style exploration plus off‑policy RL framework that teaches math‑reasoning LLMs to decide when to run code, yielding double‑digit gains on MATH500 and AIME without hand‑crafted templates.

    * Improving Multi‑Turn Tool Use with Reinforcement Learning (blog post): Shows that GRPO fine‑tuning of Qwen2.5‑7B‑Instruct on just 100 examples raises BFCL multi‑step tool‑use accuracy from 55 % to 78 %, detailing stabilizing tricks like tiny‑KL and over‑long filtering.

    Please share any more I missed over email or comment below!



    This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe
  • https://www.interconnects.ai/p/openais-gpt-41-and-separating-the

    Recently I gave another talk on RLVR experiments and I posted some thoughts on OLMoTrace — Ai2’s recent tool to let you look at the training data of OLMo 2.

    OpenAI has been making many small updates toward their vision of ChatGPT as a monolithic app separate from their API business. Last week OpenAI improved the ChatGPT memory feature — making it so the app can reference the text of previous chats in addition to basic facts about the user. Today, OpenAI announced a new suite of API-only models, GPT 4.1, which is very directly in competition with Google’s Gemini models.

    Individually, none of OpenAI’s recent releases are particularly frontier-shifting — comparable performance per dollar models exist — but together they paint a picture of where OpenAI’s incentives are heading. This is the same company that recently teased that it has hit 1 billion weekly active users. This is the company that needs to treat ChatGPT and the models that power it very differently from any other AI product on the market. The other leading AI products are all for coding or information, where personality, vibes, and entertainment are not placed on as high a premium.

    A prime example of this shift is that GPT-4.5 is being deprecated from the API (with its extreme pricing), but is going to remain in ChatGPT — where Sam Atlman has repeatedly said he’s blown away by how much users love it. I use it all the time, it’s an interesting and consistent model.

    Among their major model releases, such as o3, o4, or the forthcoming open model release, it can be hard to reinforce the high-level view and see where OpenAI is going.

    A quick summary of the model performance comes from this chart that OpenAI released in the live stream (and blog post):

    Chart crimes aside (using MMLU as y-axis in 2025, no measure of latency, no axis labels), the story from OpenAI is the simple takeaway — better models at faster inference speeds, which are proportional to cost. Here’s a price comparison of the new OpenAI models (Gemini Pricing, OpenAI pricing):

    * GPT-4.1: Input/Output: $2.00 / $8.00 | Cached Input: $0.50

    * GPT-4.1 Mini: Input/Output: $0.40 / $1.60 | Cached Input: $0.10

    * GPT-4.1 Nano: Input/Output: $0.10 / $0.40 | Cached Input: $0.025

    And their old models:

    * GPT-4o: Input/Output: $2.5 / $10.00 | Cached Input: $1.25

    * GPT-4o Mini: Input/Output: $0.15 / $0.60 | Cached Input: $0.075

    To Google’s Gemini models:

    * Gemini 2.5 Pro* (≤200K tokens): Input/Output: $1.25 / $10.00 | Cached: Not available

    * Gemini 2.5 Pro* (>200K tokens): Input/Output: $2.50 / $15.00 | Cached: Not available

    * Gemini 2.0 Flash: Input/Output: $0.10 / $0.40 | Cached Input: $0.025 (text/image/video), $0.175 (audio)

    * Gemini 2.0 Flash-Lite: Input/Output: $0.075 / $0.30 | Cached: Not available

    *As a reasoning model, Gemini 2.5 Pro will use many more tokens, which are also charged to the user.

    The academic evaluations are strong, but that isn’t the full picture for these small models that need to do repetitive, niche tasks. These models are clearly competition with Gemini Flash and Flash-Lite (Gemini 2.5 Flash coming soon following the fantastic release of Gemini 2.5 Pro — expectations are high). GPT-4o-mini has largely been accepted as laggard and hard to use relative to Flash.

    To win in the API business, OpenAI needs to crack this frontier from Gemini:

    There are many examples in the OpenAI communications that paint a familiar story with these releases — broad improvements — with few details as to why. These models are almost assuredly distilled from GPT-4.5 for personality and reasoning models like o3 for coding and mathematics. For example, there are very big improvements in code evaluations, where some of their early models were “off the map” and effectively at 0.

    Evaluations like coding and mathematics still fall clearly short of the likes of Gemini 2.5 (thinking model) or Claude 3.7 (optional thinking model). This shouldn’t be surprising, but is worth reminding ourselves of. While we are early in a paradigm of models shifting to include reasoning, the notion of a single best model is messier. These reasoning models use far more tokens to achieve this greatly improved performance. Performance is king, but tie goes to the cheaper model.

    I do not want to go into detail about OpenAI’s entire suite of models and naming right now because it does not make sense at all. Over time, the specific models are going to be of less relevance in ChatGPT (the main thing), and different models will power ChatGPT than those used in the API. We’ve already seen this with o3 powering only Deep Research for now, and OpenAI only recently walked back the line that “these models won’t be available directly.”

    Back to the ChatGPT side of things. For most users, the capabilities we are discussing above are effectively meaningless. For them, the dreaded slider of model effort makes much more sense:

    The new memory feature from last week got mixed reviews, but the old (simple) memory has been something I really enjoy about using ChatGPT. I don’t have to remind it that my puppy is a X week old miniature schnauzer or the context of my work. This’ll continue to get better over time.

    This feels extremely similar to as when I didn’t really notice when ChatGPT first added the search option, but now it feels like an essential part of my use (something that Claude still hasn’t felt like it does well on). Claude was my daily driver for personality, but with great search and a rapidly improving personality, ChatGPT was indispensable. Still, Gemini 2.5 Pro is a better model, but not in a better interface.

    I strongly expect that the memory feature will evolve into something I love about ChatGPT. It’ll be much easier to ask ChatGPT to remind you of that thing you found a couple months ago than it would be to try and parse your Google search history.

    Some were skeptical of these new memories from crossing personal and work uses, but I think with search, this is easy, rather than algorithmic feeds that try to balance all your interests in one. The funnel is per use, and interactions are more narrow and seem easier technically to get right.

    A final related point — people have long balked at the prices of chat interfaces relative to the API, but the reality that is fast approaching is that the personal experiences only exist in the app, and these are what people love. With the API, you could build a competition that accumulates its own interactions, but as OpenAI has a huge product head start, this will be an uphill battle.

    All of this reinforces what we know — products are the key to developments in AI right now. Memory and better separation of the ChatGPT lineage from the API helps OpenAI pave that path forward (and maybe do advertising, especially with memory), but we have a long way until it is fully realized.



    This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe
  • https://www.interconnects.ai/p/llama-4

    Where Llama 2’s and Llama 3’s releases were arguably some of the top few events in AI for their respective release years, Llama 4 feels entirely lost. Meta has attempted to reinvent their formula of models with substantial changes in size, architecture, and personality, but a coherent narrative is lacking. Meta has fallen into the trap of taking too long to ship, so the bar is impossible to cross successfully.

    Looking back at the history of Meta’s major open models, the sequence is as follows:

    * OPT – Released May 3, 2022 (ai.meta.com | 125M, 350M, 1.3B, 2.7B, 6.7B, 13B, 30B, 66B, 175B): A foundational open model that is underrated in the arc of language modeling research.

    * LLaMA – Released February 24, 2023 (ai.meta.com | 7B, 13B, 33B, 65B): The open weight model that powered the Alpaca age of early open chat models.

    * Llama 2 – Released July 18, 2023 (our coverage | about.fb.com | 7B, 13B, 70B): The open standard for academic research for its time period. Chat version had some bumps, but overall a major win.

    * Llama 3 – Released April 18, 2024 (our coverage | ai.meta.com | 8B, 70B): The open standard for its time. Again, fantastic base models.

    * Llama 3.1 – Released July 23, 2024 (our coverage | ai.meta.com | 8B, 70B, 405B): Much improved post training and the 405B marked the first time an open weight model competed with GPT-4!

    * Llama 3.2 – Released September 25, 2024 (our coverage | ai.meta.com | 1B, 3B, 11B, 90B): A weird, very underperforming vision release, outshined by Molmo on the same day.

    * Llama 3.3 – Released December 6, 2024 (github.com | 70B): Much improved post-training of the smaller 3.1 models, likely in response to other open releases, but largely a minor update.

    * Llama 4 – Released April 5, 2025 (ai.meta.com | 17A109B, 17A400B): What we got today.

    The time between major versions is growing, and the number of releases seen as exceptional by the community is dropping. Llama 4 consists of 3 models, quoting from the blog post, notes in brackets mine:

    * Llama 4 Scout, a 17 billion active parameter model with 16 experts [and 109B total parameters, ~40T training tokens], is the best multimodal model in the world in its class and is more powerful than all previous generation Llama models, while fitting in a single NVIDIA H100 GPU.

    * Llama 4 Maverick, a 17 billion active parameter model with 128 experts [and 400B total parameters, ~22T training tokens].

    * These models are our best yet thanks to distillation from Llama 4 Behemoth, a 288 billion active parameter [and 2T total parameters] model with 16 experts that is our most powerful yet and among the world’s smartest LLMs…. we’re excited to share more details about it even while it’s still in flight.

    Here are the reported benchmark scores for the first two models, which are available on many APIs and to download on HuggingFace.

    Where Llama models used to be scaled across different sizes with almost identical architectures, these new models are designed for very different classes of use-cases.

    * Llama 4 Scout is similar to a Gemini Flash model or any ultra-efficient inference MoE.

    * Llama 4 Maverick’s architecture is very similar to DeepSeek V3 with extreme sparsity and many active experts.

    * Llama 4 Behemoth is likely similar to Claude Opus or Gemini Ultra, but we don’t have substantial information on these.

    This release came on a Saturday, which is utterly bizarre for a major company launching one of its highest-profile products of the year. The consensus was that Llama 4 was going to come at Meta’s LlamaCon later this month. In fact, it looks like this release may have been pulled forward from today, the 7th, from a commit in the Meta Llama Github:

    One of the flagship features is the 10M (on Scout, Maverick is 1M) token context window on the smallest model, but even that didn’t have any released evaluations beyond Needle in a Haystack (NIAH), which is seen as a necessary condition, but not one that is sufficient to say it is a good long-context model. Some more modern long-context evaluations include RULER or NoLiMa.

    Many, many people have commented on how Llama 4’s behavior is drastically different in LMArena — which was their flagship result of the release — than on other providers (even when following Meta’s recommended system prompt). Turns out, from the blog post, that it is just a different model:

    Llama 4 Maverick offers a best-in-class performance to cost ratio with an experimental chat version scoring ELO of 1417 on LMArena.

    Sneaky. The results below are fake, and it is a major slight to Meta’s community to not release the model they used to create their major marketing push. We’ve seen many open models that come around to maximize on ChatBotArena while destroying the model’s performance on important skills like math or code. We’ll see where the released models land.

    Regardless, here’s the plot Meta used. Look at the fine print at the bottom too.

    This model is actually the one tanking the technical reputation of the release because its character is juvenile. The actual model on other hosting providers is quite smart and has a reasonable tone!

    ArtificialAnalysis rated the models as “some of the best non-reasoning models,” beating leading frontier models. This is complicated because we shouldn’t separate reasoning from non-reasoning models; we should just evaluate on reasoning and non-reasoning domains separately, as discussed in the Gemini 2.5 post. So-called “reasoning models” often top non-reasoning benchmarks, but the opposite is rarely true.

    Other independent evaluation results range from medium to bad and confusing — I suspect very weird results are hosting issues with the very long context models. At the same time, the Behemoth model is outclassed by Gemini 2.5 Pro. To list some of the major technical breakthroughs that Meta made (i.e. new to Llama, not new to the industry):

    * Mixture of expert architectures, enabling Llama 4 to be trained with less compute than Llama 3 even though they have more total parameters — a lot more.

    * Very long context up to 10M tokens.

    * Solid multimodal input performance on release day (and not a later model)

    Interconnects is a reader-supported publication. Consider becoming a subscriber.

    Sadly this post is barely about the technical details. Meta nuked their release vibes with weird timing and by having an off-putting chatty model that was easiest to find to talk to. The release process, timing, and big picture raise more questions for Meta. Did they panic and feel like this was their one shot at being state of the art?

    The evaluation scores for the models are solid, they clear a fairly high bar. With these highly varied MoE architectures, it’s super hard to feel confident in an assessment of the model based on benchmarks, especially when compared to dense models or teacher-student distilled models. The very-long-context base models will be extremely useful for research.

    The question here is: Why is Meta designing their models in the same way as other frontier labs when their audience is open-source AI communities and businesses, not an API serving business or ChatGPT competitor?

    The model sizing for the likes of Gemini and ChatGPT is downstream of nuanced decisions based on a balance of training cluster size, inference needs, and performance trade-offs. These trade-offs are very different for open models, where you don’t pay inference, and many users are not hyperscale companies.

    The model that becomes the “open standard” doesn’t need to be the best overall model, but rather a family of models in many shapes and sizes that is solid in many different deployment settings. Qwen 2.5, with models at 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B parameters, is the closest to this right now. There’s actually far less competition in this space than in the space Meta chose to go into (and take on DeepSeek)!

    One of these communities historically has been the LocalLlama subreddit, which named the entire community around running models at home around the Llama series of models — they’re not happy with Llama 4. Another community is academics, where the series of models across different size ranges is wonderful for understanding language models and improving methods. These two groups are all GPU-poor, so memory-intensive models like these sparse mixture of experts price out even more participants in the open community (who tend to be memory-limited).

    This is all on top of an onerous license that makes all artifacts that use Llama in the process be tagged with the “Llama-” name, the Llama license, the “Built with Llama” branding if used commercially, and use-case restrictions. This is at the same time when their competitors, i.e. DeepSeek, released their latest flagship model with an MIT license (which has no downstream restrictions).

    A third group is potential businesses looking to use open models on-premises as open models close the gap to closed counterparts. These feel like groups that would be sensitive to the extra legal risk that Llama’s license exposes them to.

    On top of all of this weirdness, many of Meta’s “open-source” efforts are restricted in the European Union. Where the Llama 3.2 models blocked you if you tried to access them from Europe, Llama 4 is available for download but prohibits the use of vision capabilities in an acceptable use policy. This is not entirely Meta’s fault, as many companies are dealing with side effects of the EU AI Act, but regulatory exposure needs to be considered in Meta’s strategy.

    Meta had a tight grasp on these communities, the Llama projects were rightfully loved, but now they feel lost. With Qwen 3 around the corner and countless other amazing open-weight models out now (and many more teased, such as from OpenAI), the competition is extreme.

    The soul of the Llama series died by not releasing enough models frequently enough. Reclaiming that with GenAI’s constant organizational headaches looks like a Sisyphean task. What is Meta’s differentiation in the AI space? It still seems about enabling their own platforms to flourish, not about truly supporting open.

    Meta’s GenAI organization has been showing major signs of cultural challenges thoughout its entire existence — including their head of AI research leaving just a few days before this model was launched.

    Sadly, the evaluations for this release aren’t even the central story. The vibes have been off since the beginning by choosing a weird release date. Over the coming weeks, more and more people will find reliable uses for Llama 4, but in a competitive landscape, that may not be good enough. Llama is no longer the open standard. Personally, this makes me sad. As an American, I want the default pieces of the open ecosystem to be run by American or American-friendly companies.

    With the macro pressure coming to Meta’s business and the increasing commoditization of open models, how is Zuckerberg going to keep up in face of shareholder pressure pushing back against the cost of the Llama project? This isn’t the first time he’s done so, but he needs to reevaluate the lowest level principles of their approach to open AI.



    This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe
  • https://www.interconnects.ai/p/rl-backlog-openais-many-rls-clarifying

    I have a second blog where I post half-baked thoughts, sometimes previews of what comes here. If you’re interested, I posted some musings on OpenAI’s coming open model release.

    It’s obvious that reinforcement learning (RL) is having a total return to glory among the broader AI community, but its real successes are mostly the things people aren’t focusing on. More math and code datasets are important platforms — we know they’re coming and are important. They’re still over-indexed on. The same RL methods are being used in many of the leading models and AI products.

    This is largely a post I wrote a few weeks ago on RL news, which I was following. It never had a focusing function, so it didn’t get published, but I’m sharing it because many folks are following this area very closely. Today:

    * OpenAI’s many forms of RL,

    * On distilling chain of thoughts vs. RL,

    * Did DeepSeek distill o1?, and

    * Why latent reasoning is so interesting.

    Interconnects is a reader-supported publication. Consider becoming a subscriber.

    OpenAI’s many forms of RL

    For those plugged into the OpenAI cultural tap that is Twitter, it is obvious that they’re very invested in reinforcement learning. With the hype around the release of their o-series of reasoning models, it was easy to assume that those were the only avenue for excitement. OpenAI’s recent releases have shown this is not the case, and every release from a model launch to a new product has included mentions of RL training. Some of this, of course, is marketing, but they all fit as different applications of reinforcement finetuning (RFT) / RL with verifiable rewards (RLVR).

    The first other application was OpenAI’s Operator agent. They stated:

    Combining GPT-4o's vision capabilities with advanced reasoning through reinforcement learning, CUA is trained to interact with graphical user interfaces (GUIs)—the buttons, menus, and text fields people see on a screen.

    There’s a bit more speculation to do than normal in this post. Ultimately, with partners they launched with like DoorDash, Instacart, etc., they could set up verifiable domains where the agent is rewarded for accomplishing a natural language task. This could rely on help from those websites to get started. Ultimately, lots of people know that this could work, as agents deeply tied to the core of RL lore, but the implementation details haven’t really been worked out in open projects.

    The same goes for Deep Research. They stated:

    Deep research independently discovers, reasons about, and consolidates insights from across the web. To accomplish this, it was trained on real-world tasks requiring browser and Python tool use, using the same reinforcement learning methods behind OpenAI o1, our first reasoning model.

    Deep research was trained using end-to-end reinforcement learning on hard browsing and reasoning tasks across a range of domains.

    Some more was shared in the Deep Research system card.

    There are lots of things one can envision — e.g. agent gets a reward if the document retrieved from search has relevant information (not a verifiable reward, but LLM-as-a-judge). Most of this is likely used to get very high reliability across tool use to enable the tons of calls done in the back end when a call takes 10+ minutes for the user.

    More | research | has emerged on RAG/search with RL.

    Least surprising was the announcement of the new GitHub CoPilot model with new and improved RL training for code:

    Our new code completion model is shipping in public preview today. We are calling it GPT-4o Copilot. Based on GPT-4o mini, with mid-training on a code-focused corpus exceeding 1T tokens and reinforcement learning with code execution feedback (RLEF).

    This all goes back to what I said in OpenAI's Reinforcement Finetuning and RL for the masses — this new RL training is a perfectly aligned way to get nearly perfect performance on a domain you can control carefully. The best results come with mastery of the domain and with training.

    A fun speculation that OpenAI is really invested in RL and post-training is that their new o3-mini model has the same date cutoff, October 2023, as OpenAI’s other flagship models. This getting very far in the past shows how invested OpenAI is in their search products (which, to be fair are quite good) for information and how such strong performance gains can come by other improvements in the stack of training.

    OpenAI also released a paper on competitive coding with RL training, but it did not have a ton of useful details.

    On distilling chain of thoughts vs. RL

    There were a few points from the DeepSeek paper and discourse that warrant repeating. To repeat it, distillation in this case is training a model (usually with SFT, but any loss function works) on outputs from a stronger model. Let’s get right into it.

    First, DeepSeek made it very clear that using more RL after distillation (SFT) is crucial for the best possible models.

    Additionally, we found that applying RL to these distilled models yields significant further gains. We believe this warrants further exploration and therefore present only the results of the simple SFT-distilled models here.

    My current understanding here is that matching the data distribution from the base model’s training to the distillation data and the RL prompts is very important. This specifically is crucial for enabling RL at the end — SFT will almost always boost the scores, but can narrow the scope to which the model can be finetuned more. DeepSeek figured this out for their models, but didn’t share the details.

    The next point is on how scale mediates the impact of RL training:

    First, distilling more powerful models into smaller ones yields excellent results, whereas smaller models relying on the large-scale RL mentioned in this paper require enormous computational power and may not even achieve the performance of distillation.

    This is more confusing than useful, and drawn from the fact that “DeepSeek-R1- Distill-Qwen-32B, which is distilled from DeepSeek-R1, performs significantly better than DeepSeek-R1-Zero-Qwen-32B across all benchmarks”. We should not expect that -Zero style models trained only with RL will perform well on benchmarks (unless you’re training on test). This is not what they are designed for. The distilled models are trained on text very finely tuned for existing language modeling workflows. The RL-Zero (not distilled) models are very exporatory in their behaviors.

    The right baseline would be putting Qwen-32B through the whole R1 recipe — which would be far more likely to outperform the distilled version.

    With this is the fact that small models take more work from RL. Doing this sort of exploratory RL is much easier with big models. It could be that they hold more rare behaviors in them during pretraining and RL draws them out. The smaller models may squash these long-tail behaviors.

    Continuing on this, the DeepSeek authors state:

    Second, while distillation strategies are both economical and effective, advancing beyond the boundaries of intelligence may still require more powerful base models and larger scale reinforcement learning.

    Did DeepSeek distill OpenAI’s o1 model? (hint, no)

    This is a question I meant to address ages ago, but here we are, a few model launches got in the way. The criticism pushed by OpenAI and many media outlets is that DeepSeek was trained on reasoning traces from OpenAI’s o1 model. OpenAI spent approximately 18 months getting the initial data to train their o1 model, so it is understandable that they are wary of giving that away for free, but the existing evidence suggests that DeepSeek training on o1-CoTs is extremely unlikely.

    To start, the o1 chains of thought were not visible to the users. In order to get this data, DeepSeek would need to reliably hack the OpenAI API or ChatGPT to reveal this data. Users were getting banned from OpenAI’s properties for trying to do this. Creating this scale of a cover-up is unlikely to go unnoticed.

    Second, as shown in the DeepSeek R1 recipe, training on on-policy completions from your model(s) is crucial to training a model like this. In many ways, distilling from CoTs would likely be harder to create the final R1 model than following the recipe DeepSeek presented in the paper. They have evidence in training plots that their RL training works.

    At the same time, this is a hard claim to settle, as I think it is very likely that DeepSeek used OpenAI model outputs in the training process of their recent models. Distillation in multiple stages of the post-training process is a very common practice. For example, in order to do initial post-training on models like DeepSeek V3, training on completions from OpenAI chat models is a very simple way to get going.

    To this day, OpenAI is still worried about distillation from their chain of thoughts, or they’re doing something that makes it so showing the underlying chain of thought doesn’t make sense (e.g. basic forms of search or self-consistency). For example, OpenAI now shows summaries of the chain of thoughts for their o-series models, but they’re not raw like Claude or Gemini’s.

    These aren't the raw CoTs but it's a big step closer and I'm glad we can share that experience with the world.

    Why latent reasoning is so interesting

    One of the most intellectually engaging ideas to emerge during this early 2025 rush of reasoning research is a set of ideas where language models can reason in a compressed intermediate representation rather than outputting the same text tokens, which come with the quadratic inference cost. The two papers that come to mind are:

    * Training Large Language Models to Reason in a Continuous Latent Space

    * Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

    Here’s a figure from the latter:

    Without going into details of these papers’ implementations, this is compelling because it pushes in the direction of letting language models think in whatever representation suits them. Then, they’ll output tokens or take actions in a form that works in the environment or is human legible.

    We’ve already seen many related results of RL training, such as the DeepSeek R1 Zero model switching from English to Chinese randomly when it thinks.

    Ultimately, RL training is all about improving outcomes, so this type of structural drift is expected. The question is whether we can incentivize the models to use far more compressed representations than the standard language tokens used in modern tokenizers.

    An existing trade-off already exists in language models, where non-English languages are often far more costly than English to perform inference on because they’re a lower priority in the tokenizer (or the language is just more verbose). The goal of latent or compressed reasoning research is to push this in the other direction.

    Anthropic has been doing interesting research on understanding the nature of the reasoning traces from Claude. With their reasoning launch, they already stated that:

    we didn’t perform our standard character training on the model’s thought process.

    They’ve also seen that the reasoning is not connected with the model’s actual thoughts.

    Thus far, our results suggest that models very often make decisions based on factors that they don’t explicitly discuss in their thinking process. This means we can’t rely on monitoring current models’ thinking to make strong arguments about their safety.

    This is expected. The reasoning chain is not the same artifact that humans use chain of thought for even if it appears in the same format. Chain of thought is generating the right context to get the final answer right. There are no guarantees that the most interpretable form is the one with the highest performance — in fact, in many deep learning systems end-to-end learning where no constraints are put on the intermediate representation is often best!

    To end, I’m leaving you with another classic Rich Sutton essay in full (author of the Bitter Lesson). With RL, better verifiers make it so you get more out of RL training and inference-time scaling:

    Verification, The Key to AI

    Rich Sutton, November 15, 2021

    It is a bit unseemly for an AI researcher to claim to have a special insight or plan for how his field should proceed. If he has such, why doesn't he just pursue it and, if he is right, exhibit its special fruits? Without denying that, there is still a role for assessing and analyzing the field as a whole, for diagnosing the ills that repeatedly plague it, and to suggest general solutions.

    The insight that I would claim to have is that the key to a successful AI is that it can tell for itself whether or not it is working correctly. At one level this is a pragmatic issue. If the AI can't tell for itself whether it is working properly, then some person has to make that assessment and make any necessary modifications. An AI that can assess itself may be able to make the modifications itself.

    The Verification Principle:

    An AI system can create and maintain knowledge only to the extent that it can verify that knowledge itself.

    Successful verification occurs in all search-based AI systems, such as planners, game-players, even genetic algorithms. Deep Blue, for example, produces a score for each of its possible moves through an extensive search. Its belief that a particular move is a good one is verified by the search tree that shows its inevitable production of a good position. These systems don't have to be told what choices to make; they can tell for themselves. Image trying to program a chess machine by telling it what kinds of moves to make in each kind of position. Many early chess programs were constructed in this way. The problem, of course, was that there were many different kinds of chess positions. And the more advice and rules for move selection given by programmers, the more complex the system became and the more unexpected interactions there were between rules. The programs became brittle and unreliable, requiring constant maintainence, and before long this whole approach lost out to the "brute force" searchers.

    Although search-based planners verify at the move selection level, they typically cannot verify at other levels. For example, they often take their state-evaluation scoring function as given. Even Deep Blue cannot search to the end of the game and relies on a human-tuned position-scoring function that it does not assess on its own. A major strength of the champion backgammon program, TD-Gammon, is that it does assess and improve its own scoring function.

    Another important level at which search-based planners are almost never subject to verification is that which specifies the outcomes of the moves, actions, or operators. In games such as chess with a limited number of legal moves we can easily imagine programming in the consequences of all of them accurately. But if we imagine planning in a broader AI context, then many of the allowed actions will not have their outcomes completely known. If I take the bagel to Leslie's office, will she be there? How long will it take to drive to work? Will I finish this report today? So many of the decisions we take every day have uncertain and changing effects. Nevertheless, modern AI systems almost never take this into account. They assume that all the action models will be entered accurately by hand, even though these may be most of the knowledge in or ever produced by the system.

    Finally, let us make the same point about knowledge in general. Consider any AI system and the knowledge that it has. It may be an expert system or a large database like CYC. Or it may be a robot with knowledge of a building's layout, or knowledge about how to react in various situations. In all these cases we can ask if the AI system can verify its own knowledge, or whether it requires people to intervene to detect errors and unforeseen interactions, and make corrections. As long as the latter is the case we will never be able to build really large knowledge systems. They will always be brittle and unreliable, and limited in size to what people can monitor and understand themselves.

    "Never program anything bigger than your head"

    And yet it is overwhelmingly the case that today's AI systems are not able to verify their own knowledge. Large ontologies and knowledge bases are built that are totally reliant on human construction and maintenance. "Birds have wings" they say, but of course they have no way of verifying this.

    Sharing a copy of Rich Sutton’s essay because his website sometimes has DNS issues and goes down. http://incompleteideas.net/IncIdeas/KeytoAI.html

    Thanks for reading!

    Thanks to Tanmay Gupta for helpful links or comments used in this article.



    This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe
  • https://www.interconnects.ai/p/gemini-25-pro-googles-second-ai-chance

    Google, with its immense infrastructure and talent, has been the safe bet for the question of “Who will have the best models in a few years?” Google took a long time to get here, overcoming Bard’s launch and some integration headaches, and yet the model they launched today, Gemini 2.5 Pro feels like the biggest jump in evaluation scores we’ve seen in quite some time.

    It’s often hard to communicate how the models we are getting these days are actually better. To be informed, you need to take a balanced view across many benchmarks, look roughly at the percentage by which the model is clearly state-of-the-art, and of course, try the model yourself.

    To summarize, while more evaluations are rolling in, Gemini 2.5 Pro is 40+ Elo points clear on the popular ChatBotArena / LM Arena benchmark (more here). Normally, when a model launches and claims the top spot, it’s barely ahead. In fact, this is the second biggest jump of the top model in LMSYS history, only behind the GPT-4 Turbo overtaking Claude 1. GPT-4 Turbo is when models were not really trained for the benchmark, so progress was much faster.

    The blog post highlights insane scores on the benchmarks used to evaluate the leading reasoning models. One to note here is the score of 18.8 on Humanity’s Last Exam without search or tools, which was one of the evaluations I highlighted as impressive with the launch of OpenAI’s Deep Research, which compiles knowledge from the web!

    Gemini 2.5 is topping other independent evaluations such as the Scale Leaderboard (which is underrated or at least low on visibility, more here). More independent evaluations are going to trickle in, but all of the ones I’ve seen are extremely positive.

    Gemini still is also the model with the longest context length and with very strong multimodal performance (including audio). There are plenty of small wins that Google has like this that are hard to see when skimming the benchmarks above.

    So, how did Google do it? As usual, the blog post doesn’t have a ton of technical details. Google says:

    we've achieved a new level of performance by combining a significantly enhanced base model with improved post-training.

    Until we have API pricing, it’ll be harder to make even informed guesses about whether the model is huge like GPT-4.5. As for understanding how Gemini models will behave, Google shares:

    Going forward, we’re building these thinking capabilities directly into all of our models, so they can handle more complex problems and support even more capable, context-aware agents.

    This idea of directly integrating reasoning into all of their models is something Sam Altman teased for GPT-5. This trend has serious trade-offs on user experience that we will get to later, but it is crucial for people to keep up with as the discourse today is often centered on "the best non-reasoning model” or “the best reasoning model.”

    This came up recently with DeepSeek’s new V3 model.

    DeepSeek's new model (0324) is a major update in performance and license. The MIT license will make it hugely impactful for research and open building. Though many are ending up confused about whether it is a "reasoning" model. The model is contrasted to their R1 model, which is an only-reasoning model (like o1).

    Reasoning models are on a spectrum now, and it's not just yes or no. GPT 4.5 is a good example of what a model with pretty much no reasoning looks like today.

    Compared to other models in the industry, like Claude 3.7 and Grok 3 with reasoning toggles, the new DeepSeek V3 is definitely in this class of "hybrid reasoners" where models are still trained extensively with RL on verifiable domains (or distilled directly from another reasoning model), but other parts of the post-training process come first and hold more weight than the RL heavy reasoning-only models.

    This is all to say that when people say that "DeepSeek V3 0324 is the best non-reasoner model," that doesn't really make sense. The original V3 had very light post-training, so it wasn't really on the reasoning model spectrum.

    Now, things are complicated. It'll be like this for a while!

    Gemini 2.5 Pro is quite simple. It is very much a reasoning model, at least in how it is offered to users in Gemini Advanced and AI studio — every query has reasoning before an answer. It is fairly conclusive now that using this extended reasoning can boost performance across many domains, but it’s not clear how to best trade off cost and speed with varying amounts of reasoning.

    Gemini 2.5 in its current offering is a brute force approach — a big, very smart model that is tuned to use a lot of reasoning tokens — and it’s good for the trajectory of the industry that it paid off with such high performance.

    Interconnects is a reader-supported publication. Consider becoming a subscriber.

    The state of the AI industry

    With launches from DeepSeek, GPT-4.5 from OpenAI, Claude 3.7 from Anthropic, Grok 3 from xAI, and now Gemini 2.5 Pro, this has been a wild spring for progress in AI models. The major AI laboratories have all delivered super impressive performance — this post feels like the ribbon that ties all of them together.

    The one player seriously missing this spring is Meta with their Llama models. They’ve fallen into the trap where the longer you go between models, the harder it gets to release them because expectations get higher. I hope Llama 4 succeeds because they’re a large part of the open community, but it is a warning to AI laboratories on how to manage deliverables.

    With the major progress that AI labs are making, it feels like the answer for who will have the best model is now who can drop the hot potato of a cutting-edge model into the real world the fastest.

    The common interpretation of events is that models are commoditizing, but that is an incomplete story. The value in the ecosystem is poised to accrue to the sites with users. Some established ones in AI are ChatGPT, Perplexity, Cursor, etc. This may not always be the case is uses for AI evolve.

    What we’re seeing with the newest models is that the pace of progress is staying high in many areas (i.e. more than just ChatBotArena). All sorts of evaluations, from niche information to hard mathematics to software development, are getting new leading models every few weeks.

    The more often state-of-the-art models are released in a fixed time window, the more confident you can be in the pace of progress continuing. These labs are all racing up similar trees, but it’s only possible for so much competition to exist when progress isn’t super hard to find. The ceiling on performance is rising and the potential value underneath it that we haven’t unlocked is continuing to balloon.

    Google AI’s second chance

    This quote has been going around after Ben Thompson interviewed OpenAI CEO Sam Altman on his plans for OpenAI:

    Ben Thompson: What’s going to be more valuable in five years? A 1-billion daily active user destination site that doesn’t have to do customer acquisition, or the state-of-the-art model?

    Sam Altman: The 1-billion user site I think.

    A world where user-facing websites are the most valuable part of AI is a world where AI is less of a platform for doing things and more of a tool for complementing existing habits. AI progress is as high as it ever has been, and the focus of that is on moving from benchmarks towards turning them into agents and tools.

    Google’s biggest opportunity is being the one player that has it all — leading models, infrastructure, and a cloud offering to make it the default platform for building value with AI. They have users to retain with Google.com, which they are obviously trying to do, but the rest of their efforts should be on being an AI platform.

    With this release, I spent time trying to use Google’s Gemini Advanced offerings like I use ChatGPT and Claude. These use cases were immediately confusing. It didn’t feel like Chat is at all the right way to evaluate this new Gemini 2.5 model. It’s perfectly capable, but without a depth of personality it feels lost relative to the fun GPT-4.5 or the ever quirky Claude.

    And why am I paying for Gemini Advanced? Google is the company known for giving things away for free and at scale. If Google isn’t committed to figuring out advertisements for its chat products, then it’ll never meaningfully shift the revenue. Breaking through the moat of ChatGPT with anything other than better models at a free plan is next to impossible at this point. The disruption and new habits have already formed.

    Many of my experiences with Gemini 2.5 Pro, other than lacking the distinctive character of GPT 4.5 and Claude 3+, where Gemini feels sort of bland, have to do with the form factor of forcing reasoning into every model. Even for basic queries the extensive reasoning of Gemini 2.5 Pro makes the time to first token on the order of seconds.

    Normal consumers don’t benefit from the improvements with reasoning that are accompanied by such a decrease in speed. For agents doing substantial work in the background, a long time to first token or a bland personality don’t matter!

    Reasoning heavily on every query is a major quality-of-life drain for chat and reopens the same discussions on reasoning models that balance when they should reason. Claude, Grok, DeepSeek, and OpenAI all have selectors for toggling reasoning on or off. This should be the default until models are better.

    Gemini should not be focusing on competing with ChatGPT in the same business. That’s a losing battle and arguably not even the biggest possible final market — subscriptions never have scaled to be the core of the world’s largest companies.

    Where Gemini Advanced (at gemini.google.com) feels like a ChatGPT clone, AI Studio (at ai.dev) feels like the onboarding point for developers and customers using their platform. Logan and others have made big progress softening the barrier for people jumping from OpenAI and Anthropic to Google. These leads are far more valuable than Gemini Advanced subscribers.

    Google should be a platform for others to build AI and use AI to make their offerings better. Google has had success with its AI overviews and continues to build on that. At the same time, their offerings for using Gemini in products have pretty much failed completely.

    There are two clear avenues where Google can use Gemini to deliver business value:

    * Gemini for product: Enhancing existing products like Docs, Sheets, YouTube, Android, Assistant, etc. — i.e., taking the above and making it actually work. The Gemini product offerings across the Google suite are pretty much still in their Bard stage. The same focus and execution from training needs to extend into Gemini products and Google Cloud for the next stage of this arc. Extreme value is ready to be captured by the models even if the models don’t continue to improve. The urgency on products at this point very well should be higher than the pressure to train better models.

    * Google Cloud: Offering fast and cheap inference of Gemini in the form factor developers need. Google Cloud, being integrated from hardware in TPUs to model can often provide the best models now at the lowest prices. Selling Gemini into a world of flourishing agents is a far better match for Google’s culture of product successes. AI Studio and API developer relations around it can be a seed that grows.

    Google has the best models again, as they should have started this whole AI bloom. The strategic error has been righted. The AI leadership has woken up to the crisis, and the researchers/engineers have risen to the occasion. The rest of the company has to do the same.



    This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe
  • https://www.interconnects.ai/p/how-to-manage-ai-training-organizations

    It is a closely guarded secret how the leading AI laboratories structure their training teams. As with other technology companies, the saying “you ship your org chart” still applies to training AI models. Looking at these organizational structures will reveal where research can be scaled up, the upper limits of size, and potentially even who uses the most compute.

    How modeling teams do and do not work

    A crucial area I’m working on (reach out if you would like to share more off the record) is how to scale these lessons to bigger, more complex teams. The core factor differentiating teams that succeed from those that do not is maintaining these principles while scaling team size.

    Big teams inherently lead to politics and protecting territory, while language models need information to flow from the bottom to the top on what capabilities are possible. Regardless of the possibilities, leadership can shift resources to prioritize certain areas, but all of the signals on whether this is working come from those training models. If senior directors mandate results under them before unblocking model releases, the entire system will crumble.

    Seeing this potential end state — without naming specific companies — it is obviously desirable to avoid, but anticipating and avoiding it during rapid growth takes substantial intentionality.

    Within training, the planning for pretraining and post-training traditionally could be managed differently. Pretraining has fewer, bigger runs so improvements must be slotted in for those few annual runs. Post-training improvements can largely be continuous. These operational differences, on top of the obvious cost differences, also make post-training far more approachable for non-frontier labs (though still extremely hard).

    Both teams have bottlenecks where improvements must be integrated. Scaling the pretraining bottlenecks — i.e. those making the final architecture and data decisions — seems impossible, but scaling teams around data acquisition, evaluation creation, and integrations is very easy. A large proportion of product decisions for AI models can be made irrespective of modeling decisions. Scaling these is also easy.

    Effectively, organizations that fail to produce breakthrough models can do tons of low-level meaningful research, but adding organizational complexity dramatically increases the risk of “not being able to put it together.”

    Another failure mode of top-down development, rather than bottom-up information, is that leaders can mandate the team to try to follow a technical decision that is not supported by experiments. Managing so-called “yolo runs” well is a coveted skill, but one that is held close to the models. Of course, so many techniques work still that mandates don’t have a 100% failure rate, but it sets a bad precedent.

    Given the pace of releases and progress, it appears that Anthropic, OpenAI, DeepSeek, Google Gemini, and some others have positive forms of this bottom-up culture with extremely skilled technical leads managing complexity. Google took the longest to get it right with re-orgs, muddled launches (remember Bard), and so on. With the time lag between Meta’s releases, it still seems like they’re trying to find this culture to maximally express their wonderful talent and resources.

    With all of this and off-the-record conversations with leadership at frontier AI labs, I have compiled a list of recommendations for managing AI training teams. This is focused on modeling research and does not encompass the majority of headcount in the leading AI companies.

    Interconnects is a reader-supported publication. Consider becoming a subscriber.

    Recommendations

    The most effective teams who regularly ship leading models follow many of these principles:

    * The core language modeling teams remain small as AI companies become larger.

    * For smaller teams, you can still have everyone in one room, take advantage of this. For me personally, I think this is where remote teams can be detrimental. In-person works for this, at least when best practices are evolving so fast.

    * Avoid information siloes. This goes for both teams and individuals. People need to quickly be able to build on the successes of those around them and clear communication during consistent rapid progress is tricky.

    * For larger teams, you can scale teams only where co-design isn’t needed. Where interactions aren’t needed there can be organizational distance.

    * An example would be one team focusing on post-training algorithms & approaches while other teams handle model character, model variants for API, etc (specifications and iterations).

    * Another example is that reasoning teams are often separate from other pieces of post-training. This applies only to players that have scaled.

    * Language model deployment is very much like early startup software. You don’t know exactly what users want nor what you can deliver. Embrace the uncertainty and learn quickly.

    * Do not overly try to separate engineering teams from training. Engineering needs to build tools for the generation +1 model and cannot do this without talking to researchers.

    * Evergreen research is separate from the language modeling teams itself, but still sits within “research”. Otherwise, it will be impossible to prioritize truly long-term ideas. Long-term goals are fragile and need nurturing. Language modeling is about the next 1, or maybe 2, models.

    * A lot of the sexy work is not that helpful and a lot of the useful work isn't sexy. Data is the prime example as the often most impactful type of work.

    * Expect failed training runs and do not overreact to them along the way.

    Failure modes

    High-priority projects can fail if you…

    * Try to ship too many models for each capability improvement. Instead, stick to a set schedule of model training. Have fewer models that are more capable.

    * Try to force contributions from individual teammates into the final product. Do not sacrifice performance for personalities in search of “a contribution”.

    * Let in teams that try and territorially force their way into contributing to the big company goal.

    * Scale the training organization too much. Having too many people “doing stuff” and adding noise to the organization detracts from high-level direction and focus on the execution of specific goals. (This can also relate to 1. and be trying to do too much in one model).

    * Letting politics grow, taking many forms, and causing intertwined issues. Do not lose the sense of results being the #1 driving factor of decisions. Bad decisions here compound.

    * Over-indexing on a single model evaluation will hamper (or flat out block) real progress in other areas.

    Before the rest of the post, expanding on the topics above, you may be interested in previous articles on this topic.

    Related writing

    For more reading on how language modeling teams work, see some of my other writing here, on team structure, and…

    ….managing risk.

    An example of how mid-sized training projects work

    I recently got a list of questions on how training for Tülu 3 operated (which is a post-training analog to OLMo really). I figured I would share these and they serve as a foundation for me gathering useful information from friends on frontier labs on how representative it is.

    With reasoning models, most of this translates directly. Infrastructure is becoming more important because generating long sequences is particularly memory intensive (and can expose issues in open-source tools for inference), but when the time comes to make a state-of-the-art fully open reasoning recipe, the lessons learned here will apply directly.

    1. How long does a large post-training project take?

    Tülu 3 was the focus of our post-training team from mid-July until its release on November 21st, 2024. We were building on our previous recipes, in Tülu 2/2.5, so not very much of this was catching up on internal know-how, but rather integrating new external resources. If a team like this was working continuously all year on the same focus it would’ve taken approximately one month less to achieve these results. Bootup takes substantial time, as does release management.

    2. How do you choose the right personnel for a moderately sized training project?

    A project like Tülu 3 or any other effort to push the frontier of AI in a popular area normally takes a moderately sized team. The smaller the niche, the smaller the team you need. The team at Ai2 is researcher-heavy relative to engineer-heavy among the 20+ authors. If prioritizing only performance on known techniques, the ratio of engineers can be far higher. Pushing the frontier takes 10x the resources as repeating extensively documented work.

    In the case of Tülu 3, where most of the techniques are not known the proportion of researchers is obviously higher. This, though, for companies trying to scope who to hire for modeling teams is not a trivial problem. First, one must scope the level of uncertainty in the domain of interest and then hire around it. Applying Tülu style approaches could definitely be done with a team of 2-4 focused engineers.

    3. What model sizes are used for iteration? How do results scale?

    A core principle of modeling research is to iterate at the smallest model that provides a reliable signal. This is the entire principle behind scaling laws as a de-risking tool. In post-training, compute costs are substantially lower so the models used actually can be bigger. In this case, given a project designed around the Llama 3.1 base models, ~80% or more of experiments were at the 8B scale (normally 8 or 32 H100s, finishing in

  • Post: https://www.interconnects.ai/p/gemma-3-olmo-2-32b-and-the-growing

    Ever since the release of the original ChatGPT, much has been said about making a truly open-source version of it — with data, code, weights, etc., all available. Open-source versions increase transparency, access, long-term progress, security research, and lots more. Lots of people have used this claim to bring hype into their projects, but the substance of these releases have been rather shallow (i.e., often focusing on one evaluation).

    This milestone was so long coming that I entirely forgot about it as a target. Through 2024, and especially before DeepSeek, the impression was that scaling AI capabilities was just too expensive for the smaller players willing to do truly open-source development.

    Truly open releases take a lot of effort by making more to release and maintain, open up potential legal risks that preclude types of training data, and completely undermine competition. The few organizations doing fully open-source research are non-profits, like Ai2 or Eleuther AI; academics, like LLM360; or companies that benefit from the long-term ecosystem growth, like HuggingFace.

    I was poking through the results for our latest model when I realized that we finally did it! We have a fully open-source GPT-4 class model, i.e., it is comparable with OpenAI's original release rather than the current version.

    Today, we're releasing OLMo 2 32B, the biggest model we've trained from scratch yet. Here are the post-training evaluations, where it surpasses GPT-3.5, GPT-4o-mini, Qwen 2.5 32B Instruct, the recent Mistral Small 24B, and comes close to the Qwen and Llama 70B Instruct models.

    And this recipe is extremely training efficient. Here’s a plot showing the FLOP comparisons to peer base models:

    Most of this release isn't entirely new. OLMo 2 is the result of lots of small wins on data, architecture, post-training with Tülu 3 recipe and so on — we just let the GPUs hum for a lot longer. You can learn more about OLMo 2 in my original release announcement or in this podcast with the leads.

    The new part of this release is a major milestone where any company can pick up our training stack and cook up exactly the model they need at nearly the GPT 4 level. Beating the latest GPT 3.5 and GPT 4o mini models feels like fair game for the claim. This capability will take time to diffuse, but it is a major moment in the arc of why we do what we do. Even without more progress on OLMo, which we obviously will continue this year, this will keep fundamental AI progress outside of the major AI labs going for multiple years. It’s an optimistic day for open-source.

    Here are your links to more information on OLMo 32B:

    * Blog with technical details and demo

    * Base model: OLMo-2-0325-32B

    * Instruct model: OLMo-2-0325-32B-Instruct and intermediate SFT, OLMo-2-0325-32B-SFT, and DPO checkpoints, OLMo-2-0325-32B-DPO

    * Pretraining dataset: OLMo-mix-1124

    * Mid-training dataset: Dolmino-Mix-1124

    * Post-training datasets: Tülu 3 SFT Mix (updated), Preference data for OLMo 2 32B and RLVR Mix

    Gemma 3 as the next point on a steep trend line

    Yesterday, March 12th, Google released the next batch of their flagship open-weight models, Gemma (report, models, flagship model). They highlight the following capabilities in their documentation:

    * Image and text input: Multimodal capabilities let you input images and text to understand and analyze visual data. Start building

    * 128K token context: 16x larger input context for analyzing more data and solving more complex problems.

    * Wide language support: Work in your language or expand your AI application's language capabilities with support for over 140 languages. Start building

    * Developer friendly model sizes: Choose a model size (1B, 4B, 12B, 27B) and precision level that works best for your task and compute resources.

    Some technical details of note:

    * In open models, 32B dense models are convenient because they can be finetuned on one node of 8 H100s (slowly). Google's sizing at 27B likely is downstream of TPU considerations that don't map directly, like how knowledge distillation works at pretraining.

    * The Gemma models continue to be trained extensively with teacher-student knowledge distillation (KD). This KD is different than the colloquial definition of distillation in leading AI models. The common use of distillation is training the models on any output of a much stronger model. This is most commonly done in post-training to learn from generated completions of the stronger model. KD is a subset of the general idea of distillation, where the model being trained learns to match the distribution of the teacher model. Other labs than DeepMind have mentioned this KD technique, but Google has pushed it far further. This was discussed further in last summer’s post on synthetic data.

    Otherwise, the paper has some interesting information but nothing super groundbreaking. This is par for the course for most technical reports these days.

    Onto the evaluations, and therein the impact, of Gemma 3.

    The best way to think about this model is a “general chat model” like GPT-4o and Claude 3.7 rather than a reasoning model like R1. The rise of reasoning models has made comparing models tricky because there are multiple evaluation suites that people care about — broadly characterized as a reasoning suite and an instruct suite. They overlap, but strong capabilities on both is rare.

    Gemma 3 27B’s performance on some tasks like MATH and Bird-SQL (coding) match the Gemini 1.5 Pro model from just a few months ago! The progress on small, open weight models is simply insane. Small models can perform excellently on narrow tasks like math and some coding, but they lack the depth and world knowledge, as seen in GPQA or SimpleQA above.

    Yes, DeepSeek distills are better at smaller sizes on MATH, but not enough people evaluate those distills across all capabilities like ChatBotArena. Having it all in one model is very convenient and is still how most workflows are handled.

    Most people are also fairly skeptical of evaluation scores like MATH stated by Gemma, DeepSeek distills, and the like, claiming they don’t translate to real world usefulness. This is why the ChatBotArena results were the most striking of the Gemma 3 release. Gemma 3 falls in the top 15 of every category. It beats DeepSeek V3 with its 600B+ total parameters. It is outperformed in niche categories like math or coding by its peer models in the overall ranking, indicating a small level of superficial alignment, but doing this to get into the top 10 of ChatBotArena during this period of AI with immense competition is a huge accomplishment.

    It is an ever evolving open question on how reliable chat evaluations like ChatBotArena are. These days, with how in vogue RL training methods to maximize MATH evaluations are, the value is higher again. Is it representative of some subset of real-world use, which would indicate that specific capabilities that small models are able to excel at — math, general chat, etc. — can translate directly to real value.

    This implies that tasks like SimpleQA and GPQA indicate performance on more niche tasks that not many people encounter, but we have a lot to learn as a field here.

    With my use of leading models, I find this hard to believe — switching to something like GPT-4.5 feels like an immediate improvement in chat tasks. My conclusion is that the answer is somewhere in the middle — small open models can do super well on everyday tasks, but we don’t know exactly how to measure it. ChatBotArena and SimpleQA don’t tell us exactly what to expect from the models.

    The fact that there isn’t a cliff in performance with models this small is going to drive substantial business value — it’ll be easier to find exactly the minimum model size needed for your distribution of tasks.

    In the specific case of Gemma and most of the leading open weight models right now (with DeepSeek R1 as a wonderful exception), the licenses of models that often restrict downstream use are a brake on open weight model adoption. Without getting too much into the details, they expose companies looking to use the models to some amount of legal risk and add terms and conditions to finetuned models.

    Largely, until feedback loops and use cases for open models are established, teams releasing these models don’t have strong cases other than future potential growth to counter the safe option that comes from legal teams’ recommendations. I hope that the evolution in the adoption of open weight models for valuable applications applies pressure to make licensing less of a headache.

    Interconnects is a reader-supported publication. Consider becoming a subscriber.

    The state of the open-closed model gap

    3 of the top 15 models on ChatBotArena are open weights. In a world where frontier labs have many minor model versions crowding the leaderboard, this is an incredible platform for accelerating progress in open model adoption. Even just the gap from Gemma 3 to Google's flagship Gemini models is very small! The entire AI market is the most dynamic and competitive it has been in some time — maybe ever.

    When it comes to “truly open,” i.e. actually open-source models, the gap between closed models has remained somewhat consistent — I estimate it as about 18 months behind closed labs. With open models generally on the upswing, open-source access to code, data, etc. is likely to come with it. The likes of Llama, DeepSeek, etc. are some of the most important pieces in the complete open ecosystem, and approaches like Ai2’s would struggle without them.

    Related to this, in the coverage of DeepSeek R1, I noted:

    This points to the first time since Stable Diffusion’s release that the most relevant and discussed AI model is released with a very friendly license. Looking back at the journey “open-source” AI has been on over the last 2.5 years, this is a surprising moment in time marked in the history books.

    A month later, this is still the case.

    To understand the progress of the open-closed model gap, with an oversimplification, there are two simple ways that models tend to get better:

    * Scaling up model size, and

    * Various data, architecture, and algorithm improvements that make an equivalent performing model smaller.

    Progress on truly open-source models has always relied on the second of these due to the relatively smaller budgets of players in the open-source space. By being forced to focus on efficiency rather than scaling, it has often seemed like truly open source models are falling behind by more than they really likely were. With similar potential to scale, the gap between truly open-source and closed would be quite small on standard academic benchmarks which "raw model potential" is usually evaluated on.

    When models like Llama 3.1 405B, DeepSeek V3 and R1, etc. come out, the open weight models scaled up have been at the frontier. With more resources for truly open-source training I don't see many reasons why that wouldn't be the case for models like OLMo too. In some ways, the biggest moat to training is just resource access. For example, DeepSeek is reported to have at least 10x more compute than the likes of Ai2.

    This interpretation of the gap is more tractable than other critiques I've imposed on the open-source AI community in the past, such as the need for better preference data / RLHF pipelines.

    Comparing Gemma 3 27B to OLMo 32B, the pretraining evaluations for both are super similar, but Gemma 3 scores are way better after post-training. The ceiling on post-training expectations has been shifting extremely fast among open models. Largely, open recipes are motivated by matching existence proofs like Gemma 3 or DeepSeek R1. Throughout 2025, expect open post-training resources to get even better again, motivated by the need for better performance and more detailed workflows on reasoning training.

    All of these subtleties are paving the ground for a better platform for building value with open models. The gap from closed models to open counterparts being its smallest means the opportunity is the best it’s ever been.

    The race for inference providers to host models like DeepSeek R1 is a great platform for progress and example. Even if the why of open models starts as primarily being good for marketing and recruiting, this can trickle down into bigger benefits.

    Open weight models are likely to be offered for inference at far lower margins in the near term, due to the massive competition in the inference space, driving the prices even lower for those looking for ways to adopt them. Here’s an example of some of the leading platforms offering R1 inference:

    This sort of everywhere-inference for an open model is something that has only happened in the past for Llama models.

    Yes, I know APIs are orthogonal to the benefits I just listed, but they’re the way people become familiar and interested in the models. APIs help people check if the model can serve their purpose, and then if it does, they can spin up a more complex workflow.

    “Why open” is more important than “what is open”

    The biggest stories in open-source AI in 2024 often felt like bickering about definitions. I wrote a lot of articles about definitions. Llama 3 was pretty much all we had to get excited about. At the end of the day, even with how much I think it would be better with more information on the whole stack of AI development, open-source is largely going to be defined by community norms. For now, Llama weights have been that norm rather than other definitions.

    By comparison, 2025 feels poised to be about actually building open AI. We have had surprising, impactful, and exciting releases and it’s only March. We know Meta is looking to get back into the conversation with Llama 4 in April at LlamaCon. We have our open-source ChatGPT. We’ll have more we can’t predict.

    Crucially, on top of the gap being smaller, all of these open models are crossing meaningful boundaries in performance. When model capabilities made the leap to GPT 4 class models, tons more applications were possible. Now, we have GPT 4 class small models that can be deployed in privacy-conscious ways. There’s been a huge demand for this, and the ecosystem is slowly building the tools to do so. Yes, closed AI will continue to march forward, but open solutions need to prove their own independent feasibility.

    In the long march of progress, open-source AI feels far closer to an inflection point of proving out the hypothetical benefits we have focused on for a few years. Transparency, privacy, better performance, etc. could actually all be happening this year.

    For the backlog of content I've written on open-source AI, click here. Thanks to Vikram Sreekanti, Ross Taylor, and Xeophon for feedback that helped shape this post.



    This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe
  • Eugene Vinitsky is a professor a New York University department of Civil and Urban Engineering. He’s one of my original reinforcement learning friends from when we were both doing our Ph.D.’s in RL at UC Berkeley circa 2020. Eugene has extensive experience in self-driving, open endedness, multi-agent reinforcement learning, and self-play with RL. In this conversation we focus on a few key topics:

    * His latest results on self-play for self-driving and what they say about the future of RL,

    * Why self-play is confusing and how it relates to the recent takeoff of RL for language models, and

    * The future of RL in LMs and elsewhere.

    This is a conversation where we take the time to distill very cutting edge research directions down into the core essences. I felt like we were learning in real time what recent developments mean for RL, how RL has different scaling laws for deep learning, and what is truly salient about self-play.

    The main breakthrough we discuss is scaling up self-play techniques for large-scale, simulated reinforcement learning. Previously, scaling RL in simulation has become economical in single-agent domains. Now, the door is open to complex, multi-agent scenarios where more diversity is needed to find solutions (in this case, that’s what self play does).

    Eugene’s Google Scholar | Research Lab | Linkedin | Twitter | BlueSky | Blog (with some great career advice).

    Listen on Apple Podcasts, Spotify, YouTube, and where ever you get your podcasts. For other Interconnects interviews, go here.

    Show outline & links

    We cover many papers in this podcast. Also, as an experiment, here’s a Deep Research report on “all the papers that appeared in this podcast transcript.”

    In this episode, we cover:

    * Self-play for self-driving, mostly around the recent paper Robust Autonomy Emerges from Self-Play (Cusumano-Towner et al. 2025). The simulator they built powering this is Gigaflow. More discussion on HackerNews.(Here’s another self-play for self-driving paper and another from Eugene from earlier this year).A few highlights:

    “All simulated agents use the same neural net with the same weights, albeit with randomized rewards and conditioning vector to allow them to behave as different types of vehicles with different types of aggressiveness. This is like driving in a world where everyone is different copies of you, but some of your copies are in rush while others are patient. This allows backprop to optimize for a sort of global utility across the entire population.”

    “The resulting policy simulates agents that are human-like, even though the system has never seen humans drive.”

    * Large Language Models are In-context Preference Learners — how language models can come up with reward functions that will be applied to RL training directly. Related work from Stanford.

    * Related literature from Interconnects! The first includes literature we mention on the learning locomotion for quadrupeds with deep RL (special shoutout as usual to Marco Hutter’s group).

    * Recent and relevant papers Value-based RL Scales Predictably, Magnetic control of tokamak plasmas through deep reinforcement learning.

    * Other things we mention:

    * Cruise, Tesla, and Waymo’s autonomy stacks (speculation) and how the self-driving industry has changed since we were / were considering working in it.

    * Evo 2 foundation model for biology.

    * Eugene is working with a new startup on some LLM and RL stuff. If you’re interested in this episode, ping [email protected]. Not a paid promotion.

    Chapters

    * 00:00:00 Introduction & RL Fundamentals

    * 00:11:27 Self‑Play for Self‑Driving Cars

    * 00:31:57 RL Scaling in Robotics and Other Domains

    * 00:44:23 Language Models and In-Context Preference Learning

    * 00:55:31 Future of RL and Grad School Advice

    Transcript

    I attempted to generate with ElevenLab’s new Scribe tool, but found the formatting annoying and reverted back to Alessio’s smol-podcaster. If you’re interested in working part-time as an editorial aide to Interconnects, please get in touch.

    Nathan Lambert [00:01:27]: Hey, Eugene. Welcome to the show.

    Eugene Vinitsky [00:01:29]: Hey, Nathan. Thanks for having me. Excited to be here.

    Nathan Lambert [00:01:32]: Yeah, so I'll have said this in the intro as well, but we definitely go well back in all the way to Berkeley days and RL days, I think.

    I will embarrass you a little bit now on the live read, which is like, you were one of the people when I was switching into RL, and they're like, oh, it seems like you only figured out how to get into AI from a potentially different background, and that's what I was trying to do in 2017 and 2018.

    So that was kind of fun, and now we're just friends, which is good.

    Eugene Vinitsky [00:02:01]: Yeah, we were both figuring out. If I had any lead over you there, I was also frantically trying to figure it out, because I was coming from a weird background.

    Nathan Lambert [00:02:11]: There are definitely a lot of people that do that now and over-attribute small time deltas to big strategic plans, which was probably what it was.

    And we're just going to do some of our normal conversations on RL and self-play.

    I think the backstory of this is you told me that your recent paper from some of your time at Apple, I think I don't want to time for it too specifically, was something, paraphrasing, like the most exciting RL thing you've ever had a part of.

    And major RL projects are not that frequent.

    I think if you segment out all the language model excitement in the past 10 years, there's really a few major milestones, and it's good to kind of talk about them.

    So we can kind of start, I think, basic things, like how do you define reinforcement learning, and it will kind of build up to this self-driving project.

    Eugene Vinitsky [00:03:05]: Yeah, so I think RL is kind of a big thing, but at a really basic level, you have this process of taking actions in the world.

    You're seeing the state of the world.

    If you're taking actions in the world, you sometimes receive a reward that tells you the value of that action, and you're trying to kind of optimize your cumulative behavior over time.

    So that, you know, over long trajectories, you're optimizing those costs.

    That's both, you know, the hard thing and the exciting thing is that if you do it well, you can really optimize really long horizon behaviors.

    Nathan Lambert [00:03:41]: Yeah, I agree.

    And it's funny because now it's finally, the language models are finally doing this long chain of thought, and I don't really think that's the same.

    I think the interactive notion will come up a lot here where these long context behaviors are many, many actions interacting with the world relative to one really, really long action, which is kind of odd.

    Eugene Vinitsky [00:04:04]: Yeah, I guess, yeah, it mixes things, right?

    Because it has very long state, right?

    It's got very long contexts, and it's generating its own context.

    But in the end, there's really one action at the end that, like, kind of determines how everything went, you know?

    Nathan Lambert [00:04:23]: Yeah, yeah, yeah, we'll get into this.

    And then the next thing that we kind of need to set up is what do you define self-play as?

    I think this word has been particularly broken in recent times with language models, and I'm hoping we can get a fairly specific criteria for what is self-play and what are related topics.

    Eugene Vinitsky [00:04:42]: Yeah, I think even within the field, there's quite a bit of debate as to what constitutes self-play.

    So talking to, you know, experts, people will disagree about what methods are and are in self-play.

    But what I will say is I generally define self-play as anything where an agent plays a copy of itself.

    So up to a bunch of different agents interacting with each other, as long as they're mostly, in some ways, copies of each other, we're doing self-play.

    Nathan Lambert [00:05:12]: Yeah, and then do you think anything, I mean, your background's in multi-agent as well.

    Do you think there is something fundamental to kind of a game that has a really specific hill to climb where it's kind of this competitive nature versus something like language?

    Eugene Vinitsky [00:05:29]: Yeah, this is kind of the dream of, I think, some multi-agent researchers is this type of like ratchet effect where you have a bunch of agents interacting with each other and kind of increasing complexity on the part of any agent generates increasing, like creates new challenges that need to be solved and then force you to learn new skills.

    And then you kind of get this endless, endless ratchet.

    Maybe that's what you meant.

    I may have misinterpreted.

    Nathan Lambert [00:05:55]: We're going to revisit it.

    I think also it's like, how does the multi-agent nature of a lot of these things change what people think about with RL?

    This is kind of the last building block before we go into the self-driving stuff.

    Eugene Vinitsky [00:06:07]: Yeah, yeah, yeah.

    So the way that the multi-agent thing changes things is it makes everything much harder and more interesting.

    So you go away from this world where you have like a clear score function, right?

    So you have some reward for first in single agent setting, you have some reward.

    If that reward is high, you're doing well, right?

    And when you move into the multi-agent setting, it becomes reward with respect to whom, right?

    It all of a sudden matters whom I'm playing, right?

    So if we go to a game of like, like one setting is like two players, zero sum games, right?

    So a game of two player poker, I give you, I train a poker bot, right?

    How do I know it's any good?

    I have to play another poker bot to decide that it's any good, right?

    And so all of a sudden, this challenge of like, what is a good policy becomes very fundamental.

    And you kind of lose even a notion of there being like one clear good policy.

    And like the whole, a lot of, a lot of the field of multi-agents is coming up with different definitions of what would cost you goodness.

    Nathan Lambert [00:07:06]: Um, so, and then back to the self-play thing with that, like, is all of the self-play that we discussed, like if you were playing yourself, does the same consideration apply?

    Like, is that, is self-play necessarily a multi-agent framing?

    Eugene Vinitsky [00:07:19]: Um, I think it, I think it is because oftentimes what we're trying to do with self-play is like to converge to some notion of policy goodness.

    And self-play is just a mechanism that gets us to some definition of, of high quality policies.

    Um, and, and, and what turns out to be the case is there, there are actually many like non-self-play type methods for doing this.

    Self-play just turns out to be an effective way to accomplish constructing effective policies.

    Nathan Lambert [00:07:45]: Yeah, I, I, there's many, I'll, I'll link later a lot of these papers on self-play for preference learning and look into them a bit more.

    Eugene Vinitsky [00:07:56]: Yeah.

    Nathan Lambert [00:07:57]: Essentially that's been the lens.

    There's two lenses by which this has come back and both of them, I don't think fit into, I, I think this multi-agent lens of self-play is much richer and I don't think any of them have fulfilled this.

    I think there's useful methods for preference tuning.

    I think that's like maybe spin it's like self-play something preference learning is one.

    And there's papers related to this where they're probably looking at the probability of the own model in generating a response or something like looking at the internals of the model.

    And it's not really set up in this game nature of some sort.

    And then also with Q stars, when the self-play stuff came back where I really think I've, I've talked to some people that did original reporting on this and it was that the model looked like it was talking to itself.

    And I think that very understandably for less, a little bit less technical audiences that haven't engaged with self-play, that coverage of talking to itself got transformed into a self-play commentary and hype cycle, which took people down the wrong path for like an entire year, which is so brutal, but also very understandable and funny.

    Eugene Vinitsky [00:09:11]: Yeah, I think there's something interesting and different happening in these like multi-agent like LLM self-play setups.

    I'm not super familiar, but I think what's happening is something quite different than what we mean in other multi-agent settings when we're talking about self-play.

    Like I feel like it's, it's more about like refining like the distribution of actions that it takes in some, some kind of odd way.

    Nathan Lambert [00:09:39]: I think this sounds ridiculous at first pass, but it's almost that the language models are simulating a softer version of self-play within themselves to kind of check their own work and engage in their own discourse, which the level of intelligence they have is not going to like unlock the true like incremental progress that we think of with self-play.

    Which probably, I think for context of things for self-play, just to put them on the record of this are, have been very impactful or things like AlphaGo and New Zero.

    I think that's, those are the prime examples of generating some superhuman policy in a closer way.

    I think it's, it's important to kind of gate the conversation on like, these are the aspirational goals, um, in terms of outcomes and then figuring out how to apply them to new domains and new tools is kind of unknown.

    Eugene Vinitsky [00:10:31]: So, so maybe I should have said this earlier, but like self-play is the thing that gives a, is like maybe the one way that we know to build superhuman agents right now.

    So, right.

    So, um, superhuman go, um, human level Dota, human level, uh, Starcraft.

    Um, technically poker is in a, in a slightly weirder, um, weirder space where I don't, I don't exactly know that I would call the method on that underlie that self-play.

    Um, sorry.

    Um, and, uh, but yeah, it's the one way we really know how to build superhuman agents.

    Nathan Lambert [00:11:06]: And I think this is a kind of a natural transition because the, to make people excited in the work that you did, it seems like you've discovered superhuman driving through self-play without inductive biases.

    And I'm like, um, how do you view the potential impact of this?

    And then we can kind of go into the method.

    Eugene Vinitsky [00:11:27]: Right.

    So the, the challenge with self-play is, and this requires a bit of technical detail to get there, but you know, in, in like two players, here are some games, games where you and an adversary are playing with each other and somebody wins and somebody loses, there's a very well defined notion of what being good is.

    Um, you know, that they're, they're well, you know, their criteria that we would like our policies to converge to.

    And, and the challenge has always been about moving beyond that to a domain where it's much harder to define what, what doing well means, right?

    There isn't like an abstract notion of what good driving is there out in the world where I could just write down the reward function and simulate it and optimize with respect to that.

    And all of a sudden I'd have a good driving policy.

    So the, the gap has always been between these methods that work really, really well in, in well-defined games like, like Starcraft or go, uh, and chess, um, and settings where it's much harder to define that.

    And so we haven't been able to, to move to self-play in settings where, for example, humans might be in the loop, right.

    And, and particularly driving is an instance of that somewhere where at the end, we're going to take our policy and it's going to drive with humans and we have no way to simulate humans and play against them.

    Um, and so figuring out how to close that gap has been kind of an open, open challenge.

    And I think maybe this is the first instance of, uh, finding a way to do that.

    Nathan Lambert [00:12:51]: Okay.

    So that's a much better motivation than I gave.

    And I understand the excitement now, because if this works in one domain, um, and you'll tell us about how grand of an effort it actually was.

    I know big tech companies can put a lot of force and long-term investment behind things to get them off the ground.

    Then a lot of the other things that people are saying about language models or other complicated domains are at least there's an existence proof of something similar happening.

    So why don't you just continue to explain, uh, this problem set up of learning driving without having a human teacher.

    It will probably take detours to analogize different self-driving stacks just because we know about them and it's good to compare.

    Eugene Vinitsky [00:13:36]: So one way of framing this is, and I'm going to put cautions in the end, I'm going to give you the, the, the extreme version of it.

    And I'm going to walk it back a little bit is like human level driving without any human level data.

    And the caution needs to be that this is in simulation and our ability to measure human level driving in simulation is limited in a lot of ways.

    So I can tell you about the ways that we measured it and then I'll, I'll have to tell you what the limitations of those things are.

    Um, so this was a large scale effort, um, uh, and Lovlin Colton's team and at Apple, um, it was about like eight researchers, research engineers working together for about a year and a half, uh, build, building the stack out.

    Um, it was, I think a lot of us came at it from different places.

    I know some folks were very inspired by this idea of like alpha star for driving, you know, building a diverse, rich world and then driving it in a way that such you would, you would transfer to policies that you hadn't seen before.

    So like human actors.

    Um, so, um, yeah, the, the, if, if, if it's helpful that the idea here is that, or the goal here was to build a human level simulated driver.

    Um, and here, what that means in our case is not a fully end-to-end method, right?

    So we're not simulating perception.

    So driving stacks consist of like generally perception, prediction, planning controls.

    So you have a perception stack that, you know, takes your LIDAR, your camera, your radar, and converts it into, you know, where are the cars, where are the road is, what's impassable.

    Um, and then a prediction stack will take the like positions of all the cars, the cyclists, pedestrians, and it'll predict, predict where they're going to go next.

    And then a planning stack will say, okay, given those predictions, you know, what's a good trajectory for me to take.

    And then the control stack will say how to actually follow that trajectory safely and robust.

    Right.

    And we're talking about subsuming the prediction, planning, control portion of the stack, not the perception part of the stack.

    Nathan Lambert [00:15:28]: Okay.

    So I was, I was thinking that you might not even do control.

    I was thinking you might just say, uh, control is a softer album and not do that too.

    Eugene Vinitsky [00:15:35]: So in the same way, we're kind of, we're only kind of doing control.

    Uh, we're, we're, we're doing this for, I think Waymo uses the

    Nathan Lambert [00:15:42]: the term behavior for this.

    I think it's been their behavior team for a while.

    Is that right?

    Eugene Vinitsky [00:15:46]: Okay.

    Nathan Lambert [00:15:47]: Uh, you know, I very, it's hard to know where the abstraction ends, but they definitely have a behavior team that's done a lot of things through the years.

    Well, he's not the job apps that I've been applying to an interview or have interviewed for in the past.

    Yeah, me too.

    Eugene Vinitsky [00:16:01]: Um, I think we do know how to control cars.

    We know how to make cars follow a pre-specified trajectory, right?

    This is, this is somewhat of an easier problem than like humanoid robotics or something.

    You know, big thing got wheels.

    We know how to make it turn.

    Nathan Lambert [00:16:14]: Um, so how do we get these things from, I mean, they start as like, it doesn't start at just all the simulated cars crashing all the time.

    What is the start here?

    Eugene Vinitsky [00:16:24]: I'll send you the video once it's out, but like, you know, the, the first 10 hours of simulation is just like cars scattered all across the road, smashing into each other, driving off the road, that type of thing.

    It's actually interestingly useful because what we do is when two cars crash, we have them come to an immediate stop.

    And this actually creates a lot of blockades in the road.

    So at some point during the training, the cars start to learn to drive around stopped cars, even though those cars are stopped because they've crashed, um, as well as to drive around like obstacles and things like that.

    Um, so that, yeah, that's what it looks like.

    Um, yeah.

    Nathan Lambert [00:16:58]: Um, as well as the reward function for these.

    So you have a bunch of cars that can see their peers and there's some reward function I'm guessing.

    Eugene Vinitsky [00:17:06]: So the, the major component of the reward function is getting to your goal without colliding.

    So we, we have these maps that we've taken from the cartless simulator.

    They're fairly large maps.

    Some of them are like multiple kilometers in spatial extent.

    We have eight of them and we place goals randomly over the map.

    Um, and you get a sequence of goals.

    So, you know, that like, okay, I want to get to this point.

    And then after that, I'm going to want to get to this next point.

    After that, you're going to get a big reward for getting to that goal.

    You're going to get some amount of penalty for colliding.

    And then there's also an implicit penalty because if you collide, you're not ever going to get to your goal.

    And then there, there is some amount of hand design here in that there are small rewards for like staying in your lane and being aligned with your lane and like, you know, not driving in the opposite direction in the wrong lane.

    Nathan Lambert [00:17:51]: This was one of the questions is if you had to do this sort of thing.

    Eugene Vinitsky [00:17:54]: You have to do that.

    But one interesting thing, and maybe we could talk about that at some point is we randomize the weights of those rewards.

    So there are agents that like really want to drive in the lane going in the right direction.

    And there are agents that don't care about that at all.

    And they will take the wrong lane on the highway, uh, you know, going at full speed in the opposite direction.

    And that's kind of useful because you're ready for that scenario.

    You've seen that scenario in the world when you're driving around.

    Right.

    Um, but yeah, we have to, we have to do some of that stuff because at some point there are laws and you can't avoid encoding the laws into your system.

    You know, stop signs are a human concept.

    Um, they're, they're not, you know, it's not going to emerge that you see a red thing and you're like, oh yeah, that means I should stop.

    And then I should like give the right of way rules to the other cars.

    Um, but all of our rewards are kind of soft in the sense, like, you know, if you're at a stop sign and folks have been preventing you from going for a very long period of time, right.

    You're going to start to nudge in and like break the rules about right away.

    Nathan Lambert [00:18:55]: One of my questions for later on this is like, do you think our vehicles and driving dynamics and infrastructure kind of constrain the way of driving?

    Like we've co-designed human driving in our infrastructure so that human driving is actually no longer that special because of the track is so long, so defined.

    Eugene Vinitsky [00:19:11]: I think this is, this is part of why this is all going to work or like why it works is because like human, human driving is, and human behavior in many domains is like fairly constrained by the institutions and the laws and the norms that we design.

    Uh, it's not super free from, uh, so like driving amongst humans is much more of a constrained problem than you would, than you would, you would think it's also unconstrained in some interesting ways, but, but it's, it's quite unconstrained, quite constrained.

    Nathan Lambert [00:19:42]: And how hard to act was this to actually learn?

    So how sensitive of a process is it now?

    I think in the paper, you're talking about gigaflow, which is like a high speed

    simulation engine.

    So obviously, you know, on data, the final paper says that it learns in 1.6 billion kilometers of driving.

    I was wondering if you had an intuition for that.

    So like how many miles are driven by all the cars in San Francisco and day or something like this?

    Eugene Vinitsky [00:20:10]: That's a, that's a great question.

    Nathan Lambert [00:20:12]: Um, it could be a good chat GPT query, to be honest.

    Eugene Vinitsky [00:20:16]: This might be a chat GPT question.

    Um, let me, let me give some, some numbers that I do know.

    Uh, and this is kind of maybe helpful.

    So I think cars crash every 20,000 to a hundred thousand miles and a fatal collision happens every a hundred million miles, something like that.

    Um, but how many miles are driven in a day in a city?

    I'm not sure.

    1.6 billion kilometers, the distance between here and Saturn.

    Um, it sounds like kind of far when you put it that way, but there are a lot of cars.

    Yeah, there are a lot of cars, right?

    There are a lot of drivers.

    Um, there are surprisingly few trips in a city, fewer than you would expect, but, um, I'm struggling to put a number on it.

    Nathan Lambert [00:21:01]: Um, I'll tell you what chat GPT gets when it's done.

    I was thinking it's Oh three mini high.

    This is not a reliable number.

    Take this time.

    So your intuition that it's lower goes a lot.

    I mean, you've thought about a lot of these car systems for a very long time and I will link to some of your other work on this.

    So you definitely have better intuitions than I would.

    Eugene Vinitsky [00:21:20]: Well, the intuition comes with the fact that like a lane of the highway can take 2000 vehicles per hour, which is like just not that many vehicles.

    Um, and you know, most, most of, most of traffic is between like, you know, 8am and or like 7am and like 10am and then on the way back home.

    And so, you know, you can like kind of estimate based on how many lanes there are on the main highway, how many trips there are.

    Nathan Lambert [00:21:43]: So San Francisco, the chat Oh three mini high estimated four to 5 million miles in a day in San Francisco.

    It's a bully.

    It's a plausible number, but it's well below what you are doing.

    Like this is, I think maybe globally this billion kilometers could be hit.

    So this is okay.

    Eugene Vinitsky [00:22:03]: Here's one way to think of it.

    We simulate 10,000 years of human drive.

    Nathan Lambert [00:22:08]: Okay.

    So yeah, 10,000 per one.

    I guess it depends on how many cars you have in parallel.

    Eugene Vinitsky [00:22:14]: Per one training run one trip to get the policy that we get.

    We simulate about 10,000 years of human drive.

    Nathan Lambert [00:22:20]: Yeah.

    Eugene Vinitsky [00:22:21]: Yeah.

    Nathan Lambert [00:22:22]: So to have 10,000 cars, it's all of them driving for a year.

    Eugene Vinitsky [00:22:26]: Yeah, exactly.

    And we have about like a million cars driving at any given time in the simulator.

    Nathan Lambert [00:22:34]: Do you think that substantially changes the learning dynamics?

    Like are they all, how many cars are any of them interacting with at any one time?

    Eugene Vinitsky [00:22:40]: Yeah.

    Any given simulator in any given world.

    So this is this like kind of like Isaac's gym style vectorized simulator.

    So it all runs in the GPU.

    So it's a bunch of worlds happening in parallel, but any given world, there are about 150 cars in it, which means that sometimes you're driving in sparse traffic and sometimes you're going to interact with like 10 or 20 cars at any given time.

    Um, and I, I think one thing is that one, one cool thing is that at that scale, I think RL becomes very, very stable.

    Um, like for us, like every training run succeeds, the reward curves go straight up.

    You know, there's no like, um, what are you scaling?

    Nathan Lambert [00:23:19]: Are you just like scaling batch size effectively?

    Uh, what is, yeah.

    What is the actual thing you're, they're scaling?

    Eugene Vinitsky [00:23:26]: We're scaling the amount of experience generated.

    So it's like a trillion samples of, of total experience, um, that, that the agents train on.

    Um, and then, yeah, we use gigantic batch sizes, like, you know, um, but like, what is the thing

    Nathan Lambert [00:23:43]: that you need to dial up in order to make learning actually happen?

    Eugene Vinitsky [00:23:47]: Uh, total amount of experience generated, right?

    So you need to be generating, you know, million samples per second to train on type of thing.

    Nathan Lambert [00:23:57]: Okay.

    And then what is the actual, I guess I don't know a ton about multi-gen RL, but what is the actual RL like algorithm and is it a giant replay buffer that is just building and building and building?

    Eugene Vinitsky [00:24:08]: It's PPO.

    Uh, you know, one thing we've been seeing throughout our work pretty continually is that for, for both theoretical and empirical reasons, PPO is actually a really good multi-agent RL algorithm.

    Nathan Lambert [00:24:20]: You had the paper, are you, you are on the paper years ago.

    That's like on the something, something PPO multi-agent simple.

    Eugene Vinitsky [00:24:29]: So we know that PPO works empirically pretty well.

    That's basically the title of the paper.

    That's a PPO simple, good multi-agent cooperative.

    Good.

    Uh, it's good in cooperative problems.

    It's, it turns out to be pretty good in two players, zero, some games.

    And, and here in, um, this driving thing, it's what's called the general sum game.

    And, and there, you know, it seems to work in the setting too.

    So evidence is accumulating.

    Nathan Lambert [00:24:51]: Something that people probably don't know about multi-agent RL and maybe I don't know either, but in this paper, all of the cars were using the same actual weights of the model.

    Is that standard in multi-agent RL or is it kind of a variable?

    Eugene Vinitsky [00:25:04]: So I'll add one little, uh, subtlety here.

    So yes, we're using every policy is the copy of the same agent, right?

    They're all looking at their local observations.

    So it's decentralized, but it's all one copy, but every agent gets its own like conditioning vector.

    That's like, what are my like reward weights?

    How big of a, you know, what's my width and my length?

    Am I a cyclist?

    Am I a pedestrian?

    Am I a driver?

    And they flexibly adjust their behavior based on that condition.

    Nathan Lambert [00:25:29]: Do you think that's actually like, if you were to squint at the system, is that actually changing the policy or is it changing the environment in kind of an indirect way?

    Eugene Vinitsky [00:25:38]: It's, it's changing the policy.

    Like you'll see that like a car is like, oh, I'm a, I'm a, like a pedestrian.

    I'm a, I'm a big truck.

    I'm going to do like a K point turn to turn around.

    Uh, I'm a pedestrian.

    I'm, you know, going to like smoothly wiggle through these small boxes of areas that I couldn't get through.

    Otherwise it, it, it really, uh, appreciably changes the policy, which is cool because it's this like tiny 3 million parameter neural network or like 6 million parameter.

    Um, and, and so like, there are all these like little sub policies inside of it that you can activate by, by conditioning.

    Nathan Lambert [00:26:11]: Can you do it, um, post hoc to change the behavior in an interpretable way?

    Eugene Vinitsky [00:26:16]: Um, I don't know about interpretable.

    I guess it, it sometimes depends what we mean when we say interpretable, but yeah.

    So you can be like, look, okay, you, you, you don't care about staying in your lane and you'll see it start going into the other lane and driving.

    You know, you change the size of the policy or like the, the car and it will change the trajectories that it takes in response.

    Um, it's, it's very responsive to this condition.

    Um, we have some cool graphs in the paper pointing, pointing out all the different things you can make it do by changing these, these values.

    Nathan Lambert [00:26:46]: Um, I'm trying to think of how this reflects on the nature of driving and what the downstream use of this tool is.

    So you showed that this is doable and what does this, like, what does this mean for self-driving specifically?

    Like, what would you do if you had the same big team and you know that this exists and you're interested in self-driving as a field?

    I mean, there are obviously a lot of people that a lot of companies that have big teams and lots of money to try to think about self-driving.

    Eugene Vinitsky [00:27:14]: So as I said earlier, like there's this like, um, perception, prediction, planning, control stack.

    And I think this is a really is providing a lot of evidence that you could maybe subsume the prediction and the planning stack, um, and, and put it into this type of like end-to-end policy that you could then like train in sim and then maybe not zero shot deploy onto the roadway.

    Just like take a straight from sim, put it onto the roadway though.

    I think like maybe possible, uh, but like really give you this like base policy that you could then start to put on the roadway and start to build this flywheel, um, that you can then use to collect, you know, more and more experience, validate the safety.

    You know, like if you're, you know, if you're a, um, uh, automotive manufacturer that doesn't have like a full spun up self-driving team, but you have a pretty good perception stack, like this is something that you can use to just like get something out in the world pretty fast.

    Cause like three, I think like two, two, three days of training later, you have something that I think, and we'd like to start testing it, uh, can be like straight up put onto the roadway with humans driven around and things will be like pretty okay.

    Um, you know, don't take the safety driver out, but like, yeah, and you have some cred

    Nathan Lambert [00:28:24]: saying this given that you've done RL experiments with real cars, this is not something that's, um, ripping off the bandaid for the first time.

    You've done different types of self-driving experiments with RL policies in the real world.

    I don't, it might not be at the same level of the stack, but I can add links to that.

    Eugene Vinitsky [00:28:42]: That was a lot more constrained, right?

    We were putting these cars on the highway to like smooth traffic.

    So they would drive in a way such that like stop and go waves and traffic would like get smoothed out and disappear.

    Um, but there it was just like, you know, stay in your lane, follow the car behind you here.

    We're talking about like, you know, complicated interactions at intersections and that type of thing.

    So a lot, a lot more like safe, everything there is safety critical, but like significantly less constrained than anything we've done in the past.

    Nathan Lambert [00:29:08]: And then to kind of keep leading this on, uh, I will say a bunch of things because you're more of an industry insider.

    So it makes it less revealing if I say things, cause I don't really know anything.

    Um, back when I was interviewing for a job and around 2021, at least a lot of RL people were interviewing with self-driving companies who were doing extensive research in RL for different parts of this behavior stack.

    Um, even at that time, four years ago, prediction seemed largely or like sensing and prediction was perception was largely solved.

    At least CV stacks are really mature and figuring out the actual driving component and decision making was really hard.

    There was, I mean, I did a Tesla self home self like take home and for their self-driving team and they were hiring other RL people that take home was ridiculous.

    Eugene Vinitsky [00:29:54]: I was like, yeah, I remember that.

    Nathan Lambert [00:29:56]: Freaking intersection of polygons.

    It's four years ago.

    They've got to be using a different question, but it was so hard.

    Um, I did end up solving the test cases.

    Um, it was, I solved the test cases.

    God, that was rough.

    But essentially the rumor was they're doing something like mu zero for self-driving and or a mix of imitation learning, which is there's a duality of learning a world model from real data relative to building a simulator.

    But the motivation of the work is very similar, which is in mu zero, you want to unroll trajectories and be able to learn from that and distill an RL policy versus if you have a big simulator, you then can learn everything from scratch and figure out how to transfer that to real.

    And I think there's different assumptions on what would work.

    And the history of RL, it is now that the simulator to real is generally a more promising path.

    If you can build the right simulator then and going from real to enhancing real with, with RL alone, um, cruise was building a research team.

    And one of the best engineers I talked to was trying to build a world model or like a simulator and do this like alpha go for self-driving.

    I think that was a phrase from the interviews four years ago.

    So a lot of this, and Waymo is now obviously winning.

    I think Waymo, I don't know exactly what they're doing.

    I think their stack is actually probably the most complicated, um, where they probably were looking at behavior, like all sorts of RL inspired things for very specific parts of the stack to, to improve behavior.

    But it's funny that looking back four years ago, this was something that the spectrum of ideas that industry was looking at was actually very related to this.

    And in the same time, the self-driving industry has changed a lot.

    Uh, so what do you think of this whole industry of self-driving relative to, you have a lot of experience here.

    I mean, I'm, I'm a big Waymo fan now, but there's just like, it's so funny how these things evolve.

    And I think after this, later on, we'll talk about the, like, this is the RL specific trajectory with simulation, simulated results and stuff too.

    Eugene Vinitsky [00:31:57]: I mean, we were interviewing at the same time.

    So I was also interviewing with all of these self-driving companies when you were, uh, and, and it, it did seem like it was the place that was the most friendly to doing RL type research at the time.

    Um, I think now almost everyone has gone all in on this like imitation learning type approach, um, that are like, this is a huge fraction of what people are doing.

    I think a lot of the RL teams have been spun down, uh, which I think is unfortunate a little bit because I think what this work shows is that, uh, it may be wrong to do so that there is a lot of, a lot of value still in RL for this last piece of, of the, of the puzzle.

    Um, you know, um, you know, one thing we have here is, uh, an insanely robust policy, right?

    So like just an end to end neural network in SIM, it crashes once in a million miles,

    Nathan Lambert [00:32:46]: um, crashes at all.

    Eugene Vinitsky [00:32:49]: Yeah.

    Nathan Lambert [00:32:50]: And you, but what was the number you said before for miles per crash?

    Eugene Vinitsky [00:32:53]: Uh, humans are between 20 and a hundred K, um, somewhere, somewhere like that.

    It's a little hard to get estimates because it varies from place to place a lot.

    So, I mean, a lot of industries are pretty excited about this, like alpha zero for self driving type thing.

    And the question, you know, becomes, as you said, like, what is the simulator that we do this in?

    And so one perspective that's very prominent is like, let's collect a lot of data.

    Let's sell the world model and then let's unroll in that simulator.

    And then the challenge becomes like, who do you unroll in that simulator?

    Now your world model has a build into itself, a model of the other agents, right?

    If you kind of take the single agent perspective, I'm going to unroll world model.

    I'm going to place a car inside of it.

    And that's the car I'm going to train with RL.

    And now what happens.

    Nathan Lambert [00:33:40]: This was a big problem for self-driving because you have like a dynamic number of, um, objects in the scene that you're supposed to reason about with your world model.

    How does the policy that you train handle this kind of agents coming in and out?

    Now, is it all just that you have some, like, are you identifying entities as nearby as other cars are nearby or is there some abstraction or is that the perception stack handles that?

    Eugene Vinitsky [00:34:04]: Yeah, exactly.

    We roughly simulate a sensor in the sense that you only see cars in some radius of yourself.

    Um, but, but we don't, we don't, yeah.

    I mean, all the cars are there persistently in the simulator driving around and we, we answered this riddle of like, what should the other cars do by like their self-play, right?

    They're a copy of your policy.

    They're driving around.

    Um, whereas I don't know what happens in the world model, right?

    Like kind of in this like world model approach, you're limited by how capable the world model is at simulating the behavior of other actors.

    And if your world model has actually learned a robust model of human driving for all the other agents in the simulator, then like, you don't even need, you don't really need, you need to do RL because like the world model already has a model of how humans should behave in a simulator at human level, but they don't.

    Um, so yeah.

    Nathan Lambert [00:34:53]: And it's just like, it's just, it's, it's so funny that it just feels like they haven't.

    And the only way that Waymo et cetera has gotten it, it seems like Waymo has adapted a autonomous stack with like some human inspiration to make the driver more smooth is what it seems like when you're in it, which is like extremely, really strong perception and world understanding with some really clever policy that is tuned to feel human, but probably not human or RL at the bottom of the day.

    Eugene Vinitsky [00:35:27]: I wonder, I don't know what Waymo's planning stack actually looks like in the end, right?

    Like Waymo's pretty secretive and, uh, I've never worked there.

    Um, and if I had worked there, I wouldn't be able to say.

    Um, but you know, I think, I think, you know, if I had to make a bet, it's some, some kind of like hand designed cost, um, like mixing a bunch of terms together about like what a good trajectory looks like, maybe mixing with a little bit of human data to like, to make that trajectory feel like a little smooth in human life.

    Nathan Lambert [00:35:59]: And yeah, to prompt you, um, what does your, yeah, I agree with this.

    What does your history of being a nerd on urban plan and planning make you think of what is coming for self-driving cars?

    Eugene Vinitsky [00:36:12]: So, so I guess the thing to mention is I'm a professor of transportation engineering, uh, among other things.

    So I have, I have, um, required to have some thoughts on this.

    Um, I think that, you know, self-driving cars are, are coming.

    Um, I don't know if they're, they're coming a year from now to who knows when the cost curve gets driven down.

    Nathan Lambert [00:36:32]: Where we live, they're more likely to come sooner given tech hubs and, um, people are willing to pay very high premiums.

    Eugene Vinitsky [00:36:39]: That's true.

    So like, like a lot of goods, they may come for, for wealthy folks first.

    And then that allows the cost scaling to come down over time.

    Um, and it really is a magical experience to take away Mo, right?

    Like I remember the first day I saw like the cars driving around and nobody in it.

    And I actually just started chasing one of the cars cause I was so like, it was such a magical moment.

    I needed to, I needed to experience it for as long as possible.

    Nathan Lambert [00:37:04]: Um, yeah, my first time was in Scottsdale, Arizona for one of my best friend's bachelor parties.

    He's also an engineer.

    And we saw one driving with no person.

    And I was like, I wonder if we could take one.

    And I immediately download the app.

    And because it's in the middle of nowhere, they're testing zone.

    They have tons of supply and no demand.

    Eugene Vinitsky [00:37:20]: So we were just immediately able to drive one around.

    I actually sat in an airport for three hours in, in Phoenix while my phone upgraded to the newest OS so that I could like download the app and take away Mo for the first time there.

    Nathan Lambert [00:37:36]: Uh, yeah, it's, this is totally reasonable behavior for anybody listening and you should update your prior.

    If you don't think it's reasonable, it's totally reasonable.

    Eugene Vinitsky [00:37:44]: It's a totally normal thing to do.

    Um, but I think, so I think in cities, like.

    Um, so I think that it's, it's still going to be a long time before these things are rolled out at scale.

    Um, so just because costs, safety, how long does it take you to verify that it's safe to drive in a new city?

    I mean, okay, let's put Tesla aside.

    I don't, I don't talk about it.

    I don't really know how to think about that.

    Um, but that's how I feel too.

    Um, there's, you know, there, there's parts of the United States that are due to state dependence, like very spread out, right?

    Like because of like suburbization, like, I don't know if that's a word, it's the word I use.

    Um, like they're, they're very spatially spread out.

    Like in my, my grandpa's hometown or where my grandpa lives, uh, there's no public transit.

    There's no way to get by without a car.

    Public transit isn't viable because of the way people are distributed.

    So if those systems like those continue to exist, like, you know, people will continue to drive there.

    And over time, those things will be replaced by, by, by a few self-driving cars.

    Um, you know, uh, as a public transit advocate, I would still say that I think within cities, it is significantly more efficient to like fund buses and subways and things like that.

    Um, but you know, there's, there's parts of the U S that are just like, so set up and I expect self-driving cars to be part of that.

    Uh, yeah.

    Nathan Lambert [00:39:15]: I mean, this isn't, this isn't a hot take.

    I think you're just kind of like a realistic and you don't have a crazy worldview about it.

    Eugene Vinitsky [00:39:22]: Yeah.

    I mean, I have, I have my, you know, real deep love for public transit and like a desire for more people to experience it than just the people who live in New York city, where I think New York sees like 50% of all public transit areas in the U S.

    Um, uh, but you know, the system is what the system is right now.

    Nathan Lambert [00:39:41]: Yeah.

    Okay.

    Um, let's pivot from self-driving land where we've had this self-play RL and try to draw some analogies to the other RL breakthroughs that aren't language models that have been happening.

    I think the one that everybody should know about or in many people do is this, um, locomotion and or sim to real with robotics with humanoids, quadrupeds.

    Eugene Vinitsky [00:40:07]: Yeah.

    Nathan Lambert [00:40:07]: If you look at it, it is definitely directionally similar to what is this self-play thing is.

    I think that it's hard for people who haven't been in RL to understand the grandness of this transition from one agent locomotion to many agents doing something in a cooperative or competitive game with these same mechanisms.

    I feel like even talking to you, I don't think we've done a good job grasping just that enormity.

    Like multi-agent is just historically so much more complex.

    I don't know if there's anything about something like Dota five and how that, I wish I knew more lore of how that happened and it didn't continue because I feel like it could be a good example of why this is actually so much harder than even something like AlphaGo, which is just one policy and these robotics things we're going to talk about, which are one, like it is all still one policy, but just like one thing in the world.

    Eugene Vinitsky [00:41:07]: So let me, let me give it another try because I think I also haven't done the greatest job describing it.

    So like in something like Dota or Go, there is in fact a notion of like a best way to play.

    It's, you know, it's, it's, well, it's, it's, it's like an ash equilibrium.

    It's like, you can't do better than that.

    If you play it, nobody can beat you.

    Nathan Lambert [00:41:27]: Have we arrived at that at Go?

    I don't think like, have we actually arrived at these at chess and Go because the ELO scores are still going up.

    Eugene Vinitsky [00:41:33]: No, we haven't.

    Nathan Lambert [00:41:34]: But like conceivably there is a max.

    Eugene Vinitsky [00:41:37]: There is a max.

    You're, you're never going to get it.

    It's like the game's too big, but there is a best way to play.

    And then in every domain where there's a human in the loop, there's not an obvious best way to play it.

    And so the challenge has always been like, you know, if I run self-play, it's going to converge to some behavior.

    And that behavior is not necessarily something that can drive with, with humans in the loop.

    Like, you know, it'll learn something like you could imagine, for example, like you do a bunch of self-play and the cars learn that they can tell their partner where they want to go by hitting the blinkers left, left, right, right, left, left.

    That means like, I'm taking a left turn and I'm going to go at 25 miles per hour.

    And so there's this idea that like, there's all these policies that you wouldn't want to play and don't make any sense.

    And kind of what we show in this paper is that if you do a little bit of reward design and you really scale up RL, then like the simple fact of like having, being uncertain about where everybody wants to go and having to be very robust to collisions constrains your behavior in such a way that you like broadly learn how to drive well.

    Um, and, and I think this is transferable to other domains where, you know, you want some kind of base policy that roughly knows how to do the task well over some unknown distribution of partners.

    Nathan Lambert [00:43:02]: How easy do you think it would be to learn in the same simulator?

    If all the other cars were controlled by this policy, if you only had to learn for one policy, how much easier is it to learn just one policy that kind of works in the world rather than this multi-agent and everybody is learning at one setup?

    Because this is essentially what people have done is like, we've learned how to control one robot in the world and do that super well versus learning everything from scratch with multiple is well harder.

    Eugene Vinitsky [00:43:30]: And I, I think if imagine, if you imagine that, okay, we have N cars and N minus one of them are controlled by a perfect model of human driving, right?

    Like, I think that you could learn that super fast and really robustly.

    Um, and of course the problem is we don't have that one perfect model of human driving that we can plug into our simulator.

    I don't think you would, it would take, you know, the trillion samples that it took us.

    Nathan Lambert [00:43:54]: So, so I think this, yeah, so that's the difference.

    And that's what these other things and that's like quadruped for robotics to have, let me just let you talk about it.

    Where do you think this sim to real robots and single agent is at and heading?

    So like, what do you see in the, and being more, slightly more plugged into the academic RL side of things?

    So, so it's like 2021 and 2022 is when these Marco Hutter group papers started.

    And I'm sure the trend is still content, continuing.

    Eugene Vinitsky [00:44:23]: It's still continuing, right?

    Like for quadrupeds, we, we, there, people are regularly making them do these insane behaviors that we, we haven't been able to design in other ways.

    And I think the real, the lesson there is like at scale RL works.

    Like a lot of the lessons of like self-supervised learning are transferable to the RL side.

    Um, and while it would be great to get the sample complexity down and stop doing this with like a trillion samples, you know, if you're willing to bite that bullet and just scale, um, and you, you have, you know, kind of a, a fairly good simulator, um, you know, you can, you can, you can really do incredible things.

    Nathan Lambert [00:45:00]: Um, do you think these RL results scale more with model size or sample complexity?

    Do you think that they're kind of brute forcing it through many more inter scaling the interactions with the world?

    Eugene Vinitsky [00:45:10]: Yeah, uh, I think that they scale, um, scaling with model size is like a little iffy in, in, in, in RL.

    Um, there, there are tricks that people have been coming up with to let you use bigger and bigger models.

    But right now I think a lot of the, the impetus is towards smaller models that like have low inference costs, um, that let you like force a trillion samples into the policy, right?

    Whereas if you make the model bigger, inference cost becomes more of a thing.

    I think it's harder to acquire the samples.

    Nathan Lambert [00:45:38]: Um, so I think, I think this relates to, I think the other area that I'm excited about in RL is this like procedural generation and open-endedness.

    Do you think this kind of, I see needing to see a ton of samples as being in spirit related to this where open-endedness is, I think a field of study designed to make agents that are good at collecting the right samples and in using the word explore, which we haven't really used.

    Eugene Vinitsky [00:46:07]: Yeah.

    So I think a lot of, a lot of what we're doing here is actually kind of dodging the exploration problem in a lot of ways.

    Um, and in general, this is something that like the RL that works is a lot about dodging the exploration problem, right?

    Why do we need a trillion samples?

    Because we explore very inefficiently.

    Um, if we, you know, the, the kind of, this is, I think what we have, if we talk about the quadrupeds and things like that, right?

    Like these are well-defined tasks with a well-understood reward function.

    Um, and this, you know, at some point as we started to scale up RL, this, this, this task design will become the bottom, right?

    It's like, uh, what tasks should the agent do?

    There's a human in the loop sitting down writing the reward function saying, okay, that's a good task.

    This is a good task.

    The kind of the dream of open-endedness is that we'll move away from this and towards kind of just taking the human, this task designer out of the loop.

    Nathan Lambert [00:47:00]: Let's start taking a step back.

    Are there tasks that you think are heavily simulated in control domains that are actually well-suited to this RL approach that may have not been done yet?

    I mean, simulation is like a core tool in robotics and autonomy.

    So what other things are doing heavy simulation and not leveraging this?

    Maybe even like hard sciences are doing this.

    Eugene Vinitsky [00:47:19]: I think this is going to eat almost everything that can be simulated.

    Uh, well, so like the fundamental thing is like, can you simulate it with a relatively small sim to real gap and can you simulate it efficiently?

    Um, and if you have both of those things, like I think RL is going to eat all of those things.

    Um, so, and, and, or you can also scale this up by like, you know, uh, paying the price.

    So if you can, uh, for example, like I expect like formal verification, like agents that like, uh, write lean proofs to do really well.

    Um, you know, any there it's expensive because the simulator is slow, but there's no sim to real gap.

    Nathan Lambert [00:47:57]: Um, I'm thinking in the, like this scientific in control domain.

    I think one of them is, I mean, a timely example is humanoids, which I'm very, um, I've been consistently bearish on them.

    I think if you have the simulator, the control policy will be solved, but I think most of it is an environment problem where the robotic actuators are too strong.

    So therefore they're limited to manufacturing.

    And I don't necessarily know how much a humanoid, um, is better than the static arm in manufacturing and logistics.

    Eugene Vinitsky [00:48:29]: So I might be bearish on humanoids for similar reasons, but like, you're, I guess you're right on the point.

    I think like, will we think, will we be able to make a humanoid follow whatever trajectory we would like it to follow through scaling up RL?

    Yeah, I think so.

    Like, but then the question becomes like, what trajectory should it follow?

    And then that's where things get iffy again, right?

    Like exactly, you know, how, how softly should it behave?

    You know, how, stuff like that, like task planning, things like that.

    But, but from the, like the controls perspective of like, here's a system, I want it to follow this trajectory.

    Um, most of these things have like good fast simulators.

    Nathan Lambert [00:49:10]: Um, do you think RL should be used more in AI for science than scaling deep learning?

    So I'm guessing there are a lot of scientific domains that are very simulation intensive.

    And a lot of the approaches and excitement right now is to train a deep learning model to predict data.

    I think there's Evo2, which is a recent DNA sequence predictor.

    And I was reading a lot about this.

    And a lot of the criticism is like, we don't really know how to use it.

    And the mechanism is if, if the model is like, oh, I don't know about this DNA string, and then like, maybe it's a mutation.

    And there's a lot of weirdness like that.

    Yeah.

    But maybe it's still that just this slow burn of scaling RL systems has a, at least like a more direct way that can potentially improve some domains.

    Eugene Vinitsky [00:49:54]: Great question.

    Um, super interesting question.

    Um, so I think that the story I've been telling you about like sample inefficient RL scaling really well, I think we, we understand that pretty well.

    And less, less clear on the sample in RL in like limited sample domains.

    And I think a lot of the thing in deep learning for science that the simulators themselves are quite slow.

    Um, so like if you want to simulate, say like a fusion, like loop, um, honest, like, like a tokamak of some kind, um, it can take actually months to run a single simulation.

    Nathan Lambert [00:50:28]: Um, then what do you think of the deep, you brought this up?

    What do you think of the deep line nuclear fusion control paper then?

    Eugene Vinitsky [00:50:34]: Uh, they might've been doing a slightly different simulator.

    It's a different simulator.

    It's not as it doesn't, it's, I don't think it requires quite as much, much precision.

    Um, I'm thinking of other, other, other simulators, um, to be clear, I haven't read that paper super closely, but if you think about something like AI for materials or AI for bio, like a lot of these are fairly slow simulation platforms.

    What I do think is pretty exciting is I think at some point somebody is going to, and there are a lot of bottlenecks to this, like someone's going to build like a, an autonomous lab and just like keep letting it loop forwards and characterizing some material and then like running it through the loop again.

    The problem there is actually this like characterization step doing it correctly is really hard.

    Um, like, you know, what are the properties of the material that I've just synthesized?

    Um, but, but, you know, so, so I think that in terms of RL for science, I think that trajectory is a little trickier because of this like kind of low ability to acquire samples.

    Whereas in the humanoid and the quadruped domain, we can generate just, you know, people they'll simulate like 2000 humanoids at once on one GPU or something silly like that.

    Nathan Lambert [00:51:41]: Um, do you think these things scale nicely with action space?

    I feel like if we want to do this open-ended learning in a lot of environments, I don't know how to constrain the action space in a nice way.

    So that somewhat worries me.

    Eugene Vinitsky [00:51:55]: So I think there's a couple of pieces of that.

    So I think LLM sometimes give you pretty good priors over actions, right?

    That's the thing we've been pretty consistently seeing is that they constrain the action space on their own in a, in a really helpful way.

    Um, it is also the case that with like much larger action spaces, you just eat a sample complexity penalty and like things take longer, but we're seeing it be fine.

    You know, kind of in the domain of like 500 actions, this kind of thing.

    Now, if we all of a sudden go out to like 5 million actions, I think all bets are off.

    Nathan Lambert [00:52:26]: Um, it does kind of seem like they might have the same thing that happened with language models is that open-endedness.

    Now all the agents are going like pointing out a language model or some general interface, like a computer interface that ends up concerning action space to keyboard mouse inputs, which order of magnitude wise is actually the same in action space.

    Eugene Vinitsky [00:52:49]: I think, I think there are a lot going to be a lot of helpful constraints on the action space to let you, you know, deal, deal with this problem, right?

    You know, you're not operating in totally open-ended language or open-ended use of a computer.

    Nathan Lambert [00:53:04]: To recap some of this, just because it's interesting is this self, it's like the self-play question is, is the hardest, is the hardest one to grok, which is really just that, uh, honestly, I still don't even fully understand it.

    And I will listen to some of these things, but it's just like the scaling in sample time is the biggest, is the thing that makes RL actually work is that we can scale in samples.

    And that is the thing that's needed for most of these domains.

    And it's very different than what's happening in language models, but it's at least a consistent theme across what is now over like five years of like resurgent RL for control results.

    Eugene Vinitsky [00:53:41]: Scaling works in RL, right?

    There's, there's like no, no real wall here.

    Nathan Lambert [00:53:46]: It's a different type of scaling than people expect.

    I think a lot of the historical scaling papers were trying to scale parameters and looking at something like deep, um, what is dreamer V3 or whatever, and scaling the parameters of the policy with the number of environment interactions.

    But it seems like that is just like, it's actually just different axes to what is thought of in traditional deep learning scaling.

    Eugene Vinitsky [00:54:07]: It's, it's kind of the number of samples that you're getting.

    Nathan Lambert [00:54:10]: Which is very good to say clearly.

    And then the, um, the self-play thing is changing the domain to be one that is much more complicated.

    And the fact that it can still work there opens up a lot of interesting questions.

    So like the scaling thing is an independent axis that actually works.

    And then the self-play thing is dramatically increasing the complexity of your problem from this single agent world.

    Eugene Vinitsky [00:54:34]: It's adding a lot of like diversity through, you know, there being other agents in the system that behaving in unexpected ways.

    Um, so yeah, there's like scaling up the sample complexity.

    There's scaling up in diversity of your simulator and in the single agent domains, you don't need to scale up the diversity of your simulator.

    You have one task you want to do it.

    Um, but yeah, yeah, that makes sense.

    Nathan Lambert [00:54:55]: Okay.

    I'm pretty, that's, these are interesting takeaways to kind of reflect on what they mean, um, to kind of go towards wrapping this up or go into language model corner and then grad school advice corner.

    You had a paper that you, um, you were on a paper as an advisor of large language models are in context, preference learners.

    What, what is this story here?

    I think I've been preference learning is openly out of vogue, but I think that's because people are short-sighted and AI is so easy dominated where it's like, everyone is still actually doing preference tuning, but everyone's just talking about like RL and verifiable domains

    Eugene Vinitsky [00:55:31]: or whatever, whatever the hype is that I, yeah.

    Nathan Lambert [00:55:34]: But like, what is your take on this preference learning thing?

    And I know you have big self-play and RL background here.

    Eugene Vinitsky [00:55:42]: YSo I'll tell you, I'll tell you how we got here real quick and it'll, it'll make clear and I should say that there's coincident work by Jaden Clark, Joey, I'm going to mispronounce his name, Hedgna and Dorsa at Stanford.

    Um, that kind of got the same idea across at the same time.

    And the idea is that if you want to do preference learning, right.

    And you're doing preference learning, you're doing like RLHF from scratch or something.

    You have to learn this reward function.

    And you have to acquire a lot of samples to do it, to do so.

    Um, and you know, the tabula rasa version of this is really inefficient.

    And it turns out that what you can do is if you have a particular like description of a task you'd like to do, you can ask a language model, like write me a set of reward functions that I think correspond to this task, right?

    You'll take those reward functions.

    You'll train an agent corresponding to those reward functions.

    You'll have a human rank, um, their preferences over those reward functions.

    Like this was a good one.

    That was a bad one.

    And you'll feed all of that to a language model again, and do another loop of asking it to write some reward functions, given these preferences.

    And it turns out that language models can take that information and use it to decide what a good next reward function to try out is.

    And over time, you're going to get a reward function that is much more aligned with your preferences just by having the language models actually write the rewards.

    And so this is like, you know, kind of letting you do like personalization or reward tuning in like 50 human queries kind of scale.

    So like this kind of came about because we want, we were asking like, if I wanted to build an agent that like acts scared or is deceptive, right?

    Like I kind of have to do some amount of preference learning, right?

    Like is deceptive is the thing that's defined with respect to humans.

    Um, and so, you know, we're trying to figure out how to do something like that sample efficiently.

    It turns out you can just ask an LLM, write a reward function that corresponds to being deceptive and then like run that loop a bunch of times.

    Nathan Lambert [00:57:35]: So I would say that this means that like language model personalization doesn't need to be done within the parameter space or something like this.

    Or like, imagine having the domain here is, yeah, the domain here is partially at least control looking at, looking at this figure, but it's goes to show that they actually probably have the representation and it's just a problem of setup and how do you get people to actually do this and stuff.

    I've, I've said the same 20 questions saying, which is what if a language model and ask 20 questions about the user and get the information out of there.

    So it's nice to see that this might not be like, it's not an actual technical limitation.

    It's just like, we don't like, how the hell do you do that and chat or whatever.

    Eugene Vinitsky [00:58:17]: We haven't tested that particular case.

    And I want to, I think it's like a thing that might work.

    Like our case, it was like, make a robot jump like a human, right?

    Like write a reward function that corresponds to jumping like a human.

    And it turns out like a couple of rounds of iteration, you can get a language model to write down a reward function.

    Nathan Lambert [00:58:32]: Did the reward function make sense?

    Like what the, what is the reward function for jumping like a human?

    Eugene Vinitsky [00:58:37]: You know, it's like, you know, make sure that you jump off both legs instead of hopping on one leg, you know, don't flail your arms around too wildly, you know, that type of thing.

    I think, I think the whole reward function is in the paper.

    But yeah, you know, it's got some interpretable terms, like the, the, the, the base reward function that comes out in the first iteration always involves hopping on one leg.

    And then by the end, it's, you know, two legs, not too much flailing.

    Nathan Lambert [00:59:04]: There's like a deeper RL control thing there, which is that all of the agents when they start are just wildly flailing when you learn control from scratch.

    So that's in many ways, not super surprising.

    Um, do you have any comments on the general like language modeling RL stuff?

    I think it's so focused on, and part of the point of this is just like broaden the, broaden the lens by which people consider RL to be a thing.

    Eugene Vinitsky [00:59:33]: I mean, the biggest thing I should say there is I think it's going to work.

    Like, I don't think like for domains where you have verifiable rewards, like I just, I think this is going to work.

    Just going to have to bite the bullet and, and, uh, generate a lot of samples.

    Nathan Lambert [00:59:46]: Um, so, uh, it's interesting that you say that because one of the biggest things we find is you just have to keep training, right?

    So I mean, it's a classic RL curve.

    So you start out, it's a log, it's like a really, it's like, you start out really fast and then you're just on this plateau of just, you're getting a little bit more for a really long time.

    And it's much far fewer samples and pre-training and everything, but it is, the learning curves look so similar to anything you would get in RL and you can have crashes and stuff, which is such a throwback.

    And it's very different than preference tuning, which you have this whole over-optimization thing and stuff like this, where it's just so much less artful.

    It's like, just so obvious.

    It's like, if the number's going up, it's probably fine.

    And you don't really have to do that much.

    It's nice, right?

    There's a number.

    Eugene Vinitsky [01:00:32]: There's a, there's a number.

    You just have to push that number up.

    Life is great.

    Nathan Lambert [01:00:36]: Last, last section is career corner.

    What do you think people are interested in working on an RL right now?

    What do you say to them?

    Eugene Vinitsky [01:00:46]: I mean, I think, I think RL is just starting to eat different domains.

    So like, I think this is a really good time to get started on it.

    You know, there, there are not enough strong RL researchers, surprisingly.

    So like, I don't think we're like an, even an oversaturated domain.

    Nathan Lambert [01:01:07]: Do you think it's feasible for grad students to do this language model stuff and core like RL agent stuff?

    Because the agent stuff feels like more of a long-term moat because you're doing something that fewer people know about.

    But should people like fully ignore the language model stuff if they're trying to get established as a grad student?

    Eugene Vinitsky [01:01:24]: I think that you should like, this is an important academic thing is like, you need to focus on demonstrating the idea with as minimal, as few barriers as possible to it, right?

    So you want to pull out the minimum version of the demonstration.

    And a lot of the time putting a language model in there is adding a huge bottleneck, right?

    All of a sudden you need a bunch to use, training takes forever.

    So, I mean, you should probably do some language modeling things at some point because like, it's a good skill to have demonstrated when you go on the job market.

    So I think a lot of students will do is they'll do that in their last year or two of grad school, just to show that they kind of know, can't, can do this.

    But like for demonstrating the core ideas, I don't think, I don't think you always have to use the language model there unless your ideas are deeply tied to that domain.

    Yeah.

    Nathan Lambert [01:02:14]: The way that by which things scale and results are communicated is just really different in the RL for domain or the core RL algorithm or the language model plus RL thing, which I think is sequencing probably can be the best bet, which is like starting with something less competitive and focusing on skill development, which is generally my biggest, should I do a PhD answer is if you're invested in developing new skills or you are a person that's like actually extremely academic and scientific in nature, which there are a subset of people

    that are just truly like scientists in the nature of the word and they will probably thrive emotionally in that situation.

    But most people want to do an AI PhD because they think of it as a credential for a future job, which is generally a hilarious and ill-formed side effect of wherever the field is right now.

    Eugene Vinitsky [01:03:03]: It's just such a bad idea.

    There's one brief moment in history where a PhD was like a definitive route to a, like a high paying job, right?

    Generally what a PhD is supposed to do is it's like, Hey, it should be fun.

    It should be fascinating.

    It should be like five years where you're like, I could not imagine doing something cooler than what I'm doing right now.

    And then it's supposed to unlock some jobs that aren't accessible to you otherwise.

    Running a research team in industry, doing particular skills that, you know, using particular skills that are hard to develop unless someone gives you a year or two to focus on, right?

    Like hard optimization problems, a lot of specialties.

    But, you know, the like, I'm going to do a PhD.

    That's going to give me like a 500K total compensation job straight out of grad school.

    It's just like such a weird quirk of history that like optimizing for it is never a good idea.

    Nathan Lambert [01:03:56]: Yeah.

    And I think that kind of says if people are trying to be, or people are grad students or like junior faculty or junior grad students right now, I think if anything you're optimizing for is trying to extract value from that quirk of history, you're putting yourself in a precarious position.

    Eugene Vinitsky [01:04:15]: Yeah, optimize for being able to do cool things, you know, that's, that's a consistent thing you can always optimize for.

    It doesn't go away when the job market changes.

    Nathan Lambert [01:04:29]: Yeah, I agree.

    I mean, that could be a good place to end it.

    You can do, you can actually surprisingly still just do things.

    And yeah, I think it's easy to lose track of that in the language modeling chaos.

    Eugene Vinitsky [01:04:43]: Yeah, I mean, but look, also, I'm coming from a position of privilege, right?

    Like I have a faculty position.

    Nathan Lambert [01:04:48]: We're early enough where it is fine.

    Eugene Vinitsky [01:04:51]: Yeah.

    Okay.

    Well, you know, this has been a pleasure, you know, thank you for taking the time to chat with me and give me a chance to talk about this paper, which is, I think, still had some trouble conveying exactly why I think it's so exciting.

    But hopefully some of it got across.

    Nathan Lambert [01:05:06]: I think we got to some clear things and the self-play being weird thing definitely gives me more sympathy to how bad the discussion around self-play for language models is, because there is very nuanced on why this, like what we're doing with RL with verifiable rewards is very different than language models talking to themselves and both updating their policy.

    And it's not to say we shouldn't be trying that, but we should be very wary in saying that is that until we are trying to do really, really hard things.

    The grandioseness of language model self-play is probably like trying to let language models discover their own language to be more effective at tasks.

    And they do this with interacting with each other.

    Eugene Vinitsky [01:05:50]: And so, yeah, language model self-play for tasks they haven't been trained on, like learning to do new tasks collaboratively together.

    Super exciting.

    It makes sense.

    I'm doing some work on it.

    I'm excited about that.

    This thing where you, the amount of knowledge that they have is bounded and you do self-play to kind of refine the distribution that they're playing over, as opposed to like doing a new task together, a little weirder, weirder stuff going on there.

    Yeah.

    Nathan Lambert [01:06:16]: Yeah.

    So I think it's good.

    I think people now know that this, the single agent RL stuff working is not surprising.

    And the self-play area could be one of these like multi-year out before takeoff thing.

    And there are early signs that it could actually work.

    And that's something that people are often looking for, which is what are the problems that there are a bit more risk on, but not complete risk and not the obvious thing that everybody is working on.

    Eugene Vinitsky [01:06:41]: Yeah.

    And I think if you're thinking that category, collaborative agents, agents that know how to effectively collaborate with humans, with other AI agents, very underrated area, going to be big in a bit, I think.

    Nathan Lambert [01:06:54]: I think the ones who are playing these models is what does it mean for multiple language models to be interacting on the web that have separate goals, but are going to interact with each other and it's not necessarily the same self-play setup, but you could understand it with, through some of these lenses.

    And it's easy to see how this is going to come about with you when listening to the marketing from all these labs.

    Eugene Vinitsky [01:07:17]: Yeah.

    Well, it's, yeah, it's going to be fun.

    It's going to be weird.

    It's going to be great.

    I also, I do have to inject a quick little pitch before I, before I disappear.

    Nathan Lambert [01:07:28]: Sounds good.

    Eugene Vinitsky [01:07:28]: Um, so, uh, I've joined, uh, like a new, like AI, uh, stealth AI company.

    Um, we're like working on, um, making better decisions in critical industries like healthcare, supply chain, defense.

    So like in industries that are like not very tech forward, not very AI native, um, but like are where almost all productivity lies.

    Um, and so, uh, we're, uh, looking for strong folks with experience with either RL or LLMs.

    Nathan Lambert [01:07:55]: Do you have a company name?

    Eugene Vinitsky [01:07:56]: We do not.

    Nathan Lambert [01:07:57]: Okay.

    Well, we'll have a link.

    We'll have a contact link below.

    I think, yeah, Eugene is a fun person.

    Um, so, and at least knows enough to make it through this conversation, which is, there are a lot of startups out there that are riding on less.

    So that's good.

    Eugene Vinitsky [01:08:13]: Um, that'll be fun.

    It's, it's, there aren't, I don't think there are that many applied RL teams out there in the world.

    So maybe there are, and I just don't know them, but I think like Mark Belmer's startup is

    Nathan Lambert [01:08:24]: actually potentially related, which is seems like it's kind of trying to scale.

    RL with the mix of language models to solve problems, but I haven't asked him directly.

    Eugene Vinitsky [01:08:33]: I mean, that's somewhat of what we're, what we're doing, doing too.

    Um, but, uh, that's all I can say.

    Nathan Lambert [01:08:42]: Yeah.

    Sounds good.

    It's, it's early days for, for that and self-play and many other things, but, um, I'm sure we'll cross paths soon.

    Either if I go back to New York or if you, for some reason, come all the way to Seattle, which I don't know.

    I love Seattle.

    Eugene Vinitsky [01:08:58]: So yeah, I'll love in the fall.

    Um, anyways, yeah, uh, it was, it was a pleasure talking to you and hopefully, uh, you know, get a chance to talk again soon.

    Nathan Lambert [01:09:08]: Yeah, we'll do.



    This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe
  • Full post: https://www.interconnects.ai/p/elicitation-theory-of-post-training

    If you look at most of the models we've received from OpenAI, Anthropic, and Google in the last 18 months, you'll hear a lot of "Most of the improvements were in the post-training phase." The most recent one was Anthropic’s CEO Dario Amodei explaining Claude 3.7 on the Hard Fork Podcast:

    We are not too far away from releasing a model that's a bigger base model. Most of the improvements in 3.6/3.7 are in the post-training phase. We're working on stronger base models (perhaps that will be the Claude 4 series, perhaps not; those are coming in a relatively small number of time units [months?].

    Here's a simple analogy for how so many gains can be made on mostly the same base model.

    The intuition I've been using to understand the potential of post-training is called the elicitation interpretation of post-training, where all we are doing is extracting and amplifying valuable behaviors in the base model.

    Consider Formula 1 (F1), most of the teams show up to the beginning of the year with a new chassis and engine. Then, they spend all year on aerodynamics and systems changes (of course, it is a minor oversimplification), and can dramatically improve the performance of the car. The best F1 teams improve way more during a season than chassis-to-chassis.

    The same is true for post-training. The best post-training teams extract a ton of performance in a very short time frame. The set of techniques is everything after the end of most of pretraining. It includes "mid-training" like annealing / high-quality end of pre-training web data, instruction tuning, RLVR, preference-tuning, etc. A good example is our change from the first version of OLMoE Instruct to the second — we improved our post-training evaluation average from 35 to 48 without touching the majority of pretraining.

    Then, when you look at models such as GPT-4.5, you can see this as a way more dynamic and exciting base for OpenAI to build onto. We also know that bigger base models can absorb far more diverse changes than their smaller counterparts.

    This is to say that scaling also allows post-training to move faster. Of course, to do this, you need the infrastructure to train the models. This is why all the biggest companies are still building gigantic clusters.

    This theory folds in with the reality that the majority of gains users are seeing are from post-training because it implies that there is more latent potential in a model pretraining on the internet than we can teach the model simply — such as by passing certain narrow samples in repeatedly during early types of post-training (i.e. only instruction tuning).

    Throwback to the superficial alignment hypothesis

    Another name for this thoery is the Superficial Alignment Hypothesis, coined in the paper LIMA: Less is More for Alignment. This paper is getting some important intuitions right but for the wrong reasons in the big picture. The authors state:

    A model’s knowledge and capabilities are learnt almost entirely during pretraining, while alignment teaches it which subdistribution of formats should be used when interacting with users. If this hypothesis is correct, and alignment is largely about learning style, then a corollary of the Superficial Alignment Hypothesis is that one could sufficiently tune a pretrained language model with a rather small set of examples [Kirstain et al., 2021].

    All of the successes of deep learning should have taught you a deeply held belief that scaling data is important to performance. Here, the major difference is that the authors are discussing alignment and style, the focus of academic post-training at the time. With a few thousand samples for instruction finetuning, you can change a model substantially and improve a narrow set of evaluations, such as AlpacaEval, MT Bench, ChatBotArena, and the likes. These do not always translate to more challenging capabilities, which is why Meta wouldn’t train its Llama Chat models on just this dataset. Academic results have lessons, but need to be interpreted carefully if you are trying to understand the big picture of the technological arc.

    What this paper is showing is that you can change models substantially with a few samples. We knew this, and it is important to the short-term adaptation of new models, but their argument for performance leaves the casual readers with the wrong lessons.

    If we change the data, the impact could be far higher on the model’s performance and behavior, but it is far from “superficial.” Base language models today (with no post-training) can be trained on some mathematics problems with reinforcement learning, learn to output a full chain of thought reasoning, and then score higher on a full suite of reasoning evaluations like BigBenchHard, Zebra Logic, AIME, etc.

    The superficial alignment hypothesis is wrong for the same reason that people who think RLHF and post-training are just for vibes are still wrong. This was a field-wide lesson we had to overcome in 2023 (one many AI observers are still rooted in). Post-training has far outgrown that, and we are coming to see that the style of models operates on top of behavior — such as the now popular long chain of thought.

    The counterargument to elicitation

    The prevailing counterargument to the elicitation theory of post-training has been that post-training is teaching specific skills to the model. This can be seen through very large finetuning datasets used in the early eras of open models. One of the last prominent research examples of this is OpenMathInstruct 2, which showed performance improvements with finetuning on up to 14.3 million instructions.

    A rough industrial norm is that you can do excellent post-training with only 1 million prompts at instruction tuning. The way to interpret the above plot with the elicitation theory is that the vast majority of the gains come from the beginning of training.

    The trade-offs discussed in this counterargument, such as scaling post-training methods, were specifically discussed in the paper Revisiting the Superficial Alignment Hypothesis, where it is clear that you can teach new skills to models in post-training. The hardest part today is to know when the skills are entirely new — it is often safer to assume the model has seen them at pretraining. Therein, it is far more efficient to elicit behaviors from the model than to teach, and eventually, the tide may shift to favor teaching models, but not in the near future.

    Comparing this math training dataset to current best (and emerging) practices for reasoning models makes the strongest case for the elicitation theory yet.

    Interconnects is a reader-supported publication. Consider becoming a subscriber.

    RL’s role in elicitation

    The reinforcement learning (RL) training we’ve seen take off in this early era of reasoning models is often described as “sample efficient” and “the model learning new behaviors.” Both of these fit with the theory presented. The astute way to view the model learning new behaviors is not that it is learning entirely new abilities but rather learning to reinforce behaviors that were already latent in the pretraining dataset.

    Compared to teaching the model math with millions of samples, just a few thousand prompts of RL training can far surpass the performance in MATH shown above.

    In many ways, RL training exploding in popularity and effectiveness is the ultimate endorsement of the elicitation theory. Where we used to try and teach the model math with millions of supervised samples, now we just let the model try different approaches on thousands of math problems, and they reach far higher peak performance.

    This is, of course, also linked to why people say that “stronger base models are better starting points for RL.” All of this fits together as the base model is the platform on which post-training is built.

    A reductionist take here is to say that pretraining is not important — in reality, pretraining is just slow and hidden from most of the gains we are seeing. Still, excellent post-training and the performance improvements we enjoy today are all well downstream of pretraining. Pretraining still is arguably the most important part of the training stack, as it allows those with confidence in the elicitation during post-training to thrive.

    Thanks to Mohit Raghavendra for some email exchanges that helped this post.



    This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe
  • Link: https://www.interconnects.ai/p/where-inference-time-scaling-pushes

    There’s a lot of noise about the current costs of AI models served for free users, mostly saying it’s unsustainable and making the space narrow for those with the historical perspective of costs of technology always plummeting. GPT-4.5’s odd release of a “giant” model without a clear niche only amplified these critics. With inference-time compute being a new default mode, can we still have free AI products? Are we just in the VC-subsidized era of AI?

    For normal queries to ChatGPT, the realistic expectation is that the cost of serving an average query will drop to be extremely close to zero, and the revenue from a future ad model will make the service extremely profitable. The most cohesive framework for understanding large-scale internet businesses built on the back of such zero marginal costs is Ben Thompson’s Aggregation Theory.

    Aggregation Theory posits that extreme long-term value will accrue to the few providers that gate access to information and services built on zero-marginal cost dynamics. These companies aggregate user demand. It has been the mode of modern dominant businesses, with the likes of Google and Meta producing extremely profitable products. Naturally, many want to study how this will apply to new AI businesses that are software-heavy, user-facing platforms, of which OpenAI is the most prominent due to the size of ChatGPT. Having more users and attention enables aggregators to better monetize interactions and invest in providing better experiences, a feedback loop that often compounds.

    Aggregators are often compared to platforms. Where the former relies on being an intermediary of users and other marketplaces, platforms serve as foundations by which others build businesses and value, such as Apple with the iPhone, AWS, or Stripe.

    Businesses like ChatGPT or Perplexity will rely on a profitable advertisement serving model being discovered that works nicely for the dialogue format. ChatGPT interweaving previous discussions into the chat, as they started doing in the last few months, is encouraging for this, as they could also have preferred products or sources that they tend to reference first. Regardless, this will be an entirely new type of ad, distinct from Meta’s targeted feed ads, Google’s search ads, or the long history of general brand ads. Some of these past ad variants could work, just sub-optimally, in the form factor.

    An even easier argument is to see the current hyperscalers using low-cost inference solutions on AI models that complement their existing businesses and fit with components of Aggregation Theory — such as Meta serving extremely engaging AI content and ads. The biggest platform play here is following the lens through which language models are a new compute fabric for technology. The AWS of AI models.

    All of these business models, ads, inference, and what is in between, were clear very soon after the launch of ChatGPT. As the AI industry matures, some harder questions have arisen:

    * Who bears the cost of training the leading frontier models that other companies can distill or leverage in their products?

    * How many multiples of existing inference paradigms (0-100s of tokens) will inference-time scaling motivate? What will this do to businesses?

    This post addresses the second question: How does inference time compute change business models of AI companies?

    The announcement of OpenAI’s o3 with the inference cost on ARC-AGI growing beyond $5 per task and the proliferation of the new reasoning models raised the first substantive challenge to whether aggregation theory will hold with AI.

    The link to inference time compute and the one that sparked this conversation around aggregators was Fabricated Knowledge’s 2025 AI and Semiconductor Outlook, which stated:

    The era of aggregation theory is behind us, and AI is again making technology expensive. This relation of increased cost from increased consumption is anti-internet era thinking.

    This is only true if increased thinking is required on every query and if it doesn’t come with a proportionate increase in value provided. The fundamental operations of AI businesses will very much follow in the lens of Aggregation Theory (or, in the case of established businesses, it’ll reinforce advantages of existing large companies), and more work is going to be needed to figure out business models for inference-heavy products.

    We can break AI use today into two categories:

    * ChatGPT and general-use chatbots.

    * Domain-specific models, enterprise products, model APIs, and everything else that fits into the pay-for-work model (e.g. agents).

    The first category is established and not going away, while the second is very in flux. Inference time scaling will affect these in different ways.

    Consumers — well, most of them (and not most of you reading this who are power users) — will never know how to select the right model. The market for super users is far smaller than the market for general use. The core for consumer products is having the model know how much compute to spend. This is where RL training will likely be most important and is something notably missing from the release of Claude 3.7 Sonnet.

    OpenAI’s model offerings and initial excitement around inference time compute made many, myself included, get excited about the idea of a compute dial being shown to the users so they can control the “thinking effort” for their query. The problem is that rules for how well that translates to performance rely on a deep understanding of AI and how language model performance is very stochastic.

    The so-called dial is being reduced to simple reasoning buttons or always-on reasoning — extremes and yes/no decisions are much easier for users to work with. This is already how I engage with models. I start with a normal model, and if it doesn’t work, I punt to o1 pro. Would my trying to guess the right spot on a dial for a new query really be a good experience? Please, the model should know its own limits and handle that on its own.

    Today, the RL-trained reasoning models primarily serve as a trust and legibility enhancement to average users rather than a performance improvement. This is leading to the exposure of the Chain of Thoughts (CoTs) to be an industry norm. At the same time, this sort of minor increase in context length will still be subsumed into a zero marginal cost style business, pending the assumed discovery of a functional ad model. This is all also just the tip of the iceberg for inference time compute. From my coverage of Claude 3.7:

    RL training is a short path to inference time scaling laws being used, but in the long-term we will have more methods for eliciting the inference-time tradeoffs we need for best performance.

    For power users and enterprises, RL training and one model fits all is less important. Enterprises will want to benefit from clear trade-offs on performance vs log(compute).

    Many in the industry, including in the aforementioned Claude 3.7 release and o3’s ARC-AGI performance, are discussing the use of parallel test time compute relative to just increasing generation lengths. Inference time scaling with parallel computation and strong verifiers will be essential to the long-term trajectory of the sub-area.

    Where RL models can increase the compute spent by a model by factors of 2, 4, or 10 for a question, parallel computation already uses factors of 1000 (see o3), and will go far higher. This is a far more direct way to continue scaling the log-compute plots for inference time scaling. It’s also more efficient due to the quadratic costs of generating longer contexts — in fact most of the models we are using cannot scale output length infinitely, as we can with the number of samples.

    Interconnects is a reader-supported publication. Consider becoming a subscriber.

    Better verifiers will increase the slope of the inference time scaling plots we are seeing, as discussed in our coverage of Claude 3.7.

    Models will be trained to make the probability of a true answer appearing increase over many generations and maximizing the probability that the extraction method can select it, rather than maximizing the probability that 1 single generation is correct out of the box. This is a very different way to finish the training of models than has been considered in some time. Here’s a recent example of a research paper studying this, Inference-Aware Fine-Tuning for Best-of-N Sampling in Large Language Models, and more will surely come soon.

    Verification as the limiter for inference-time scaling performance is not a new idea. It was the starting point of my coverage on inference time scaling, before the release of OpenAI’s o1 (and mentioned in the recent post on Claude 3.7). Ultimately, the challenge is robustness, not if the models can generate the answer:

    I wanted to highlight a result from the last paper, Large Language Monkeys, as to why inference time compute is feasible. That paper focuses on repeated sampling from a variety of language models to understand the distribution of answers in a model. They show that with an Oracle answer extractor (more on this later), even models as surprising as Pythia-70M have the true answer inside.

    Remember, the domain of math answers is infinite. This shows that underlying distribution of the models has the right answer, we need to figure out how to extract the right ones. We need strong verifiers to make answer selection easy. The Snell et al. paper above (1.) calls this the “Proposer and Verifier” perspective.

    The understanding that while the models we are using will almost always be able to generate the right answer and the fact that training verifiers for using that has just started should increase optimism that inference time scaling can work.

    This type of performance will not be cheap. Unlocking new potential applications is still worth way more than the few dollars these queries can cost. Noam Shazeer of Google explained this on his Dwarkesh Podcast appearance with Jeff Dean:

    One of the big areas of improvement in the near future is inference time compute, applying more compute at inference time. I guess the way I like to describe it is that even a giant language model, even if you’re doing a trillion operations per token, which is more than most people are doing these days, operations cost something like 10 to the negative [18 dollars]. And so you're getting a million tokens to the dollar.

    I mean compare that to a relatively cheap pastime: you go out and buy a paper book and read it, you're paying 10,000 tokens to the dollar. Talking to a language model is like 100 times cheaper than reading a paperback.

    So there is a huge amount of headroom there to say, okay, if we can make this thing more expensive but smarter, because we're 100x cheaper than reading a paperback, we're 10,000 times cheaper than talking to a customer support agent, or a million times or more cheaper than hiring a software engineer or talking to your doctor or lawyer. Can we add computation and make it smarter?

    I think a lot of the takeoff that we're going to see in the very near future is of this form. We've been exploiting and improving pre-training a lot in the past, and post-training, and those things will continue to improve. But taking advantage of "think harder" at inference time is just going to be an explosion.

    The TLDR is “Making language models more expensive is worth it.” It is quite the provocation — that intelligence is worth far more and will restructure the budding AI inference market.

    All of the dynamics are going to be accelerated by AI adoption and Jevon’s paradox. Jevon’s paradox is the idea that when a technology gets more efficient, i.e. “reducing the amount needed for a single application,” the total usage will actually go up. The resource in this case is GPUs for inference. AI inference, generally, and reasoning models, specifically, are in this feedback loop. Reducing the baseline cost of models will make inference-time scaling even more palatable. On the research side we are also embracing far more ambitious compute spending per problem.

    These are business dynamics that don’t fit into the most common aggregation theory model. With a gradient from classic zero-marginal cost, high-margin services to high-cost, high-impact new uses of AIs, the behavior of aggregators is going to evolve.

    We’ve seen examples of Aggregation Theory applied to high-cost businesses, such as Amazon’s Prime — distribution is expensive — but these don’t posit to operate on the same internet scale of digitally native businesses. All in, aggregation theory will still apply to most of the consumer and enterprise markets, but there are large areas of the market where the way this is unfolding is entirely new.

    The defining factor at the top end of inference scaling is paying for the most productive output. If the costs are so great in both capital and marginal costs, the AI super-companies of the future may look more like platforms than aggregators. They would have strong network effects as the only people in town who can offer the powertool that is inference-heavy AI.

    AI companies will be pushed along the barbell effect — those with consumer demand can aggregate on cheaper models, but those with specific niches will be forced to compete on performance. The ceiling on price is unknown and markets need to be built on new platforms.

    High costs could support AI being less winner-takes-all than many would suspect — we could have Anthropic operate as the platform for AI agents, ChatGPT aggregating the consumer use, and another party dominating code generation. With so much investment in AI it feels more likely that companies will win their area by focusing on specific domains — they cannot afford to dilute their efforts relative to a focused peer — unless an expansive AGI really is coming, in which one company conceivably can take the entire market.



    This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe
  • More: https://www.interconnects.ai/p/gpt-45-not-a-frontier-model

    As GPT-4.5 was being released, the first material the public got access to was OpenAI’s system card for the model that details some capability evaluations and mostly safety estimates. Before the live stream and official blog post, we knew things were going to be weird because of this line:

    GPT-4.5 is not a frontier model.

    The updated system card in the launch blog post does not have this. Here’s the original system card if you need a reference:

    Regardless, someone at OpenAI felt the need to put that in. The peculiarity here summarizes a lot of the release. Some questions are still really not answered, like “Why did OpenAI release this?” That game theory is not in my purview.

    The main contradiction to the claims that it isn’t a frontier model is that this is the biggest model the general public has ever gotten to test. Scaling to this size of model did NOT make a clear jump in capabilities we are measuring. To summarize the arc of history, the jump from GPT-3.5 to GPT-4 made the experience with the models go from okay to good. The jump from GPT-4o (where we are now) to GPT-4.5 made the models go from great to really great.

    Feeling out the differences in the latest models is so hard that many who are deeply invested and excited by AI’s progress are just as likely to lie to themselves about the model being better as they are to perceive real, substantive improvements. In this vein, I almost feel like I need to issue a mea culpa. I expected this round of scaling’s impacts to still be obvious before the brutal economic trade-offs of scaling kicked in.

    While we got this model, Anthropic has also unintentionally confirmed that their next models will be trained on an approximation of “10X the compute,” via a correction on Ethan Mollick’s post on Claude 3.7.

    Note: After publishing this piece, I was contacted by Anthropic who told me that Sonnet 3.7 would not be considered a 10^26 FLOP model and cost a few tens of millions of dollars to train, though future models will be much bigger.

    GPT-4.5 is a point on the graph that scaling is still coming, but trying to make sense of it in a day-by-day transition is hard. In many ways, zooming out, GPT-4.5 will be referred to in the same breath as o1, o3, and R1, where it was clear that scaling pretraining alone was not going to give us the same level of breakthroughs. Now we really know what Ilya saw.

    All of this marks GPT-4.5 as an important moment in time for AI to round out other stories we’ve been seeing. GPT-4.5 likely finished training a long time ago — highlighted by how it has a date cutoff in 2023 still — and OpenAI has been using it internally to help train other models, but didn’t see much of a need to release it publicly.

    What GPT-4.5 is good for

    In the following, I am going to make some estimates on the parameter counts of GPT-4.5 and GPT-4o. These are not based on any leaked information and should be taken with big error bars, but they are very useful for context.

    GPT-4.5 is a very big model. I’d bet it is well bigger than Grok 3. We have seen this story before. For example, GPT-4 was roughly known to be a very big mixture of experts model with over 1T parameters total and ~200B active parameters. Since then, rumors have placed the active parameters of models like GPT-4o or Gemini Pro at as low as 60B parameters. This type of reduction, along with infrastructure improvements, accounts for massive improvements in speed and price.

    Estimates place GPT-4.5 as about an order of magnitude more compute than GPT-4. These are not based on any released numbers, but given a combination of a bigger dataset and parameters (5X parameters + 2X dataset size = 10X compute), the model could be in in the ballpark of 5-7T parameters total, which if it had a similar sparsity factor to GPT-4 would be ~600B active parameters.

    With all of these new parameters, actually seeing performance improvements is hard. This is where things got very odd. The two “capabilities” OpenAI highlighted in the release are:

    * Reduced hallucinations.

    * Improved emotional intelligence.

    Both of these have value but are hard to vibe test.

    For example, SimpleQA is a benchmark we at Ai2 are excited to add to our post-training evaluation suite to improve world knowledge of our models. OpenAI made and released this evaluation publicly. GPT-4.5 makes huge improvements here.

    In another one of OpenAI’s evaluations, PersonQA, which is questions regarding individuals, the model is also state of the art.

    And finally, also GPQA, the Google-proof knowledge evaluation that reasoning models actually excel at.

    At the time of release, many prominent AI figures online were touting how GPT-4.5 is much nicer to use and better at writing. These takes should be taken in the context of your own testing. It’s not that simple. GPT-4.5 is also being measured as middle of the pack in most code and technical evaluations relative to Claude 3.7, R1, and the likes.

    For an example on the writing and style side, Karpathy ran some polls comparing GPT-4.5’s writing to GPT-4o-latest, and most people preferred the smaller, older model. Given what we know about post-training and the prevalence of distilling from the most powerful model you have access to, it is likely that GPT-4o-latest is distilled from this new model, previously called Orion, and its drastically smaller size gives it a night and day difference on iteration speed, allowing for better post-training.

    More on the character in that GPT-4o-latest model was covered in our previous post on character training.

    All of this is a big price to pay to help OpenAI reclaim their top spot on ChatBotArena — I expect GPT 4.5 to do this, but the results are not out yet.

    I’ve been using GPT-4.5 in preparation for this. It took a second to get used to the slower speed, but it’s fine. I will keep using it for reliability, but it’s not worth paying more for. o1 Pro and the other paid offerings from OpenAI offer far more value than GPT-4.5.

    Interconnects is a reader-supported publication. Consider becoming a subscriber.

    Making sense of GPT-4.5’s ridiculous price

    When the original GPT-4 first launched, it was extremely expensive. In fact, GPT-4 was comparable in price to GPT-4.5 at launch. Here’s a help post on the OpenAI forums, conveniently found by OpenAI DeepResearch with GPT-4.5, that captures the context. GPT-4 launched in March 2023.

    We are excited to announce GPT-4 has a new pricing model, in which we have reduced the price of the prompt tokens.

    For our models with 128k context lengths (e.g. gpt-4-turbo), the price is:

    * $10.00 / 1 million prompt tokens (or $0.01 / 1K prompt tokens)

    * $30.00 / 1 million sampled tokens (or $0.03 / 1K sampled tokens)

    For our models with 8k context lengths (e.g. gpt-4 and gpt-4-0314), the price is:

    * $30.00 / 1 million prompt token (or $0.03 / 1K prompt tokens)

    * $60.00 / 1 million sampled tokens (or $0.06 / 1K sampled tokens)

    For our models with 32k context lengths (e.g. gpt-4-32k and gpt-4-32k-0314), the price is:

    * $60.00 / 1 million prompt tokens (or $0.06 / 1K prompt tokens)

    * $120.00 / 1 million sampled tokens (or $0.12 / 1K sampled tokens)

    GPT-4.5’s pricing launched at:

    Input:$75.00 / 1M tokens

    Cached input:$37.50 / 1M tokens

    Output:$150.00 / 1M tokens

    OpenAI included language in the release that they may not keep this model in the API, likely forecasting low demand, as they wanted to hear from users if it enabled entirely new use-cases.

    Many analysts think that Nvidia’s next generation of GPU, Blackwell, which comes with GPUs with far more memory per FLOP (enabling storing bigger models), are not priced into this. We can expect to see the same arc of pricing with 4.5 as we did with 4 to 4 Turbo to 4o.

    * GPT-4 Turbo launched in November 2023 at $10 / 1M input and $30 / 1M output.

    * GPT-4o launched in May 2024 at $2.5 / 1M input and $10 / 1M output.

    These are huge reductions, about 10X.

    These are products that OpenAI makes a healthy margin on, and there are no signs that that isn’t the case. The AI community collectively has grown so accustomed to incredible progress in making the technology more efficient that even a blip in the process, where bigger models are available, feels potentially bubble-popping.

    The future of scaling

    Scaling language models is not dead. Still, reflecting on why this release felt so weird is crucial to staying sane in the arc of AI’s progress. We’ve entered the era where trade-offs among different types of scaling are real.

    If forced to summarize all of this curtly, it would be: GPT-4.5 is, oddly, ahead of its time.

    This means that the progression of AI needs to take a different tack, but we already knew this with the rapid progress of reasoning models. The true impact of GPT-4.5 is when it is integrated with multiple lines of rapid progress.

    One of the flagship results in the DeepSeek R1 paper and related RL follow-up work in the AI community is that scaling RL training works better on bigger models. There is a lot of work to do to know all the domains that’ll be absorbed into this umbrella. Future models like o4 could be distilled from a reasoning model trained on GPT-4.5. In fact, this may already be the case. OpenAI’s current models likely would not be so good without GPT-4.5 existing.

    In as soon as a year, most of the models we are working with will be GPT-4.5 scale and they will be fast. The “well-rounded” improvements they offer are going to help make many more applications more robust, but OpenAI and others in the AI labs have pushed scaling a bit further than the current serving infrastructure can support.

    Frontier labs are not taking enough risk if they’re not going to try to push the limits of every direction of scaling they have. Though releasing the model isn’t needed, we have to guess why OpenAI actually wanted to do this. It’s likely that GPT-4.5 is being used in other internal systems for now and other external products soon, so releasing it is a natural step on the way to the next thing, rather than a detour.

    GPT-4.5 is a frontier model, but its release is not an exciting one. AI progress isn’t free, and it takes a lot of hard work. Most people should only care when GPT-4.5 is integrated into more than just chat.



    This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe
  • https://www.interconnects.ai/p/character-training

    The vast majority of evaluations used to measure progress on post-training at frontier laboratories are internal evaluations rather than the evaluations you hear about all the time like MATH or GPQA. These, the well-known intra-industry evaluations, are certainly important for ballparking behavior, but for every public evaluation, these frontier laboratories are likely to have 10+ fine-grained internal evaluations.

    The internal evaluations these model providers have cover a range of topics. Surely, most of them are basic, repetitive user behaviors that they need to make sure a new model doesn’t roll back too many of. Of these, the vast majority are likely skills, and “character” remains more of an art than a hill to climb up with careful data engineering.

    Leading post-training laboratories surely know how to reinforce more robust behavior within a specific character, as seen by the march of progress on evaluations like ChatBotArena, but crafting a specific personality from scratch is an open question.

    The primary goal of this post is to start the conversation outside of frontier AI labs around character training. Character training is the subset of post-training designed around crafting traits within the model in the manner of its response, rather than the content. Character training, while being important to the user experience within language model chatbots, is effectively non-existent on the web.

    We don’t know the trade-offs of what character training does, we don’t know how exactly to study it, we don’t know how much it can improve user preferences on ChatBotArena, and we should.

    The appearance of the AIs people are using is deeply coupled with how intelligent users will find it to be. Style of communication is crucial to how information is parsed. This is likely a very high priority to industrial labs, but something that almost no academic literature exists on. Even though I want to do research on this, I’m honestly not sure how to do so yet other than a 1 of 1 technical report on findings.

    ChatGPT gets character depth

    Out of nowhere on Saturday, February 15th, Sam Altman tweeted about this new GPT-4o model that will serve as the foundation of ChatGPT.

    This is the biggest subjective change I’ve ever felt within intermediate model versions, from any primary provider — something more akin in vibes change to the shift from GPT-3.5 to GPT-4. The model immediately and consistently showed new behavior patterns. I found these very positive (Karpathy agrees), but they’ll take some getting used to.

    Where ChatGPT used to sound robotic and shallow, it’s now very clearly leaning into a chipper assistant demeanor. Yes, for basic tasks, this new default model in ChatGPT is very Claude 3.5-like — more testing is needed to know if this GPT-4o with its peer models like o3-mini can dethrone Claude 3.7 Sonnet as a daily programming driver.

    The biggest changes in the new GPT-4o model are:

    * It now loves to reference past interactions in the chat (way more obvious than any other provider has been) — it was trying to flex that it knows my dog breed, mini schnauzer, or my book topic, RLHF. This is very in line with the new roadmap to GPT-4.5 and GPT-5 that Altman posted, where ChatGPT is designed around a fluid experience rather than standalone, siloed, powerful models.

    * The model is very chipper, sprinkles in more emojis, and is almost funny.

    * The multi-turn conversation is more dynamic, with follow-up questions and added texture to longer back and forths.

    The reasons are at a high level very complementary to those I listed when I switched to Claude as my daily driver model.

    The shocking part of this is that the impact of this sweeping change is almost entirely undocumented. Yes, OpenAI updated the Model Spec (my previous coverage here and here), but that doesn’t really capture how this model is different — it just clarifies the direction OpenAI is optimizing for. There are a few overlapping interpretations of this lack of transparency:

    * OpenAI cannot precisely measure the differences as a few specific behavior traits, so they can only see the model performs better in high-level testing like ChatBotArena or other A/B testing, but they cannot capture the changes in score deltas between a few evaluations they could release.

    * AI is moving so fast that taking the time to document these models is not worth it,

    * Detailing the changes will make the character too easy to reproduce and will be another path of “distillation” of OpenAI’s models.

    The community of model users is extremely far from having clear ways to measure these differences. While there are vibe tests on Twitter, they will not be conclusive. ChatBotArena won’t even come close to measuring the levels of these differences (and in the case of referencing past chats, it cannot). Character training is the sort of addition to a post-training stack that takes industrial training techniques from being reproducible, but expensive, to dark arts that are largely undocumented.

    The most interesting part of the model spec for industry analysts is this plot where OpenAI shares the agreement rate of their newer models. This is comparing a reasoning model, o1, to a GPT-4o model, so there are questions of whether this is attributable to reasoning training.

    Every frontier AI laboratory should have a model spec

    Model Specs are the sort of community norm that a race to the top is the goal. They’re muddled if mandated — how would you actually check that a required model spec is accurate? — but if they are implemented by every lab carefully with feedback from the community, it would be far easier for the development ecosystem to exist around models.

    The model spec is an extremely useful document detailing how developers can expect your models to change over time. They are also one of the few sources of insight we have into what the model providers are trying to get their models to do (which has regulatory advantages) and let us know what is an intentional or unintentional behavior mode.

    Interconnects is a reader-supported publication. Consider becoming a subscriber.

    A model spec doesn’t provide all the information we need to keep up with model versions. This new version of ChatGPT desperately needs to be accompanied by evaluations capturing the behavior change, otherwise, a lot of undocumented differences will be passed on to developers updating endpoints to it. This is another rendition of the same lack of transparency we’re used to from leading AI laboratories.

    The closest thing Anthropic has to a model spec is the mix of Claude’s Constitution and this blog post on Claude’s Character. Character training is a fairly new technique for the industry. From Anthropic’s post:

    Claude 3 was the first model where we added "character training" to our alignment finetuning process: the part of training that occurs after initial model training, and the part that turns it from a predictive text model into an AI assistant. The goal of character training is to make Claude begin to have more nuanced, richer traits like curiosity, open-mindedness, and thoughtfulness.

    The process is extremely synthetic data-heavy, but requires an artist’s touch, as stated later in the blog post: It “[relies] on human researchers closely checking how each trait changes the model’s behavior.”

    Character training being the focus of developments is the strongest endorsement that RLHF and related approaches have shifted from their philosophical motivations of alignment to being primarily an empirical tool. The models can capture so many different behaviors, but getting them to reliably behave how we want is the hardest part. Right now, it seems more likely that this is about capturing the upside of RLHF as a performance tool, rather than a safety one.

    One of the few public discussions of character training came from Amanda Askell during her appearance on the Lex Fridman Podcast (taken from the transcript):

    Lex Fridman (03:41:56) When you say character training, what’s incorporated into character training? Is that RLHF or what are we talking about?

    Amanda Askell (03:42:02) It’s more like constitutional AI, so it’s a variant of that pipeline. I worked through constructing character traits that the model should have. They can be shorter traits or they can be richer descriptions. And then you get the model to generate queries that humans might give it that are relevant to that trait. Then it generates the responses and then it ranks the responses based on the character traits. In that way, after the generation of the queries, it’s very much similar to constitutional AI, it has some differences. I quite like it, because it’s like Claude’s training in its own character, because it doesn’t have any… It’s like constitutional AI, but it’s without any human data.

    In summary, Anthropic uses the same techniques they use for Constitutional AI and general post-training for capabilities to train these models’ characters. This is not surprising. This could be related to Askell’s other Tweet on how she designs system prompts, as system prompts are the easiest way to quickly change a model’s character:

    The boring yet crucial secret behind good system prompts is test-driven development. You don't write down a system prompt and find ways to test it. You write down tests and find a system prompt that passes them.

    This is very in line with what we started this post on — internal AI lab evaluations.

    How far can you push character training?

    Ahead of the Grok 3 release, Elon Musk Tweeted this example from Grok 3, saying it was “based.”

    One of the predominant reactions to Grok 3 was, “Wait, so it isn’t actually based?” This is one of the big questions of character training and lacking model specs. Did xAI not figure out how to make their model-based and reliable? What model was Elon using here?

    Whatever your politics, it’s likely that the default personality of models that you encounter will eventually not be something you like. There’s quite a lot of nuance in what the perfect chatbot is for each user.

    Companies should be allowed to have a default personality for the models of their choosing, but a far better long-term equilibrium is to make the expectation that model providers make it easy to get exactly the personality you like out of a model. This isn’t regulation I’m recommending right now, but one way to make sure that an all-powerful AI model isn’t going to reinforce one point of view is to have tests that models need to pass on the breadth of their character and views.

    Model specs are a step in the right direction to avoid drama about “what did they actually want their model to say,” but we still have a lot of work to do on creating a spectrum of tools that captures all the relevant information when comparing models.



    This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe
  • On Monday, February 24th, 2025, Anthropic announced their latest model, Claude 3.7 Sonnet, which is their first model explicitly trained to use more inference time tokens to improve performance. This is another reinforcement learning (RL) trained model (mentioned in system card). With this model, they also released Claude Code as a limited research preview, which is a “command line tool for agentic coding.” Continuous improvements in models are enabling new modalities and domains addressable by the models, but assessing the impact of each new domain takes far more time than a quick model reaction post.

    This is a tidy release, a solid improvement, but not a step change for Claude or the industry. Expect a lot of small changes to accumulate massively this year.

    Claude 3.7 Sonnet is a clear improvement over Claude 3.5 Sonnet (New) and continues to push the limits in areas where users love Claude (e.g. read Ethan Mollick’s review here). The scores for those areas such as software development (SWE Bench) and tool use, are clearly state-of-the-art.

    For example, Claude 3.7 Sonnet is the highest scoring “standard non-reasoning” language model on the Aider Polyglot benchmark. While models like o3 and Grok 3 DeepThink highlight superhuman performance on code benchmarks, this sort of behavior being integrated without extra inference time compute is wonderful. The price for superhuman coding AI is plummeting.

    Even with o1 Pro, I still find myself using Claude 3.5 (New) on a regular basis. O1 Pro is the best model for doing succinct, one-off tasks like writing short scripts. It is extremely controllable and will often work out of the box. Though, when I’m doing tricky, iterative tasks I still use Claude. Claude 3.7 Sonnet only makes these workflows stronger and I’m stoked to play with it further.

    The most useful piece of this release for those trying to understand the direction of the ecosystem, rather than just the status of today, is Anthropic’s post on Claude’s extending thinking where they detail the product trade-offs, alignment, and future of inference time compute in their models.

    Anthropic’s offering of extending thinking to boost inference-time performance is far, far cleaner than that of OpenAI’s current model drop down disaster. Anthropic’s thinking model is the same as their general purpose model, much like xAI’s Grok 3, and what OpenAI teased will be the plan for GPT-5. Having just one model makes lots of infrastructure, product, and training decisions cleaner, but may come at the cost of the absolute Pareto front of performance for your organization shrinking.

    The reasoning training being embedded in one model with a standard inference mode will make the reasoning benefits and behavior feel closer to something like Gemini-Thinking, rather than OpenAI’s o1 or DeepSeek R1 that are designed solely for this reasoning mode of operation. It doesn’t mean that in the limit that a single model will be weaker in performance, but rather that currently training them may be slower to iterate on than a “full” reasoning language model.

    Focusing on deploying just one model that serves all the users is one of many examples where leading AI companies are needing to make their offerings legible to users and easy to use — a sign of the industry maturing from a race to intelligence to a race to usefulness.

    Still, Claude’s interface is not perfect by any means, the user still has to intentionally go to a drop down menu to get performance when they need it. The best mode is that the model knows when inference compute is needed on its own. My hypothesis is that when training one model with reasoning and without, having the model figure out how much compute to use is harder than a reasoning-only model like o1 figuring out its own compute budget. Or, Anthropic needed to keep a special flag that is turned on and off in the system prompt. This is a subtle potential trade-off of putting reasoning in just one model, but we’ll see where the final equilibrium is.

    On the other hand, Claude 3.7 Sonnet is showing the reasoning traces directly to users like DeepSeek R1 and Grok 3. These organizations have different ways of saying why, but it is clear that users just enjoy seeing it and it builds trust. Anthropic, understandably is using the reasoning traces to monitor the alignment of the models.

    The reasoning chains in these models are how the general public is learning more about the internal representations of language models. Another interesting detail is that “didn’t perform our standard character training on the model’s thought process.” This is how Claude thinks out of the box and the actual answers have a different flavor to them. More research will study how far the reasoning chains can diverge from the answer language. We’ve seen research on latent reasoning within the model, but beyond this, we could have reasoning languages that are entirely ungrounded from human languages because they are a more token-efficient representation of information for the model. More on this soon.

    The developer facing version of this Claude Extending Thinking is far cleaner and a sign of things to come — developers can request a specific amount of thinking tokens in their response.

    How this works is that the model will stream thinking tokens until the number is reached, then shift to answer tokens. This is still one autoregressive stream and no search is being used in the new products, yet.

    This explicit control over the thinking and answering phases is a growing behavioral focus in training reasoning models — expect more here soon. Developers can tune the setting that works for them and keep it baked in, rather than relying on the user to pass in a query that just happens to get the model to think for a long time. Explicit test-time inference budget increases are much more covetable than needing to hit the gold mine in a prompt search.

    The best place to see where this could be applied is by selecting performance on a task that scales nicely with inference time compute. Anthropic ran a similar experiment on the challenging math evaluation AIME — the same one that OpenAI used in their original inference time compute plot.

    Here there’s a subtle difference from the developer experience, where in Anthropic’s internal tests the model could exit early. In practice, this subtle difference shouldn’t shift the usefulness of the deployment methodology.

    Anthropic continues in their excellent post, saying:

    Our researchers have also been experimenting with improving the model’s performance using parallel test-time compute. They do this by sampling multiple independent thought processes and selecting the best one without knowing the true answer ahead of time. One way to do this is with majority or consensus voting; selecting the answer that appears most commonly as the 'best' one. Another is using another language model (like a second copy of Claude) asked to check its work or a learned scoring function and pick what it thinks is best. Strategies like this (along with similar work) have been reported in the evaluation results of several other AI models).

    To accompany this, they shared the following results.

    It is crucial to note here that the dashed red line — pass@N — is not an actual evaluation result, but measuring if the correct solution appears in the number of answers generated on the X-axis. The two lines below show how good initial answer extraction methods are at selecting the right answer from the N candidates. As has been known for a long-time in inference-time scaling research is that the models can often generate the correct answer to extremely hard questions, but not reliably.

    Interconnects is a reader-supported publication. Consider becoming a subscriber.

    They make it very clear this is not used yet in their products:

    Parallel test-time compute scaling isn’t available in our newly-deployed model, but we're continuing to research these methods for the future.

    Still, this is a direction other labs are already pursuing. The best reporting on o1 Pro indicates that it does a “search” of some sort over parallel generations. Other OpenAI employees have stated that o3 uses a learned verifier to extract answers, at least for coding domains. As progress in scaling single-streams from the language model slows, this is the natural next place for scaling to turn to. As it has been for some time, performance limits are largely different forms of infrastructure problems before models can be served to users.

    Claude is here, and it reinforces that RL training is a short path to inference time scaling laws being used, but in the long-term we will have more methods for eliciting the inference-time tradeoffs we need for best performance.

    Thanks to Ross Taylor for some immediate feedback on this post.



    This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe
  • Full post: https://www.interconnects.ai/p/grok-3-and-an-accelerating-ai-roadmap

    xAI launched their latest flagship model, Grok 3, last night via a live stream on X, which is a new take on the launch process, but it largely felt familiar. Grok 3 is a state-of-the-art model on some important benchmarks. The core is that it is state-of-the-art relative to available models and we know better models are out there. Only some of them have been announced, some of them have been teased, and others lie in waiting.

    What feels different is how the broader AI industry is signaling rapid progress coming soon. xAI said on the livestream that they will be updating the model “daily.” An era of sitting on unreleased models could be ending.

    Grok 3’s release is a reinforcement of trends people began reckoning with as of the release of DeepSeek V3 + R1 — AI progress is not held in the hands of a few companies nor is it slowing down. 2023 and 2024 were defined by truly state-of-the-art AI being concentrated within OpenAI, Anthropic, and Google, where these companies could take a lot of time to package models from training to release and still have a substantial moat on capabilities relative to their peers.

    At the time of R1’s launch, the “people’s choice” model was Claude 3.5 Sonnet, a model that had been trained “9-12 months ago” and the best models like Claude 3.5 Opus or GPT-4.5 (a.k.a Orion) were not available to users for a grab bag of reasons.

    Competitive pressure from DeepSeek and Grok integrated into a shifting political environment for AI — both domestic and international — will make the established leading labs ship sooner. A large portion of delays in delivering models is for “safety testing,” but we don’t have exact details on how much of it was that and how much was cost-benefit tradeoffs (and other big company hurdles such as legal departments). The brand, and culture, of “having the smartest model” is extremely important to these companies, but having a way smarter model was often financially too much to bear.

    “Safety” is actively being removed from the spotlight of the AI discourse. It is possible that this overcorrection causes meaningful harm, as this is an extremely powerful and rapidly evolving technology, but the political capital to make safety a core tenet of the AI industry was spent too early relative to meaningful harm emerging.

    Increased competition and decreased regulation make it likely that we, the users, will be given far more powerful AI on far faster timelines.

    We’ve seen time and time again the value of having the best model first. The only way to onboard new users is to have some capability or behavior that your model differentiates on. With the pace of progress high, minimizing the time from training to release is the best way to maximize one’s chance of impact.

    DeepSeek and xAI show how organizations with slightly trailing technical progress or resources can outshine the likes of OpenAI and Anthropic who have voluntarily not shipped their latest models.

    Interconnects is a reader-supported publication. Consider becoming a subscriber.

    Grok 3 by the numbers

    Benchmarks and vibe tests mark Grok 3 as one of the best models available today. As with any release, companies often choose evaluations that flatter their models. Yes, winning on these evaluations is extremely challenging, and much credit must be given to xAI for delivering a leading-edge model just about 19 months after its incorporation.

    That being said, what is shown below is a total of 4 language model evaluations. Given that models like DeepSeek R1 or Gemini Thinking launch with 10-20 evaluations detailing their performance relative to peers, this has to be taken with a grain of salt. It is very likely that Grok 3 doesn’t outperform its peers in every category, but there is a slim chance these other comparison evals just weren’t run in the optimization for expedience.

    To start, we can compare Grok 3 benchmarks versus available instruct models.

    And versus available reasoning models (note how OpenAI’s announced o3 scores exceed these clearly).

    An important detail, as we’ve seen with OpenAI’s reasoning model releases is, what do the shaded regions on the above plots show? Without exact details, we don’t know the inference cost for each of the models on these reasoning plots. Pushing the frontier in absolute terms is important, but the field overall is getting messier before it’ll get clearer.

    Regardless, in the above two plots Grok 3 is pushing progress both on standard model training and the new reasoning training. While reasoning training and RL are the hot new things in the AI field, simple scaling and optimization of existing techniques still deliver value.

    And Grok’s score on ChatBotArena.

    A model launching at top of every category on ChatBotArena feels like something that should be rare (given it now encompasses many categories like Math, Coding, Style Control, Longer Queries, etc.), but it happened just a few weeks ago with Gemini 2.0 Pro!

    ChatBotArena is known to favor models that are likely to not refuse requests (we don’t know by how much), as evidenced by Claude 3.5 Sonnet (New)’s relatively low position on the leaderboard relative to its utility, but overall is a hard evaluation to top. xAI’s stated goals of a “based” model should correlate well here.

    A question we don't know the answer to: How many points of performance on evals do you gain by not caring about safety at all? Internal to the model, i.e. in behavior latent spaces, safety is pretty orthogonal to common high-utility behaviors like reasoning and code, and bigger models tend to do more things without a cost to other behaviors, but there has to be a safety performance margin. Did Grok 3 succeed because of this? It’s too early to tell.

    At a technical level, Grok 3 is certainly a very big model. We don’t have specifics, but it’s reasonably safe to take a datapoint for scaling still helps for performance (but maybe not on costs). xAI’s approach and messaging has been to get the biggest cluster online as soon as possible. The Occam’s Razor explanation until we have more details is that scaling helped, but it is possible that most of Grok’s performance comes from techniques other than naive scaling.

    Grok 3’s size to beat existing models feels like when Nemotron 340B beat Llama 3 70B, making it the leading open-weight model at the time, but uptake was slow because the cost relative to the performance gains wasn’t worth it to adopt. We’ll know more about this when Grok 3 is available in their API and we see the exact costs.

    When models are approximately equal in performance, price and ease of use are the determining factors of adoption.

    Overall, Grok 3 is a huge technical achievement but not one that indicates a substantial change in who is at the frontier of effective training. xAI is obviously closing in on OpenAI, Anthropic, and most of all Google, but all available data points put these labs ahead of xAI on the frontier of efficient model training. It is good that they are being pressured to deliver more absolute intelligence and not to just continue optimizing their frontier of performance per dollar.

    Read some other reviews of Grok 3 here and here. Karpathy’s summary is particularly in line with my thinking (while potentially slightly overselling capabilities).

    As far as a quick vibe check over ~2 hours this morning, Grok 3 + Thinking feels somewhere around the state of the art territory of OpenAI's strongest models (o1-pro, $200/month), and slightly better than DeepSeek-R1 and Gemini 2.0 Flash Thinking. Which is quite incredible considering that the team started from scratch ~1 year ago, this timescale to state of the art territory is unprecedented. Do also keep in mind the caveats - the models are stochastic and may give slightly different answers each time, and it is very early, so we'll have to wait for a lot more evaluations over a period of the next few days/weeks. The early LM arena results look quite encouraging indeed. For now, big congrats to the xAI team, they clearly have huge velocity and momentum and I am excited to add Grok 3 to my "LLM council" and hear what it thinks going forward.

    Where progress is heading

    If these AI models, and the industry writ large, are accelerating, it is important to wonder where they are accelerating toward. Most of the evals we use now to launch leading models are not that representative, in many cases they’re actually 100% out of distribution to normal life. What is the value in solving a competition math problem like AIME or so-called “Google Proof” questions? Time will tell, but the case for usefulness to average users is definitely stretched.

    Small ChatBotArena improvements are marginal gains in robustness, where something like a 20-point difference in Elo rankings — the relative difference between Grok 3 and the next top model — translates to the model winning something like 51% of head-to-head match-ups. This robustness adds up over time, but it is far from meaning that this model is more intelligent in an absolute sense.

    In fact, in the case of some of the latest evaluations from the research community, it seems like evaluations are being designed more around being hard than being useful. It is a natural response to models being super powerful to try and find something to challenge them with, but it makes tracking progress and communication far harder.

    Companies have many internal evaluations that are not shared. Increasing transparency on these would help contextualize what is and is not meaningful progress. Without these, the only benchmark we have for model changes is them becoming more deeply integrated into products. Product-model synergy can enable extremely useful, new workflows, but it makes tracking the progress of AI a proxy measurement.

    I do personally believe these somewhat arbitrary capabilities we are marching toward will generalize to extended and amplified value, but it takes some "feeling the AGI" to see that these models that are better on esoteric benchmarks generalize to every day use. So far they have. Bigger and generally “better” models have been more robust and easier to find valuable veins in, but we as an industry should be sharing more so that it is not just AI insiders who understand how to track progress.

    When 2024 was reflected on with meager advancements, evidence is that there was substantial progress but less of it was delivered to users. We only got o1 late in the year and other models were deemed "too big to ship" or the requisite urgency (DeepSeek) did not exist.

    2025 will be a year of intelligence being put in the user’s hands. The pace of underlying progress with that will continue to be high. The so-called “walls” facing AI progress haven’t materialized, but making sense of the progress we are getting is much more nuanced.



    This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe