OpenAI's Yann Dubois: Why AI Progress Suddenly Feels Real | Edited Transcript
A professionally copyedited transcript of Yann Dubois on GPT-5.5, post-training, reinforcement learning, reasoning models, evals, continual learning, agent harnesses, and where startups can still build.
Chapter Timestamps
00:00 Cold open: reliability crossed the threshold where AI tools start feeling practically useful
00:34 Matt Turck introduces Yann Dubois, GPT-5.5, post-training, reasoning models, and continual learning
01:30 Why progress feels sudden: reliability, internal AI-use feedback loops, and RL moving into real-world work
04:13 Model reliability and the release cycle: reducing step-by-step error rates and shipping GPT-5.5
07:33 How OpenAI coordinates vertical teams, horizontal capabilities, and the Post-training Frontiers group
09:49 Efficiency and test-time compute: shifting the latency-performance curve left
12:32 Yann’s path from Switzerland, multilingual NLP, Stanford Alpaca, and impact-driven work
15:37 Reasoning in 2026: from verifiable rewards to messy user utility and realistic evals
18:34 GPT-5.5 Thinking versus Pro: more test-time compute, slower answers, higher correctness probability
20:09 How reasoning becomes more efficient: expert-like search, backtracking, and avoiding bad paths earlier
23:23 Pre-training scaling, larger models, token efficiency, and why the data wall has not fully arrived
27:03 Multimodal data, synthetic data, embodied AI, world models, and common-sense grounding
31:05 Mid-training and post-training: overweighting high-quality data, SFT, behavior cloning, and RL
37:21 Whether RL creates new capabilities or unlocks latent ones from pre-training
38:53 Why RL was once considered too finicky and why scale made it work
42:17 The frontier of scaling RL: sampling cost, agentic rollouts, credit attribution, and GRPO
43:09 Model building as craft and science: alchemy first, then systematic explanation and engineering
45:11 Task-specific post-training: fast final-stage iteration, vertical datasets, and reward design
48:21 Generalization across domains: horizontal capabilities, GDPval, and why real-world ambiguity is harder than contests
54:18 Hallucinations as a training problem: how SFT can reward confident invention and RL can penalize it
56:04 Negative generalization, conflicting instructions, and tradeoffs between helpfulness and precision
58:05 Scaling RL to law, medicine, finance, and the rest of the economy through domain priority and data
1:00:19 The evaluation bottleneck: model-as-judge, GDPval, expert grading, and trust in automated evals
1:04:21 Continuous progress versus discontinuous user experience, plus the unsolved continual-learning problem
1:08:49 Will foundation models absorb the agent harness, and what still belongs in external tools
1:11:23 Where startups should build: vertical last-mile workflows, integration, and domain-specific product judgment
Made with: The Transcript Desk Chrome Extension
Full video:
AI suddenly feels like it has crossed a threshold, and Yann Dubois, co-lead of the Post-training Frontiers team at OpenAI, joins Matt Turck to explain why. Yann’s team has led the post-training behind the company’s reasoning models, including the recent GPT-5.5 release. In this conversation, we go inside the shift from raw model capability to useful, reliable systems: what changed with GPT-5.5, why reinforcement learning is moving beyond math and coding competitions into messy real-world work, how reasoning models like GPT-5.5 actually work, the difference between GPT-5.5 Thinking and GPT-5.5 Pro, why post-training has become one of the most important frontiers in AI, and why evals, model-as-judge, hallucinations, agentic workflows, GDPval, and continual learning are now central to the next phase of frontier models. Yann also shares why continual learning remains one of AI’s biggest unsolved problems three years after ChatGPT, and where startups still have massive room to build as frontier models race ahead.
Transcript
00:02-00:34
Yann Dubois: You need to reach a certain level of reliability to make these AI tools truly useful. I think we crossed that threshold around December of last year, at least at OpenAI. Now, we can trust these models to handle a significant portion of the work we’re doing. The last few months have been incredible; we’ve moved from focusing on competitions to providing real usefulness to users, and that’s what we’re experiencing right now. I believe the biggest challenge is often the “last mile.” There will always be plenty of room for innovation in that final stage across different verticals, and I highly encourage people to continue working on that.
00:34-01:27
Matt Turck: Hi, I’m Matt Turck. Welcome to the MAD Podcast. My guest today is Yann Dubois, who co-leads the Post-training Frontiers team at OpenAI. The recent release of GPT-5.5 was yet another major milestone in AI, and Yann’s team helped build it alongside OpenAI’s previous top reasoning models, including o3 and GPT-5 Thinking. Before joining OpenAI, Yann was at Stanford, where he co-authored Stanford Alpaca—the landmark project that kicked off much of the modern post-training research community. In this conversation, we go deep into what’s actually new in GPT-5.5, why reinforcement learning is moving from math and coding competitions into messy, real-world work, why AI progress can feel like a sudden step function, and why continual learning remains one of the big unsolved problems in AI three years after ChatGPT. Please enjoy this fantastic conversation with Yann Dubois.
01:27-01:29
Matt Turck: Hey Yann, welcome.
01:29-01:30
Yann Dubois: Hi Matt, thanks for having me.
01:30-01:54
Matt Turck: It’s been another wild few weeks in the world of frontier AI with the release of GPT-5.5 and the Claude Mythos preview. It feels like we’ve unlocked yet another step function in progress, particularly in cybersecurity and agentic coding. From your perspective, what’s the best way to think about this? Are things accelerating? What is happening?
01:56-01:57
Matt Turck: Are things accelerating? What exactly is happening right now?
01:57-03:43
Yann Dubois: Yeah, the last few months have been pretty wild. Internally, we really feel it too. I think anyone who is coding right now is truly feeling the shift. I believe there are three main reasons for this.
First, even though I think progress is actually quite continuous, you have to reach a certain level of reliability before these AI tools become truly useful. I think we crossed that threshold around December of last year. At least at OpenAI, that’s when I felt we reached the point where we could trust these models to do a significant portion of our work. So, it feels like a step function to the user, even though the underlying capability growth has been steady.
The second reason is that once you have models that are this good, you start to accelerate your own progress. This is especially true for coding. Since we all code internally, we use these models to help train subsequent models and build the research tooling we need to do our jobs. This creates a feedback loop where the last few months have felt faster and faster.
The third thing I think we’re feeling is the culmination of last year’s work. We spent a lot of time building these reasoning models and pushing hard on reinforcement learning (RL). Initially, with models like o1-preview or even o3, they were still primarily optimized for what we call “verifiable rewards”—scenarios where we have access to a ground truth and it’s easy to test whether the model is correct or not. This applies to things like math problems or coding competitions.
03:46-04:13
Yann Dubois: We’ve moved beyond just coding competitions. What we are realizing now is that we were able to take many of the tools we built for those verifiable reward cases and apply them more generally to reinforcement learning for real-world use cases. I think that’s why it feels so different right now—we’re seeing it in practical, real-world coding rather than just competitions. We’ve moved from winning contests to being genuinely useful to users, and that’s what people are feeling.
04:13-04:21
Matt Turck: Okay, fascinating. We’re going to unpack a lot of this, particularly on the reinforcement learning (RL) side. Regarding the first thing you mentioned—reliability—is that an engineering challenge? What exactly makes a model reliable in the way you mean it?
04:21-05:11
Yann Dubois: It’s a little bit of everything. But in general, since these are agentic models, if you think about it, there’s a certain probability that they’ll make a mistake every couple of minutes. The longer they run, the higher the probability that the final answer will be wrong. That’s just something inherent in agentic models. What we’ve been pushing for is decreasing that probability of error at every step. While there is a lot of reliability work happening on the applied side—and the team at OpenAI has been doing an amazing job there—I’m specifically talking about the reliability of our models and ensuring we fundamentally decrease the likelihood of them being wrong.
05:11-05:31
Matt Turck: Great. So, GPT-5.5—formerly known as “Spud”—is a big deal. I’m curious, from the inside, what are you most proud of? What did you find the most challenging? Give us some color on how the team felt during that process.
05:33-05:34
Matt Turck: Could you give us some color on how you all felt about releasing this?
05:34-06:43
Yann Dubois: To be honest, we’re all really excited about 5.5. It’s one of those models where everyone in the company was extremely involved in the building process. I think we really feel that impact now; we’ve received a lot of attention because of 5.5, and it feels like all the stars aligned. That doesn’t always happen, but this was just a great model for that.
It’s kind of funny because, in general, with every model that looks really good early on, we all get incredibly excited. Then, a wave of doubt starts to creep in. People think, “Wait, everyone is hyping this thing internally, but it’s actually underperforming in these other areas.” Then there’s another wave where people start under-hyping it. It goes through these cycles, and how people feel internally often depends on exactly where we are in that wave when we actually ship it. That’s true for most of our models. 5.5 wasn’t that different in that regard, but it definitely had a higher amplitude. People were very excited, then not as excited, but now that we’ve shipped it, people are very happy with the external reception.
06:43-06:56
Matt Turck: How long does that process take? Including those waves of excitement going up and down—I assume it depends on the specific release and its importance—but are we talking a few weeks or a few months?
06:56-07:32
Yann Dubois: It really depends. I can’t speak to the exact specifics of what went into 5.5, but it depends on which part of the pipeline is training different parts of the model. We have various sub-teams, including pre-training, the mid-training stage, and post-training. Usually, the closer you get to the product, with post-training being the final stage, the faster the iteration cycle is. If you’re more upstream, the iteration cycle is slower. It can range from months down to days.
07:32-07:43
Matt Turck: GPT-5.5 was particularly strong in agentic coding, computer use, knowledge work, and early scientific research. How does that work internally?
07:43-07:49
Matt Turck: Do different people focus on those specific areas? How do you actually achieve those results?
07:49-08:15
Yann Dubois: Yes, we definitely have different teams focusing on specific use cases and pushing the boundaries in those areas. My team, specifically, is the one that takes all of these vertical improvements and integrates them into the final model. You can think of us as the “smoothing function.” We have all these individual improvements, but we need to ensure the model doesn’t feel too “spiky” or behave inconsistently across different verticals.
08:15-08:33
Yann Dubois: We also have teams—and this is primarily what my team does—working on horizontal improvements. These are things like instruction following, function calling, or determining how much a model should “think” about different problems. These are very horizontal capabilities that impact every use case. So, we have both vertical and horizontal teams.
08:33-09:01
Yann Dubois: Both are essential for improving the model. The great thing is that these areas can be improved orthogonally. You might have multiple teams working on various verticals, and for a specific model release, perhaps only half of those teams integrated their improvements into the latest run. Then, for the next version, the others might catch up.
09:02-09:49
Yann Dubois: For the next model, it might be the other half. At a high level, that’s how it works. Regarding what we’re most proud of with this model, I’d highlight two things. First is the efficiency. We’ve significantly improved the model’s performance; for most tasks, it now runs about twice as fast. That’s a huge win. Second, as I mentioned before, is the internal alignment of the company. Ensuring everyone is working toward the same goal is no small feat. It takes the entire organization focusing on that single “North Star”—building one exceptional model within a specific timeline. I’m very proud of how we executed that.
09:50-10:01
Matt Turck: That’s great. Speaking of efficiency, how do you actually optimize for it? Are we talking about efficiency per token, or are we looking at serving latency? How much of that is AI research versus pure engineering?
10:02-10:54
Yann Dubois: When I say it involves the entire company, I mean it really comes from everywhere. It requires inference optimizations, but it also requires the model to be more efficient in its “thinking time.” The standard way to visualize this is with a plot where the x-axis represents the number of tokens used for thinking and the y-axis represents performance. These are the test-time scaling curves we monitor. Research focuses on shifting that curve to the left—enabling the model to think less while maintaining the same level of accuracy. Meanwhile, the inference team works on that same x-axis but translates the number of tokens into actual latency. Ultimately, what users care about is the relationship between latency on the x-axis and performance on the y-axis. That is where everything converges.
10:55-11:03
Yann Dubois: It’s about the y-axis, and this is where everything comes together. This is really what happened with GPT-5.5. That’s why I always say I’m incredibly proud of the company for this release.
11:03-11:14
Matt Turck: Okay, great. Let’s talk about you for a minute. You are on the Post-training Frontiers team, which you described as a horizontal team. What does that team do in general?
11:14-12:32
Yann Dubois: Broadly speaking, we are part of the post-training organization, and my team specifically is Post-training Frontiers. We focus on three main things.
First, we decide what goes into the final run. As we discussed, there are many vertical teams, and someone needs to decide what makes the cut and what doesn’t. We also provide the scientific experiments that allow people to iterate on something representative of the final model.
Second, we bring everything together and actually execute the “big run.” As you can imagine, we train on a massive number of GPUs. This requires a significant amount of infrastructure work, but also a lot of machine learning work to ensure all the different components work well together.
Third, we work on horizontal improvements to the models. These are things that vertical teams don’t usually focus on. For example, “thinking time”—deciding how long a model should think before answering certain questions. We also handle instruction following, function calling, memory, and general improvements that apply across the entire stack. That is what the Post-training Frontiers team does, and I lead that team.
12:32-12:33
Matt Turck: Okay, great. And what was your journey to OpenAI?
12:35-12:37
Matt Turck: What was your journey to OpenAI?
12:37-14:18
Yann Dubois: It’s a long story, but I’ll try to keep it brief. I did my undergraduate degree in biomedical engineering in Switzerland. During an exchange program in Canada, I first learned about Word2vec. I’m not sure if you’re familiar with that algorithm, but it essentially takes discrete words and maps them into a vector space. You can think of it as a plane where words with similar meanings are placed closer together. It translates discrete language into a continuous, semantically meaningful space.
I was absolutely blown away by that algorithm, and that’s when I decided I wanted to work on Natural Language Processing (NLP) and language understanding. At the time—this was 2017, right before Transformers were introduced—I mistakenly thought English NLP was basically solved. Because of that, I decided to focus on under-researched languages where data was scarce.
I went to work for Grab in Singapore, where I built their NLP pipeline, working with Khmer, Bahasa, Thai, Vietnamese, and several other languages. Skipping ahead a bit, I did some academic work in various countries before ending up at Stanford for my PhD. After a short stint in the startup world, I eventually joined OpenAI.
14:18-14:29
Matt Turck: Yes, and I remember seeing a note on your blog or personal page specifically telling quant firms not to reach out because you weren’t interested in hedge fund work.
14:31-14:42
Yann Dubois: I always think it’s very important for me to consider the positive impact I’m having on the world, or at least the impact I’m trying to have. That’s why that note is there.
14:43-15:06
Matt Turck: Yes. And as we were saying just before we started recording, people may have seen you in the GPT-5 video announcement. You did that very funny demonstration of an app built on the fly to teach your partner how to speak French. People should definitely go check that out.
15:06-15:17
Yann Dubois: Exactly. That was a fun one. GPT-5 wasn’t quite that reliable yet, so I was a little stressed that it wouldn’t work, but it ended up working out.
15:17-15:24
Matt Turck: So, was it truly live? I assume it was very, very rehearsed, but was it truly live?
15:24-15:37
Yann Dubois: Actually, right before we did it—during the last rehearsal—it didn’t work. I got slightly stressed about that, but the live take ended up working well.
15:37-16:18
Matt Turck: No pressure! But yeah, that landed perfectly. All right, let’s unpack some of the things we alluded to in the intro. We started by talking about reasoning, and I’m curious what reasoning means in 2025 or 2026 that’s different from the conversations we could have had about o1 or o3. In particular, one of the claims about GPT-5.5—and also my experience as a user—is that it’s particularly good with messy data. This seems to imply that it needs to reason through ambiguity more. What exactly has changed?
16:19-16:35
Yann Dubois: What I would say is that o1 and o1-preview were real breakthroughs in the research community. They gave us models that could “think,” where the longer they thought, the higher the likelihood they would be correct.
16:38-17:23
Yann Dubois: That was really a breakthrough. Initially, if you look at old blog posts, you would mostly see math evaluations and coding competitions—things where it’s very easy to test whether the model is correct or not. That gives you a sense of how we were training some of those models.
How I see last year, especially the end of last year and the beginning of this year, is that we were able to take these algorithms that work with verified rewards—where we can objectively say you’re correct or you’re not—and apply them to the messy real world. We are now really optimizing for the utility we provide to users and making them more productive. I think that’s what really changed.
17:23-17:30
Matt Turck: Okay. So it’s largely the post-training reinforcement learning part?
17:30-18:25
Yann Dubois: Yeah, I would say that’s a big part of it. There’s also another major factor. When you first develop a new method, it’s often fragile, not very reliable, and hard to productionize. That aspect has improved a lot.
Beyond that, we finally had a tool that we could start optimizing for different things. Initially, when we were developing this tool, we made a lot of simplifying assumptions about the real world. Now, we are removing those assumptions. In post-training, we are able to optimize for actual user utility and ensure the models are useful for real-world tasks. That’s also why current evaluations look much more realistic. If you look at GDPval, or even things like GAIA or SWE-bench Pro, these look much more realistic.
18:27-18:34
Yann Dubois: These look much more realistic than, say, the Codeforces or coding competitions we were looking at with o1.
18:34-18:52
Matt Turck: Still on the topic of reasoning, what is ultimately the difference between GPT-5.5 Thinking and GPT-5.5 Pro? Is it just a matter of more test-time compute, more tokens, and more time invested in solving a problem?
18:52-20:10
Yann Dubois: Yes, basically. It’s just a question of how much test-time compute we pour into the model, or rather, into the entire system we’re shipping. We’ve seen time and again that the longer the model thinks, the better the answers we get. The problem is that these scaling curves are definitely not linear; there is a plateauing effect, and they look somewhat logarithmic depending on which evaluations you use.
You can pour in twice as much compute and only see small performance gains. Personally, I don’t use Pro that much because I’m pretty impatient and don’t like waiting that long. I know the probability of it being correct improves, but for my needs, it doesn’t improve enough to justify the wait. However, some people really love Pro, especially for academic research. I know many mathematicians who use it because they can just leave it running in the background for an hour or two. They don’t need to iterate quickly, and Pro is excellent for those use cases.
20:10-20:26
Matt Turck: I’d love to reconcile that with what you mentioned earlier regarding efficiency per token. Is the idea that the model will be able to think longer while also being more efficient, thereby solving the task better? How do those two factors interact?
20:28-20:32
Matt Turck: How do the time and efficiency aspects interact? Specifically, how does the model perform better as those factors change?
20:32-21:28
Yann Dubois: Yes. If you go back to the plot I was describing, where the x-axis represents latency and the y-axis represents performance, what we’re doing is shifting that curve further to the left as we improve efficiency. This means we’re spending less time to achieve the same level of performance.
What the “Pro” model does is extend that curve. It says, “I’m going to think for much longer, but I’ll have a higher likelihood of being correct.” However, every iteration of the Pro model also moves to the left, becoming more efficient over time. The important thing to remember is that there will always be tasks where you just want to maximize the probability of being correct and you don’t really care about latency. For example, if I start a job before going to sleep, the model has eight hours to work; it should think for as long as it possibly can. That is essentially what the Pro model offers you.
21:28-21:42
Matt Turck: In layman’s terms, what does that mean practically? How does it work? If the model starts heading in the wrong direction, does it interrupt itself earlier? Is that one of the ways it improves?
21:42-21:46
Yann Dubois: Are you asking specifically about what efficiency means in this context?
21:46-22:12
Matt Turck: Yeah, largely for efficiency. I’m just curious how reasoning actually becomes more powerful.
Yann Dubois: That’s a great question. Let me give you a metaphor involving humans. Imagine an expert in a certain domain compared to an undergrad who is just starting out. For a specific task, the undergrad might take much longer because they need to think through more possibilities before finding the right direction.
22:15-23:23
Yann Dubois: A task might take one or two days because the model has to think through all the possibilities and investigate, especially if it’s never encountered that specific problem before. In contrast, an expert in that field will usually know which direction to take; they won’t spend time investigating ten different directions because they know which one is most likely to be correct.
This is the type of efficiency we’re talking about. We are essentially optimizing models for real-world problems. As a result, the model is trained to identify, with a higher likelihood, which paths of reasoning are most likely to be correct.
Beyond efficiency, there’s also the point you suggested: the model knowing when it’s going down the wrong path. This is something we can train the model for using reinforcement learning—teaching it to recognize, “Okay, this doesn’t seem like a great path. Let me backtrack and test something else.” If you train the model less, it might only realize it’s on the wrong path much later.
23:24-24:03
Matt Turck: Okay, all right. It seems like a lot of this goes back to reinforcement learning and post-training. Let’s talk about how the different components of modern AI systems work. We can cover pre-training, mid-training, and post-training, and perhaps spend more time on post-training since it’s so important.
Starting with pre-training at a high level—and realizing you may or may not be able to talk about how things were done specifically for GPT-5.5—the big narrative last year was that pre-training was hitting a wall and wasn’t going to yield much more progress. That seems to not be the case.
24:06-24:19
Matt Turck: It seems that the idea of pre-training hitting a wall is not the case at all in 2026. Can you walk us through what is happening in pre-training and why it’s progressing now in a way that people hadn’t predicted last year?
24:19-26:03
Yann Dubois: Regarding pre-training, I can’t go into a lot of detail about what is happening internally, other than to say the team has been doing incredible work and our models are consistently getting better. One thing I want to highlight, specifically regarding efficiency, is that as you move to larger models, the amount of “thinking time”—the number of tokens required for reasoning—usually decreases.
You can think of it metaphorically: a larger model already “thinks” through its weights as it generates a specific token. By increasing the size of the model you’re training, you can actually decrease the number of explicit thinking tokens it needs to generate. Often, by pre-training larger models, you gain better overall efficiency.
The advantage of larger models is that they can be parallelized more effectively at inference time. You might assume that using a larger model would decrease system efficiency, even if it generates fewer tokens, but that isn’t necessarily true. The larger the model, the more opportunities you have to optimize for GPU inference. Ultimately, this makes the entire system more efficient. That’s one key point: larger models actually provide a lot of efficiency gains.
26:03-26:10
Yann Dubois: Otherwise, in terms of pre-training, I find it very interesting. I actually thought maybe two years ago that pre-training was starting to hit a wall.
26:13-27:03
Yann Dubois: Pre-training was starting to hit a wall. If you look at Anthropic, for example, their models seem to be significantly larger based on the cost per token. Usually, that’s how you can tell if a model is bigger—you just look at the pricing. They are clearly achieving very high performance simply by increasing the model size. I think at least part of the field was surprised by that. There was a lot of talk about hitting a “data wall,” but it seems we haven’t quite hit it yet. The larger a model is, the more data it needs to ingest during training, and it seems different companies have found various ways to overcome the scarcity of data on the internet.
27:03-27:12
Matt Turck: Is the current frontier for data focused on multimodal data or synthetic data?
27:12-28:11
Yann Dubois: I think synthetic data can work well in a data-limited regime. Multimodal data is also an interesting one. I can’t speak to what we are doing internally, but I used to work on multimodal representation learning back in the day. I always believed that having a lot of multimodal data would significantly improve reasoning abilities. I still believe that, but if you look at Anthropic’s models, they aren’t necessarily the strongest in multimodality, yet they are still incredibly smart. So, it seems it might not be as strictly necessary as I once thought. However, I still believe that once we move toward embodied AI and agents, the models will learn a lot more about the world. We will likely see improvements in general intelligence and utility by teaching models how the physical world interacts with itself.
28:13-28:21
Yann Dubois: That is how the world interacts with itself. But looking at Anthropic’s models, for example, it seems they don’t need that much multimodal data to produce strong models.
28:21-28:34
Matt Turck: By “embodied intelligence,” do you mean robotics? For instance, if you use a video showing how gravity works and how a robot moves through space, the assumption is that would be more useful for the model. Is that the thinking?
28:34-29:29
Yann Dubois: Yes. The intuition that I—and many others—have held for a long time is that it’s difficult to understand the world through text alone. It’s hard to grasp what physics is without actually seeing it. For example, you can’t truly understand gravity without seeing things fall. When you look at our current models, they seem to understand gravity without having seen it, but it’s still not entirely intuitive. They still seem to be missing certain common-sense aspects. I believe we will improve a model’s common sense by having it interact with the real world, but we are still quite far from that—and by “we,” I mean the general academic and AI communities.
29:29-29:42
Matt Turck: Yeah. While we’re on the topic, as a quick detour, that leads us to the concept of world models. Taking your OpenAI hat off for a moment, are you bullish on world models?
29:42-30:00
Yann Dubois: World models in the sense of trying to replicate or simulate environments? Yes, but the problem is that simulations are always going to be extremely difficult to build and won’t ever be perfectly truthful.
30:01-31:05
Yann Dubois: I think there will always need to be a certain amount of training that happens in the real world to ensure the model recognizes the mismatches between a simulated environment and reality. As a field, we have a tendency to optimize for something simulated or unrealistic past the point of utility. We should be careful about that; we spend a lot of time and effort optimizing for a simulation, which is great initially, but eventually, if you over-optimize for something that isn’t representative of the real world, you’re just doing it out of habit. People need to realize when to stop. I don’t work with those types of synthetic environments as much because I don’t focus on embodied AI, so I’m not sure if we’ve reached that point yet.
31:05-31:18
Matt Turck: Okay, great. Going back to the pipeline of pre-training, mid-training, and post-training—let’s talk about mid-training. It’s a term people might have heard less often. What exactly is it, and why is it important?
31:18-32:00
Yann Dubois: As the name suggests, mid-training is the phase between pre-training and post-training. The core idea is that if you have high-quality data that is more representative of what you want in your final model, you should overtrain on that specific data. To take a step back: pre-training is basically trying to learn everything about the world by consuming the entire internet at a high level. The problem is that most things on the internet are not actually that useful.
32:00-32:44
Yann Dubois: The reality is that most things on the internet aren’t actually that useful. If you look at Wikipedia or GitHub, which provides coding data, there is significantly more information there than on some random forum. There are also a lot of ads on the internet, and you probably don’t want to train too much on those.
In pre-training, we train on everything. In mid-training, we basically overweight the high-quality data that we believe is more useful for training the final model. While I can’t speak for everyone, this is definitely happening across the academic community and in all open-source models; they all have this mid-training stage.
32:44-32:52
Matt Turck: Great. Now, let’s move to post-training. Let’s start at a high level by defining what that is. There is reinforcement learning, but that isn’t the only part of post-training. What else is involved?
32:52-33:40
Yann Dubois: It depends on how you define the term and where you draw the boundaries. In my mind, taking a very broad view that includes reinforcement learning and the training for our reasoning models, post-training is the process of taking something that knows everything about the world and turning it into something useful for people.
The metaphor I like to use for pre-training is going into a library. You have books on every subject, and in theory, you can find any information you want there. However, it is much more useful to talk to an expert who has already read those books—someone you can ask questions of, who can provide answers, and who actually understands what you are looking for. That is essentially what post-training achieves.
33:41-34:04
Yann Dubois: At a high level, the goal is to create something that is genuinely useful to users and easier to interact with. There are multiple stages involved in this process. I’ll focus on the standard stages that occur outside of OpenAI’s proprietary methods. Usually, the process begins with SFT.
Matt Turck: Which stands for Supervised Fine-Tuning.
34:06-34:53
Yann Dubois: Exactly, Supervised Fine-Tuning. Early on, most models were primarily using SFT. The idea is that if you have humans who can provide the desired final answer—the “gold” answer—you can essentially clone the human’s behavior. We call this “behavior cloning.” The problem is that you will never surpass the quality of your ground truth data. Humans are limited in many ways, so the model will never outperform the human labelers you are working with.
34:54-35:40
Yann Dubois: The reinforcement learning stage moves beyond behavior cloning to focus on optimizing rewards. The concept here is: “I don’t necessarily know what the perfect ground truth answer is, but I can define the criteria for whether an answer is correct and what specific elements I want to see in it.” You then have the model optimize for those criteria to maximize its “reward function.” This allows the model to go beyond current human capabilities—or at least beyond the capabilities of the specific humans you’re working with. Those are the two major stages. Within reinforcement learning itself, the approach varies depending on which models are being trained.
35:42-37:19
Yann Dubois: In terms of how models are being trained, at least in the open-source community, there seem to be different approaches. First, there is reinforcement learning with verifiable rewards. This is where it’s very easy to determine whether an answer is correct or not, allowing for a binary reward system. This relates back to our previous discussions on o1 and o1-preview.
Then, you have reinforcement learning without verifiable rewards. In this case, I might use pairwise comparisons—stating that one answer is better than another—without necessarily being able to define what a “perfect” answer looks like. Of course, it’s a continuum with everything in between, but those are the three high-level concepts to keep in mind regarding post-training.
The standard approach in the open-source world is to start with Supervised Fine-Tuning (SFT). They clone behaviors collected online or from humans. Once the model reaches a sufficiently high level, they apply reinforcement learning to push it beyond current capabilities. Starting with reinforcement learning from scratch would be highly inefficient because the model essentially has to stumble across the right answer.
The way reinforcement learning works is by sampling many times from the model you’re training, identifying which responses are correct and which are not, and then instructing the model to produce more of the correct ones. Because you need to hit that right solution to learn, you’re much better off first getting as close as possible to the target behavior through behavior cloning before moving into reinforcement learning.
37:21-37:29
Matt Turck: Does reinforcement learning create entirely new capabilities, or does it simply make the model better at using its existing capabilities?
37:29-38:52
Yann Dubois: That is a difficult question to answer scientifically. When a model is pre-trained on the entire internet, one could argue that all possible capabilities are already latent within it. However, if you look at the models we were pushing two years ago—for example, the Alpaca model I worked on in the open-source world—we used about 50,000 examples for Supervised Fine-Tuning (SFT).
Now, when you look at reinforcement learning (RL) in models like Kimi or DeepSeek, it seems they are using closer to a million data points. The industry has significantly scaled up the RL stage. Through this scaling, models have clearly developed new functional capabilities, such as reasoning—the ability to check their own answers, attempt to improve them, and “think” for longer periods to reach a more accurate conclusion. So, while everything might technically be present after pre-training, we have definitely been able to unlock more advanced capabilities through reinforcement learning over the last year and a half.
38:52-39:20
Matt Turck: I’ve heard several times that reinforcement learning is quite finicky and difficult to scale. Part of the reason the industry didn’t lean as heavily into RL during the initial LLM progress curve was precisely because it was so hard to make it work reliably. What exactly makes scaling RL so difficult? Is it a matter of the datasets, identifying where the rewards are, or something else entirely?
39:20-39:24
Yann Dubois: I would say most people who hadn’t worked in reinforcement learning within the academic or research communities generally thought it was too finicky to be practical.
39:26-40:06
Yann Dubois: Up until about two years ago, the academic and research communities generally thought reinforcement learning (RL) simply didn’t work or was too finicky to be practical. I used to be one of those people. In fact, when ChatGPT first came out, I wasn’t at OpenAI yet. I saw their blog post mentioning they used reinforcement learning, and my first thought was that I could achieve the same results without it. I felt it was just an overcomplicated method. That was actually the problem we started working on with Alpaca—trying to reproduce those results using only Supervised Fine-Tuning (SFT) through behavior cloning. Yann LeCun famously used the metaphor that reinforcement learning is just the “cherry on top.” I think that really captured the intuition most people had at the time.
40:10-40:36
Yann Dubois: It seems that after crossing a certain scale—where models possess “good priors” and essentially know everything about the world—reinforcement learning suddenly started to work. This isn’t limited to LLMs, either. Robotics seems to be entering the same stage. Researchers are realizing that while RL used to be very finicky, it actually learns quite well when applied to models that already have a comprehensive understanding of the world.
40:38-41:14
Yann Dubois: To answer your question about what makes reinforcement learning complicated, the first issue is infrastructure. From a high-level systems perspective, RL requires you to sample many different answers and determine which are correct and which are not. This sampling process is incredibly expensive, and you have to perform it at a massive scale. Another issue that the open-source world is currently grappling with is what happens when we train more agentic models.
41:16-42:01
Yann Dubois: When we are training more agentic systems, you only know whether you were correct at the end of a very long rollout. This means you get very little information per token regarding whether you were on the right track. It makes credit attribution extremely difficult; it’s hard to pinpoint exactly which part of your entire answer led to the successful outcome.
This is a significant challenge from a machine learning perspective. In an ideal world, I could say, “This specific action was good; do more of that.” The problem with these agentic systems and reinforcement learning is that you don’t really know which parts were effective until you reach the very end. That is a major hurdle in reinforcement learning right now.
42:01-42:17
Matt Turck: What is the current frontier of reinforcement learning? It seems like there’s a jungle of acronyms out there, like GRPO and various other techniques. What are you using, what are you excited about, and what do you think is most promising?
42:17-43:09
Yann Dubois: I can’t discuss specifically what we are using at OpenAI, but in the open-source world, for example, GRPO seems to be working very well. People used to use various methods like PPO and DPO, but the community seems to have really converged on this one.
The big difference with GRPO is that it utilizes the simple method I mentioned earlier: sampling as many answers as possible and identifying which one is correct. In many ways, GRPO is a very simplistic method. Generally, we have seen time and again in machine learning that the simplest method—the one that allows you to scale up compute most effectively—is usually the one that ends up winning. That seems to be exactly what is happening here, at least in the open-source community.
43:13-43:27
Matt Turck: A question crossed my mind—you often hear that AI systems aren’t really built, but rather grown. How would you characterize your day-to-day work? What part of it is science versus a craft, where you’re trying multiple things and simply keeping what works best?
43:27-45:09
Yann Dubois: That’s a great question. I think it usually starts as a craft. People try out many different things and begin building a mental model of what works and what doesn’t. Over time, we move from that “craft land” toward a more scientific approach. However, the purely scientific approaches are rarely the ones that work first. It’s very rare that you can take a strictly scientific approach, declare, “This is the optimal thing to do,” execute it, and have it just work.
There is definitely a sense of alchemy involved. People often have a good flair for something and make it work; then, either they or others start trying to improve upon that by being very scientific. I’d say this cycle happens over and over in machine learning: first craft, then science. Both are incredibly important, but they happen at different stages of the pipeline.
In terms of engineering, that is always a necessity. I would say most researchers have become quite adept at—well, I wouldn’t necessarily say they are all “good engineers” in the traditional sense—but they’ve become very good at working within complex systems and figuring out exactly what they need to test. As our systems and infrastructure have become increasingly complicated, the nature of the work required has definitely changed over time.
45:11-45:41
Matt Turck: Fascinating. So, circling back to reinforcement learning and some of the points you made at the beginning: if I want to make my model better at computer use, agentic coding, or any specific domain, would I spend a dedicated amount of time performing reinforcement learning specifically for that task? Does it involve putting together a specialized dataset and then coming up with rewards? Is that how it works—you just pick one problem and apply reinforcement learning specifically to it?
45:41-47:04
Yann Dubois: To be clear, I talk about reinforcement learning more because it’s the area I know best and have worked on for a long time. However, we discussed mid-training earlier, and all of these stages are extremely important. You can improve the model at different points in the pipeline. As I mentioned, the closer you are to the final stage of the model, the smaller the scale of training usually becomes. This allows you to iterate quickly—in terms of days rather than months.
Usually, people start with this fast iteration loop and then go deeper, making larger changes across the entire stack. I’m not saying that only reinforcement learning matters, but that is often where people start making changes before they permeate deeper into the stack. That’s how the process works. You see this in the open-source world as well; there are far more post-trained models than there are new pre-trained bases. You see many more algorithmic improvements there, which is why we hear about GRPO, DPO, PPO, and so many other “X-POs.” It’s because people can iterate very quickly on that final stage.
47:05-47:24
Matt Turck: Can I iterate quickly on this final stage of the pipeline? Does the “jagged” nature of these models stem from this approach of picking specific problems to solve? Does that make the model excellent at those particular tasks but weaker in other areas, or is that just a fundamental characteristic of AI models in general?
47:24-48:21
Yann Dubois: There’s definitely some of that. If you optimize heavily for specific types of problems, the model will certainly perform better in those settings. However, my intuition is that it’s less about the exact problems you’re optimizing for and more about the broader class of problems.
For example, if a model is very good at math competitions, it will likely be quite good at coding competitions as well. It’s not necessarily about the specific domain; it’s about the underlying skills and the way of thinking required—those horizontal capabilities needed to perform the tasks. Usually, when a model is bad at something, it’s actually bad at that specific skill across any domain or language. So, you have to think about the generalization of those skills rather than just domain-specific capabilities.
48:21-48:56
Matt Turck: Speaking of generalization, there’s been a clear evolution from success in math and coding to covering different areas. That leads into the whole “GDPval” concept, where model performance is being evaluated across various sectors of the economy. Is that progress a result of overall model improvement, or is it a deliberate effort where you decide, “Okay, now we’re going to take this specific part of the economy, build a dataset for it, and perform mid-training and post-training”? How does that progress actually work?
49:00-49:05
Matt Turck: Does that progress work by moving from those very specific domains and then generalizing to the rest of the world?
49:06-50:26
Yann Dubois: It’s definitely something we actively push on. I think we, and other companies, are realizing that we’re moving toward a world where we want to create products that are genuinely useful—improving productivity and helping people in their day-to-day lives. There is a very active effort to decide which domains we should prioritize. Now that we know we have an algorithm we can apply in different places, our main constraints are collecting the right data and finding people who really care about a specific problem to work on it. However, there aren’t that many people with the expertise to do this, so you really have to prioritize. It’s a very proactive approach.
In general, I’d say the performance of the model depends on the number of people who care about the final output and are actively scrutinizing it. If they start focusing more on specific verticals, those verticals will improve very quickly. But again, we have a limited number of people capable of doing that work.
50:27-50:52
Matt Turck: To unpack something you alluded to a minute ago: do models actually generalize better now, especially from a reinforcement learning perspective? If you make a model very good at domain A or B, is it likely to make the model better at domain C, regardless of how much effort you put into developing rewards for domain C?
50:53-51:03
Yann Dubois: I think there are different axes of generalization. First, there’s algorithmic generalization. That’s really about whether I can take the algorithm I developed or this “black box” approach and apply it elsewhere.
51:03-51:27
Yann Dubois: Can I take an algorithm or a “black box” that I developed for Domain A and apply it to Domain B? Looking at the open-source world, it really seems like people are able to do that. They take an approach like GRPO, apply it in many different areas, and it just works. That type of algorithmic generalization seems to be quite strong, which is why we’re seeing so much progress. If it didn’t generalize that way, moving forward would be much more difficult.
51:27-52:14
Yann Dubois: Then there’s the generalization of the model itself when it’s trained on a specific dataset. My mental model for this is that generalization happens in terms of capability. If the underlying capability is the same, you’ll see generalization across domains—whether that’s different languages or coding. For example, you can optimize a model to be excellent at C++ with very little specific C++ training. This works because the pre-trained model has already seen all of C++ and fundamentally understands the basics of the language.
52:14-52:45
Yann Dubois: That type of generalization definitely happens. However, the generalization I think is harder to achieve is when we don’t have those horizontal capabilities. To give you a concrete example: if a model is highly “intelligent” in the sense that it performs well in math or coding competitions, we might assume it’s just generally smart. From a human perspective, we think that if someone is good at those complex tasks, they are smart enough to do almost anything else. But that isn’t always the case with models.
52:48-54:18
Yann Dubois: It is a common misconception that models generalize perfectly across all tasks, but that really isn’t true. Many expert domains where humans work are incredibly messy. Coding and math competitions are extremely well-specified, but in the real world, you need the capability to understand underspecified tasks and navigate ambiguity. You have to identify what resources are even required to answer a question.
In a math competition, everything you need is usually right there in the prompt—maybe five or fifteen lines of text containing all the necessary information. In the real world, if I’m a consultant or working in finance, I have to go onto the internet to find and extract various pieces of information just to understand the context before I can even begin the reasoning process. This type of horizontal capability is what allows for generalization, but in many cases, we don’t have that horizontal capability yet.
That is actually why we see hallucinations across every domain. If a model is bad at admitting it doesn’t know something, that behavior usually persists across the board. You won’t typically find a model that is extremely well-calibrated about its knowledge in one domain but completely uncalibrated in another.
54:18-54:31
Matt Turck: As a quick detour, is hallucination also a reinforcement learning problem? Can you solve it by rewarding the model for saying “I don’t know” when appropriate? I believe John Schulman has a great presentation about that.
54:32-56:04
Yann Dubois: John Schulman gave a great presentation on this a year or two ago. He explained that if you use behavior cloning—the Supervised Fine-Tuning (SFT) we discussed earlier—you can actually end up optimizing for hallucinations.
To be concrete: if a model doesn’t know about a specific research paper, but the ground-truth answer provided by a human includes a citation for it, you are essentially training the model to cite something it doesn’t know exists. You’re rewarding it for making things up.
In reinforcement learning, however, you sample from the model’s own distribution first. It is extremely unlikely that the model will sample something it doesn’t know and somehow have it be correct. Therefore, you never reward that behavior. Instead, when the model samples something it doesn’t know and gets it wrong, you penalize and eliminate that behavior. The general intuition is that while hallucinations can be introduced during the SFT portion of the pipeline, a robust reinforcement learning process should prevent them from happening often.
56:04-56:20
Matt Turck: Going back to the topic of generalization, are there examples where getting better at one specific domain actually makes the model worse at others? It reminds me of how some people are incredibly gifted at math but perhaps less so in other areas.
56:22-56:26
Matt Turck: Some people are very good at math, while others are very good at English. Usually, they aren’t the same people across different domains.
56:26-58:05
Yann Dubois: Usually not. What happens, though, is that you make decisions based on which domain you choose to optimize for. If you focus on one domain, you’ll naturally be able to optimize less for another. It’s not necessarily that improving one thing makes the other worse; it’s just that you have finite resources. You are constrained by compute, data, and the human bottleneck involved in that work.
However, you can experience “negative generalization” or “negative transfer,” particularly with the horizontal aspects of the model. To give you a very concrete example: explicit instruction following versus implicit instruction following. We often hear that OpenAI models are excellent if you tell them exactly what you want. But as a result, we also hear they can be less effective if you aren’t specific.
For instance, if I say, “Change this file,” but I make a typo in the filename, an extremely good model at explicit instruction following will try to change the wrong file—the one with the typo. A human, on the other hand, would probably realize you made a mistake. In cases like that, explicit instruction following actually works against implicit understanding. You will encounter situations where these horizontal capabilities essentially conflict with one another.
58:10-58:27
Matt Turck: To continue our conversation on reinforcement learning, do you believe that as we progress from being excellent at coding and math to addressing the rest of the economy, those other sectors will be tractable problems? Do you think we can ultimately achieve the same level of performance there?
58:27-59:59
Yann Dubois: Yes, I do. I believe we can. I don’t think there is anything deeply special about coding or math that would prevent us from optimizing other domains in the same way. However, there are two main caveats.
First, most of the people building these models are highly proficient in coding and they care deeply about it because it’s a tool they use every day. There is nothing better than having the developer also be the end-user, because they intuitively understand the issues. For someone like me, it’s very difficult to know exactly what needs to change regarding, say, the legal aspects of a model if I don’t understand the legal domain myself.
The second point, which I mentioned briefly before, is the concept of verifiable rewards. There are certain domains where it is much easier to determine whether an answer is objectively correct. You mentioned cyber capabilities earlier; we’ve seen huge improvements there because in cybersecurity, it’s extremely easy to verify a result. If the model identifies a vulnerability, you can test it immediately to see if it’s a real issue. So, while reinforcement learning is simply easier to apply to certain fields right now, I don’t believe there is anything inherent in the model’s capacity that would limit it from succeeding elsewhere.
1:00:02-1:00:19
Yann Dubois: It isn’t necessarily the capacity of the model that’s the bottleneck; rather, it’s our ability to constrain the model to perform at a high level in specialized fields like law or medicine. The short answer is that we simply know less about these specific domains. There are definitely certain areas that are much easier to optimize for using reinforcement learning than others.
1:00:19-1:00:30
Matt Turck: Great. Let’s talk about evaluations, or “evals,” for a minute, as that’s a hugely important topic. To start, why is it so difficult to evaluate a model in the first place?
1:00:30-1:01:01
Yann Dubois: Evaluation has become increasingly difficult as models have improved. This is because the tasks we assign to these models are becoming more general and open-ended. For example, today I might ask a model to “build me a website that does X.” In the past, I might have just asked, “Is there a specific bug in this implementation?” It’s much easier to determine if a bug exists because I can have a human expert list the errors and then apply that feedback automatically.
1:01:03-1:01:53
Yann Dubois: It is much easier to identify a bug because I can have a human expert list every existing error and then apply that feedback automatically. In contrast, evaluating something like a website is very difficult because there isn’t one “optimal” answer; there are many valid ways to build a high-quality site. This open-ended nature of generative models makes evaluation significantly harder.
Another issue is that models are becoming better than the majority of humans in specific domains. As a result, we have fewer and fewer people capable of accurately evaluating the models’ performance in those specialized areas. That is definitely a major constraint.
To be honest, there is also a cultural hurdle. Most people want to improve the model and assume the best way to do that is through more training. In reality, identifying specific issues and ensuring we can accurately quantify improvements is just as critical.
1:01:56-1:02:41
Yann Dubois: The ability to quantify improvements is just as important as the improvements themselves—if not more so. However, there has always been a cultural gap regarding this. This was especially true in the academic world until about two years ago, when evaluations, benchmarks, and even datasets were largely static.
About four years ago, there was a major mentality shift where people realized that data is actually critical. Now, plenty of people are working on data, but I don’t think the focus on evaluation has quite caught up. Everyone acknowledges its importance, but people don’t fully grasp how impactful working on “evals” can be.
When I first joined OpenAI, my first project was focused specifically on data and evals. I chose that because I knew it was an area that wasn’t getting enough attention, which meant working on it would be incredibly impactful. The tide is finally starting to turn.
1:02:44-1:03:01
Matt Turck: The tide is shifting, but perhaps not fast enough. Is the pace of progress in “model-as-a-judge” and AI evaluating AI moving just as quickly? Is that a distinct area of research, or is it fundamentally based on the same ideas and techniques?
1:03:01-1:03:26
Yann Dubois: It’s fundamentally the same method. Most of what we do in evaluation—especially now that we have reinforcement learning—could be applied almost exactly as is during the training phase. That’s actually another reason why evaluations are so complicated: every time you build an eval, you’re essentially building a way to generate training datasets. Once you have that, you’re going to optimize for it. Even if you aren’t targeting that specific evaluation, you’ll be training on the same type of data.
1:03:28-1:04:16
Yann Dubois: You’re going to use the same type of data, and you’ll perform exceptionally well because of the generalization of capabilities I mentioned earlier. The model learns from one dataset and becomes so proficient at a specific evaluation that the eval itself becomes obsolete very quickly. That is a significant challenge with evaluations.
To return to your question, the “model-as-a-judge” concept is incredibly important. It’s likely one of the most critical developments because as we build better models, we create a self-reinforcing loop—a capability flywheel where superior models become better teachers for other models. This is vital for training, but you can apply the same logic to evaluation. A large part of my team focuses on this, and I believe developing these “model-as-a-judge” modules is absolutely critical.
1:04:21-1:04:49
Matt Turck: Okay, fantastic. As we head toward the end of this conversation, I’d love to zoom out a bit and get your sense of where things might be heading. Obviously, it’s incredibly hard to make predictions about AI years out, but let’s look at the next 12, 18, or maybe 24 months. Is your sense that things will continue to progress steadily, or are we heading toward something that could feel more like a discontinuity?
1:04:49-1:05:07
Yann Dubois: In terms of actual progress, as I mentioned before, I think it’s always continuous. However, the *feeling* of discontinuity will definitely happen. We saw it happen three or four months ago with coding, and I think we’ll see that across every other domain now. Most people haven’t yet felt the full impact of the capabilities and usefulness of our models.
1:05:08-1:05:48
Yann Dubois: We are seeing the capabilities and usefulness of our models expand in the same way that coding and software engineering are evolving right now. This progress will definitely permeate many other verticals. In terms of the capability bumps within the sectors we are already monitoring, I believe the growth will be more continuous. We rarely see massive, global discontinuities; usually, they are local. When you zoom out, the trajectory typically looks quite smooth. While that isn’t always the case, it has been the pattern most of the time. I certainly can’t predict when the next major discontinuity will occur.
1:05:48-1:06:02
Matt Turck: What is your sentiment on the general concept of accelerating loops in AI? I’m thinking of everything from continual learning—which makes models more current and able to learn faster—to the broader concept of AI building AI in an increasingly automated way.
1:06:05-1:06:09
Matt Turck: What about the concept of AI building AI in an increasingly automated way? Is that fact or fiction? And beyond that, what are you personally excited about?
1:06:09-1:06:55
Yann Dubois: I’m extremely excited about continual learning. I don’t think we’ve quite cracked it yet. We have things like Codex memories, which are helpful, but that’s definitely not the end state. I have a friend who always talks about a specific type of plot we should be looking at: the x-axis is time, and the y-axis is the utility or usefulness the model provides to the user.
Right now, if you drop a model into a company on day zero, you could argue it’s more useful than most new employees; it starts at a higher baseline at T0. However, over time, its utility remains mostly constant because it doesn’t truly learn specific company knowledge or adapt.
01:06:57-01:07:32
Yann Dubois: Models don’t really learn company-specific knowledge yet, nor do they become more efficient over time at the tasks they perform. In contrast, humans learn very quickly. What matters is the “area under the curve”—the cumulative learning over time. Because of this, humans remain more useful in many scenarios. To bridge this gap, we need to solve continual learning. We need to make that learning curve monotonically increase over time, ensuring that models become increasingly useful the longer they operate within a specific environment.
01:07:32-01:07:47
Yann Dubois: I’m extremely excited about this frontier, though I’m actually surprised we aren’t further along yet. Three years ago, when ChatGPT first launched, I was working on a startup with some friends. We were considering focusing on continual learning, personalization, and memory in general. At the time, we thought OpenAI would solve it quickly because they had the data, the users, and the models.
1:07:48-1:08:06
Matt Turck: Take memories, for example. We all thought, “OpenAI is going to solve that in the next six months. They have all the data, they have the users, and the models will learn incredibly quickly from those interactions.” Yet, three years later, I don’t think we’re there yet. In layman’s terms, what is the fundamental difficulty?
1:08:06-1:08:36
Yann Dubois: That’s a good question. To be completely honest with you, I don’t quite know why it’s taking us this long to figure it out. It’s the kind of domain where I believe if we truly put enough resources behind it, we would solve it. Of course, especially when you talk about memory within a corporate context, you run into major hurdles regarding permissions and privacy—specifically, what information can be shared and what cannot be accessed across different boundaries.
1:08:38-1:08:50
Yann Dubois: We aren’t quite there yet, even for a single user. I don’t fully understand why—at least not at a high level that I can discuss—but we haven’t solved it.
1:08:50-1:09:32
Matt Turck: What you’re bringing up is really interesting for AI builders, investors, and startups. It touches on the question of models getting increasingly smarter within an enterprise context. There is a tension between the core capabilities of the models and the infrastructure people build around them. A year or two ago, everyone was focused on RAG (Retrieval-Augmented Generation); these days, it’s all about agent harnesses. A lot of people are wondering if the models will eventually “eat” the harness, making those external frameworks temporary. From your perspective, how do you see that playing out?
1:09:35-1:10:26
Matt Turck: Where do you think this is headed?
Yann Dubois: I think agent harnesses can significantly improve a model’s capabilities right now. However, given the rapid progress we’re seeing in core model capabilities, I personally wouldn’t lean too heavily on the harness unless it’s for a very concrete, immediate goal.
For companies focused on a specific vertical, they might want to move from 80% reliability to 85%. A harness can provide that extra boost, which is very important. But they need to build it with the understanding that they’ll have to retune that harness in the future as the underlying models evolve, and that’s perfectly fine. If you’re trying to build a general-purpose harness, though, that’s a different story.
1:10:28-1:11:20
Yann Dubois: I don’t believe a general-purpose harness will be sustainable over the long term. Building harnesses for specific domains is a necessary short-term strategy, and there is always more you can do with them. In fact, I think everyone should be doing more of that if they have a specific problem in mind, because we are leaving so much potential on the table by not having good harnesses.
Arguably, if we froze our current models right now and focused entirely on perfecting the harness—perhaps even spending more time training specifically with a great harness—people would already feel the impact of AGI in every single domain. However, since we aren’t freezing development and will continue to train increasingly better models, we don’t yet fully understand what the final version of the harness will look like.
1:11:21-1:12:10
Matt Turck: The final harness is constantly evolving; it’s always going to change. I have a similar question regarding applications. We alluded to your progress in different verticals, specifically with GDPval, but also how you’ve benchmarked Telecom, which involves complex customer service workflows. You’ve also shown progress with finance agents, automating 88.5% of internal investment banking modeling tasks and 51.1% on Office QA Pro.
Bit by bit, you are tackling more of these domains. Do you think people should still be building specific applications, or will all of this ultimately become part of the model’s core capabilities as we get closer to AGI?
1:12:10-1:12:21
Yann Dubois: There is still an enormous amount of space for external companies and startups to push into specific verticals. I believe there’s a lot of room for that because many people think raw model capability is the main bottleneck, but in practice the bottleneck is often the last mile.
1:12:23-1:13:15
Yann Dubois: A lot of people think of intelligence—in quotations—or raw capability as the real bottleneck, but I don’t think that’s true. Most of the time, the bottleneck is the “last mile.” It’s about ensuring the model has the right permissions, the right connectors, and things of that nature.
At OpenAI, we are going to be very focused on this general, horizontal aspect. I think other companies should focus more on the verticals, providing maximum value from what we currently have. There will always be plenty of space left for this last mile across different industries. I would highly encourage people to continue working on that. Perhaps one day we will stop making horizontal progress, but I don’t see that happening anytime soon.
1:13:18-1:13:22
Yann Dubois: I don’t think that’s happening anytime soon. We might start focusing on it eventually, but that’s not what we’re working on right now.
1:13:22-1:13:34
Matt Turck: Okay, well, that feels like a very optimistic note to end on, at least for the startup ecosystem. Thank you so much, Yann. This was terrific; I really enjoyed it. Thank you for spending time with us.
1:13:34-1:13:36
Yann Dubois: Great. Thanks, Matt.
1:13:36-1:13:55
Matt Turck: Hi, it’s Matt Turck again. Thanks for listening to this episode of the MAD Podcast. If you enjoyed it, we’d be very grateful if you would consider subscribing—if you haven’t already—or leaving a positive review or comment on whichever platform you’re using to watch or listen. It really helps us build the podcast and attract great guests. Thanks, and see you in the next episode.
Made with: The Transcript Desk Chrome Extension

