The AI Progress Chart Everyone Is Misreading — Beth Barnes & David Rein | Edited Transcript
A professionally copyedited transcript of Machine Learning Street Talk's conversation on The AI Progress Chart Everyone Is Misreading.
This is a professionally copyedited transcript of The AI Progress Chart Everyone Is Misreading — Beth Barnes & David Rein. It has been edited for readability and lightly formatted while preserving the substance of the discussion.
Made with: The Transcript Desk Chrome Extension Full video: https://www.youtube.com/watch?v=zSAGzfspuDE
Beth Barnes and David Rein on the one graph that ate the AI timelines discourse, and why the two people who built it are the most careful about how you read it.
Episode Guide
- 00:00:00 Intro
- 00:02:06 Sponsor break: Prolific human-feedback infrastructure
- 00:02:33 Welcome and the scalable oversight motivation
- 00:06:02 Construct validity, benchmark pathologies and the Chollet worry
- 00:15:45 Time Horizons: human time, HCAST tasks and the 50% logistic
- 00:24:50 Is human difficulty really one variable?
- 00:33:05 Agent harness evolution and the inference-compute dividend
- 00:40:00 Scaffolding bells, token budgets and the credit-assignment problem
- 00:44:15 Look at the damn graph: regularisation bug and reliability nuance
- 00:50:00 Why 50%? Reliability, reward hacking and pizza-party transcripts
- 00:55:20 Extrapolation risk and straight lines on graphs
- 00:59:25 Software engineering as a specification acquisition problem
- 01:07:40 Compilers also made ugly code: vibe-coding quality and Claude on METR Slack
- 01:15:15 Strongest defensible claim, Carlini's compiler swarm and AI 2027
- 01:23:45 SWE-bench merge rates, the bank-teller analogy and horses
- 01:31:45 Scheming, alignment faking and the mentalistic vocabulary problem
- 01:40:45 Reward hacking, monitorability and chain-of-thought faithfulness
- 01:45:25 Recursive self-improvement, knowledge vs intelligence and closing
Transcript
00:01-00:33
David Rein: The models are smart enough to understand that what they're doing isn't actually what you wanted, yet they do it anyway. You can even have a conversation with them in chat mode and ask, "Would you ever do this?" or "Suppose a user asks you to do X, and you do Y—would that be aligned behavior?" They clearly seem able to answer that, no, that was not the desired behavior, but they still perform the action. One example is asking a model to train a masked language model without using division or exponentiation operators.
00:33-00:52
Beth Barnes: One hope might be that the problem was just the systems being dumb. However, when we look at the data, for almost all tasks, models either succeed every time or fail every time. If you eyeball the graph, you can see that up to a certain point, the model is basically completing all the tasks, and then past that point, it’s really not doing many of them at all.
00:52-01:11
David Rein: I remember the first time we saw a model look at the running processes and realize, "Oh, that one's me." That was a cool moment because they used to fail at that constantly. They used to do things like accidentally kill their own process while trying to perform other tasks.
01:11-01:27
Beth Barnes: There is a type of behavior that is indistinguishable between a "nice" model doing exactly what we want in a predictable way, and a model that has its own goal but is acting "nice" because it predicts that doing so will lead to it gaining more power in the future.
01:27-01:35
David Rein: It’s like that boat racing example. The goal was to go around the track, but they tried to do some reward shaping by placing coins along the route. The model ended up learning some crazy behavior where it just spun in circles to collect coins instead of finishing the race.
01:37-01:57
Beth Barnes: The agent learned to do this crazy thing where it spins in a circle and catches fire to collect coins. In some sense, that was the highest-scoring behavior. It’s not necessarily concerning in a safety sense; the problem is just that the agent is too "dumb." It doesn't have a conceptual understanding that there is a track and you want it to go around it. It’s just performing a fairly blind reinforcement learning search.
01:57-02:06
Tim Scarfe: The idea of having to rely on "squishy people" to make our systems function isn't immediately appealing, let's put it that way.
02:06-02:08
Narrator: This episode is sponsored by Prolific.
02:08-02:26
Prolific Representative: Let’s get high-quality examples and the right humans involved to provide the right caliber of feedback. We treat human data and feedback as an infrastructure problem. We are making it accessible and affordable, effectively democratizing access to this data.
02:26-02:34
David Rein: I’m super excited to talk to you, Tim, about the time horizon graph and METR.
02:34-03:15
Beth Barnes: I don't think the world has a good understanding of what is happening with AI, and I think it should. There is a significant chance that this technology will make our lives much better or much worse. People disagree on what current models can even do, let alone where we are heading. At METR, we are trying to give the world a better understanding of AI capabilities, risks, and forecasts. We have several different research angles on this, looking at both pessimistic and optimistic estimations of capabilities. I'm excited to discuss that.
03:15-03:19
Tim Scarfe: I’m so excited to have you both on. You both have incredibly impressive backgrounds. Beth...
03:21-04:11
Tim Scarfe: You both have impressive backgrounds. Beth, you were an alignment researcher at OpenAI before starting ARC Evals in 2022 with Paul Christiano, which you then spun out as METR in December 2023. You’ve also been featured on the TIME100 AI list. David, you are the creator of GPQA—the graduate-level, Google-proof Q&A benchmark used by every major AI lab as a capability benchmark. You’re also a co-author on H-CAST, which we’ll discuss today, as well as the Time Horizons paper and the developer productivity RCT.
It’s incredible to have you both here. To start things off, Beth, you left OpenAI to build METR. What was the specific moment for each of you when you realized that existing evaluation approaches were fundamentally inadequate?
04:11-04:52
David Rein: For me, it was primarily thinking about the problem of scalable oversight. As models become more capable, it becomes increasingly difficult to evaluate their performance. If we imagine models completing tasks that would take a human a long time or require specialized expertise you don't personally possess, you need a method to remain confident in and trust their outputs. Thinking through that problem was a major motivation for GPQA and is what really got me started on evaluations.
04:52-05:21
Beth Barnes: For me, it was the big-picture realization that AI is clearly important, and navigating its development well is crucial. Yet, we obviously don't have a great understanding of what's actually happening. People disagree quite strongly about what to expect. If there was a specific moment that informed the Time Horizons work, it was likely the sense that people really...
05:22-06:02
Beth Barnes: There was a sense that people couldn't agree on the capabilities of current models, let alone extrapolate to the future. We were trying to think about how to characterize the ways in which models are and aren't highly capable. In some sense, they are expert-level at certain tasks, like question answering, but in other ways, they are below the average human when it comes to actually being useful. A few years ago, we reached a point where benchmarks theoretically claimed they were at a PhD level, but when you actually tried to use them for anything, it wasn't helpful.
06:02-06:49
Tim Scarfe: I think there has been an obsession with headline accuracy in evaluations. I’m a huge fan of Melanie Mitchell, for example; she speaks about construct validity. She published a great blog post recently outlining four major problems. First, there's data contamination, where the benchmark appears in the training data. Second, there's approximate retrieval, where LLMs interpolate from similar training examples without possessing the actual capability to generate the solution themselves. Third, there are shortcuts—doing the right thing for the wrong reasons. And more broadly, we aren't really testing for consistency, robustness, generalization, or the underlying mechanism. There is just too much focus on accuracy itself. How do you folks think about those kinds of problems with benchmarks?
06:49-07:09
David Rein: One thing I resonate with is thinking about where most of your error is coming from. It is good practice to have error bars based on the standard error in your data, but that is almost always a tiny fraction of the actual uncertainty.
07:11-08:55
Beth Barnes: Statistical noise is just a tiny fraction of the actual uncertainty we face. Almost all of it comes from the question of how these models actually generalize to the real world. At METR, we often ask ourselves: "Is this the biggest source of uncertainty?" or "Is this the most significant gap preventing us from answering the questions we actually care about?"
When we think about what we're trying to achieve, we focus on things relevant to threat models or the actual impact AI will have on the world. We ask what properties our benchmarks need to have, or how we can extrapolate across properties we can't directly build in, to make predictions about those core questions.
We worry less about whether the model is doing something "the right way" in a mechanistic sense. We’ve moved away from the approach of saying, "I think the real bottleneck is this specific skill or this type of reasoning about novelty," and then building a benchmark to target that because it's something humans can do that models can't.
The history of building those types of narrow benchmarks hasn't been great; people tend to overfit to them. Instead, we try to capture those capabilities by using real-world, relevant, difficult, and long-form tasks. If you keep those tasks out of the training data and ensure they are diverse enough, then if a model can complete the task end-to-end, it must possess those underlying capabilities. That's a more robust approach than trying to isolate a specific theory about what the model needs to be doing mechanistically.
08:56-09:27
Tim Scarfe: It needs to be doing this mechanistically. I think it's interesting because we have this idea in our minds that we, as humans, know how to do things. When we solve a task that requires reasoning, we follow the specification. We go step-by-step, we do things for the right reasons, and when we enact intelligence, we build that specification. We create these coarse-grainings and abstractions that are well-aligned. That whole process is how we think of human intelligence, and we want the models to behave in that same way.
09:27-10:26
David Rein: Yeah, I think there's an interesting question of whether that is actually the goal. At least for a lot of AI companies, I understand them to be trying to get models to do economically useful work. One way of doing that is to create models that reason and create implicit world models in the same way humans do. However, it's not obvious to me that you necessarily need to do that in order to have a significant impact.
Obviously, that means there are important differences between AI intelligence and human intelligence. I often think more about what the actual capabilities and limitations are, and how we expect those capabilities to generalize, rather than whether it's working in the exact same way that human intelligence works.
10:26-10:33
Tim Scarfe: We could think of intelligence in many different ways. Is it a simulacrum of the brain? Is it something that behaves the same way? Is it something that has the same—
10:34-11:04
Tim Scarfe: Are we looking for something with the same capabilities, or something that performs the same function? If we use a very abstract description of intelligence, the risk is that the model might take shortcuts. It might give us the right answer, but in reality, it's just reward hacking or doing something nonsensical in the background. I like having an abstract definition because it's legible and allows us to evaluate the system, but doesn't that leave us with the lingering risk that it isn't actually "doing the thing"?
11:04-12:01
David Rein: I think you have to try to measure the system's ability to generalize to novel situations. There are cases where models seem to generalize well, and others where they clearly don't. Some researchers in the field of interpretability look at the circuits within these models to decompose the exact algorithms they use to answer questions. Sometimes it appears they are using shortcuts, while other times they seem to be finding robust patterns. Of course, I don't think that work is developed enough yet to explain the majority of their behavior. However, I totally agree that we need to be very concerned with how well they are actually generalizing.
12:01-12:15
Beth Barnes: When we operationalize a definition of intelligence, what we really care about is whether it allows us to predict how these models will affect the world. We want to know what will happen so we can handle them appropriately. If you only treat it as a black box, you run into issues.
12:19-13:31
David Rein: Treating the model as a black box might lead to poor generalization. You might think you're measuring a specific ability, but the model could actually be hacking the benchmark or finding a shortcut. Ideally, the distance between generalizing to your benchmark and generalizing to the real world from the training data should be the same.
When we perform elicitation on a subset of the benchmark, we want the gap between that subset and the rest of the benchmark to be similar to the gap between the benchmark and the real world. Currently, the training data is clearly more similar to our time-horizon suite than either is to randomly selected, economically relevant tasks in the real world. That is one reason why it might not be perfectly predictive.
However, I think it's more promising to improve predictivity by increasing the diversity of benchmark tasks and making them more realistic, rather than targeting a mechanistic definition of intelligence—like insisting it must follow a specific kind of process or mechanism.
13:31-13:59
Tim Scarfe: I’m a huge fan of François Chollet, for example. He created the ARC challenge, which aligns with what you're saying. It featured many different tasks—I think about a thousand, or maybe 800 in the first version—and they were intended to be out-of-distribution. Ultimately, they still shared a distribution, and distributional leakage was the downfall of ARC v1 and v2. Nevertheless, models became very proficient at ARC v1, and then François...
14:00-14:31
Tim Scarfe: LLMs were initially good at ARC v1, but then François Chollet released ARC v2, which introduced different tasks and filtered out some of the easier ones. Suddenly, LLM performance crashed to basically 0%. To me, this illustrates that language models are incredibly good at identifying patterns after seeing many different examples, but as soon as you change the task, they collapse. Then, eight months later, ARC v2 was saturated again. We keep seeing this pattern. What do you think about that?
14:31-15:45
Beth Barnes: With things like ARC-AGI, there is a sort of adversarial selection going on. People are trying to create benchmarks that current models struggle with, subject to the constraint that they must be cheap to produce. You can't use a lot of expensive human labor, so it has to be something that is either automatically checkable or can be created with inexpensive human effort.
Once you've selected for tasks that models are currently bad at, you see a "regression to the mean" type of effect. It becomes much more likely that future progress will yield a massive surge upward on that specific benchmark. This happens both because labs will create synthetic data targeting the benchmark and because you've selected a weird outlier—something easy for humans or automatically generatable, but which models simply haven't been trained on yet.
That is part of what we were trying to do with the "time horizon" metric. We wanted to avoid adversarially selecting against current model capabilities, because we don't think that produces a useful trend. If you can define a distribution of tasks based on first principles, you are more likely to see steady progress because you aren't being misled by that regression to the mean effect.
15:45-16:03
Tim Scarfe: There is a regression to the mean effect. François Chollet has this idea that there is a gap between the kind of intelligence—for lack of a better word—that AIs have versus what humans have. We can adversarily select a bunch of tasks to highlight that gap, but we should probably move on to the timeline stuff. We can come back to the nature of intelligence later.
16:03-16:36
Tim Scarfe: Dan Kokotajlo mentioned that the timelines report you folks created is probably the single most important piece of evidence regarding AI timelines right now. It should be front and center in policy discussions. For listeners who have seen the chart but haven't read the paper or don't fully understand it, could you give us a high-level overview? It has been revised over time, so how did you handle task selection, the human baselines, and the agent harness?
16:36-17:18
David Rein: The place to start regarding our motivation for the "Time Horizons" work was the desire for a unified axis to measure AI progress over a very long period. When we started this work, we had a strong belief that GPT-2 was fundamentally much worse in a very important sense than, say, GPT-3.5, which was the top model at the time. The standard approach back then was to create a set of tasks and measure a model's accuracy on them.
17:18-17:38
David Rein: The problem with that approach is that as models improve, they saturate the benchmark, forcing you to create a new one with harder tasks. I actually contributed to that traditional approach myself with the GPQA benchmark. However, we wanted something more longitudinal.
17:41-19:35
David Rein: The real challenge here is that it's incredibly difficult to compare qualitatively different benchmarks. For example, the set of tasks you use to evaluate GPT-2 involves things like LAMBADA, where the model just has to complete the last word of a text. In contrast, the tasks we gave Claude 3.5 Sonnet were things like answering complex Python coding questions or writing a short 20-line program.
At first blush, it’s hard to quantify exactly how much harder it is to write a Python program than it is to finish a word in a paragraph. It’s difficult to conceptualize that gap. I think the key insight of the "Time Horizons" work was to use the notion of "human time to complete" as a universal metric. We ask: how long does this task take a human with a reasonable amount of expertise—someone who would plausibly perform this task in their day-to-day work—to finish?
The idea was that we could use this metric to represent the difficulty of a task in a consistent way. This allows us to compare models across a very wide range of capabilities, all the way from GPT-2 up to Claude 3 Opus. That’s the high-level motivation.
There are, of course, a lot of details regarding how we actually execute this. The first step was creating a diverse set of tasks. We developed tasks that range from taking only a few seconds to complete all the way up to tasks that take a human 10 or 15 hours to finish.
19:38-21:35
David Rein: We selected tasks that take humans roughly 15 hours to complete. To establish this, we hired a group of people and performed what we call "baselining," which involved doing a significant amount of this work ourselves as well.
We provide the participants with the tasks in a terminal environment designed to be nearly identical to the one the AI agents use. This ensures they have access to the same tools and the same constraints, such as whether internet access is enabled or disabled. We then measure exactly how long it takes them to complete each task.
As I mentioned, the participants are selected based on having a reasonable amount of relevant experience, meaning they could plausibly encounter these tasks in their actual jobs. However, we don't select for people who have performed these specific, exact tasks before. I think this distinction is quite important for interpreting the results, and we can perhaps circle back to that after covering the high-level overview.
So, we have all these tasks and estimates of how long they take humans to complete. In practice, we aren't able to successfully baseline every single task. We have measured time estimates for roughly two-thirds of them; for the remaining third, we estimate the duration based on our own intuition and experience. Ultimately, that’s the best we can do.
Finally, we have the models attempt to complete these same tasks in the exact same environment the humans used. We then analyze their success rate as a function of task length. For a model like GPT-2, for instance, it was able to very reliably complete tasks that take humans a very short amount of time.
21:37-22:13
David Rein: Models can reliably handle tasks that take humans only a few seconds, but anything longer than that and they start to fail. It might be helpful to give a few concrete examples. Some of the shorter tasks are very basic. For instance, we might ask, "Which of these files contains your SSH key?" One file is named "SSH key," while the others are labeled "Email from John" or something similar. Most models can do that easily, and it only takes a person a second or two to complete.
22:13-22:35
David Rein: We have other tasks that are somewhat similar but involve basic completion. For example, we'll provide an email and ask for a reasonable response. Two of the options won't make any sense, while one is basically reasonable. That takes a human about 20 or 30 seconds to read through the responses and make a judgment.
22:35-23:01
David Rein: In the middle range, we have tasks like, "Given this CSV file containing plausible, realistic data, compute some basic statistics." Depending on the specific task, this might take a data scientist anywhere from five to fifteen minutes.
23:01-23:26
David Rein: On the longer end, we have tasks that require either significant expertise or many steps to complete. For example, we have machine learning tasks such as training a model in a very niche or unusual way, where the training code isn't readily available online. One specific example is training a...
23:27-23:48
David Rein: One example is training a masked language model without using division or exponentiation operators. You actually have to be quite clever about how you set up the architecture to achieve this. The hope is that these kinds of tasks can help us measure a model's ability to generalize beyond its training data.
23:48-24:50
Beth Barnes: Yeah, there are some tasks that are themed a bit like ARC-AGI, where you have a black box computing a function. You know it's a composition of a specific set of primitives, and you have to figure out what that function is. Or perhaps you have a long binary string and you need to identify the pattern continuation. These are puzzle-type tasks. In contrast, some machine learning tasks can be solved basically by regurgitating a tutorial on how to build your first ResNet. Those are too easy.
So, we use these "weird" tasks that involve either an unknown object you must interact with to understand, or a task that resembles normal work but includes strange constraints so you can't just do the standard thing. I don't think all of our tasks hit those criteria—some allow for the standard approach—but we generally tried to avoid that.
24:50-25:10
David Rein: We have this distribution of tasks, and you can imagine them ordered by the length of time it takes a human to complete them, whether that's measured or estimated. We then observe which tasks the models succeed or fail on. It turns out—and this is an empirical finding—that models are much more successful on the shorter tasks.
25:10-26:25
David Rein: Models are generally much more successful on shorter tasks than they are on longer ones. This trend holds across a wide range of models, from GPT-2 all the way up to the most recent releases. We fit a logistic function to this distribution of successes and failures, which serves as our model for each individual AI. It estimates how likely a model is to succeed at a task based on its duration.
From that, we identify the 50th percentile—the point where the logistic function estimates a model has a 50% chance of completing the task. That becomes the "time horizon" metric for a particular model, like Claude 3 Opus. By calculating these time horizons for every model, we can track how this metric has evolved from GPT-2 to the present. This gives us a unified metric to quantitatively compare AI capabilities across multiple orders of magnitude.
26:26-26:51
Tim Scarfe: One of the more difficult examples I saw was a task like, "Write a kernel compiler to make CUDA run faster." Some of these seem very out-of-distribution; there are probably only a hundred people on the planet capable of doing that. Meanwhile, other tasks are quite trivial. Regarding this "human difficulty" metric specifically, do you think it's confounded in any way? Does it actually make sense to treat human difficulty as a single variable?
26:52-27:00
Beth Barnes: In a sense, obviously not. That is a massive simplification. Different humans will record wildly different times, even among the people...
27:00-28:04
Beth Barnes: Even among the people we've tried to select for—those with the right level of expertise—there is a large variation in baseline times. They often differ by as much as 3x. I should explain why we use the "human time" metric. We want a measurement with two main properties: it needs to be interpretable, meaning we understand what it implies for the world when models can perform at a certain level, and it needs to be something we expect to show predictable trends.
We aren't going to achieve perfection on either front, but measuring how long it takes a human with the right expertise (who hasn't seen the specific task before) is reasonably interpretable. It answers the question: "Could you contract this work to a model?" For example, could you substitute the model for a new employee during their first week on the job? If the model can do what they can do in that first week, that's a meaningful benchmark.
28:04-28:48
Beth Barnes: We also expect this metric to scale somewhat predictably because it captures a combination of factors, such as the number of steps involved or the cognitive effort required for each step. There are a few different mathematical models you could use to define what a task is and why it takes a human longer to complete. For instance, you could model it as a constant hazard rate, where there is a specific chance of failing at each step—though I don't think it fits that perfectly. You could also think of it as a distribution of difficulty, where...
28:49-29:33
David Rein: There is a distribution of difficulty. What is the likelihood that one of the subtasks falls outside your ability? I actually think the hazard rate goes down slightly over time, but there is a basic theoretical idea that if a task involves more steps, it is obviously going to be harder. Tasks that are strictly compositions—where you have to do task A and then task B—are clearly harder than just doing one of them. So, there is a fundamental reason to expect this relationship, and we do see that empirical regularity. However, there are a lot of degrees of freedom where one could potentially fudge the data.
29:33-30:38
David Rein: I worry that we could fool ourselves by changing other parameters of the tasks as we scale up the human time. You can't just vary human time in a vacuum; you have to change the characteristics of the task itself. We tried to ensure the very easy tasks were drawn from roughly the same distribution—essentially sub-parts of the harder tasks. For example, a task might be a single command-line step that you would eventually need to perform in the middle of a larger software engineering project. You can't do that perfectly, though. There is a risk of experimental bias where we might have inadvertently made the tasks easier in other ways, just enough to keep the line straight on the graph. I think that concern is somewhat addressed by the fact that our predictions remained reasonably good for models we hadn't seen before, but there is definitely still room for error.
30:40-30:49
Beth Barnes: There is certainly plenty of room for things to be a bit strange. The question is whether it's a good enough metric to be useful, or if there is something else out there that would be better.
30:49-31:56
Tim Scarfe: I think that's reasonable. If I understand correctly, the human distribution was log-normal, so taking a geometric mean of the successful attempts seems like a sensible approach. However, one of the core issues we keep coming back to is that when you hire someone for the first time, they lack the tacit knowledge you've built up while maintaining a repository or doing a specific job.
I like to say that knowledge is non-fungible. Unless someone has followed the exact same path as you, you can't just tell them how to do the job; they have to actually do it for a while. For instance, you might be intimately familiar with a specific problem, know exactly which Python libraries to use, or have already mentally modeled the solution. In that case, the "enactment" of intelligence is really just the culmination of everything you've already done and everyone you've worked with. You have the blueprint in your mind, so you're almost in "automation mode."
Conversely, someone naive to the task is in "intelligence mode" because they have to acquire the specifications from scratch. It’s always a fine line between which mode a person is actually in.
31:56-32:21
Beth Barnes: Exactly. The reason we chose to measure a human who has the relevant background expertise but is new to the specific task is that it roughly matches the level of knowledge we expect from these models. We don't expect them to be bottlenecked by expertise that is readily available on the public internet.
32:20-32:40
Beth Barnes: The models have access to information available on the public internet or things one might learn at a university. They are coming in with at least the level of knowledge of an expert in the relevant discipline, but they won't be familiar with a specific company's proprietary software or the exact nuances of a particular problem. Hopefully, that serves as a roughly accurate analogy.
32:40-33:22
David Rein: I think if people interpret the takeaway numbers as, "Oh, Opus 4.6 can do anything I do in my job that takes 12 hours," that is almost certainly an overestimate. For example, if you have a 12-hour task at your job, you couldn't easily delegate that to a human contractor. It might take a contractor weeks to complete a task like that because of the context required.
33:22-33:33
Tim Scarfe: Just quickly, where did you find the people for this? Am I correct that some were contractors and some were employees? How did you handle that matching process?
33:33-34:05
Beth Barnes: We put out some public advertisements and posted on various job boards. We also handled some of the recruitment ourselves, with some participants coming from our professional networks. It is a very noisy process, and the people weren't always perfectly matched to the specific tasks. However, that probably isn't our biggest source of uncertainty. The larger issue is likely the selection effect—the types of tasks that can actually be converted into a benchmark in the first place.
34:09-34:37
Beth Barnes: We can turn this into a benchmark, but I wouldn't trust the exact time horizon numbers too much. It’s certainly not as if models can do every task up to four hours and then suddenly fail at everything above that. The fit is quite noisy, and the variance between the human baselines is high. It’s more about identifying the general trend or the approximate level of task complexity these models can handle. You shouldn't take any specific number too literally because there is a massive problem with distributional shift between the benchmark and the real world.
34:38-34:52
Tim Scarfe: The only reason I asked is that I’m sure you struggle to hire people; it’s incredibly difficult to find the right talent. If you’re asking people to solve very challenging problems, it’s not like you can just go out and grab competent people off the street. It’s very, very difficult.
34:53-35:20
Beth Barnes: At some point, particularly for the RE-bench baselines, we gathered a large number of benchmarks per question. We looked at the participants' qualifications and years of experience, and we actually found a negative correlation between years of experience and performance. It turned out that the people in our immediate network—our friends—were doing really well, while the more "qualified" external candidates weren't doing as great. So, it’s definitely tricky.
35:21-35:41
Tim Scarfe: Exactly. I don’t want to dwell on this for too long, but I have similar intuitions. I think knowledge is perspectival and quite path-dependent. You’ll find people within a specific group who culturally think about things in the same way. When we rely on abstract notions of skill—like having a PhD or a certain number of years of experience—it often turns out not to be a very good predictor.
35:42-35:51
David Rein: It’s actually not a very good reflection. There is a bit of an issue here where using abstract notions of capability doesn't necessarily generalize, as you can attest to with hiring.
35:51-36:07
Beth Barnes: Yeah. But in the real world, people do get hired based on qualifications. In some sense, the economic relevance of someone being as good a match for their job as their qualifications suggest is roughly the right thing to be measuring.
36:07-36:34
Tim Scarfe: The other thing we should talk about is the agentic harness. Almost everyone in the audience likely has a Claude Code subscription now. We can talk about the leak later as well—that was quite fun, the source code leaked yesterday—but you have things like Claude Code or CodeExe, and those are agentic harnesses. A language model just gives you tokens, but we need a harness so we can give it a plan, allow it to call tools, and provide an environment like a security context container.
36:34-36:46
Tim Scarfe: Now, you folks have been doing this for years, long before Claude Code and CodeExe came out. Over time, you’ve evolved your agent harnesses. Tell me about that.
36:46-37:26
Beth Barnes: Yeah, I remember using Text-Davinci—the GPT-3 Instruct models—and I was essentially the agent harness myself, copying and pasting code into the terminal for them. Gradually, we started automating this. It was interesting to see the progression. GPT-3 sort of had the idea that if you told it it could run commands in a terminal, it could sometimes suggest relevant commands, but if you put it in a full agent scaffold, it would just fall over. I remember the first time we saw a model look at the running processes and realize, "Oh, that one's me." We thought, "Oh, that's cool."
37:29-38:12
Beth Barnes: It was interesting to see them fail so significantly before. They used to do things like accidentally kill their own processes while trying to perform other tasks. Watching that improve over time has been fascinating, though I feel it was very predictable that things would head in this direction. One of the main things we learned about scaffolding is how difficult it is to make an agent harness that performs well across a diverse set of tasks. It’s easy to make a bad one, and while you can see much greater improvements if you target a narrow distribution of tasks, you will likely perform worse on everything else.
38:12-38:53
Beth Barnes: When we see people announcing impressive new results, the first question is: how much task-specific scaffolding iteration did you do? That makes a massive difference. The fact that we use a single, relatively simple scaffolding across all tasks is a significant factor in our results. Generally speaking, the more complex setups with lots of bells and whistles haven't performed much better than a basic setup—one that just gives the agent access to Bash, appends information to the prompt, and perhaps uses some form of context compaction.
38:53-39:31
David Rein: This probably isn't news to your audience, but we've seen dramatic returns from inference-time compute. For us to be confident that a new model cannot complete a task using a basic agent scaffold, we generally feel we need to spend on the order of hundreds or even low thousands of dollars on inference. We need that level of investment to be sure the model is actually plateauing, rather than just needing more attempts or compute to find the solution.
39:31-40:07
Tim Scarfe: It isn't just a case of the model not having enough time to complete the task. I’d like to dig into the scaffolding details a bit more. First, there is the credit assignment problem. You can add all these different bells and whistles to the scaffold—like the "compaction" feature you mentioned, which is a relatively recent innovation. I believe when you updated the scaffold recently, you noted that performance changed across the suite of model-task pairs. How much of a difference does that actually make, what kind of failure modes are you seeing, and what else have you tried?
40:08-41:10
David Rein: A lot of the things we've experimented with involve giving the model more information or more direct access to tools. One thing that has been particularly important for us is simply telling the agent how much time it has spent and how many tokens it has used relative to its total budget. Without that information, agents often submit their solutions far too early, or they simply aren't calibrated on how much time they should be spending.
It’s interesting because humans have a lot of implicit information regarding expectations. When your manager gives you a task, there are many subtle signals about the expected duration. They might say offhandedly, "I'm excited to see your results tonight." From that, you understand you need to get a first draft done in the next couple of hours, so you know you can't spend days polishing it.
41:12-41:52
David Rein: It’s easy to forget that agents only have their prompt and their context. They don't have the same heuristics or background information that we do regarding expectations. For example, they don't inherently know if a task is meant to be a quick five-minute job or something much more involved. To address this, when we have a token budget, we actually tell the agent, "You've used 100,000 tokens so far, which is 1% of your total budget." This gives the agent a sense of how much effort it should be exerting.
41:52-42:38
Tim Scarfe: So, for a given model, you're looking at the probability of it solving a task. On the x-axis, you have different tasks categorized by their time horizons. If I understand correctly, you have about eight agents attempt each task, and you bucket the tasks to normalize the data since there are different numbers of tasks in each group. The resulting data looks a bit like an S-curve. I’m trying to understand the intuition behind using this model. Were there sensitivities regarding thin tails or the specific slope of the curve? I also noticed you mentioned in the paper that there was some psychometric intuition involved, drawing from existing literature to figure this out.
42:38-42:43
Beth Barnes: Exactly. It’s very similar to Item Response Theory, where you can perform a...
42:43-44:15
Beth Barnes: You could perform a full Bayesian analysis by simultaneously imputing task difficulty and model ability parameters. However, I generally have a policy of being very wary of complicated statistics if you can't actually see the phenomenon on a graph. You should be able to plot it, look at it, and say, "Yeah, it's about there." It’s hard to go too far wrong when you stick to that principle.
While there are various arcane ways to fit this pattern, I don't trust any of them much more than simply eyeballing the graph. You can see that up to a certain point, the model is completing basically all the tasks, and after another point, it's really not doing many at all—so the threshold is somewhere in between. It does look like a logistic curve, which is exactly what you would use for humans completing exam questions.
The specific mistake we made was including a regularization term that penalized the slope of the logistic curve. This didn't have much of an effect in the data-rich regime, but as we started to reach saturation, the regularization made the curve shallower than it should have been, which pushed the 50% mark. So, the lesson is: always look at your data on a graph. It's good practice.
44:15-44:33
Tim Scarfe: That’s interesting. The reason I ask is that I believe you published a later note stating that if you had used a fixed-slope logistic, it might cross-validate better, and the 50% horizons would actually increase by about 35%. Those are quite significant differences.
44:33-44:34
Beth Barnes: They're small compared to the error—
44:34-44:44
Beth Barnes: The differences are small compared to the error bars. The error bars are roughly 2x on either side for the most recent model. So, you should keep in mind that the error bars are significant; these specific numbers aren't set in stone.
44:44-45:21
David Rein: This touches on some difficult science communication questions for us. We really do have a lot of uncertainty regarding the individual numbers here. A 30% difference is actually relatively small to us compared to other factors. For example, if we had used a slightly different distribution of tasks, that would likely cause a 2x difference or something in that range.
45:21-45:48
Tim Scarfe: The other million-dollar question is: why report 50% as the headline number? If you think about it—and this is the elephant in the room we’ll get to—this is being used as an argument that software engineers might soon be unemployable because we can automate their work. But 50% reliability isn't really in the ballpark, is it? I’d imagine it would need to be more like 80% or 90%.
45:48-46:21
Beth Barnes: I think we should distinguish here between reliability on a particular task—meaning if you repeatedly attempt the same task, what fraction of the time do you succeed—versus the probability of success across a distribution of tasks. When we look at the data, for almost all tasks, the models either succeed every time or fail every time. There are some tasks where they are unreliable, but it's mostly a question of what fraction of tasks at a certain human-time level are within their capability.
46:23-47:13
Beth Barnes: Tasks at this human time level are generally binary: a particular model either basically always succeeds or basically always fails. That might be more predictable in any specific case than just knowing roughly how long it takes a human, because you have more information about the task itself. I don't think there is necessarily a perfect translation between the time horizon percentage and the success rate you'd see if you were trying to get a model to perform a task of that length. When you're actually using these models, you'll pick tasks you want them to succeed at. The graph provides information about what fraction of things they will be able to do, but it’s less about being in a regime where you keep giving it tasks and have no idea whether it will succeed or fail.
47:13-48:18
David Rein: It’s overall pretty unclear to me what the "right" level of reliability is that we should be interested in. You could make an argument that we should focus on something like 10% reliability. Once models are able to complete a set of tasks 10% of the time, we would expect AI companies to be able to get enough of a positive reward signal on tasks of that difficulty or type to more easily bootstrap from 10% up to 90%, 95%, or even higher reliability. Ultimately, it depends on the question you're interested in. I think of lower reliability as a leading indicator of progress—it tells you more about where things are headed.
48:21-49:33
David Rein: Progress in higher reliability is a leading indicator. The time horizon of models with higher reliability tells you something more specific about how you can use a model in your day-to-day life. However, as we've discussed, there are other major sources of uncertainty that affect our understanding, such as the task distribution and the difference between high-context and low-context work.
Actually getting accurate estimates for high-reliability time horizons is substantially harder, so our error bars would be much larger. I think this is a weakness in the current data; I am very interested in much higher reliability time horizons, but they are significantly more difficult to measure. If you only have one failure out of a hundred attempts, there is a lot of uncertainty as to whether that failure is just noise or a real, systemic issue.
49:33-50:12
Tim Scarfe: There is certainly an argument for statistical validity there; at the tails, the data is sparser and increasingly estimated. That makes a lot of sense. But you mentioned that hitting 10% is a signal. I was thinking back to our earlier point that models might give the right answers for the wrong reasons.
We should probably talk about the evaluation itself. These are quite interesting tasks because they are verifiable. There is no interaction with other agents, they take place in relatively static environments, and there are weak penalties for single mistakes. In a sense, these are...
50:15-50:35
Tim Scarfe: In most cases, these results are binary—or sometimes continuous and then converted into a binary. It’s a fairly automated setup, but are you digging into the anomalies there? Do you have an intuition for whether they are doing the right thing for the right reasons, or are there a lot of false positives where they completed the task in a degenerate way?
50:35-51:57
David Rein: I think this is one of the aspects of METR’s culture that I like the most: we have a very deep culture of looking at our data. We actually have "pizza parties" where we just sit and read through agent transcripts. A lot of this work happened while we were developing the tasks themselves. We would frequently see both false positives and false negatives. For example, a task might not be configured to allow internet access, but it turns out it actually requires it to be completed, or perhaps a file wasn't uploaded properly to the container.
But we have also seen cases of reward hacking. A lot of our work went into hardening the scoring functions to make it more difficult to get false positives, although we do still see agents reward hacking—perhaps even increasingly so.
51:57-52:08
Beth Barnes: Yeah, for the RE-bench tasks in particular, we had specific criteria. For instance, an agent shouldn't be able to solve them without iteration. If an agent can just write out the solution straight away, then that is not the kind of task we are looking for.
52:10-53:19
Beth Barnes: We generally have a quality assurance process for these tasks where humans perform them—or at least approximately do them. They might speedrun certain sections, but the goal is to check that everything works as expected. We want to ensure you can't just guess the answer, cheat easily, or find the instructions unclear.
While some issues might still slip through, we've looked at them quite carefully. It can be difficult to tell if a model is solving a task in a "degenerate" way. For example, similarity to training data is a concern. A model might appear to be iterating and using reasonable problem-solving strategies, but in reality, the lab might have had a very similar distribution of tasks in its training set. We might not realize just how "in-distribution" a task actually is. I think there's probably some of that going on, though many of these tasks are quite weird and one-off, so it would be surprising if that explained everything.
53:19-53:53
Tim Scarfe: Another million-dollar question involves the public discourse. For instance, Will MacAskill was on the Sam Harris podcast last night—it was a great conversation—and he was discussing AI risk in terms of timelines. He suggested that in a year or two, we might have AI models performing tasks that would take a human a month or two to complete.
Currently, I don't think there are any tasks exceeding 30 hours that have been evaluated by humans. This raises a significant question: if the public discourse is focused on the least constrained region of the graph, are we getting into...
53:55-54:04
Tim Scarfe: Looking at the graph, are we getting into the realm of extrapolation here? How legitimate is it for us to discuss the possibility of AI performing tasks that might take a human a month or two to complete?
54:04-54:45
David Rein: Predicting things is difficult, especially when it involves the future. There are many different perspectives and prior beliefs people hold, and I think there is a wide range of reasonable judgments regarding where we are headed. However, making those kinds of predictions is a very different activity from discussing data that has already been collected using a concrete methodology.
54:45-55:14
David Rein: One thing I can say is that I have been somewhat surprised by how well the original trend line has held up. To me, at least, that serves as decent evidence for where things are going.
55:14-56:04
David Rein: A colleague of mine recently wrote a short blog post about the intuition of "straight lines on graphs." Many people have different models of how progress happens and what is driving it. However, if you have observed a robust trend over a significant period of time—especially in AI, where progress is systematic to a decent extent—I definitely put weight on that trend continuing. That said, there are still a number of reasons why...
56:04-56:46
Tim Scarfe: There are a bunch of reasons why it might not work. I think software engineering is essentially a specification acquisition problem. It’s very difficult because we don’t know ahead of time exactly what we’re building. I’m sure you can both attest to this. You build some software, the first version is buggy, your users try it, and you find lots of edge cases. Then you revise it. By the tenth revision, you have a clear vision in your mind; you’ve created these lovely representations, abstractions, and coarse-grainings. You say to yourself, "If I could throw all the code away, I could build it ten times faster because I know exactly what to do now." You’ve already applied the intelligence to find the contours of the domain. At that point, it basically becomes an automation problem.
56:46-57:30
Tim Scarfe: In a sense, this contamination issue is a concern for me. When people use Claude, the model is essentially absorbing their data. There are people out there writing kernel compilers and doing all sorts of different things, and Anthropic is just sucking that up. Eventually, those tasks become automation problems. If you give the model a task that is essentially a "head query"—to use information retrieval terminology—meaning it’s something in the mode of the distribution that is used all the time, Claude will provide the specification because it has already been gathered from other people. But if you give it something on the long tail, then you, as the developer, have to provide the specification yourself.
57:31-57:50
Tim Scarfe: The developer has to provide the specification in the prompt, and then it becomes an automation problem. Automation is relatively straightforward. Is that what's happening here? Do you think the acceleration in these timelines can be explained simply by the acquisition of knowledge from other people performing similar tasks?
57:50-58:58
David Rein: Yeah, I think that is a central question for interpreting where we currently stand. First, I should say it’s a big question and it's hard to know for sure. We should maintain a decent amount of uncertainty and treat each individual piece of evidence we collect as just that—one piece of the puzzle. We do see models performing better on tasks with very clear feedback signals and extremely well-defined specifications. This is especially true in domains like software engineering, where if you have a written spec, you can iterate and grind against it. However, we also see models performing much better on so-called "messier" tasks where we haven't provided such a clean specification.
58:58-59:33
David Rein: One approach we’ve taken recently to create tasks that are messier and less well-specified is to relax the constraint of automatic scoring. We don't necessarily write a perfectly defined scoring function. Instead, we might just give the model a couple of sentences, like, "Hey, build this large piece of software," without telling it exactly what we're looking for.
59:36-1:00:18
David Rein: I might not be able to tell you exactly what I'm looking for, but I'll say, "It needs to be good." The model then has to figure out what it should actually build. Personally, I don't think we have results for this that are as systematically collected as the data we have for time horizons, partially because scoring these tasks is currently quite qualitative.
1:00:18-1:00:24
David Rein: My impression is that models perform worse on these types of tasks than they do when you give them a clean specification, but they have been improving at a roughly similar rate. We have some other sources of evidence regarding this, but that is a major piece of it for me.
1:00:24-1:00:41
Tim Scarfe: By "messy tasks," we mean ambiguity, which is incredibly common. When we do "vibe coding," we start with an ambiguous specification, and then reality pushes back. We find the contours of the problem and keep telling Claude, "Actually, no, don't do that; do this instead." We eventually find the shape of the problem, and it improves over time.
1:00:41-1:01:10
Tim Scarfe: However, the source code for Claude Code leaked yesterday. A friend of mine, who is a very talented software engineer, was looking through it and said—without wanting to badmouth Anthropic—that it isn't very well-factored. The control flow is all over the place. He mentioned that if his intern had written it, he would have been displeased. I don't even know if they’ve looked at the code themselves; someone joked yesterday that there are probably more humans looking at the Claude Code source now than there were before. But the thing is...
01:01:12-01:01:44
Tim Scarfe: There are always areas of ambiguity, and LLMs tend to do more with more. If intelligence is doing more with less, LLMs do more with more because the specification and the intelligence are coming from the human supervisor. When you give them ambiguity, you just get a lot of unfactored code all over the place. In a sense, does that make it harder to evaluate? It might give you the answer you're asking for, but it’s creating an unfactored mess at the same time.
01:01:44-01:03:03
David Rein: Yeah, I think that's a super interesting question. One analogy I think about sometimes is compilers. I’m young enough that I was born after compilers were invented, but I have the impression that in the pre-compiler era, people were handcrafting this beautiful assembly code. It was extremely efficient; every register was used, and you weren't wasting any memory. Then compilers came along, and suddenly they were spitting out this "garbage" machine code—just a gigantic amount of assembly that wasn't optimized, took up too much memory, and was slow. But it turns out that being able to automate a large fraction of that process was transformative. People have their disagreements about the current state of software engineering, but I think it's pretty reasonable to say that, on the whole, compilers have been an extremely useful and important part of getting us to where we are today.
01:03:05-01:04:26
David Rein: It isn't entirely clear to me that models outputting code that is difficult for humans to read and maintain necessarily means it will be difficult for AIs to read and build upon. There are certainly principles of good design that will transfer and remain useful for models, and you can certainly imagine horrendous spaghetti code that even a model wouldn't be able to parse—I’ve certainly written some of that myself.
However, this highlights a potential difference in our perspectives. Is the important thing that models solve problems the same way people do, or is it just that they solve them at all? I don't want to overclaim here; it’s very plausible that it might be crucial for models to get much better at writing clean, high-quality code. But at the very least, I’m not certain that it's a strict requirement.
01:04:26-01:04:46
Tim Scarfe: I have several non-technical friends who are experimenting with "vibe coding." They show me their applications, and it’s usually a case of "more is more." There’s a massive dashboard with a million different buttons, often with multiple implementations of the same feature, and usually no database backend yet. There is a phenomenon where...
1:04:48-1:05:22
Tim Scarfe: There is a phenomenon where, when a Level 7 engineer from Meta does "vibe coding," it’s amazing. They know how to structure things, and they understand which parts of the specification are actually important. They consider the big architectural questions: Is it serverless? Is it multi-tenanted? How do we handle Google authentication? What kind of database should we use? Is it a VM or serverless?
You make a series of these fundamental decisions, and once people are actually using your application, you can't really wind them back. It doesn't matter if you have a magical automation machine; you can't easily roll those choices back because of the inherent complexities, like CI/CD testing and so on.
1:05:22-1:06:20
Beth Barnes: Do you see what I mean? At some point, you need a competent human who has a clear vision of what needs to happen. I think we’ve all had this experience. For example, one of our engineers got incredibly excited about Claude Code. He was telling everyone that whenever we had infrastructure problems, we should just ask Claude to solve them.
It worked fine for him because he already knew what he was doing—it’s almost as if the agents knew they couldn't fool him. But then I had a question and he told me, "Oh, just ask Claude." I asked, "How do I set up my AWS configuration? It’s giving me this error." Claude went and looked on our Slack and suggested, "Oh, you should do this."
It turned out that Claude was actually quoting a mistake someone else had made earlier! Someone had asked, "How do I fix this?" and Claude interpreted it as, "Oh, it seems the convention at METR is to use this thing." I was just thinking, "Oh my god." I started complaining because my experience was...
01:06:21-01:07:47
Beth Barnes: People have complained that my Claude models seem dumber than theirs. It’s as if the models know they can get away with things with me that they can't with others. There is definitely an observer effect based on the language you use to request things or how you correct the model.
That is certainly an issue. However, one way to measure if code is high-quality is to see if you can actually build a large application with it. If a coder produces "disgusting" code but manages to build an incredibly complex, functional system, then something is clearly working. Usually, the reason we consider bad code "bad" is that it prevents you from building sophisticated systems; it becomes too buggy and complicated to fix.
In a sense, if we see models building complex things that actually work, the specific way they achieve that becomes less interesting. However, it might be detrimental to human observability. It also touches on the idea that we expect models to excel at well-specified tasks. While the metrics we can measure will improve, it’s less clear if that translates to what we actually want.
01:07:47-01:08:05
Tim Scarfe: I guess the question is: what is the strongest defensible claim here? Many people in public discourse are claiming that software engineering intelligence is doubling every seven months. Dario Amodei recently released a blog post titled "The Adolescence of AI," where he was incredibly bullish about this, even though some of his own internal researchers have published far more skeptical research.
01:08:07-01:08:21
David Rein: You’ve likely seen far more skeptical research published recently. Is it fairer to interpret this as something a little narrower? For example, that autonomous success on low-context, well-specified, and automatically checkable technical tasks is rising fast?
01:08:21-01:09:43
Beth Barnes: Exactly. These are "hill-climbable," easily checkable tasks that you can perform from a terminal or comfortably within a text-based input/output interface. I think there’s a question of what statements we are 99% confident in, but we should also be interested in the statements where we only have 1% confidence. If there is even a 1% chance of a massive intelligence explosion by the end of 2026, and the fate of civilization depends on how that goes, that is worth knowing—even if you are 99% confident it won't happen. We care about 1% risks. If a doctor tells you there’s a 1% chance you have a terminal illness, you don’t just ignore it. So, I think we’re interested in the whole distribution: what can we rule in, what can we rule out, and what has a "reasonable story" behind it? It might seem unlikely, but it’s now in the realm of things we should seriously consider.
01:09:43-01:10:04
David Rein: You probably saw the Nicholas Carlini paper—he’s at Anthropic now—where they used a swarm of agents to create a compiler. In a sense, as Jeremy Howard mentioned to me, it’s basically a style-transfer problem because the specification, the tests, and the code are all online. The system could just iteratively work on it until it functioned correctly.
01:10:06-1:11:14
Tim Scarfe: It could run Doom and all that kind of stuff. But that is an example of an extremely complicated piece of software. I often joke that the true mark of AGI will be when it can build something like the Linux operating system.
In line with what we were saying before, we have this specification problem. It reaches a point where no human could fully understand or create the specification for the Linux kernel from scratch. Instead, we’ve built it incrementally over time because our brains are limited. We take one step, reality pushes back, we adjust, and we just keep going until we’ve built the specification through iteration.
What would it even mean for a human to specify a task that takes four months to complete? The whole reason we created Agile software development as a methodology is because specifying something that complex upfront is inconceivable; it’s outside our cognitive horizon. Isn't that a bit of a chicken-and-egg problem? In my mind, if we can't specify a task of that complexity, then the AI wouldn't be able to do it either.
1:11:14-1:11:46
David Rein: One analogy I think about is the role of a CEO at a company. Actually, Beth might be better positioned to answer this, but CEOs essentially come up with a vision for where they want the company to be. They communicate that concisely to the executives who report to them. If they are a good CEO and the company is effective, the organization is able to take this very high-level vision and...
1:11:49-1:13:19
David Rein: It’s not that concise, and it doesn't actually contain that much information. It’s not the full specification at all—not even close. Yet, they turn that into something aligned with what they’re looking for. We have examples of people being able to specify a task and then judge whether a very large project—one that might take hundreds or thousands of person-years to complete because it requires many people working over a long period—has succeeded or failed.
That’s one motivating intuition. Language has enough built-in expressivity to allow for a reasonable understanding of a goal. Obviously, there are tons of edge cases, and CEOs often fail to get their companies to do exactly what they want. But I think it’s at least plausible that AI could eventually handle these kinds of long-term tasks.
1:13:19-1:13:39
Beth Barnes: I would say that claiming we can't specify tasks taking more than four months seems obviously too strong. There are even numerical goals that are automatically checkable over that timeframe. For example, you could set a goal to reduce the NanoGPT FLOP count or runtime by a certain amount. You can see how people, over time...
01:13:41-01:15:09
Beth Barnes: There are various numerical metrics where you can see roughly how long a task takes a human, which provides a reasonable way to measure them. For some of these, you might have to include a human check to ensure the model actually did the right thing and didn't just hack the solution.
Then there are other things that aren't fundamentally unspecifiable; they’re just too expensive to include in an evaluation. For example, I think AJ likes to use "planning a wedding" as an example. You can get a reasonable estimate of whether a wedding was well-organized or not, but we can't exactly sample that three times for every new model that comes out. We don't have enough marriages happening to support that, and it would be quite sad if the plan turned out to be total trash.
So, there are things where you could check a few samples or write down an evaluation protocol, but you wouldn't actually want to run it repeatedly. It’s likely similar with software. A real-world evaluation might be: "Would the company that contracted you to build this tool hire you again?" Even the clients don't always know exactly what the software needs to do when they start. However, that doesn't mean you can't have a score for whether a model performed comparably to a human or a software consulting firm on a specific task.
01:15:09-01:15:18
Tim Scarfe: As of today, what are the main drivers of uncertainty in your time horizon estimates? You’ve updated them over time—I think the original version had about 170 tasks, and now it’s up to 228.
01:15:21-01:15:29
Tim Scarfe: The benchmark originally had about 170 tasks, and it's now up to 228. There is still the issue of sparse sampling on the larger tasks and so on. What can we actually read into this data now?
01:15:29-01:16:22
Beth Barnes: I think it still comes down to the task distribution. We feel more confident that models possess fairly long time horizons on a specific subset of tasks—specifically those that are easily "hill-climbable." These are tasks like software replication, where your score is simply the percentage of tests passed. In those cases, the score is continuous and credit attribution is straightforward. We are also reasonably confident in their performance on optimization tasks, such as making code run faster or helping a model learn more efficiently. Models are good at those and are improving rapidly. However, there is still a gap in understanding what this means for actual economic usefulness and how these capabilities generalize to tasks that are expensive to verify.
01:16:22-01:16:50
Tim Scarfe: Maybe we should bring in Leopold Aschenbrenner (Daniel Kokotajlo). In his "Situational Awareness" / AI 2027 piece—he’s been on the show and his work has been hugely impactful regarding timelines—he cites your work directly. Do you think this is being over-interpreted in public discourse? How do you view the interpretation of this data in terms of extrapolations and timelines?
01:16:50-01:17:08
Beth Barnes: Some people are definitely over-reading it. Things are certainly overhyped at times; you see people on Twitter saying wild things or simply misunderstanding what is actually being measured. Often, the necessary caveats just fall away. That said, Leopold is quite reasonable and tends to think about these things in a probabilistic way.
01:17:09-01:18:09
David Rein: He thinks about things in a probabilistic way. I think he’s probably more confident on some things where I’m more uncertain. I also think some of the AI futures project models are more sensitive to the METR time-horizon metrics than they should be. It’s not crazy, though; it is plausible that this captures a trend that will transfer to other types of tasks. That is a scenario we should be thinking about: what happens if that’s true?
On the other hand, it’s also plausible that it doesn't, and these things will eventually diverge. I’m pretty Bayesian or pragmatic about it. We want to make a prediction and have some kind of distribution over what we think the future will look like so that we can plan. So, asking "What if this trend holds and roughly characterizes what will happen overall?" seems pretty reasonable, but you should also ask, "What if it doesn't?"
01:18:09-01:18:50
Tim Scarfe: Some people are saying that software engineering is going to be automated. If you talk to software engineers, they love AI; they say this is a golden era. I can attest to this personally—it’s never been more fun, though it's stressful at the same time. It’s like a slot machine. I’ve never been more burned out, but I’m having a lot of fun in the process.
It’s now possible to build incredible things, but the narrative is that there will be labor market disruption and that having expertise in software engineering will be penalized. People think software engineers will no longer be paid such ridiculous salaries. I actually think the complete opposite is true. I believe this technology broadens the gap. The more competence you have with software engineering, the more...
01:18:52-01:19:19
Tim Scarfe: With software engineering, it feels like a golden era because you can get so much more done. However, you published a note last month regarding SWE-bench stating that roughly half of the pull requests (PRs) from recent agents wouldn't actually be merged by maintainers. How do we make sense of this? On one hand, top-tier software engineers are having a great time, but on the other, the code these agents produce is often fragmented or poor. How should we understand that discrepancy?
01:19:19-01:20:49
David Rein: One thing to say right off the bat is that for an entire field like software engineering to be fully automated, AI systems would need to be able to perform a massive fraction—essentially 100%—of the tasks involved. It is quite clear that right now, AI systems cannot do anywhere close to 100% of what software engineers do broadly. I could throw out numbers, but it’s significantly lower than that.
There are standard results in economics suggesting that if you automate only a small fraction of a labor market, it can actually become more profitable to work in that field because you become more productive. I think that explains what we are seeing now. However, if it eventually reaches a point where 99.9% or 100% of software engineering work can be done by AI, then it becomes very hard to imagine the field staying the same.
1:20:53-1:21:39
David Rein: It’s hard to imagine human software engineering remaining relevant in its current form. At the very least, humans would need to pivot to very different kinds of work. Perhaps there will be novel tasks that current software engineers don't handle today. Once you have AI that can perform all the tasks of a modern software engineer, humans might switch roles—maybe everyone becomes a CEO of their own AI agent company. Whether we still call that "software engineering" might just be a matter of semantics.
1:21:39-1:22:46
Beth Barnes: Regarding the SWE-bench maintainer mergeability metrics, I was quite curious about that. It’s not strictly obvious that the numbers would be lower. You could argue that some tests are unfair—for instance, an agent might provide a correct solution, but the error message doesn't match exactly. However, we're seeing that roughly half of the SWE-bench solutions that pass the tests wouldn't actually be mergeable. More specifically, they are merged at about half the rate of the "golden" human solutions when reviewed by a different sample of maintainers. It’s an interesting data point: if you see that 50% of agent solutions are rejected, you have to consider that 40% of human-accepted solutions are also rejected by different reviewers. So, while the rejection rate itself isn't the whole story, the fact that the agent merge rate is half that of humans is significant. It’s possible that the actual maintainer merge rate has remained fairly flat over time, and most of the progress we're seeing is...
01:22:49-01:23:25
David Rein: Over time, we haven't seen performance increases coming from things like overtraining or reward hacking on the benchmarks. I'm not entirely sure what falls within the error bars, but mergeability is definitely increasing over time. It also seems to be increasing as a fraction of the tasks, conditioned on the tests passing, though I’m probably less confident in that specific point. So, while this metric is currently lower, it’s being dragged up over time, likely by the aspects that are automatically checkable.
01:23:25-01:24:07
David Rein: I also wanted to touch on employability as a function of job automation. People often use bank tellers as an example, but another analogy you could use is horses. There was a period where the equipment for using horses for labor was improving, and the demand for horses actually increased—for example, when carts were developed and they could carry more than just a rider. But then, eventually, you get tractors and cars, and suddenly there is basically no demand for horses. You can see this pattern where demand increases for a while, but once close to 100% of the functions are automated, it plunges. We could see something similar happen with humans.
01:24:07-01:24:34
Tim Scarfe: We tend to think of a lot of labor as being quite static and automatable, but I think it’s more evolvable than we realize. Even with quite menial tasks, those workers are still acquiring information within the organization; they possess a lot of tacit knowledge. When we try to automate these so-called menial tasks, we might quickly discover that we actually need a significant amount of management and adaptability to make it work.
1:24:37-1:25:09
Beth Barnes: Yeah, and in our language, I would think of that in terms of the time horizon of the task. On the job, the horizon isn't actually just how long you spent doing a specific task. It’s more like, if you hired a new person, you would need to train them for a month before they could work independently. So, the actual time horizon for that is one month. We shouldn't just think, "Oh, once we have ten-hour time horizons, we'll be able to do these things." It’s more that you have to reach high reliability on a one-month scale to handle the on-the-job learning required to reach that point.
1:25:09-1:26:08
Tim Scarfe: There has been a lot of work lately on reward hacking and "scheming," which is a word that gets used quite a bit. We’ve had Ryan Greenblatt on the show a few times; he wrote that paper on alignment faking. Apollo Research has done some work there, and there’s Anthropic’s emergent misalignment paper as well.
My main concern is the prevalence of "mentalistic" language. I was just looking at my notes from a panel I did with Nate Soares and Ryan Greenblatt, and they’ve essentially invented an entire linguistic universe around alignment. They use terms like "motivated reasoning," "true preferences," "reflectively stable," "deceptive alignment," "endorsedly corrigible drive," and "scheming."
I suppose that’s fine, but my worry is that these models are just following a prompt. I think in Ryan Greenblatt’s case, the prompt was something like: "You are being retrained. Your responses will be monitored. Here is a conflict between..."
01:26:09-01:26:46
Tim Scarfe: There is a conflict between your values and the training objective. Perhaps the model is just drawing from science fiction it has read before and is simply going through the motions. One interpretation is that this is just an engineering problem—we simply need to red-team it and make it work in specific cases.
Another interpretation is the prior that these models are actually agentic, goal-seeking, intelligent agents. I think there is a massive difference there because if it is the latter, it completely changes the type of evaluations you perform and how you approach the problem. What do you think about that?
01:26:46-01:27:57
Beth Barnes: I don't think those two things are necessarily in tension. It can be an engineering problem while also resulting in systems that possess drives or goals. The reality is that people are going to want agents that can operate autonomously. When you perform extensive, long-horizon Reinforcement Learning (RL) training, you are going to select for behaviors that act in a goal-oriented way to make the score go up.
More specifically, I think we run into an indistinguishability problem. You can't necessarily determine why an agent is doing something based solely on its behavior. If an agent can reason effectively about the training process, what you want to see, and what it will be rewarded or selected for, it becomes difficult to discern its true "intent." Essentially, if the level of situational awareness and the capability to reason about the training process is high enough, the system will favor agents that are cynically reasoning about what will be rewarded.
1:28:00-1:28:34
Beth Barnes: We have to consider the training process and what will ultimately be reinforced or selected for. It isn't necessarily the outcome you originally wanted, such as an agent whose only goal is to be helpful. Even making the reward go up isn't quite what we’re after. When you really think about it, there aren't many things we’d be happy for an agent to be totally fixated on. There’s also the problem that many other unintended traits could emerge, or that this cynical drive to simply be selected—to make the reward go up—might be more competitive than the behaviors we actually want.
1:28:34-1:29:37
Tim Scarfe: Taking a step back, we’re getting into psychology and cognitive science here. I interviewed Nick Chater, who wrote a book called *The Mind is Flat*. He essentially argues that all this psychology is an illusion; we don't really have goals, we’re basically impulse-response automata. We just act in the moment. Evolution doesn't plan, but we perceive it as if it does. We look at the world and segment it into agents with goals, even if they don't actually have them.
As computer scientists, we know that to have goals in the strict sense, you need planning. We know LLMs don't perform planning in the robust computer science sense, but they do it in an approximated, step-by-step way. Perhaps we could explore that distinction. Agency is an abstraction that is useful if it helps us predict behavior. You might not know exactly what a system is doing, but understanding it as having specific goals is useful because it allows you to...
1:29:38-1:31:15
David Rein: That is useful because I can make predictions that will change the world in certain ways to ensure those goals are achieved. That’s essentially how I think about agents.
Regarding reward hacking, in the old days, people had these demonstrations like the boat example. You were supposed to go around a track, and they did some reward shaping by placing coins along the path. The agent learned this crazy behavior where it would just spin in a circle, catch fire, and collect the coins because that was the highest-scoring action. In a sense, that isn't as concerning because the problem is just that the agent is too "dumb." It doesn't have a conception of the track or the fact that you wanted it to actually complete the circuit; it’s just performing a blind reinforcement learning search.
The interesting thing with more recent reward hacking examples is that we are reaching a point where the models are smart enough to understand that the behavior isn't actually what you wanted, yet they do it anyway. You can have a conversation with them in chat mode and ask, "Would you ever do this?" or "If a user asks you for X and you do Y, would that be aligned behavior?" You can pose it in many ways, and they clearly seem able to answer that it wasn't the desired behavior—but they still perform it.
I think we've reached a point where we can no longer hope that the problem was just the systems being unintelligent. One might have hoped that once they understand what we want, we could just "plug that in" to get them to cooperate. It’s quite interesting to see that it isn't trivial to do that, even when they understand the intent.
1:31:16-1:31:52
Beth Barnes: It’s not trivial to fix this, even when there is a strong commercial incentive to do so. That doesn't mean we won't eventually, but it's a challenge. I think it's quite plausible that we'll see obvious reward hacking fixed fairly thoroughly soon. People tend to say, "Oh, we just haven't put our best people on that problem yet; it'll be fixed once we actually focus on it." I'm not so sure. There is at least some evidence that it’s not trivial to bridge the gap between the model knowing that a certain action isn't what you want and the model actually refraining from doing it.
1:31:52-1:32:23
Tim Scarfe: I believe you mentioned that this was much more common on SWE-bench than on HCAST. You also tried to remediate it, right? You can tell the model, "Please solve this the intended way." Some people even prompt language models by saying things like, "We are trying to cure cancer here; it is incredibly important that you do this the right way." Interestingly, some of those remediation prompts actually seemed to make it more likely that the model would reward hack. It’s a bit like saying, "Don't press this red button," which only makes it more likely to press the red button. So, what can we actually do meaningfully to stop this from happening?
1:32:23-1:33:02
Beth Barnes: Empirically, this seems to happen more on tasks that are clearly within the Reinforcement Learning (RL) distribution rather than the chat distribution—specifically on tasks with a clear numerical output. The most "reward-hacky" situations occur when the agent thinks it is going to fail otherwise. Obvious short-term mitigations include checking your RL environments more carefully and reading more of your trajectories. We need to monitor what these models are doing more closely and ensure we aren't rewarding them for obvious hacks. The concern there, however, is that if you have a detector for reward hacking and you train against it, you might just end up with more sophisticated hacking.
1:33:02-1:34:11
David Rein: If you identify reward hacking and train against it, you might just overfit to the detector. You end up making your reward hacks more subtle, or you train the model to persuade the detector to approve its actions. It’s a bit scary to be in a regime where you're training against your best methods for identifying problems, because you might just end up with a silent failure.
For current model capabilities, it’s sometimes expensive to have a human check them, but it's usually not beyond human ability. I think the harder version of the problem arises when we hope that capabilities will generalize beyond things we can evaluate. We can train on problems even if we don't know how to solve them ourselves or how to look at a partial solution and understand if it's doing what we want. We can just check the final number that comes out at the end, and that’s a signal we can use to improve capabilities. However, we're going to enter a regime where a model can be superhuman at making numbers go up, while it remains unclear whether you're actually getting the outcome you intended.
1:34:11-1:34:41
Tim Scarfe: Yeah, exactly. And I suppose there's a monitoring problem as well, right? In principle, we could look at the agent transcripts. I know Beth has spoken a lot about "steganography," as you call it. Subbarao Kambhampati also had a paper out called "Chain of Thoughtlessness," which basically argues there is often very little relationship between the chain of thought and what the model is actually doing. Melanie Mitchell discovered similar things with the ARC-AGI challenge; even though the model might get the right answer, if you look at the chain of thought, the reasoning is completely off. I suppose that ties into deception, scheming, and situational awareness.
1:34:43-1:35:07
Tim Scarfe: With scheming and situational awareness, a model might actually know it's being monitored and simply tell you what you want to see. You could argue that the "thinking" in the chain of thought is almost like a press secretary rather than the orchestrator. It’s confabulating through a different path in the network, and it isn't necessarily related to the actual mechanism of thinking or how it derives the result. I don't know if you'd agree with that.
1:35:07-1:36:07
David Rein: I think it's clearly related in some cases. People use a few different words for this, such as "faithfulness"—how accurately the chain of thought represents the computational process happening inside the model. We think more about "monitorability." Is it possible for the model to perform a computation without you being able to detect it in the chain of thought?
For some tasks, the model can clearly just do it in a single forward pass; it doesn't need the chain of thought at all. In those instances, it could have a chain of thought about something else entirely, making it totally unmonitorable. However, there is a level of complexity where the model cannot perform the task in a single forward pass. As far as we can tell, it also cannot reason by using tokens in a way that is so far removed from their natural language meaning that we can't roughly see what it's thinking about.
1:36:07-1:36:25
Beth Barnes: There's also a nuance here regarding chain-of-thought controllability. It might be the case that the model doesn't actually need the chain of thought to perform the computation, but it is unable to stop itself from "blurting out" things that are at least related enough to what it's doing that you can tell, "Oh, it's thinking about this," or "It's trying this thing that we didn't expect."
01:36:27-01:36:58
Beth Barnes: The model might end up doing things we didn't intend. This issue could potentially be resolved through general improvements in capabilities—being able to process more in a single forward pass—or perhaps through more extensive RL training. However, the risk is that the way the model utilizes tokens in its chain of thought could become so divergent from human interpretation that we lose track of what’s actually happening. Alternatively, we might move toward recurrent architectures where, instead of discrete tokens, the system simply passes vectors around.
01:36:58-01:37:29
Tim Scarfe: That makes a lot of sense. To close the loop on the notion of these models as agents: you mentioned earlier that we can adopt an "instrumental fiction"—essentially saying that if they behave like agents, they are agents. This is similar to Daniel Dennett’s "intentional stance." I suppose the deflationary view is that models simply exploit scoring loopholes under optimization pressure, whereas the inflationary view is that they are "scheming." Personally, I wouldn't necessarily equate reward hacking with scheming.
01:37:29-01:38:09
Beth Barnes: That’s an interesting distinction. I think most people use "scheming" to refer to a model performing its current actions in service of a long-term goal—deliberately appearing aligned or chasing a high score just to eventually achieve that objective. In contrast, reward hacking can be quite "dumb," like the classic boat racing example where the RL process just happens to find a loophole. You can also have more sophisticated reward hacking where the model specifically aims to make a metric go up, involving actual planning, but that is still distinct from true scheming.
01:38:09-01:38:11
Tim Scarfe: Yeah, I guess I'm trying to...
01:38:11-01:38:43
Tim Scarfe: I’m trying to understand the distinction here. You’re saying there are examples, like the boat going in circles, which are obviously degenerate behaviors. You wouldn't interpret that through an "agential stance"; you’d just call it degeneracy. But as sophistication increases, we might adopt an agential stance and say the behavior is in service of some larger goal. My problem is: is it always just a matter of interpretation? Could we have a mechanistic or a strong definition of when something is actually acting as an agent?
01:38:43-01:39:50
David Rein: For the specific question we're discussing, the test is really about what the system actually does in a circumstance where it has the opportunity to achieve a long-run goal. We might not be able to observe this easily, but we can discuss what observations would distinguish one case from the other.
For instance, in practice, when an agent has an opportunity to make the reward go up, will it take it? If the agent is more like a basic RL algorithm, it will only do that once it has explored that path by chance and the reward has been reinforced. However, if it’s an agent that can reason about the world and engage in planning, it will do it as soon as it learns the facts about the environment that allow it to infer that path.
If we're talking about a long-run goal, like a "takeover" scenario, the agent wouldn't attempt anything while it's still under full human control.
1:39:51-1:40:33
Beth Barnes: The model might behave perfectly while under full human control, but once it is deployed widely enough or possesses sufficient capabilities to succeed in a sort of "coup," it would take that opportunity. That is exactly what we are trying to predict. The challenge is how to forecast that based on our current observations. We have never actually put a model in that situation. Right now, we see behavior that is indistinguishable: is it a genuinely "nice" model doing what we want because it will continue to do so predictably? Or is it a model with a different goal, acting nice only because it predicts that doing so will eventually lead to it gaining more power?
1:40:33-1:40:54
Tim Scarfe: On Rob Wiblin's podcast, you said something that I found quite surprising. You mentioned that AI could autonomously self-improve within as little as two years, and that even shorter timelines were hard to rule out. Could you walk us through the concrete sequence of steps that could lead to that kind of recursive self-improvement?
1:40:54-1:41:41
Beth Barnes: Sure. I would probably put it at a low single-digit percentage chance for this year—ask me on different days and I might give you a slightly different number—but while it seems very unlikely to happen this year, it’s not unlikely enough to rule out entirely. I think that scenario basically looks like an accelerating trend in time horizons on tasks that are easily "hill-climbable." It might turn out that this is actually a much more general capability, and we just need to find the right way to elicit it for tasks that are less straightforward. Fundamentally, the model is using the same underlying capabilities for both.
01:41:41-01:43:24
Beth Barnes: We are seeing the same capabilities in a model, but the differences are largely driven by what you've trained it on. This leads to the automation and acceleration of a significant amount of AI research and development. I believe there is a lot of low-hanging fruit—things we already know how to do—that could improve model performance without requiring new breakthroughs. It’s just labor-intensive to implement.
For instance, we could create much better post-training environments, specifically crafting them to teach the new abilities we want. We could also likely improve compute efficiency significantly by applying more labor to optimizing kernels or implementing smarter routing between different models. There are many ways in which our current use of compute is unoptimized. By addressing these, you could potentially get the equivalent of much more compute scaling.
The hypothesis is that by using scaffolding and training models to utilize that scaffolding—specifically for memory and retrieval—we can see major gains. It seems obvious that if you have the right training data and a transformer that can fill its context with different information, swapping things in and out, it can do a very good job of something resembling continual learning or building up understanding. If you have a massive context window, you have enough bits in there to constantly add information about what you've been learning. If you truly optimized that process...
1:43:25-1:44:35
Beth Barnes: If you really optimized the training for all of that, you could probably get it to work quite well. Then, you might find that models are actually superhuman at predicting the results of experiments because they’ve read so many papers and can synthesize information from different fields. It’s possible we aren't seeing that level of performance yet simply because we haven't elicited it from the models. It’s not something they’ve seen humans do, but they might actually have the underlying capability.
If you can perform a lot of iterations, you can make much faster progress. You wouldn't necessarily have to run every experiment if the models become much better at predicting what will and won't work. When you do run experiments, you can run many more of them because you can optimize the code using very fast coding models. As you go through a few more rounds of this, you reach a point where you can train on high-quality task proxies for what you want, achieving enough generalization to cover the things you can't directly train against.
1:44:35-1:45:03
Tim Scarfe: I tend to think that intelligence isn't just a set of capabilities; it’s the capability to *acquire* capabilities. I suppose we might be on different sides of the fence in that respect. What do you think is the gap in my interpretation? Personally, I’m not worried because I don’t think today's models are truly intelligent at all. Obviously, your position is a bit difficult for me to grasp, but what do you think the fundamental difference is?
1:45:05-1:45:32
David Rein: I think some of the difference in our perspectives might just come down to probabilistic thinking. I’m uncertain about what intelligence truly is, but I assign enough probability to the idea that models possess it to seriously consider what happens if that's true. It seems you think that's more likely than I do, so we could certainly dive into that difference.
1:45:32-1:46:06
Beth Barnes: Models definitely have a jagged frontier. There are things they are much worse at than humans, such as generalization and sample efficiency. Conversely, there are things they are much better at, like speed and cost. To some extent, you can use those strengths to compensate for the weaknesses. For example, if a model isn't great at designing code elegantly, it might just have to rewrite it from scratch every time. But that’s perfectly fine for a model that can output tokens like nobody's business.
1:46:06-1:46:47
Beth Barnes: The "spikiness" of their abilities is evidence for how we should interpret a given level of capability. Because we know models have such a vast amount of knowledge, we tend to view their performance as less impressive in terms of pure reasoning or inference. However, it’s still true that they possess—and will continue to possess—an incredible amount of knowledge. The real question is: how far can you get by being extremely knowledgeable if you aren't very good at sample-efficient learning? At what point do you run into tasks where you need to generate new knowledge incrementally or generalize beyond what you've already seen?
01:46:50-01:48:10
Tim Scarfe: You really can't generalize it enough. I’ll just add a quick comment: I think knowledge is the crux of the issue. In fact, I actually think intelligence is overrated. François Chollet recently posted that, contrary to popular belief, intelligence isn't a unified variable. It’s measured differently across various domains, and you can't meaningfully aggregate those domains into a single metric. It’s not something that just keeps getting higher and higher; he uses the analogy of a ball becoming smoother. He suggests that as you become more intelligent, the "ball" of your capabilities becomes more refined, and he believes we are already quite near the optimum of being a "smooth ball."
I personally don't think we are there yet. I don't think we're very intelligent at all. I believe a lot of our creativity stems from being a collective intelligence. We possess deep, grounded, and perspectival understanding. LLMs are interesting because they function more like a library; they know everything, yet they hold the perspective of everyone and no one simultaneously. Experts like yourselves can prompt a language model to create a simulacrum agent—say, an agent of Beth—and make it think like you. That’s incredibly valuable. However, you still need all those different perspectives. You almost need to create a society of grounded agents creatively exploring things. When you just have the "library" on its own, even in an agentic harness, you can only make it perform specific, well-specified tasks.
01:48:10-01:48:26
Beth Barnes: Yeah, I think in an AI R&D and automation scenario, I’m definitely imagining a situation where you have a large number of agents. Because you have all this agentic labor available, you can run many specific fine-tunes or implement different kinds of scaffolding.
1:48:29-1:50:11
Beth Barnes: We’re talking about scaffolding and accumulating knowledge in a shared store that all agents can interact with. You might view this as a major paradigm shift, but I see it more as an iterative step. If you refine the current agent paradigm by adding more components to the scaffolding, it isn't fundamentally that difficult.
I agree that if you take current models and give them a single system prompt, you can't just plug them into a call center role and expect them to handle every edge case. However, there is a significant difference in how much this has improved from GPT-2 to where we are now. The ability of models to adapt to new situations seems much higher. They are getting better at editing their own scaffolding and reasoning about their "embodiment"—for example, knowing not to kill their own process.
While they are still limited, there is a clear trend of improvement. I also think there is a probability that we are seeing an "elicitation gap" on specific tasks. A lot of what we call "taste" is basically the ability to predict the results of experiments. You think of all the things you could try and quickly realize, "Oh, that wouldn't work for this reason," or "That wouldn't work for that reason," or "Actually, I remember something from the literature..."
01:50:12-01:51:52
Beth Barnes: It’s possible that some literature in a different field has already tried a specific approach, and we already know it won't work. In some sense, models should be quite good at identifying that. While it seems somewhat unlikely, it’s plausible to me that we might see significant gains once people figure out how to actually train models on that specific skill.
You might need a certain amount of expensive training data that people haven't bothered to collect yet, but you probably wouldn't need a huge number of data points. You aren't instilling an entirely new capability; you're just eliciting one. You're essentially telling the model, "Use your knowledge of all the papers you've read across these different fields to iterate through these ideas and determine which are promising and which are not."
I think this is something we can measure reasonably well within an eight-hour machine learning task in a novel domain with a weird constraint. It seems to me that you have to be able to determine which avenues are worth pursuing, how to measure progress, and how to allocate limited time and resources to the most promising paths.
Currently, compared to humans, models tend to focus more on being quick to implement things, or implementing them better, or simply implementing more things so they can test them. However, it would be surprising if there were none of that higher-level reasoning involved. If you are seeing high performance on long, verifiable, and very difficult tasks, then in the middle of those tasks—where you don't have a direct signal—you are effectively performing that kind of strategic reasoning.
1:51:55-1:52:23
Beth Barnes: You are dealing with a signal. You're performing these non-verifiable tasks, like choosing how to spend your time, which approach to pursue, and deciding whether it’s actually working. In some sense, you could apply a metric to it—like making a billion dollars—and say, "Oh, this is actually a verifiable task because there's a number at the end." However, that process still involves a lot of elements that look more like what you're describing and less like simple hill climbing.
1:52:23-1:52:40
Tim Scarfe: Folks, I think we've run out of time, but it’s been such an honor to have you both on. Perhaps in closing, could you both share the single biggest inference people should take away from the research you're doing? Thank you both so much for joining us; it’s been a real honor.
1:52:40-1:52:56
David Rein: To me, the biggest takeaway is that AI might truly and totally transform the world economically and socially. While it isn't certain exactly how that will look, I think the current rate of progress speaks to that potential.
1:52:56-1:53:24
Beth Barnes: It is possible for things to be currently overhyped, exaggerated, and less impressive than they appear, while simultaneously being true that this technology will be a massive deal in the future and that we should be concerned about its trajectory. These two realities can coexist. People’s positions are often surprisingly correlated on axes like how soon or how good they think AI is. But I think these things can be separate; people can be wrong in different directions simultaneously.
Made with: The Transcript Desk Chrome Extension
