A Single Number Doesn’t Make Sense Anymore | Noam Brown | Edited Transcript
A professionally copyedited transcript of Noam Brown and Francois Chaubard on intelligence definitions, AI evaluation, inference compute, priors, academia, open source models, and Libratus.
Chapter Timestamps
00:05 Defining Intelligence and AGI Benchmarks
01:06 Challenges in Measuring AI Intelligence
02:14 AI Progress in Long-Form Tasks
03:28 Evaluating Intelligence: Benchmarks and Metrics
05:47 Limitations of Current Intelligence Evaluations
07:26 Diversity of Intelligence and Evaluation Approaches
08:26 Scaling, Data Diversity, and the Generator-Verifier Gap
10:00 Program Induction, Search, and Human Priors
14:56 Biological Plausibility and Simplicity in AI Algorithms
17:17 Role of Priors and Skill Acquisition in AI
19:15 Contrarian Views and the Importance of Inference Compute
23:00 Train-Time vs Test-Time Recurrence and Future Directions
25:45 The Role and Challenges of Academia in AI Research
36:26 Open Source Models, Personal Growth, and Career Lessons
Made with: The Transcript Desk Chrome Extension
Full video:
Noam Brown and Francois Chaubard discuss definitions and measures of intelligence, AI evaluation benchmarks, the role of priors and compute, biological plausibility, academic challenges, open source models, and personal experiences in AI research.
Transcript
00:05-01:06
Francois Chaubard: I think we want to start with a definition of intelligence if we want to talk seriously about AGI or solving AGI, or getting the rest of the distance. Defining “the I” is probably the first step. So, how do you think about defining intelligence?
Noam Brown: I really—I don’t have a great answer here. I mean, I’m sure a lot of people have tried and failed to come up with a really concrete definition of intelligence. I remember, you know, I did a Reddit AMA back at the end of 2017, and somebody back then—even back then—was asking me, “Are you concerned about AGI?” You know, we have bots that play Go, and you just made a bot that plays poker, and it seems like the models are getting pretty intelligent. What do you think they won’t be able to do within the next 10 years? And I was like, to be honest, I’m not super worried about AGI right now. I can say with greater than 90% confidence that an AI will not be able to write a thought-provoking novel in the next 10 years. And if an AI could do that, then I’ll be very afraid of AGI.
Francois Chaubard: Interesting.
Noam Brown: And you know, that was 2017.
Francois Chaubard: Mhm.
Noam Brown: I have about a year and a half left now.
01:06-02:14
Noam Brown: I think it’s not looking great for my prediction. But I would say that’s a good example of intelligence, and I kind of stick to—I’m not going to move my goalpost on this. I think if we could have an AI that actually can write a thought-provoking novel, that’s a great sign of intelligence.
Francois Chaubard: Mhm, yeah. I guess, thought-provoking to whom?
Noam Brown: Yeah, that’s fair. I wouldn’t say this is an all-encompassing definition of intelligence, but I think it’s one of the examples I’ve come up with for where I think we need to go, and we’re not quite there yet.
Francois Chaubard: What do you think is the limiting step there? Why can’t it—why is novel production... We can do next-token prediction and just let it roll out until the end of the book. Is it just a prompt issue?
Noam Brown: I mean, in 2017, it was just unthinkable that you could even string a sentence together that was coherent. So we’ve made a lot of progress, and I think we’re actually getting pretty close. I guess the main challenge with—like, we have AIs that could probably write pretty thought-provoking short stories at this point. I think the main challenge with a novel is really the length of the task that’s involved. To actually write a novel—the novel itself is probably at least 100,000 tokens.
02:14-03:28
Noam Brown: And then you have to think about how long it takes a human to write a novel.
Francois Chaubard: If it’s a good novel, a good thought-provoking novel, they might spend more than a year working on it.
Noam Brown: And so if you look at things like—you know, I have my issues with the meter plot, but I think in some ways it’s a pretty good eval and a pretty good measurement. How long will it take until the AI can do stuff that takes a human a year to do?
Francois Chaubard: Mhm.
Noam Brown: I don’t think we’re quite there yet for most things, but we’re making progress.
Francois Chaubard: So I guess your rough proxy for an AGI benchmark would be writing a novel. Is there anything in there that is normalizing or controlling for number of its prior experiences, or joules to produce the novel? Like, all else held equal—if it were equal amounts of thoughtfulness, but it did so with 100x less joules or 1,000x less train errors, or no tabula rasa and no training at all—wouldn’t that be more impressive?
Noam Brown: Uh, sure, yeah. I mean, I think there’s definitely ways you can make it more challenging or more impressive. I wasn’t—you know, this is all just kind of absurd to even debate about, like, “Okay, well, how do we make it sufficiently difficult?” Ten years ago, it didn’t really matter.
03:28-04:28
Noam Brown: I guess we’re at a point now where the details do matter a bit. I do actually think it is probably possible now, with the existing models, to scaffold them in a way where it actually will output a thought-provoking novel.
Francois Chaubard: Yeah. So, is that sufficient for AGI?
Noam Brown: My guess is probably not. I think you’d want—I want to see a pretty light scaffold, if anything. But if you had told me in 2017 that, yeah, we’ll be at a point in 2016 where you could probably scaffold together a bunch of LLMs and they’ll be able to actually output a thought-provoking novel, I would have probably said, “Yeah, that’s fine, that’s sufficient for AGI under my definition.”
Francois Chaubard: Well, it’s good that you’re not moving the goalpost frequently, which often is happening as well. Of the available and commonly used measures of intelligence to quantify intelligence, most commonly would be val perplexity, which is intelligence per watt, which came out of Chris Ré’s lab. It’s just val perplexity divided by joules, essentially, that went into producing those tokens.
04:28-05:47
Francois Chaubard: There is some talk about switching that to everything that happens after the hashtag in the GSMK. So, whatever happens before, I don’t really care. And now it’s amount of joules to not just produce the next token, but produce the answer. And so then it’s like, well, how efficient were you with your tokens? So there are some measures there. Then there’s the cacophony of benchmarks, which is—you see the leaderboard of 15 different numbers, and they’ve all been benchmark-hacked. And then there are efforts like ARC Prize, and ARC 1, 2, and 3. What are your thoughts on these as proxies for intelligence?
Noam Brown: I mean, the reality is it’s a very difficult thing to measure intelligence, and this is why I just give one example of a way to measure intelligence. But I don’t think we even know how to measure intelligence in humans.
Francois Chaubard: You know, people talk about the different kinds of intelligence.
Noam Brown: I don’t think—if we’re at a point where we don’t have a way to measure intelligence in humans, it seems pretty hard to have an objective way to measure it in models.
05:47-07:24
Noam Brown: I think that all of these are valid. I think that there are also challenges with all of them. For example, if you’re using the GSMK eval, then what happens if the model is specifically fine-tuned on math? Okay, it’ll get really good at this eval, but is it a more intelligent model if you care about things like writing novels? Probably not. So, which is the point of Chollet’s measure of intelligence—that we have to control for prior experience, and that’s just a hard thing to do, specifically in humans. How much prior experience do you have?
Francois Chaubard: Yeah, and I guess, is there anything to that that you have applied at OI, or is there anything to normalizing by prior experience? Like, yeah, I reward-hack, but if I don’t, is it more impressive that it gets equal performance?
Noam Brown: Yeah, I don’t have a super satisfying answer here, because I don’t have a definition of intelligence that I use day-to-day. I think I like that there’s a diversity of evals. I think that looking at the aggregate of these evals is probably the best way to measure intelligence, and I think that there are probably different aspects, and the different evals can capture those different aspects. And...
07:26-08:26
Francois Chaubard: It might be that there are some kinds of intelligence that you do care about and some kinds that are not as important. Maybe if you’re a mathematician, you really do care about, “Can this model prove my math problems?” And that’s the intelligence that I really care about.
Noam Brown: Um, I think, you know, ARC captures another kind of intelligence. Like with ARGI 3, it’s really about being able to adapt to novel environments. I think that’s another perfectly reasonable definition of intelligence that is very useful in some domains, and I think it really depends on what you want these models to do.
Francois Chaubard: Mhm. I’m sure you have at OAI some massive eval set that, before you release a new model, it has to go through. Maybe that’s static, maybe it is dynamic, maybe it includes human-in-the-loop, I don’t know. I’m not asking you to say that, but what does that entail? Is that a sprawling set of tasks that you’ve developed internally? OI bench?
Noam Brown: Yeah, I probably can’t say.
Francois Chaubard: Okay. Okay. I mean, yeah. Um, okay. So, the next one is—now we’ve talked about trying to define intelligence, trying to quantify intelligence. Now, thinking you have your own—I’m going to call it OI bench, you can call it whatever you want—what are your most interesting... You know, there’s many “isms” that we talked about before. There’s the Ilia-ism, which, to be fair to Ilia Sutskever, he doesn’t necessarily believe this anymore, but at NeurIPS back in the day he definitely did, saying, “NTP is enough, all we need is more tokens, bro.” There is the, uh, “gnomeism,” that I would call it, which is Ilia-ism plus gen-verifier gap: since we have all but one internet, well, if we have a bunch of verifiable problems, we can generate and then verify against them, and that creates a lot more data, and then now we can do stuff. So, do you agree with that phrasing of “gnomeism,” or has that been—is that...
08:26-10:00
Noam Brown: Yeah, I don’t think it captures—I don’t think it captures the whole story. I mean, obviously we’re trying to distill a lot of stuff down into a single sentence. I think another key part, in my opinion, is the inference compute—the fact that a lot of, especially reasoning capabilities, I think the ability to be more productive by thinking for longer. And so it’s not just—I mean, it is a form of next-token prediction where you’re predicting a series of tokens.
There are, of course, other ways of doing scaling inference compute as well. But I think that I would consider that a key ingredient to intelligence as well.
Francois Chaubard: Mhm. And I think that the counter-narrative on just test-time compute is that, well, if I have a contrived example—such as, I’m trying to learn sort, and mankind doesn’t know about merge sort yet, but we know about bubble sort. And so I have infinite traces of unsorted lists and then the procedure called bubble sort, and then the final answer, which is the sorted list. And I can turn on infinite amounts of those. That’s verifiable, right? Gen-verify—I can generate a lot of them. And I will never test-time compute my way to merge sort. Well, I think I would just get a really crappy version of bubble sort. Maybe it’ll be exact, perfect bubble sort, but how would I compress it to a different algorithm? It’s been trained to do bubble sort exactly. The reasoning trace was bubble sort.
Noam Brown: Sure, sure. If all you’ve trained it on in its entire existence is bubble sort, right, and that’s all it knows how to do, then yeah, it probably will not be able to do anything beyond that. But I think that’s why diversity—when doing things like pre-training, a big aspect of it is making sure that you have a diverse set of data for this reason, to make sure that there is sufficient exploration.
10:00-12:00
Francois Chaubard: Right. And so I guess you have to sample—if you’re sampling in token space, where you’re doing transduction, where the theta is the program and it’s actually calling itself and it’s calling the program, versus it’s doing program induction where it’s outputting some program and doing meta-learning on top of that. If it’s the former and it’s doing transduction, then what is the probability of the trace of merge sort, conditioned upon it being trained only on bubble sorts or insertion sort or whatever other...
Noam Brown: I think this is where the generator-verifier gap comes in, because if it’s a non-zero probability—you know, you have a million monkeys on a million typewriters just clanking away, and then one of them discovers merge sort, and it’s pretty easy to tell, “Huh, we just sorted this way faster than we ever did with bubble sort. Maybe we should look at that a little bit more.” It’s pretty easy to verify that you’ve stumbled upon something that’s meaningful, right?
And I feel like, in many ways, this is how all of human civilization has developed over time, right? Like, people—somebody figures out, “Oh, we made fire.” It took a long time. I’m sure a lot of people were just banging rocks for a long time, trying to figure out, “Do we get anything interesting?” Or maybe they weren’t even intending to get anything interesting, and then they noticed, “Oh, now there’s a fire.” It’s pretty easy to verify that something interesting just happened, something important just happened, right? And now that can inform progress.
Francois Chaubard: Right. Um, yeah, and I guess the search over program space then is, you know, if it’s random, it’s extremely inefficient and we’re never really going to get to merge sort very quickly with that. But if you have some prior and it does an interesting or clever heuristic search and it has learned to search program space, then I would completely agree with you. But even then, there are ways to—you know, the people that win IMO every year are always—there’s like ten strategies that you learn, or the people that are really good at chess, and there’s like, you know, the pawn is worth one, rook is worth five, and...
Noam Brown: I don’t think it’s quite that simple for IMO or chess.
Francois Chaubard: Yes, okay. I mean, they try these—it’s a good way to get to at least 25, and it’s like you have these ten strategies, you do these things. I agree that it would be a pretty silly thing to get the rest of the distance. And I’m sure Magnus Carlsen would take some pause with, “Oh, all you need to know is that a pawn is worth one and a rook is worth five, and then you can beat me.” No. But you’re kind of bootstrapping off of someone else’s congealed knowledge, and then you’re running this algorithm, and maybe your hippocampus is just stronger than mine and you can do a lot more MCTS, you know, iterate faster and go deeper than me, and you have a better memory or something. There’s this—maybe it’s skill, maybe it’s learned, maybe it’s DNA. But I guess my point is that the heuristic of in-program search is still heavily informed by humans, because we’ve trained it on human programs as the prior to search the program space versus random.
12:00-14:00
Noam Brown: Mhm. I mean, I think that makes a lot of sense. It definitely helps to—you can have a million monkeys on a million typewriters typing away, and eventually you’ll stumble on something useful, but it takes a very, very long time. If you’re able to sharpen that prior so that it’s able to focus on things that are actually promising directions, you stumble upon useful things faster.
Francois Chaubard: Um, before we get off this one, how much do you actually—do you think the current...
14:56-15:22
Francois Chaubard: Do you think the current LLM setup is biologically plausible, and what isn’t? And do you really care?
Noam Brown: What do you mean by biologically plausible?
Francois Chaubard: Yeah, like, is the brain doing backpropagation? Is pre-training basically DNA? All these mappings that people would argue.
Noam Brown: I don’t understand—I mean, I’m not a biologist or neurologist or anything, so it’s a little hard for me to make claims here.
15:24-15:55
Noam Brown: I think the deep neural net aspect—from what I understand, there are some resemblances to the way the brain is structured. I don’t think it’s super important how close it is to the human brain, especially at this point. Clearly, there is something working here.
Francois Chaubard: Mhm.
Noam Brown: I think it is useful in the sense that maybe there are even more useful things that we haven’t discovered, but for that, I’m not the right person to say.
Francois Chaubard: Mhm. Part of your big innovation in your PhD was that leveraging test-time compute with a quite simple program that could run on a CPU was actually meaningfully better than a lot of the on-policy, you know, trained RL algorithms trained on massive amounts of poker games and all that stuff.
15:56-16:33
Francois Chaubard: And François Chollet actually said that, “I think we will end up finding the algorithm that is running the brain, and it will be a quite simple program, and it will likely have very little prior in it.” Do you agree with that, or do you disagree?
Noam Brown: I agree that it would probably be quite simple and elegant. I mean, I think a lot of the algorithms that we use today are quite simple and elegant. I think the idea that it doesn’t have much of a prior—I don’t know if I necessarily agree with that. It’s possible, but I also think there are different ways to interpret it.
16:34-17:16
Noam Brown: When a lot of people hear that, they think about something like AlphaZero, where there was a big success with AlphaGo. It trained on a massive amount of human data, did Monte Carlo research, did self-play, and it beat top humans in Go. And then the follow-up to that was AlphaZero, which took out the learning from human data—the human data is the prior—just rules and compute.
Francois Chaubard: Yeah.
Noam Brown: And it ended up doing a lot better. But when that was attempted in something like StarCraft, that didn’t work out very well.
Francois Chaubard: Yeah. And you think that has to do with action space?
Noam Brown: I think so. So, I think in principle, if you did enough RL with a big enough network and did it for long enough, yeah, you could get away with just no prior. But it turns out that’s not very practical, and so far that prior has been extremely important for things like large language models.
17:17-18:16
Francois Chaubard: Right. Is it possible that one day we move away from the prior completely and just have everything learned from scratch with just RL?
Noam Brown: I think it’s, in theory, plausible. I don’t think it’s likely, and I don’t think it’s the right path to pursue for the foreseeable future.
Francois Chaubard: Why do you say that?
Noam Brown: I think the evidence is pretty—people were very focused, for a long time, on this idea of removing the human prior and learning from scratch. For years, this was the dominant paradigm, ever since AlphaZero, from, I would say, 2017 through 2021. This was the dominant paradigm in RL, and it just became clear that being able to pre-train on large-scale internet data and build a useful prior was just extremely effective and pretty hard to argue with the results. It’s possible that that changes, but the evidence is very strong in favor of the other direction right now.
18:17-19:14
Francois Chaubard: And Chollet would say that you’re just buying arbitrary levels of skill, but the real measure of intelligence is the rate of skill acquisition—the skill acquisition efficiency. So, by boosting up my prior, you’re basically just getting closer and closer to benchmark hacking. It’s not clear that building up this useful prior is hurting its ability to then learn online. So it’s not like you’re having a trade-off there necessarily. I mean, maybe you are, but I would want to see some evidence if that’s the case.
Noam Brown: Yeah, that’s a fair point. I guess the continual learning stuff—backprop itself, as Sutton kind of proved out, is that the more you train it, no matter what you’re training it on, it reduces neuroplasticity as it improves or as it continues. I’ve experienced this too, but the further you train, the more dimensions become degenerate. If you sample random directions late in training versus early in training, every direction doesn’t impact validation or loss at all. So that would be an argument that it’s reduced neuroplasticity.
Francois Chaubard: But anyway, I want to make sure we get through the rest of these.
19:15-20:35
Francois Chaubard: So, the contrarian view on AI—I’m sure you’ve—we talked about this a little bit. It’s harder and harder to get. Do you have a contrarian view on AI today?
Noam Brown: Yeah, I would say one of my—yeah, I don’t have many. I would say, yes, I’m a strong believer that the model’s going to keep getting better. Intelligence is going to keep improving pretty rapidly. I do think that one of the things that the field is underestimating is—I still actually think the significance of inference compute is underestimated, especially as we go to releasing more capable models. I mean, I think this idea that when we release a benchmark, we compare, but when we release a new model, we evaluate it with a single number on these different evals. So we say, “Okay, well, here’s GPQA, here’s a single number telling you how smart this model is on this benchmark.” I don’t think this makes sense anymore.
Francois Chaubard: And it was true with GPT-2, it made sense with GPT-2, it made sense with GPT-3, kind of made sense with GPT-4. Even there, I think it was a little bit iffy.
Noam Brown: I don’t think it’s been true ever since you get this prompt chain-of-thought and get better performance. I think it was clearly not true once we had reasoning models, and now I just think it’s kind of silly that we’re still doing this. But I think people just kind of expect this, and so this is why it’s being done. I think that ARC has actually been quite good about moving away from this pretty quickly and measuring things with an x-axis of inference or compute or cost, and I think that’s the right way to basically measure any—especially reasoning-heavy—benchmark.
20:36-22:35
Francois Chaubard: Yeah. Now, I think this actually matters a lot when you start looking at things like preparedness frameworks or responsible scaling policies, because there are thresholds to decide, like, “Okay, well, if we release a model, what are its capabilities, and where do the different thresholds lie?” And I think it’s actually pretty important—when you’re evaluating the capability of a model for release and deciding whether it is dangerous or exceeds a level of capability—how much inference do you put into that evaluation? And if you’re at a point where intelligence is purely a function of inference, right?
22:35-22:59
Francois Chaubard: Right? Then, um, you can buy arbitrary amounts of intelligence.
Noam Brown: Yes. And so, like, you should spend infinite money on your eval.
Francois Chaubard: Yeah. And so this is a real problem because, you know, you could release a model and say, “Well, we cap it at $10 of inference,” or whatever. But if somebody can easily scaffold a bunch of queries together and now they can spend $1,000 of inference, then they effectively have a more intelligent model than what you released, right? What if they scaffold together a million dollars of inference and now they have a much more capable model than what you released, effectively? It’s not—it’s obviously, you know, is it the same model as what you released? Like, it’s debatable. Clearly, this wasn’t an issue with GPT-2 or GPT-3 because you can’t scaffold together a thousand GPT-3 queries and get something that’s substantially more intelligent than GPT-3. That was just not—it was maybe slightly the case, but not really.
23:00-23:55
Noam Brown: This is increasingly a problem as the models become more capable.
Francois Chaubard: Yeah, that’s really interesting. I actually almost feel like the current trend, the current zeitgeist, is scaling scaffolding. That’s what cloud code is doing—they’re injecting more and more scaffolding. And as the models are getting better, you can actually get away with less system prompt, and the system prompts will get smaller. You need less scaffolding and those kinds of things. But I actually want to go in another direction, which is, you know, I wrote this whole blog post about the next scaling law will be train-time recurrence. And while you have test-time recurrence—more test-time compute—if you take the argument of, like, we’re trying to build a Turing machine, we’re trying to build a Turing-complete architecture, to be more pedantic, one of the requirements to being a Turing machine is unbounded recurrence at test time. At train time, they do have unbounded recurrence, as you’re saying; they don’t at train time—it’s one forward pass and it’s teacher-forced. So you have exactly one forward pass, then you have to basically match to whatever they’re saying, and maybe this is broken when you let it go, and then you have some validation, and there is some distilling back into it from the latent space that was discovered at test time if you’re doing this kind of recursive thing. But we’ve actually seen the most amount of progress from a per-parameter standpoint on ARC 2 with train-time recurrence. It was HRM, TRM, and showing this out-of-refinement loop was actually the main contributor there. What are your thoughts on train-time recurrence in the future?
23:56-24:58
Noam Brown: Well, I think the idea of spending more compute during training time—I mean, there’s a distinction here between things like pre-training versus post-training. Obviously, for things like post-training, there is a decent amount of compute spent. But it sounds like you’re talking more about pre-training here.
Francois Chaubard: Yeah.
Noam Brown: And I think you’re already baking into it so much human bias, right? I would say, versus letting it ponder and think before you snap it back to teacher-forcing it to be exactly my next token.
Francois Chaubard: Yeah, I think it’s a great direction to pursue. I think just because we have something that works today doesn’t mean that it’s the best way for it to be done, and certainly not the only way for it to be done. I think it’s valuable that different research directions are being pursued and investigated, because it’s entirely possible—if anything, likely—that we will come up with something that is more effective than what we’re using today.
24:59-25:44
Francois Chaubard: What particularly are you most bullish will go away in the current stack? I mean, I’m not a pre-training person, so I’m probably not the right person to ask. And also, if I were, I probably wouldn’t be able to say anyway.
Noam Brown: Okay. Okay. Maybe this is the right next segue. The research—I feel like I started in Fei-Fei’s lab in 2012 till 2014, and then now I’m back, and the amount of air that LMs have sucked out of the room versus the diaspora of different research ideas that were being pursued—it’s just not really happening anymore. And now with this auto-researcher, I have like four different auto-researchers going on right now that are writing my old NeurIPS paper and things like that we talked about. What do you think the role of academia is? The role of conferences, the importance of submitting, going through the reviewers—whether it’s their agent or it’s a high schooler with their agent, even worse. What are your thoughts on the future of academia, the future of conferences, the importance of getting PhDs?
25:45-27:00
Noam Brown: Yeah, it’s a good question. I’ve been asked this question a lot over the past couple years, and I do have some thoughts on this. I mean, it’s not as hopeless as some researchers might say. I think there is actually valuable stuff to be done in academia. I’ll first talk about the problems, which is, one, a lot of the most impressive AI capabilities have been a consequence of scale, and that is a problem in academia because there’s just not as many GPUs in academia, and that’s just a reality. I think it could be a bit different. The example of theoretical physics is a good example, where there is the Large Hadron Collider—that’s not a private enterprise as far as I know. And so universities could get large amounts of compute and use that to do academic research.
27:01-28:10
Noam Brown: If I was in charge of a university, I would spend a billion dollars to get a large computing cluster. And honestly, if you are a university that’s not in the top 10 for computer science or AI and you want to quickly get into the top 10 or even be number one, you spend a lot of capital to get GPUs, and then you go to every star professor, every star student, and say, “Hey, we have more GPUs per researcher than any other place on Earth, any other academic institution on Earth.”
Francois Chaubard: Yeah.
Noam Brown: You’re going to get a lot of good talent very quickly.
Francois Chaubard: Yeah. My most famous tweet I’ve ever posted was after NeurIPS this year. We had a dinner with a bunch of professors and we just asked them about what’s compute like at your school, and I just went home and grabbed, you know, number of H100 or equivalent, weighted by dollar or whatever, divided by number of CS students, and then just posted the chart. It’s meaningfully below one for everyone except—and then I weighted it a little bit more for, do undergrads really need one H100? And it’s basically like, besides MIT and Harvard, everyone is below one.
28:11-29:57
Noam Brown: It’s pretty rough. Yeah, it’s pretty rough. And I’ve talked to faculty and I would casually mention just roughly how many GPUs I have access to whenever I want to spin up a job, and their head just explodes. I think a lot of people in academia do not understand how much of a difference it is right now between academia and industry. I completely agree. I’m funding my own, unfortunately, but shout out to University of Florida—no one knows, but they have this whole thing called Gator something, or Gator Cloud, or something like that. It’s a huge amount of GPUs. And UT Austin.
30:01-32:21
Francois Chaubard: Huge amount of GPUs at UT Austin. Huge amounts of compute. So, if you’re a PhD student, I would actually look to go there—one of those two, to be honest. And Stanford is actually really bad, despite a tens-of-billions-dollar endowment. That’s something I’ve talked to our professors about a lot.
Noam Brown: I do think it’s an important question for students to be asking when they’re deciding where to go: how much compute am I going to have access to? And probably do not just ask the professors. Ask the students how many GPUs they have access to, because you’ll get a more honest answer that way.
Francois Chaubard: I’m sure recruits ask you, if I’m coming to work at OpenAI, how much compute am I getting? Isn’t that an important factor in choosing where to work?
Noam Brown: I think so. Basically, you want to know how much compute the company has for research as a whole. I don’t get asked that very often because I think people are pretty confident that OpenAI has a ton of compute, which is the correct assessment. But it is a question that people probably should be asking.
Francois Chaubard: What do you think the solution is for academia to get the compute? Because clearly, we’re not getting particle-collider levels of money into data centers. I mean, we waited for Marlow at Stanford, and when they finally got it released, it was hundreds of GPUs for all of Stanford CS.
Noam Brown: I don’t have a great answer here. One possibility is that universities pool together and work together to do larger-scale experiments. I think it would be really cool to see an academic, purely open-source pre-training effort that is potentially competitive. The challenge is cultural. AI research right now is very much two-author papers, maybe five-author papers. If you’re going to do a large-scale project like this, how do you make sure that everybody is properly attributed for their contribution? That’s a very difficult cultural adjustment to make.
And this isn’t the only way to have an impact in academia. I want to be clear. If you want to do large-scale experiments, yes, compute is a real challenge. But there are things you can do without large amounts of compute. Evals are the one I consistently point to. At places like OpenAI, we get a lot of value from third-party evals. You used some compute to make ARC-AGI-3, but it is probably not a huge amount.
32:24-34:14
Francois Chaubard: It is not like pre-training a frontier model. I guess I’ll ask you: do you think something like ARC-AGI-3 could have been made in an academic institution?
Noam Brown: One, yes. Two and three, no. Two and three require a lot more attention to detail, money, and scaling. Also, no one really gets a PhD by writing maintenance code. It is hard to say, “I’m just going to write the next version of this.”
Francois Chaubard: I think a lot of people underestimate the impact of evals. Dan Hendrycks is a great example, where he really made a name for himself by making high-quality, useful evals. That is a viable way to have impact.
Noam Brown: Yeah. Out of our lab came LegalBench, for example, and there are a couple of other ones like that.
Francois Chaubard: It is hard. I think ARC-AGI is one of the highest-quality benchmarks, and a single PhD student, or even a group of them with funding, would still struggle to do what ARC Prize did. It was a remarkable amount of effort, talent, intelligence, and money that François Chollet and the team put into it, and I do not think it could have been done in academia.
Noam Brown: In hindsight, maybe you can say, “Oh yeah, we can make a bunch of games and make sure that they’re orthogonal in some way.” But academia is not rewarding that type of concerted effort.
Francois Chaubard: I agree. Academia does not reward this kind of effort, and hopefully that will change.
Noam Brown: But certainly there are great evals that come out of academia, and evals are something I pay attention to.
34:14-36:26
Francois Chaubard: There was another question you had about the role of conferences, and I think that is going to be very interesting to see evolve. The reality is that a lot of papers are currently being written by AI, and a lot of papers are currently being reviewed by AI.
Noam Brown: I do not necessarily see this as a bad thing. I’ve been talking to academics about this, and they basically say, “Look, the AI-reviewed papers, the AI reviews, are not that great, but they might actually be better than the average human review right now.”
Francois Chaubard: Completely agree. It is like actor-critic en masse, the biggest actor-critic thing happening in science. There are people in the loop making sure everything is copacetic.
Noam Brown: The variance in reviewers was so bad before that it is probably a net positive.
Francois Chaubard: Yeah. But I think you want to pair them with humans. It is actually a pretty good idea to have an AI review every paper, point out if there is some fatal flaw, and have that verified by a human. I think we will see this trend. Six months ago, the AI reviews were not that good, but the AIs are progressing rapidly. I would expect the latest models to do a very good job of reviewing a paper, especially if you point them to the references and they can do literature reviews. If they are not there currently, I would say by the end of the year they will be doing a great job. It is standard practice now to submit a paper to all the big AI models and get feedback back. It is like an automatic review. You might as well do it to squeeze out little issues in the paper.
Noam Brown: I agree. And to be honest, I do not think it is going to be that much longer until the AIs are just writing the papers end to end.
Francois Chaubard: Yeah, they’re probably already doing pretty decent chunks.
Noam Brown: I do not see a barrier to them writing the papers end to end, and then also doing all the experiments. It is going to take a while. It is a lot of work. Like I said, how long does it take a human to write a good conference paper? Probably a couple months at least.
Francois Chaubard: But I think we’ll get there. I pay for Overleaf credits. So there are two more questions.
36:26-37:33
Francois Chaubard: We talked about this O-shaped hole in the planet. When OpenAI stopped releasing weights and open-sourcing a bunch, there was some glimmer of getting back to that. What are your thoughts on DeepSeek, Qwen 3, and Chinese models in general being the de facto open-source weights for the world?
Noam Brown: This is not really my area of expertise or my call to make, so I probably do not have too much to add. But for what it’s worth, I do think there is a lot of value in open source. It is important that we have good open-source models out there.
Francois Chaubard: And I hope that hole is filled, either by us or by somebody else.
Noam Brown: I agree.
Francois Chaubard: Last question. You have gone through a tremendous amount of growth over the last decade and you’ve seen so much. Outside of AI, what has been the hardest period, or the period of most self-growth, that you’ve had to go through in the past decade?
37:44-40:05
Noam Brown: It was not outside of AI, but I would say the Libratus competition that I did in 2017 was a pretty pivotal moment in my life. I was a pretty unknown grad student at the time, and it was very clear to me that if I was successful in this competition, it would be a big deal and a big achievement. I also knew that it would take a ton of hard work. Basically, in early 2016, it became very clear to me what the path to success was going to be, and I had to spend a year executing on it super hard.
That year, I put a lot of hours in, basically working nonstop to make a really strong poker bot. It is really stressful to have your entire career come down to a poker game, because you do not know if your bot is any good. Poker is a super high-variance game. It is hard to tell just by looking at the hands whether it is doing a good job or a bad job.
We did not have the budget to hire a bunch of human poker players to stress test it. We played against previous bots, and we had some sense that it was doing well against them, but the variance was so high that we did not even have a good sense of the win rate. The real challenge was that humans are very adaptive. Once we played this against actual humans, would they adapt to the bot’s weaknesses? Would they find holes?
That was a very stressful period, and fortunately it ended up being successful.
Francois Chaubard: How did you celebrate when you won?
Noam Brown: It took me probably a few weeks to calm down and realize, “Oh, it’s actually over. I can relax now.”
Francois Chaubard: No celebration? Didn’t go to Aruba or anything?
Noam Brown: I do not think I actually celebrated. I also had trouble believing it until the very end, that we actually were not going to suddenly lose. But somebody told me afterwards that I could have put in 90% of the effort and gotten 0% of the reward, and I thought that was a good life lesson. Sometimes you just have to work really hard to get something done.
Francois Chaubard: Awesome. Well, thank you so much. This was amazing.
Noam Brown: Yeah, thanks for having me.
Francois Chaubard: All right, thanks.
Made with: The Transcript Desk Chrome Extension

