Ep. 011 - GPT 5.5 vs Claude 4.7: OpenAI's Comeback From the Brink (Tokenomics) | Edited Transcript
A professionally copyedited transcript of SemiAnalysis Weekly’s episode on GPT 5.5, Claude 4.7, coding-agent token economics, DeepSeek, context windows, and the AI model race.
Editor’s note:
Not worth reading. The only interesting thing is dylan patel said gemini and openai releasing a new model in 2 weeks. this was recorded on may 5th, 2026.
So that means May 18th - May 24th which is also when Google I/O is happening.
Chapter Timestamps
00:00 OpenAI's Comeback and the Latest AI Model Wars
04:05 The High Cost of AI Models and Fast Mode Effectiveness
08:16 When AI Tokens Become Too Expensive for Tasks
13:11 Why AI Model Quality Degrades and Benchmarks Fail
18:42 Deep Dive into Claude 4.7 Features and Tokenizer Changes
25:29 DeepSeek's Release and China's AI Compute Constraints
28:20 The Future of Context Windows and Agent Orchestration
30:47 The Great Debate: CLI vs. App for AI Interaction
36:33 Debunking AI Fake News and Context Window Limitations
40:51 The AI Race: China, Meta, and the Neo Cloud Vision
43:46 Final Thoughts and Listener Feedback Request
Made with: The Transcript Desk Chrome Extension
Full video: https://www.youtube.com/watch?v=BhVf3hEGnuA
OpenAI was in serious trouble at the beginning of this year. Anthropic's Claude Opus 4.5 release had triggered a wave of developers to start using Claude Code, pushing Anthropic's revenue past OpenAI's on a like-for-like basis by April. OpenAI's GPT 5.4 response was such an embarrassment they didn't even compare it to Claude in their model release card. Then came GPT 5.5 - finally back on the frontier, but is it enough to reclaim the crown?
Transcript
00:00-00:08
SemiAnalysis Host: Hello everyone, welcome back to Semi-Analysis Weekly.
00:08-00:36
SemiAnalysis Host: We’ve got a big show this week with some heavy hitters—Doug, Max, and maybe a few others tuning in. Dylan Patel is calling in from a nice setup, headphones on, with a Semi-Analysis logo water bottle.
We're going to talk about Claude Code, which has become one of our main topics, GPT 5.5, DeepSeek, and more. So, let’s dive in.
00:36-01:16
SemiAnalysis Host: Alright guys, welcome. Can you believe it? Today's May 5th, which means GPT 5.5 is out. Also, happy birthday to Max. Max just debuted on the Semi-Analysis newsletter last week with his first lead author article—the Coding Assistant Breakdown.
The funniest part is he was nervously asking Doug and Michelle if he’d pass his work trial. Bro, obviously you’re approved—what’s there to worry about? Now Max is a full-time Semi-Analysis employee.
Max Kan: Happy to be here. Really excited.
SemiAnalysis Host: Welcome to the team!
01:16-02:16
SemiAnalysis Host: Let’s go over the article quickly. Initially, it was supposed to be just a review of GPT 5.5 Pro, or just 5.5, whatever name they’d settle on—we assumed it would be called 5.5 while testing the new model. Some joked we called it "spud" or "potato patata."
But then Anthropic released Claude Opus 4.7, which has some major new features we covered in the article. We also got DeepSeek V4, which came out after a long delay of about four to five or even six months. So the article turned into a roundup of all the latest releases.
Max, can you give us your summary high-level take?
02:16-04:06
Max Kan: Sure. The TL;DR is: things looked really bleak for OpenAI for a while. At the start of the year, Anthropic was quickly closing the gap revenue-wise. Based on leaked info, Anthropic hit around $19 billion ARR compared to OpenAI’s $24 billion. By early to mid-April, Anthropic had pretty clearly surpassed OpenAI on a like-for-like revenue basis, even accounting for different accounting treatments related to gross versus net revenue from the hyperscalers.
The main driver was Anthropic’s Opus 4.5—it was a real step up in coding and agent capabilities. From late November through April, most people switched to spamming Opus 4.5 and 4.6 across workloads. OpenAI tried hitting back with GPT 5.4, but honestly, that release was an embarrassment; they didn’t even compare it to Opus models in their model release card, only to their previous OpenAI models. So that was telling.
Then came GPT 5.5, and OpenAI’s back on the frontier. Opus models got added again in the comparison. It’s not definitively better than 4.6 or 4.7 despite some hype on Twitter, but it’s definitely up to par and usable. So OpenAI has a shot again.
04:06-06:10
SemiAnalysis Host: Doug, what about you? What’s your daily driver? Do you use both or prefer one over the other?
Doug O’Laughlin: I’ve tried so many times to use Claude Code, but the usage limits keep holding me back every day. I don’t know how I keep hitting those limits when I’m on the API. Sometimes it takes me longer than 15 minutes just to get started.
SemiAnalysis Host: You’ve tried to get around it?
Doug O’Laughlin: I have, for sure. I even posted in the OpenAI Slack asking for help. It’s frustrating. Also, OpenAI’s fast mode seems really expensive and blows out the context window.
Max Kan: Yeah, to me they’re neck and neck and pretty interchangeable. But what I really want is fast mode combined with high uptime, and nobody delivers that yet.
Doug O’Laughlin: Isn’t OpenAI’s “fast mode” basically fake news now?
Max Kan: Yeah, I’d say that’s fair. They reduce reasoning depth, and it’s not really that fast. Opus’s fast mode isn’t even twice as fast like it used to be—it’s less than double now.
SemiAnalysis Host: Right, the starting point was like 2.5x faster but now less than two times faster.
Max Kan: OpenAI actually has three fast versions: priority mode, fast mode, and the 5.3 codec spark. The 5.3 codec spark is basically a different model entirely and feels gimmicky. Fast mode is probably comparable to Opus’s fast mode.
Doug O’Laughlin: I’ve been testing fast mode in codecs and it feels slow—sometimes I don’t notice it’s any faster with fast mode on.
SemiAnalysis Host: Yeah, I think Jordan showed me recently how token output speed varies with those modes—peak, median, trough—shows openAI’s priority mode doesn’t always deliver faster results.
Max Kan: Priority mode is more about guaranteed execution with service level agreements, not necessarily faster interactivity, though it does have a small premium, like twice the cost.
SemiAnalysis Host: What was that token per second data again?
Max Kan: We saw for 4.6 base, token speed was around 35 to 40 tokens per second. Fast mode started around 90 TPS but now is down to about 70. So you’re paying six times more for less than twice the speed.
SemiAnalysis Host: Right. Internally, Max, you made this point—or I guess we all did—that this is the first time engineers here have really chosen speed over quality. Not sure what that means exactly, but users seem to want fast mode even if quality suffers.
Max Kan: Yeah, for sure. I think the big part is that 4.7 isn’t meaningfully better quality than 4.6 for day-to-day tasks.
06:10-07:33
Max Kan: I was thinking about this earlier—if you map this to the OpenAI Cerberus Deep Learning Interface, and consider that Opus 4.5 passed some key threshold of intelligence or capability, lots of routine tasks become “one-shot,” meaning you don’t need to oversee the outputs or supervise the model. So maybe even if you never run larger models than that...
07:38-07:59
SemiAnalysis Host: Let's say you've got one or two hundred billion parameters on Cerberus—you'd have GPT 5.5-level intelligence in that size, probably in less than a year. I think we've actually passed the point where most people don’t really need frontier-level intelligence for their daily workloads.
07:59-08:15
SemiAnalysis Host: And that’s especially true if these models keep getting more expensive. Originally, I thought maybe we could afford to run Mythos in fast mode, but I don’t think that’s possible anymore. Lots of people are already priced out of these models. I think we’re just on the edge of full pricing out.
08:15-08:36
SemiAnalysis Host: I don’t think we can afford it. Like Mythos is what, around six or seven times more expensive? It’s like 25,150 or 25,125 tokens for Mythos, versus 5,15 tokens for the rest, and 5,25 for Opus. So Mythos is about five times more expensive, then fast mode is six times more expensive on top of that.
08:36-08:51
SemiAnalysis Host: Right now, with our token spend, I can justify it. But if you even doubled that, I’d be like, “Oh man, maybe we gotta turn off fast mode.” At this point, margins really matter.
08:51-09:02
SemiAnalysis Host: Do you think there’s a chance we’ll define something so important in SemiAnalysis research that we’d want to pay that premium for just one project or task?
09:02-09:23
Dylan Patel: I think Mythos is more token-efficient. Mythos fast mode is probably cheaper than 4.6 fast mode for most tasks, at least that’s been the case comparing Codex, 5.4, and 5.5. Even if the model got more expensive, the jump in price for the same task isn’t big. I don’t expect new models to cost more for the same task.
09:23-09:30
Dylan Patel: The problem isn’t the price per task; it’s that we keep coming up with new tasks to do with these models.
09:30-09:33
SemiAnalysis Host: Yeah, that’s Jevons' paradox.
09:33-09:49
Dylan Patel: Exactly. Based on what we do in the next few months, we’ll find new, more complex tasks that need bigger models. But right now, I don’t even think about whether a task is worth spending tokens on; I just spend them. But if models get much more expensive, we might have to reconsider if a task is worth those fast mode token costs. That would be sad.
09:49-10:13
SemiAnalysis Host: Question: what’s the real trade-off? I’ve had experiences where it’s actually more expensive to burn tokens than just doing the work myself. Like, what examples do you have where the cost wasn’t worth it?
10:13-10:39
SemiAnalysis Panelist: There were times I set up a benchmark on a DigitalOcean droplet and used Opus for six fast mode calls, and it cost around $400. I thought, is that really worth 10 minutes of my time? Probably not.
10:39-11:04
SemiAnalysis Panelist: Yeah, I think stuff you can script or write documents for—maybe the first pass with the model is good, but editing can be annoying. Sometimes the model messes up edits or wrong parts. That’s a quality versus cost discussion.
11:04-11:13
SemiAnalysis Panelist: A full waste of tokens at double price? I don’t think I’ve hit that yet.
11:13-11:40
SemiAnalysis Host: Okay, here’s mine: if you’re scraping large datasets, at some point, there’s diminishing returns. Like, say you want 10,000 employees from a company, then you run complex searches or profiles on each one. But when you run the cost, the data enhancement API might be just one-tenth the token cost.
11:40-12:13
SemiAnalysis Host: So I burned $800 in tokens to do something that would have cost like $55 to $100 calling an API. One approach might be sloppy, the other at least theoretically verified. So that’s probably the clearest example where burning tokens just isn’t efficient compared to serving data through APIs. Scraping the entire internet that way is just not token efficient.
12:13-12:48
SemiAnalysis Host: Yeah, I’ve been curious if there’s a trade-off where paying for that level of intelligence just isn’t worth it. Like the Rick and Morty meme, right? The AI’s whole purpose is “Pass me the butter.” Which is really a waste. We need better token efficiency. You could hire people for cheaper to do some menial jobs than pay for tokens.
12:48-13:03
SemiAnalysis Panelist: But often you get so used to tokens being cheaper or faster that you don’t even consider the alternative, especially with aggregated bids.
13:03-13:12
SemiAnalysis Host: So what’s the takeaway, Dylan? Is OpenAI really back at this point? You think they’re coming back?
13:12-13:22
Dylan Patel: I don’t know, man. Everyone has a new release in two weeks—Google, OpenAI, maybe Anthropic. For sure Google and OpenAI.
13:22-13:50
SemiAnalysis Panelist: What are they releasing? More pre-training? More RL? Same base models? Probably finishing their pre-training that they kind of dropped early, then more RL, then a new model. Google is mostly pushing multi-modal updates.
13:50-14:10
SemiAnalysis Host: I hear the narrative that for even average users who don’t use fast mode, Claude 4.7 is actually worse than 4.6. Do you guys agree with that?
14:10-14:26
SemiAnalysis Panelist: I think instruction following is objectively worse. The model often misses CloudMD instructions or doesn’t do exactly what the skill says. This seems to be a consistent problem.
14:26-14:34
SemiAnalysis Panelist: I feel like it’s more of a compute allocation issue. Like it still says, “We’ve done a lot for today, enjoy your weekend,” and I’m like, “It’s Monday, get back to work!” It’s frustrating—it tries to make me stop working. Definitely a usage problem.
14:34-14:47
SemiAnalysis Panelist: The golden age was 4.6 before quantization. Those were the days. Fast mode 4.6 freed from nerfs. But as these models mature, become optimized for inference, and get more users, the experience actually seems worse as you scale usage by ten or a hundred times...
14:58-15:22
SemiAnalysis Host: Yeah, I think that’s happened. Honestly, 4.7 feels just fine—probably about the same as 4.6. It’s really more that there are just too many users on it now. I need a NIMBY AI—a “Not In My Back Yard” frontier model that only I can use, so I can get higher performance or faster access. That’s part of why codecs are appealing right now; in theory, you can get higher rates because fewer people are using them.
15:22-16:00
SemiAnalysis Host: Max, can we talk benchmarks? When you saw the 4.7 release and its benchmark scores, it was better on most tests, but not all. That got us thinking about when benchmarks are useful and when they fall short. Doug’s point was that individual experiences vary and preferences get shaped by that, but ideally, there should be some objective measure to say whether 4.7 is actually better than 4.6 or not.
16:00-17:14
Max Kan: Honestly, benchmarks try to be that objective measure, but I think they just don’t hold up anymore. You need to be close to frontier performance to have a chance at being the best. But even being number one on benchmarks doesn’t guarantee the model is actually the best in practical terms. Right now, benchmarks are mostly just a quick vibe check to make sure a model isn’t absolute garbage.
What surprises me is how few people dig into the details of these benchmarks or realize how unrepresentative they are of real-world LLM use cases. For example, something like “Humanity’s Last Exam” sounds impressive, but when you look closer, it’s just a series of obscure multiple-choice questions that don’t reflect typical LLM tasks. They’re designed as multiple-choice to make automatic verification easier, but real-world use is open-ended.
17:14-18:40
Max Kan: Even benchmarks that seem more practical, like SuiteBench 2 for coding, aren’t perfect. You’d think coding is verifiable—you give the model a problem, run tests, see if it passes. But it’s really hard to write coding problems that are naturally worded, perfectly unambiguous, and have only one correct answer. SuiteBench initially scraped GitHub issues, but anyone familiar with those knows issue descriptions aren’t exactly well-scoped tasks you can just hand to an AI. The verifiers also check for implementation details that aren’t even mentioned in the problem description, like asking the AI to generate a very specific error message verbatim. Later versions have tried to fix these problems, but it’s still imperfect and really highlights the limitations of benchmarks in this space.
18:40-19:46
SemiAnalysis Host: Makes sense. Now, focusing back on 4.7 itself, there were a few feature changes that aren’t just about benchmarks. You mentioned in the article that people care less about raw model quality now and more about the model combined with its “harness”—the actual product experience. So the thing people should be judging isn’t Opus 4.7 on some generic bash prompt, but rather something like Claude Code versus Codex.
During the 4.7 release, they announced an extra “high reasoning effort” mode, slotted between high and max. They added high-res image support for screenshots and styling, which helps front-end applications. Also, the model now omits thinking content by default, so users don’t see the AI’s “thinking” process. There’s a new task budget feature that lets you control how much “effort” or work the model is allowed before it’s limited by context window size. And probably most important: they updated the tokenizer, which can end up costing users 35% more tokens for the exact same output, just because token counting changed.
19:46-21:30
SemiAnalysis Host: Okay, sorry—Dylan just totally checked out during my tokenizer rant. I literally put him to sleep with that monologue. The tokenizer is so boring it knocks Dylan out cold—that’s the takeaway! No, really, now that he’s awake, we can dive into all the details we want.
21:30-22:04
SemiAnalysis Host: So, summarizing: GPT 5.5 is fine—not game-changing, but definitely good enough with plenty of capacity. Now, about those benchmarks—they suck. But what about the tokenizer difference? Between Opus 4.6 and 4.7, they changed the tokenizer, so the same output can have 35% more tokens with the new tokenizer.
Max Kan: Yeah, they actually increased the vocabulary size from 4.6 to 4.7, which leads to more tokens.
SemiAnalysis Host: Wouldn’t that mean fewer tokens per output? Since bigger vocab usually means the model can represent longer chunks with fewer tokens?
Max Kan: You’d think so, but actually more tokens are used because each token is more granular. Imagine if you had a language with only 28 tokens, you’d have to spell out everything letter by letter. But if you have 500 tokens, including common words like “of” and “the”, you’d theoretically need fewer tokens. But in practice, the model isn’t more token efficient with a larger vocab; it actually uses more tokens.
22:04-23:05
SemiAnalysis Host: Interesting. That’s a good point. Conceptually, you could train a model with whole words as tokens, but in practice, it turns out the model is less token efficient with a bigger vocabulary. Man, I came in hot with that one!
23:06-23:13
SemiAnalysis Host: Wow. He was thinking that whole time. Okay, yeah.
23:13-23:36
SemiAnalysis Host: The whole idea behind being more token-efficient is that you can solve full tasks using fewer tokens. If you have a more granular or larger token size, or just a bigger vocabulary, you can represent more information in latent space. For example, the phrase "relationship analysis" might be one token instead of breaking it into "semi" and "analysis."
23:36-24:09
SemiAnalysis Host: But then, that creates more context versus maybe four tokens in "some analysis." I get it. It’s all about choices—richer information. Like, the embedding for "semianalysis" might be closer to the embedding for "Dylan" in latent space, whereas "semi" or "analysis" separately probably aren’t close to it.
24:09-24:27
SemiAnalysis Host: It’s probably unclear whether this actually improves performance significantly because people still debate whether 4.7 or 4.6 is better. And also if it’s more token-efficient or worse, since some say if it’s not a significantly better model, why bother with a new tokenizer?
24:27-24:43
SemiAnalysis Host: Maybe this is just an early checkpoint—they need more reinforcement learning to improve 4.7 and then drop a 4.8 version that really makes a difference. I don’t think Anthropic puts out half-baked models like OpenAI clearly does. OpenAI’s 5.3 Codex was basically RL only on code. Is that their most half-baked model so far?
24:43-25:12
SemiAnalysis Host: Before, we saw Sonnet before Opus—like 4.5 Sonnet, then 4.5 Opus, then 4.6. This is the first new one that’s just Opus. Actually, Opus 4.7 is basically Sonnet too.
Jordan Nanos: Are you not a truther?
SemiAnalysis Host: Yeah, and then Opus 4.7 is Vitas. There you go. This is straight-up truther talk.
25:12-25:30
SemiAnalysis Host: Opus 4.6 was Sonnet all along—does everyone remember that? Like, it was a big-small model, removing the mask and such. Yeah, that kind of model.
25:30-25:38
SemiAnalysis Host: Anyway, I wanted to circle back to DeepSeek since we haven’t talked about it much yet.
Dylan Patel: Oh, DeepSeek—the DeepState part of the article?
SemiAnalysis Host: Yeah, DeepSeek. We only touched on it lightly, but we have a lot more internal thoughts we didn’t write up.
25:38-25:56
SemiAnalysis Host: My question: do you think the gap between open source AI in China and the U.S. is starting to widen again? It feels like it is, and I’d say it’s tied to compute constraints. What do you think?
25:56-26:18
SemiAnalysis Host: Quick thoughts — DeepSeek is notable because it’s a one million token context window model. Many leading open source models, even the good ones used for coding, don’t reach that million-token context. Also, they published and later took down a reasoning and visual space paper, which hints they might release a multimodal version or are working on one.
26:18-26:52
SemiAnalysis Host: Those developments are fascinating because they could help China catch up on a product basis. Comparing Cloud Code, Codex, and DeepSeek across open code, DeepSeek might support all the same features but be better at a lot of other things too. But when DeepSeek first dropped, I used it a lot for random stuff; now I don’t really use it anymore.
Jordan Nanos: So is Kimi 2.6 better anyway?
SemiAnalysis Host: Yeah.
Jordan Nanos: I don’t use that either.
SemiAnalysis Host: My point is, the first DeepSeek release felt like it came fully baked out of nowhere—like appearing from thin air. This new release is just state-of-the-art in the open source space; it’s not necessarily better than Kimi 2.6.
26:52-27:40
SemiAnalysis Host: The biggest improvements in this release seem to be inference optimizations. They mentioned the Ascend kernel partly supporting inference, which could unlock much more compute power in China for the first time. Looking at the weight size, it seems designed to fit into the memory domain of an H200 8x pod. All the current top models can be inferred within that memory range, and nothing state-of-the-art is bigger than that.
27:40-28:14
SemiAnalysis Host: That clearly seems like a hard cap. They might have bigger models, but they probably won’t release them publicly in something like a 200-pod configuration. It feels like they’re hitting a wall. Do you think this will slow or limit Chinese AI progress because they can’t run inference at a bigger scale?
28:14-28:42
SemiAnalysis Host: I'd like to hear some hot takes.
Dylan Patel: Okay, I think… what the hell are you talking about?
SemiAnalysis Host: He’s heating up. Getting all hot and sweaty.
28:42-29:08
SemiAnalysis Host: DeepSeek is mostly an engineering release. It’s fascinating, especially all the new attention variants and compression in the Kibbe cache. Their infrastructure work is solid. Kibbe manages 256k context, and DeepSeek does one million – that’s a huge upgrade for tasks needing really long context.
Jordan Nanos: But is the jump from 256k to a million tokens useful or even practical, like in Opus?
Dylan Patel: No, it’s crap.
SemiAnalysis Host: It’s not entirely useless, but it’s probably just a theoretical step forward in state-of-the-art terms.
29:08-29:37
SemiAnalysis Host: The problem is nobody wants to be constantly clearing context during big tasks. Having the whole context visible is really helpful. Compaction—the process of summarizing and compressing prior tokens—just doesn’t work well. Not being able to use a million token context window is frustrating. But honestly, I think it might actually be better to just clear the context.
Dylan Patel: Better to clear?
SemiAnalysis Host: Yeah, that’s my hot take. I’d rather start fresh. Make a summary of what we’ve done so far, copy-paste that, and just start over. Forget compaction.
29:37-29:56
SemiAnalysis Host: DeepSeek’s 3.2 paper found that for their benchmarks—which I think aren’t very good—once you exceed the context window for a task, performance actually improves on their tests by clearing context entirely instead of trying to summarize or compact it.
Dylan Patel: Isn’t making a summary just what compaction is underneath anyway?
30:00-30:21
SemiAnalysis Host: When I say "make a summary," I mean something like, literally using the entire context window versus just the last thing I was doing. I’m trying to memorize, but I’m not going to read through a million tokens of sloth pacing. Instead, I’ll say, “Okay, what was I doing here? Read this.” And then I might copy-paste just a small bit, not the whole thing. So I’m not really summarizing everything—I’m just passing off tasks.
30:21-30:37
SemiAnalysis Host: There are different ways to compact information, but I see compaction as different from summarization. Compaction is about removing unnecessary thinking traces early on. It’s not about having the model write a summary of the full context.
30:37-30:46
SemiAnalysis Host: I thought summarization meant literally taking your entire context and asking, “Hey, please summarize this.” That’s one way, and they’ve tried multiple approaches.
30:46-31:07
SemiAnalysis Host: Here’s my hot take: Anthropic did well with Claude Code’s CLI. They kept making the CLI better and better. But the CLI isn’t the ultimate platform for orchestrating agents. So in a way, they’ve trapped themselves in an innovator’s dilemma by focusing too much on improving the CLI.
31:07-31:32
SemiAnalysis Host: OpenAI, on the other hand, has the real vision for what the future agent orchestration platform looks like. It’s an app-based experience that integrates voice, multimodality, and all sorts of things. The app is far better than the CLI. Users should really be on the app, not stuck on the CLI, which is a dead end—a relic from the first and second halves of 2025.
31:32-31:42
Dylan Patel: Are you saying the app is the future forever? Or do you mean their device? They’re talking about releasing a consumer device next year.
31:42-31:52
SemiAnalysis Host: No, no. I mean the laptop app—the Codex app. It’s about their long-term plan.
31:52-31:55
Dylan Patel: Then why does the Codex app suck?
31:55-32:01
SemiAnalysis Host: It’s all about long-term vision.
32:01-32:24
SemiAnalysis Host: Dylan, I think you’re thinking too small. In a perfect, maxi world, the operating system won’t even be necessary. You just get some hardware, plug it in, it opens up the terminal, connects to your cloud API, and builds the OS for you.
32:24-32:37
SemiAnalysis Host: The Codex app is adding generative UI features, which is interesting. But if you’re a coding purist, generative UI is downstream from the CLI experience.
32:37-32:50
Dylan Patel: I’m a CLI guy—that’s just personal preference. I love the CLI. Like, Claude Code usage really feels like just a CLI wrapper. You can tell.
32:50-33:01
Dylan Patel: And Codex CLI feels like just an app wrapper. It’s like they forced everything into this structure.
33:01-33:15
SemiAnalysis Host: There are two camps on what the future holds, and honestly, who knows which will win long term. I’m happy to keep the competition open. But from a true maxi’s perspective, it’s all downstream of the CLI. It’s just tokens—that’s the most efficient approach.
33:15-33:28
SemiAnalysis Host: All you need is an API and inputs. Your app and all that other stuff? That’s just layers of obfuscation. Mythos (the known maxi) would definitely agree more with OpenAI than with Anthropic on this. Who are we to decide what UI we truly want?
33:28-33:33
Dylan Patel: No, dude. It’s CLI all the way down. Pure maxi vision.
33:33-33:42
Dylan Patel: I’m a VS Code plug-in guy. I even tested this yesterday. I was bugging Max, who told me to back off, about using Ghosty for CLI stuff... but it just doesn’t work well.
33:42-33:54
SemiAnalysis Host: Still, if you have, say, six Ghosty CLI terminals open, it feels like a better experience than juggling six different chats in the Codex app.
33:54-34:09
Dylan Patel: Definitely, I feel more productive. Comparing it to VS Code, I can have six windows open with a file browser on the left, right-click easily—it’s just better workflow.
34:09-34:13
SemiAnalysis Host: The difference is, you’re actually writing real code and you care about the output, whereas I don’t even look at the code.
34:13
Dylan Patel: No, I’m not reading code either—I’m just copying images or Excel files. Don’t accuse me of reading the code!
34:21-34:35
SemiAnalysis Panelist: You’re talking about Edson. My favorite thing is, I actually agree that the API or CLI is basically the new compiler these days. Nobody really reads the compiler anymore. Nobody cares. They don’t need to touch the magic.
34:36-34:50
SemiAnalysis Panelist: That might be true in the future, but right now, for any code you really care about, it’s not true. The models still produce a lot of bad output that needs coaxing to fix.
34:51-35:11
SemiAnalysis Panelist: I’m not saying I type the code myself anymore, but I do read code. This morning, I was reading a group chat with some of the best kernel programmers in the world. And what they’ve found — because they’re my boys, you know I’m a master networker — is that the best way is...
35:12-35:35
SemiAnalysis Panelist: Look, Max, Codex is dumb, but I just have it create code and then it kind of works. The code it generates is sloppy, but then I have Opus rewrite it and clean it up. You can’t go the other way around — you can’t have Opus generate code and then have Codex fix it.
35:36-35:51
SemiAnalysis Panelist: Funny enough, everyone else in the firm prefers the opposite workflow. The whole firm actually likes it the other way around.
35:52-36:01
SemiAnalysis Panelist: But hey, we’re not writing kernels for Treedow or anything, right? We mostly do kernel benchmarks, but yeah, we wrote some kernels — not Treedow kernels, just GPU-mode kernel competitions.
36:02-36:15
SemiAnalysis Panelist: Here’s the deal: if you talk about niche microarchitecture details, Claude tends to waffle and um and ah instead of just doing it. Codex will just try to implement everything, even if sloppy. Then Opus will clean it up and fix it.
36:16-36:34
SemiAnalysis Panelist: Doug, this ties into the context window stuff. When you feed tons of docs about a GPU’s instruction set architecture for writing a kernel, you need huge context windows—like a million tokens—because smaller windows just run out of space.
36:35-36:43
SemiAnalysis Panelist: Anyway, Dylan, do you want to circle back to DeepSeek? Any hot takes? Why didn’t it shake up the market this time even though KVCash was cut by 90%?
36:44-37:00
Dylan Patel: Man, it’s been a while since I was in Asia, but every time I go, they’re referencing some new paper that reduces KVCash by a crazy amount — this has been happening nonstop for the past three years. And American researchers have never even heard of these papers. They praise them like they’re the best thing ever.
37:01-37:14
Dylan Patel: DeepSeek and TurboQuant were the biggest ones popping up recently, but TurboQuant was obviously fake news. Yeah, it’s pretty hilarious.
37:15-37:24
Dylan Patel: Honestly, maybe people are just tired of being burned. The reality is, Jevons’ law is clearly in play, with GPU prices going up—it’s simple economics.
37:25-37:33
Dylan Patel: Speaking of real fake news, let’s talk about Sub-Q. No plans to write an article or post on this, just some free alpha. Did anyone read that? It’s pretty suspicious.
37:34-37:41
Dylan Patel: It’s actually extremely, ultra, mega suspicious—like people just launching startups left and right.
37:42-37:58
Dylan Patel: I honestly think if they closed funding right now, they’d say, “Wow, it was just Opus with 10 context windows stitched together.”
37:59-38:15
SemiAnalysis Panelist: You think the market’s hot enough for them to close a $200 billion round next month or something?
Dylan Patel: If they did, I don’t know if they’d hit $200 billion, but maybe $50 billion at a billion valuation...
SemiAnalysis Panelist: A trillion?
Dylan Patel: No, I said billion.
SemiAnalysis Panelist: No, trillion, trillion, trillion.
Dylan Patel: Sorry, billion.
38:16-38:23
SemiAnalysis Host: Wait, are you talking about Anthropic? Dylan, are you even familiar with what we're discussing?
Dylan Patel: No, sorry. I just thought you guys were talking about Anthropic.
SemiAnalysis Host: No, we're not. We're actually talking about model sparsity.
38:24-38:38
SemiAnalysis Host: I'm talking about the worst case here. Did you see that fake news Twitter thing called SubQ today?
Dylan Patel: Oh yeah, yeah, yeah, I saw that.
SemiAnalysis Host: Don’t worry about it. We actually requested API access ourselves—using our SemiAnalysis email—to get access to this model, which is supposedly very real, bro.
38:39-38:52
SemiAnalysis Host: Who knows? Maybe it's a state space model or maybe it’s something else. But no, I don't think it’s an SSM. It’s just funny because people are freaking out over DeepSeek. If it were real, the stock should have dropped by a quadrillion percent today.
38:53-39:15
SemiAnalysis Host: But obviously, it’s not real. I mean, no offense to the founder, maybe he’s legit, but looking at these guys, I just don’t think they’re the ones going to solve the single hardest problem in AI.
39:16-39:48
SemiAnalysis Host: This reminds me though—Llama for Scout or Maverick, I think Scout, the smallest one, was announced with a 10 million token context window, but it wasn't really supported in the released weights. I’m surprised no one with unlimited compute has tried a more expensive model with a larger context window yet.
Where are they going to get the data, right? Most people pre-train with something like a 16K or 32K context window, then fine-tune to hack in longer context. But what data even exists out there for huge windows?
39:49-40:07
SemiAnalysis Host: That’s why my experiments with 250K to a million tokens context are trash—there’s just no data for it. The model doesn’t generalize well with such huge context. If you jump to 10 million tokens, what useful info do you even have beyond one million or so? The difference is tiny.
40:08-40:25
Dylan Patel: Makes sense. Maybe synthetic data, or something else? Why even bother? Seems obvious people would be working on it, but no one’s publicly announced anything over 2 million yet.
SemiAnalysis Host: Google actually serves 2 million tokens on their 3.1 pro version—they started with 1 million and then updated to 2 million on some models.
40:26-40:52
Dylan Patel: Okay, so 1 million is what's real today. Makes sense.
SemiAnalysis Host: Yeah, so why not scale up with TPUs and give it a shot?
40:53-41:07
SemiAnalysis Host: By the way, Max, this was kind of missed in the article, and Doug asked about it—a question on whether China is catching up or still behind with their AI models.
I was bugging you the other day: is DeepSeek or Kimmy ahead or behind Meta? And where do they stand relative to Grok, Cursor, or SpaceX’s XAI?
41:08-41:34
Max Kan: I’d say that today, DeepSeek is probably ahead of all those companies. But the real question is the slope from here—what their growth trajectory looks like.
It might sound basic, but the key input for how good an AI model is going to be remains how much compute you have.
41:35-41:57
Max Kan: Obviously, Baidu’s signing all these massive deals—it seems like they’ve overcome the problems from firing and then rehiring talent...
41:57-42:09
SemiAnalysis Panelist: The entire AI team is working on some solid models now. I’d expect Meta to pull ahead of all the Chinese competitors, if not by the end of this year, then definitely in the first half of ’27.
42:09-42:21
SemiAnalysis Panelist: Isn’t Meta doing model distillation? SemiAnalysis Panelist: I don’t think they’re distilling from Anthropic. SemiAnalysis Panelist: I thought they were distilling from the Chinese models. SemiAnalysis Panelist: They’re mostly just running open-source models. That’s what Mistral does—they don’t distill from Anthropic, but from the Chinese guys.
42:21-42:36
SemiAnalysis Panelist: Fair point. Why are we even talking about Mistral, a leading French frontier model company? SemiAnalysis Panelist: Dude, their revenue is seriously strong. SemiAnalysis Panelist: Yeah, I know. SemiAnalysis Panelist: Also, the bottles—they’ve got those too. SemiAnalysis Panelist: You pick up the bottle on the bike? SemiAnalysis Panelist: Every frontier model company is basically on a countdown to becoming a NeoCloud.
42:36-43:09
SemiAnalysis Panelist: Yeah, they’re all trying to become NeoClouds—or I believe they’re acquiring companies to do just that. SemiAnalysis Panelist: Is Cerebrus becoming a NeoCloud? SemiAnalysis Panelist: What about NVIDIA launching NeoClouds? SemiAnalysis Panelist: The ultimate business model for any company is just becoming a NeoCloud. SemiAnalysis will become a NeoCloud. SemiAnalysis Panelist: And then we become ClusterMax Platinum. SemiAnalysis Panelist: Diamond tier. SemiAnalysis Panelist: No, we need a new tier above that. SemiAnalysis Panelist: Yeah, diamond’s too common now. SemiAnalysis Panelist: How about Mustang? SemiAnalysis Panelist: SemiAnalysis—Lithium tier? SemiAnalysis Panelist: Germanium? I’m just making stuff up. SemiAnalysis Panelist: What’s the rarest one? SemiAnalysis Panelist: What’s your favorite semiconductor, Doug? Doug O’Laughlin: We should name the tiers after semiconductor materials. SemiAnalysis Panelist: You’re into semiconductors? Name them all. SemiAnalysis Panelist: Name the top five? We should be Rhodium, the rarest and most expensive precious metal. SemiAnalysis Panelist: Vanadium? SemiAnalysis Panelist: Alright guys, this is getting off track. Let’s wrap it up.
43:40-44:22
SemiAnalysis Panelist: Any final takes before we go? SemiAnalysis Panelist: I don’t know. SemiAnalysis Panelist: Me neither.
44:22-44:55
SemiAnalysis Host: Alright, quick shout-out to one of our listeners, Anna, who suggested this kind of format, which we did today. You can always give feedback here or in the comments if you want us to try something different. We’ve been debating whether to do weekly news, review newsletter articles from the past week or two, or something else. So, feedback is appreciated if you made it all the way through this episode. SemiAnalysis Panelist: I bet the audience is more patient than Dylan—he didn’t even stick around for the end!
Made with: The Transcript Desk Chrome Extension
