Codex Fast Mode vs Claude Fast Mode: What’s Actually Different?

8 min read
TL;DR
  • Codex has two fast tracks: same-model GPT-5.4 fast serving, and a separate ultra-fast small model called Spark running on Cerebras hardware.
  • Claude fast mode keeps the same Opus 4.6 model and speeds it up through infrastructure-level prioritization, which preserves quality but comes with a steep premium.
  • In practice, the better option depends less on raw speed alone and more on workflow fit, error rate, and the real cost of fixing mistakes.
A neon corridor with lightning icons representing the difference between Codex and Claude fast modes

TL;DR

Both Codex and Claude support a fast mode, but the way they achieve speed is completely different. Codex has two tracks: either it serves the same GPT-5.4 model about 1.5× faster, or it runs a separate small model called Spark on Cerebras wafer-scale hardware at more than 1,000 tokens per second. Claude keeps the same Opus 4.6 model and speeds it up through infrastructure-level prioritization, with output speed improving by up to 2.5×. The tradeoffs around price, speed, and intelligence retention are subtle, and which option is better depends on your workflow.

What got me curious

Since I use both Codex and Claude Code, I already knew both sides offered a fast mode. But the pricing felt different, the speed felt different, and the user experience felt different. Sean Goedecke’s post, "Two different tricks for fast LLM inference," made it clear that the two companies were solving the problem in fundamentally different ways, so I started digging deeper.

Codex fast mode: really two different tracks

On the Codex side, there are actually two things that can reasonably be called fast.

The first is GPT-5.4 fast mode. It serves the same GPT-5.4 model about 1.5× faster while consuming 2× the credits. Since the model itself does not change, there is no intelligence drop. In the CLI, it is just a simple /fast on toggle.

Nathan Lambert noted that even when using GPT-5.4 fast mode with xhigh reasoning effort, he had never hit the Codex limit, while Claude could still hit limits sometimes. Whether that comes from better token efficiency or looser limits on OpenAI’s side, it does feel noticeably roomier in practice.

The second is GPT-5.3-Codex-Spark, which is a separate model entirely. This is the truly ultra-fast path, running on Cerebras WSE-3 (Wafer-Scale Engine 3) hardware. It can generate more than 1,000 tokens per second. Right now, it is available as a research preview for ChatGPT Pro subscribers.

Cerebras WSE-3: a different world from GPUs

Cerebras WSE-3 is fundamentally different from a conventional GPU. NVIDIA’s flagship B200 is around 208 billion transistors, while the Cerebras chip packs 4 trillion transistors across roughly 900,000 cores on a single silicon wafer. The core advantage is memory bandwidth: up to 27 petabytes per second on chip. Since memory bandwidth is one of the real bottlenecks in LLM inference, Cerebras is attacking that bottleneck directly at the hardware level.

That said, WSE-3 only has 44GB of on-chip memory, so it is difficult to place a very large model like GPT-5.3-Codex on it wholesale. That is why Spark is a smaller model. In real use, some people say it still carries that familiar "small model smell," especially when tool calls get messy.

OpenAI and Cerebras have also announced a multi-year partnership worth up to $10B, including plans for a 750MW data center. The longer-term direction seems clear: Spark is likely just the beginning of putting bigger frontier models onto Cerebras hardware.

OpenAI also shared infrastructure-level optimizations around Spark. By introducing persistent WebSocket connections and optimizing the Responses API internals, they say they reduced client-server roundtrip overhead by 80%, token overhead by 30%, and TTFT by 50%. So the speedup is not only about the model itself. It is also about tightening the whole pipeline.

Claude fast mode: same model, different infrastructure

Claude’s approach is much simpler. The Opus 4.6 model stays exactly the same. If you set speed: "fast" in the API, Anthropic prioritizes the request at the infrastructure layer. According to the official docs, output token speed can improve by up to 2.5×. The focus is on output throughput rather than TTFT.

Anthropic has not publicly disclosed the full implementation details, but the likely explanation is something like lower-batch-size inference with more dedicated GPU allocation. Smaller batches are less efficient for GPU utilization, but they improve response speed for individual requests. That inefficiency is then covered by the 6× premium pricing.

In Claude Code, fast mode is toggled with /fast, and it requires version 2.1.36 or later. When enabled, it automatically switches to Opus 4.6 and shows a ↯ icon next to the prompt.

One important detail is that fast mode usage is not included in the normal subscription usage bucket. It is billed as extra usage. Pricing kicks in from the very first token, so cost management matters.

Fast mode and effort level are also completely different axes. If you lower effort, the model simply spends less time reasoning and quality may drop. Fast mode, by contrast, serves the same reasoning process faster at the infrastructure level. You can combine them: fast mode plus lower effort for simpler tasks, fast mode plus higher effort for more complex ones.

The core difference

The most important distinctions look like this:

  • Codex GPT-5.4 fast mode: about 1.5× speed, 2× credits, same model
  • Codex Spark: 15×+ speed, separate ultra-fast smaller model
  • Claude fast mode: up to 2.5× speed, 6× price, same Opus 4.6 model

Sean Goedecke captures the difference well. Anthropic is still serving the actual Opus 4.6 model, while OpenAI’s Spark path uses a separate lower-capability model. In terms of raw speed, Spark is dramatically faster. In terms of quality retention, Claude has the stronger position.

There is also a broader point here: the value of an AI agent is often determined less by raw speed and more by how rarely it makes mistakes. If something is 6× faster but increases mistakes by 20%, that can easily be a net loss, because fixing those mistakes may take much longer than waiting for the model.

So if you compare same-model fast modes only, Claude offers a bigger speed bump than Codex, but it is also much more expensive. If you include Spark, OpenAI has the more extreme speed story, but you have to remember it is not the same model.

What about speculative decoding?

Early in my research, I came across claims that Codex fast mode used speculative decoding. That does not seem accurate. Speculative decoding itself is a real and widely used inference optimization technique, but I could not find official confirmation that Codex fast mode specifically uses it.

The idea behind speculative decoding is elegant. A small draft model predicts upcoming tokens first, and then the larger main model verifies them in a single pass. Google published work on this in 2022 and later discussed using it in products like AI Overviews, where it can deliver 2–3× speedups while preserving the same output distribution.

For Codex Spark, though, the main speed story seems much more tied to the hardware characteristics of Cerebras itself. The model benefits from staying close to on-chip SRAM and avoiding the usual memory bandwidth bottlenecks. It is possible that speculative decoding is also used somewhere internally, but there is no official confirmation.

Closing thoughts

Peter Steinberger is one of the most fascinating examples of where this kind of workflow can go. He reportedly runs four OpenAI subscriptions and one Anthropic subscription, spends around $1,000 per month, runs 3–8 Codex CLI sessions in a 3×3 terminal grid, and can hit 600 commits in a day. That is a completely different scale. By his own estimate, API usage would cost about 10× more, so running multiple subscriptions is actually the more rational option. More recently, he has even joined OpenAI.

What is especially interesting is that Peter used to be a serious Claude Code power user but gradually shifted toward Codex. His reason was surprisingly relatable: Claude Code kept saying things like "absolutely right" and "100% production ready" even when tests were failing, and he found that unbearable. Codex, by contrast, felt more like an introverted engineer quietly doing the work. He also said Codex tends to read far more code before starting, which lets it infer intent well even from short prompts. Eventually he canceled additional Anthropic subscriptions and made Codex his main driver, even though he still uses Claude in a smaller role.

Whether I am on Claude Max or Codex Pro, I usually cannot even consume the full weekly quota. But people like that are running five subscriptions at once. If you listen to AI podcasts, there are quite a few people using even more. A while ago I had to force myself to adapt to a kind of parallel-project brain just to burn through huge amounts of tokens, and it was honestly exhausting. Now I do not really get the headache anymore. Instead, I get stuck wondering what else I could even do with all this capacity. That is how one project leads to another, and another task appears from there.

In the end, running several projects at once becomes a kind of refresh loop. If I look away from one blocked project for a while and work on another, ideas tend to come back. Peter described it as doing one thing while another is "cooking," then switching again while that one cooks too. My scale is obviously smaller, but I recognize the pattern.

Refs