Project Peak
← Studies
varies: model 4 levels

The GPT-5 family, ranked by capability

How the GPT-5 variants compare on the capability ceiling for long-context code reasoning, from nano up to the full model and the Codex variant. Same instrument and scoring; only the model changes.

Held constant
query mode: multihop reasoning effort: medium reliability thresholds: sustain ≥ 90%, break < 80% scoring: exact pass-rate task: state tracking (T5)
Not held constant: provider (openai, openrouter) .

Score by level

Each level holds the model and the test fixed. Only model changes. A higher score is better, and a tighter confidence interval is better still.

#1
gpt-5
ceiling found
73.2/100
CI 63.4 to 73.8
H10
32K
H14
64K
H18
128K
H24
200K
H32
350K
H40
524K
H52
786K
H64
1049K
sustains H24 · 200K breaks H32 · 350K sharp cliff
#2
gpt-5-mini
ceiling found
66/100
CI 63.4 to 67.1
H10
32K
H14
64K
H18
128K
H24
200K
H32
350K
H40
524K
H52
786K
H64
1049K
sustains H18 · 128K breaks H24 · 200K sharp cliff
#3
gpt-5-codex
ceiling found
62.8/100
CI 58.5 to 66.4
H6
8K
H8
16K
H10
32K
H14
64K
H18
128K
H24
200K
H32
350K
H40
524K
H52
786K
H64
1049K
sustains H14 · 64K breaks H18 · 128K moderate
#4
gpt-5-nano
ceiling found
33.6/100
CI 29.6 to 36.6
H4
8K
H6
8K
H8
16K
H10
32K
H40
524K
H52
786K
H64
1049K
sustains H6 · 8K breaks H8 · 16K moderate
Analysis generated by anthropic/claude-sonnet-4.6 · v1

How the GPT-5 family compares on long-context reasoning

Bigger models hold up better, and the jump is large: nano scores 33.6, mini 66.0, and the full model 73.2. The order isn't perfectly clean, though. Codex lands at 62.8, below mini, even though it's the code-specialized variant, and the full model has an odd dip in the middle of its curve.

What we compared

We measured four models in the GPT-5 family on the same task: gpt-5-nano, gpt-5-mini, gpt-5, and gpt-5-codex. The task is state tracking, where the model has to follow a chain of references through a long context. The only thing we set out to vary is the model. Everything else was held fixed: the same difficulty ladder, the same 0 to 100 scoring, exact pass-rate marking, and a reasoning effort of medium on every model. The one thing that wasn't fully controlled is the route: Codex is only reachable for us through OpenRouter, while the other three run on the OpenAI direct API, and through that route Codex's usable context tops out around 256K rather than the full model's 400K.

Model Score (CI) Holds through Breaks at
gpt-5 73.2 (63.4 to 73.8) H24 @ 200K H32 @ 350K
gpt-5-mini 66.0 (63.4 to 67.1) H18 @ 128K H24 @ 200K
gpt-5-codex 62.8 (58.5 to 66.4) H14 @ 64K H18 @ 128K
gpt-5-nano 33.6 (29.6 to 36.6) H6 @ 8K H8 @ 16K

The big story is the climb from nano to the full model: 33.6, then 66.0, then 73.2, about a 40-point spread. All four have a confirmed ceiling, so each number is bracketed on both sides.

Reading the numbers

Nano falls off early and steeply. It passes H6 @ 8K cleanly, drops to about 19% at H10 @ 32K, and scores zero above that. Its interval is the widest of the four, but the ceiling is not in doubt.

The full GPT-5 has a wrinkle. It passes the harder H24 @ 200K at 100% but only 87.5% at the easier H18 @ 128K just before it. 87.5% still clears the bar, so the ceiling placement holds, but the dip is real. Its interval is also wide (63.4 to 73.8), so its edge over mini is not firmly established.

Codex degrades gracefully but earlier than mini. It holds H14 @ 64K cleanly, slips to 87.5% at H18 @ 128K, and is down to about 50% by H24 @ 200K, and its OpenRouter route can't take the 350K rung at all. So on this long-context reasoning task it sits just below mini, despite being the code-specialized model. If your work is tracking state across a long context rather than writing code, Codex is not the one this comparison points to.

What it means for picking a model

Going from nano to mini is the big practical step: about 32 points, and usable context stretches from 8K to 128K. The full model adds a few more points over mini, but the intervals overlap, so the gap between them is soft. Codex comes in just under mini here, with a smaller usable window through the route we have.

Method note

Scores come from a difficulty ladder where reference hops and context length rise together, and a model has to hit a 90% pass rate to hold a rung. Every model ran at medium reasoning effort with exact scoring, so the only thing moving across the table is the model itself (and, for Codex, the API route). This is one task type under one prompt setup, and it does not speak to other reasoning or coding benchmarks.