Project Peak
← Studies
varies: reasoning effort 4 levels

gpt-5-nano · reasoning effort

Same model and task; only the reasoning-effort setting changes. It shows how much long-context capability gpt-5-nano gains from more reasoning, from minimal (which cannot do the task at all) up to high.

Held constant
model: gpt-5-nano provider: openai query mode: multihop reliability thresholds: sustain ≥ 90%, break < 80% scoring: exact pass-rate task: state tracking (T5)
Not held constant: reasoning effort (high, low, medium, minimal) .

Score by level

Each level holds the model and the test fixed. Only reasoning effort changes. A higher score is better, and a tighter confidence interval is better still.

#1
high
ceiling found
33.6/100
CI 29.6 to 37.5
H4
8K
H6
8K
H8
16K
H10
32K
H40
524K
H52
786K
H64
1049K
sustains H6 · 8K breaks H8 · 16K moderate
#2
medium
ceiling found
29.6/100
CI 27.6 to 33.6
H4
8K
H6
8K
H8
16K
H10
32K
H40
524K
H52
786K
H64
1049K
sustains H6 · 8K breaks H8 · 16K moderate
#3
low
ceiling found
22.7/100
CI 22 to 24.7
H3
4K
H4
8K
H6
8K
H8
16K
H10
32K
H40
524K
H52
786K
H64
1049K
sustains H4 · 8K breaks H6 · 8K moderate
#4
minimal
floored
0/100
CI 0 to 0
H2
2K
H3
4K
H4
8K
H6
8K
H8
16K
H10
32K
H40
524K
H52
786K
H64
1049K
sustains n/a breaks n/a gentle decay
Analysis generated by anthropic/claude-sonnet-4.6 · v1

Reasoning effort gates how far gpt-5-nano gets on long context

At minimal reasoning, gpt-5-nano can't do the task at all. Turn reasoning up and real capability appears, climbing from nothing to a ceiling around H6 at 8K, then flattening out: medium and high land in nearly the same place. So reasoning effort is a genuine lever for nano on long-context work, with sharp diminishing returns past medium.

What we compared

Same model and the same task throughout: gpt-5-nano on state tracking, where it has to follow a chain of references through a long context. The ladder, the 0 to 100 scoring, and exact pass-rate marking were all held fixed. The only thing we changed is the reasoning-effort setting: minimal, low, medium, high. (The medium arm reuses the trials we already had on file, so it cost nothing to add.)

Reasoning Score (CI) Holds through
minimal 0.0 (floored) fails even 2K
low 22.7 (22.0 to 24.7) H4 @ 8K
medium 29.6 (27.6 to 33.6) H6 @ 8K
high 33.6 (29.6 to 37.5) H6 @ 8K

A note on the endpoints: through the OpenAI Chat Completions API we use, "off" behaves the same as minimal (gpt-5 always reasons a little), and "xhigh" behaves the same as high (xhigh is only exposed on the Responses API, which we don't call yet). So those two settings aren't separately measurable here, and minimal and high are the real floor and ceiling of this sweep.

Reading the numbers

At minimal effort nano is floored: it scores about 1 in 8 even on the easiest 2K rung and zero above that. With almost no reasoning budget it simply can't track state across the context. Low effort gets it onto the ladder, holding H4 @ 8K. Medium takes it to H6 @ 8K, which matches nano's reference number. High holds the same rung and does a little better at 16K, but the gain over medium is small and the intervals overlap.

What it means for picking a setting

The useful range is minimal to medium: that's where almost all the capability appears, going from "can't do it" to a real H6 @ 8K ceiling. Past medium the curve flattens, so for nano on this kind of work, medium is the sensible default and pushing to high (or, once supported, xhigh) buys very little.

Method note

Scores come from a difficulty ladder where reference hops and context length rise together, and a model has to hit a 90% pass rate to hold a rung. Only the reasoning-effort setting varies across the table; the model, task, ladder, and scoring are identical. This is one task type under one prompt setup and does not speak to other benchmarks.