varies: reasoning effort 4 levels

gpt-5-nano · reasoning effort

Same model and task; only the reasoning-effort setting changes. It shows how much long-context capability gpt-5-nano gains from more reasoning, from minimal (which cannot do the task at all) up to high.

Held constant

model: gpt-5-nano provider: openai query mode: multihop reliability thresholds: sustain ≥ 90%, break < 80% scoring: exact pass-rate task: state tracking (T5)

Not held constant: reasoning effort (high, low, medium, minimal) .

Score by level

Each level holds the model and the test fixed. Only reasoning effort changes. A higher score is better, and a tighter confidence interval is better still.

high

ceiling found

33.6/100

CI 29.6 to 37.5

16K

H10

32K

H40

524K

H52

786K

H64

1049K

sustains H6 · 8K breaks H8 · 16K moderate

medium

ceiling found

29.6/100

CI 27.6 to 33.6

16K

H10

32K

H40

524K

H52

786K

H64

1049K

sustains H6 · 8K breaks H8 · 16K moderate

low

ceiling found

22.7/100

CI 22 to 24.7

16K

H10

32K

H40

524K

H52

786K

H64

1049K

sustains H4 · 8K breaks H6 · 8K moderate

minimal

floored

0/100

CI 0 to 0

16K

H10

32K

H40

524K

H52

786K

H64

1049K

sustains n/a breaks n/a gentle decay

Analysis generated by anthropic/claude-sonnet-4.6 · v1

Reasoning effort gates how far gpt-5-nano gets on long context

At minimal reasoning, gpt-5-nano can't do the task at all. Turn reasoning up and real capability appears, climbing from nothing to a ceiling around H6 at 8K, then flattening out: medium and high land in nearly the same place. So reasoning effort is a genuine lever for nano on long-context work, with sharp diminishing returns past medium.

What we compared

Same model and the same task throughout: gpt-5-nano on state tracking, where it has to follow a chain of references through a long context. The ladder, the 0 to 100 scoring, and exact pass-rate marking were all held fixed. The only thing we changed is the reasoning-effort setting: minimal, low, medium, high. (The medium arm reuses the trials we already had on file, so it cost nothing to add.)

Reasoning	Score (CI)	Holds through
minimal	0.0 (floored)	fails even 2K
low	22.7 (22.0 to 24.7)	H4 @ 8K
medium	29.6 (27.6 to 33.6)	H6 @ 8K
high	33.6 (29.6 to 37.5)	H6 @ 8K

A note on the endpoints: through the OpenAI Chat Completions API we use, "off" behaves the same as minimal (gpt-5 always reasons a little), and "xhigh" behaves the same as high (xhigh is only exposed on the Responses API, which we don't call yet). So those two settings aren't separately measurable here, and minimal and high are the real floor and ceiling of this sweep.

Reading the numbers

At minimal effort nano is floored: it scores about 1 in 8 even on the easiest 2K rung and zero above that. With almost no reasoning budget it simply can't track state across the context. Low effort gets it onto the ladder, holding H4 @ 8K. Medium takes it to H6 @ 8K, which matches nano's reference number. High holds the same rung and does a little better at 16K, but the gain over medium is small and the intervals overlap.

What it means for picking a setting

The useful range is minimal to medium: that's where almost all the capability appears, going from "can't do it" to a real H6 @ 8K ceiling. Past medium the curve flattens, so for nano on this kind of work, medium is the sensible default and pushing to high (or, once supported, xhigh) buys very little.

Method note

Scores come from a difficulty ladder where reference hops and context length rise together, and a model has to hit a 90% pass rate to hold a rung. Only the reasoning-effort setting varies across the table; the model, task, ladder, and scoring are identical. This is one task type under one prompt setup and does not speak to other benchmarks.