Project Peak model rankings

How well a model holds up as its context fills with code.

Name: Project Peak: model capability ceiling for long-context code reasoning
Creator: Project Peak
License: https://creativecommons.org/licenses/by/4.0/

We run each model up a ladder where the task gets harder and the context gets longer at the same time. Its score is the hardest step it still gets right about 90% of the time. Every model is graded against how it does on the easy steps, so the number says something about the model and not about how hard we made the test.

73.2 /100

gpt-5 · score (CI 63.4 to 73.8)

H24 · 200K

hardest step it holds

View the leaderboard Latest study: The GPT-5 family, ranked by capability →

Leaderboard

Model rankings

Each model is scored by the hardest step it holds on the ladder. The bars show how often it passed at every step we tested, with the easy ones on the left and the hard ones on the right. The two dashed lines mark 90% and 80% reliability. A higher score is better, and a slower drop past the top step is better still.

gpt-5

ceiling found

73.2 /100

95% CI 63.4 to 73.8

H10

32K

H14

64K

H18

128K

H24

200K

H32

350K

H40

524K

H52

786K

H64

1049K

Sustains: H24 · 200K
Breaks at: H32 · 350K
Decline begins: 73.2
Falloff: sharp cliff

gpt-5-mini

ceiling found

66 /100

95% CI 63.4 to 67.1

H10

32K

H14

64K

H18

128K

H24

200K

H32

350K

H40

524K

H52

786K

H64

1049K

Sustains: H18 · 128K
Breaks at: H24 · 200K
Decline begins: 66
Falloff: sharp cliff

gpt-5-codex

ceiling found

57.2 /100

95% CI 56.1 to 58.2

16K

H10

32K

H14

64K

H32

350K

H40

524K

H52

786K

H64

1049K

Sustains: H14 · 64K
Breaks at: H32 · 350K
Decline begins: 57.2
Falloff: moderate

gpt-5-nano

ceiling found

33.6 /100

95% CI 29.6 to 36.6

16K

H10

32K

H40

524K

H52

786K

H64

1049K

Sustains: H6 · 8K
Breaks at: H8 · 16K
Decline begins: 33.6
Falloff: moderate