Project Peak
Project Peak model rankings

How well a model holds up as its context fills with code.

We run each model up a ladder where the task gets harder and the context gets longer at the same time. Its score is the hardest step it still gets right about 90% of the time. Every model is graded against how it does on the easy steps, so the number says something about the model and not about how hard we made the test.

73.2 /100
gpt-5 · score (CI 63.4 to 73.8)
H24 · 200K
hardest step it holds
Leaderboard

Model rankings

Each model is scored by the hardest step it holds on the ladder. The bars show how often it passed at every step we tested, with the easy ones on the left and the hard ones on the right. The two dashed lines mark 90% and 80% reliability. A higher score is better, and a slower drop past the top step is better still.

#1
gpt-5
ceiling found
73.2 /100
95% CI 63.4 to 73.8
H10
32K
H14
64K
H18
128K
H24
200K
H32
350K
H40
524K
H52
786K
H64
1049K
Sustains
H24 · 200K
Breaks at
H32 · 350K
Decline begins
73.2
Falloff
sharp cliff
#2
gpt-5-mini
ceiling found
66 /100
95% CI 63.4 to 67.1
H10
32K
H14
64K
H18
128K
H24
200K
H32
350K
H40
524K
H52
786K
H64
1049K
Sustains
H18 · 128K
Breaks at
H24 · 200K
Decline begins
66
Falloff
sharp cliff
#3
gpt-5-codex
ceiling found
57.2 /100
95% CI 56.1 to 58.2
H6
8K
H8
16K
H10
32K
H14
64K
H32
350K
H40
524K
H52
786K
H64
1049K
Sustains
H14 · 64K
Breaks at
H32 · 350K
Decline begins
57.2
Falloff
moderate
#4
gpt-5-nano
ceiling found
33.6 /100
95% CI 29.6 to 36.6
H4
8K
H6
8K
H8
16K
H10
32K
H40
524K
H52
786K
H64
1049K
Sustains
H6 · 8K
Breaks at
H8 · 16K
Decline begins
33.6
Falloff
moderate