A (Partial) Starburst Leaderboard

Will be updated with further LLMs and people

Chapin Lenthall-Cleary

Jun 18, 2025

Era 1

Era 2

Era 3

Era 4

----====ASI Threshold?====----

Era 5

Cole Era (6)

1 Human-very roughly 3σ above median

Era 7

Era 8

Era 9

----====Crisis Threshold====----

Era 10

Era 11

Ryan Era (12)

1 Human-very roughly 1σ above median

Era 13

----====AGI Threshold (If Above 50%)====----

Era 14

Estimated average adult American performance

o4-mini-high (3/10)

Gemini-2.5-Pro-6-5 (1/5)*

o3 (1/10)

Era 15

Gemini-2.5-Pro-3-25 (2/5)*

Gemini-2.5-Pro (full release) (1/10)1

----====Genuine Reasoning Threshold====----

Era 16

Grok 3 (Thinking) (5/5)

o3-mini-high (2/3)*

Claude Opus 4 Extended Thinking (2/5)*

Claude Sonnet 4 Extended Thinking (2/5)*

o1 (1/3)*

Claude Sonnet 3.7 Extended Thinking (1/5)*

Era 17

GPT-4*

GPT-4o*

Deepseek R1*

Era 18

GPT-3.5*

Era 19

Era 20

We expect a smart (by our sense of the term) person to solve Starburst in era 10, an average (adult American) person to solve it in era 14, and a dumb person to solve it in era 18. The smartest geniuses could probably solve it in era 4, and an arbitrary superintelligence could probably do better than them. If adapted to not require language, some animals could probably solve it in the last era. These estimates could easily be off by an era, but more than that would be quite surprising.

Models are listed by the earliest era in which they are capable of solving Starburst, along with a fraction of solutions correct at that era for eras 16 and earlier. Human and LLM scores in eras 16 and especially later should not be treated as directly comparable, since LLMs may be able to gain a substantial advantage by throwing knowledge at the problem in those eras.
Note on calling it an AGI threshold: a model that reaches here has general intelligence roughly on par with or slightly above the average adult American. It might (for a time, likely will) not be able to do everything the average human can do on a computer due to non-intelligence limitations such as time horizons, tool use, or perceptual abilities.
More information about Starburst: https://pennheretic.substack.com/p/you-can-do-fictional-theoretical

Models marked with an asterisk were tested on an old version of the prompt with a subtle ambiguity. This had little to no effect on results, but is noted for completeness.

According to Google’s documentation, this is the same model as the 6-5 snapshot, but it performs markedly worse. This is likely due to the full release usually expending roughly half the thinking tokens of the snapshot. Even setting a thinking budget in AI Studio doesn’t fix this issue.

The Pennsylvania Heretic

Discussion about this post