Era 1
Era 2
Era 3
Era 4
----====ASI Threshold?====----
Era 5
Cole Era (6)
1 Human-very roughly 3σ above median
Era 7
Era 8
Era 9
----====Crisis Threshold====----
Era 10
Era 11
Ryan Era (12)
1 Human-very roughly 1σ above median
Era 13
----====AGI Threshold (If Above 50%)====----
Era 14
Estimated average adult American performance
o4-mini-high (3/10)
Gemini-2.5-Pro-6-5 (1/5)*
o3 (1/10)
Era 15
Gemini-2.5-Pro-3-25 (2/5)*
Gemini-2.5-Pro (full release) (1/10)1
----====Genuine Reasoning Threshold====----
Era 16
Grok 3 (Thinking) (5/5)
o3-mini-high (2/3)*
Claude Opus 4 Extended Thinking (2/5)*
Claude Sonnet 4 Extended Thinking (2/5)*
o1 (1/3)*
Claude Sonnet 3.7 Extended Thinking (1/5)*
Era 17
GPT-4*
GPT-4o*
Deepseek R1*
Era 18
GPT-3.5*
Era 19
Era 20
.
We expect a smart (by our sense of the term) person to solve Starburst in era 10, an average (adult American) person to solve it in era 14, and a dumb person to solve it in era 18. The smartest geniuses could probably solve it in era 4, and an arbitrary superintelligence could probably do better than them. If adapted to not require language, some animals could probably solve it in the last era. These estimates could easily be off by an era, but more than that would be quite surprising.
Models are listed by the earliest era in which they are capable of solving Starburst, along with a fraction of solutions correct at that era for eras 16 and earlier. Human and LLM scores in eras 16 and especially later should not be treated as directly comparable, since LLMs may be able to gain a substantial advantage by throwing knowledge at the problem in those eras.
Note on calling it an AGI threshold: a model that reaches here has general intelligence roughly on par with or slightly above the average adult American. It might (for a time, likely will) not be able to do everything the average human can do on a computer due to non-intelligence limitations such as time horizons, tool use, or perceptual abilities.
More information about Starburst: https://pennheretic.substack.com/p/you-can-do-fictional-theoretical
Models marked with an asterisk were tested on an old version of the prompt with a subtle ambiguity. This had little to no effect on results, but is noted for completeness.
According to Google’s documentation, this is the same model as the 6-5 snapshot, but it performs markedly worse. This is likely due to the full release usually expending roughly half the thinking tokens of the snapshot. Even setting a thinking budget in AI Studio doesn’t fix this issue.