If you wish to try Starburst, email me at chapinalc@gmail.com.
Era 1
Era 2
Era 3
Era 4
----====ASI Threshold?====----
Ethan Era (5)
1 Human-very roughly 3.5σ above median
Cole Era (6)
1 Human-very roughly 3σ above median
Era 7
Era 8
Era 9
----====Crisis Threshold====----
Era 10
Era 11
Ryan Era (12)
1 Human-very roughly 1σ above median
Era 13
Grok 4 (2/10)1
----====(Weak) AGI Threshold (If Above 50%)====----
Era 14
Estimated average adult American performance
GPT-5 Thinking (3/10)
o4-mini-high (3/10)
o3-pro (1/5)2
Gemini-2.5-Pro-6-5 (1/5)*
o3 (1/10)
Era 15
Gemini-2.5-Pro-3-25 (2/5)*
Gemini-2.5-Pro (full release) (1/10)3
----====Genuine Reasoning Threshold====----
Era 16
Grok 3 (Thinking) (5/5)
o3-mini-high (2/3)*
Claude Opus 4 Extended Thinking (2/5)*
Claude Sonnet 4 Extended Thinking (2/5)*
o1 (1/3)*
Claude Sonnet 3.7 Extended Thinking (1/5)*
Era 17
GPT-4.1 (5/5)
GPT-4*
GPT-4o*
Deepseek R1*
Era 18
GPT-3.5*
Era 19
Era 20
.
We expect a smart (by our sense of the term) person to solve Starburst in era 10, an average (adult American) person to solve it in era 14, and a dumb person to solve it in era 18. The smartest geniuses could probably solve it in era 4, and an arbitrary superintelligence could probably do better than them. If adapted to not require language, some animals could probably solve it in the last era. These estimates could easily be off by an era, but more than that would be quite surprising.
Models are listed by the earliest era in which they are capable of solving Starburst, along with a fraction of solutions correct at that era for eras 16 and earlier. Human and LLM scores in eras 16 and especially later should not be treated as directly comparable, since LLMs may be able to gain a substantial advantage by throwing knowledge at the problem in those eras.
Note on calling it an AGI threshold: a model that reaches here probably has general intelligence roughly on par with the average adult American. It might (for a time, likely will) not be able to do everything the average human can do on a computer due to non-intelligence limitations such as time horizons, tool use, hallucinations, or perceptual abilities. I’d call a model with every cognitive ability at or above the average adult American strong AGI, even if it’s only average-intelligence-strong-AGI.
More information about Starburst: https://pennheretic.substack.com/p/you-can-do-fictional-theoretical
Models marked with an asterisk were tested on an old version of the prompt with a subtle ambiguity. This had little to no effect on results, but is noted for completeness.
At this era, Grok 4 gives a message like “Uh-oh, too much information for me to digest all at once. You know, sometimes less is more!“ in roughly 20% of attempts. These are treated as errors and therefore not counted towards the denominator.
Unlike most other models in this list, o3-pro is primarily only accessible to users paying $200/month. This essentially gives it an unfair advantage over the $20/month models on the leaderboard.
According to Google’s documentation, this is the same model as the 6-5 snapshot, but it performs markedly worse. This is likely due to the full release usually expending roughly half the thinking tokens of the snapshot. Even setting a thinking budget in AI Studio doesn’t fix this issue.