Leaked internal benchmarks for GPT-5.2 “Thinking” have been posted by Sam Altman, and quite frankly, the numbers are ridiculous. We aren’t talking about incremental gains here.
For some reference:
-
AIME 2025: 100.0%. It solved it. This is a big math test and it means that competition math is effectively “completed” for this model.
-
ARC-AGI-2: This is the big one for the AGI purists. It jumped from 17.6% (GPT-5.1) to 52.9%. That is a massive leap in abstract reasoning and generalization—historically the Achilles’ heel of LLMs.
-
GDPval (Knowledge Work): This is the metric that matters for the economy. It flew from 38.8% to 70.9%.
It’s also worth noting that this highlights that scaling and reasoning are both advancing as this is a model that uses maximum reasoning efforts. Lately, it looked like OpenAI got caught with its pants down because Gemini scaled and it worked but this shows that reasoning is doing things that looked impossible.
For users, the thinking models aren’t that popular because they’re slow for every day tasks to replace Google but for innovation, this is huge. What the dual-releases show is that both tracks are still working. Ultimately, there will be a ‘best of both’ that unlocks something beyond this.
This is also big for the economy. GDPval tests well-specified knowledge work tasks spanning 44 occupations.
At the moment, this release is being rolled out and we’re going to see if the use cases match the numbers. What we aren’t seeing is what the lesser models do