Reasoning Will Fall
GenAI just levelled up again! OpenAI’s o3 model has set new highs in significant benchmarks—and that's a game-changer for all of us.
🔵 Here’s why it matters:
If AI can reason, code, and excel in maths and science, it’s only a matter of time before it starts reshaping tasks critical to nearly every business. Let’s dive into how o3 performed on key benchmarks:
🔵 ARC: Abstract Reasoning
The o3 model smashed the ARC-AGI benchmark—a tough visual reasoning test unbeaten since 2019. It scored:
- 75.7% in low-compute scenarios
- 87.5% in high-compute settings, on par with human performance (~85%).
This leap demonstrates o3's capacity for abstract reasoning, previously out of reach for AI. This is big news.
🔵 PhD-Level Science
On the GPQA Diamond benchmark, o3 achieved 87.7%, showcasing expert-level problem-solving in complex scientific domains.
🔵 Code
OpenAI’s o3 model excelled in competitive programming, achieving an Elo rating of 2,727 on Codeforces—ranking among the very best humans in the field.
🔵 Maths
The model delivered standout performance:
- 96.7% success on AIME, a challenging maths test.
- 25.2% solved on the EpochAI benchmark—far surpassing the previous limit of 2%.
I’m not sharing this to promote OpenAI (other players will catch up soon) but to highlight how far AI has progressed in such a short time. We’re at a tipping point where soon:
💡If a task can be benchmarked then AI can excel at it given enough data and compute 💡
🔵 What’s Driving This?
The secret? Scaling test-time compute. OpenAI reportedly spent $300,000 on GPU cycles to achieve its record-breaking ARC score. Costs will drop quickly, but the lesson is clear: scaling compute during inference is unlocking new AI capabilities around reasoning.
🔵 What This Means for Organisations
A period of dramatic change is on the horizon. Before being dazzled by benchmark-shattering models, ask yourself: is your data ready for AI?
💡 AI can only reason about your business if your data is in a reasonable state💡
The window to capitalise on these powerful—though not yet super-powerful—AI models is narrowing. Success depends on organising and connecting all your data assets into a single, integrated framework—a kind of organisation-specific benchmark, if you will.
I used to think we had ten years to do this, but now I’m rethinking. We may have less.