Reasoning Will Fall

GenAI just levelled up again! OpenAI’s o3 model has set new highs in significant benchmarks—and that's a game-changer for all of us.

🔵 Here’s why it matters:
If AI can reason, code, and excel in maths and science, it’s only a matter of time before it starts reshaping tasks critical to nearly every business. Let’s dive into how o3 performed on key benchmarks:

🔵 ARC: Abstract Reasoning
The o3 model smashed the ARC-AGI benchmark—a tough visual reasoning test unbeaten since 2019. It scored:

- 75.7% in low-compute scenarios
- 87.5% in high-compute settings, on par with human performance (~85%).

This leap demonstrates o3's capacity for abstract reasoning, previously out of reach for AI. This is big news.

🔵 PhD-Level Science
On the GPQA Diamond benchmark, o3 achieved 87.7%, showcasing expert-level problem-solving in complex scientific domains.

🔵 Code
OpenAI’s o3 model excelled in competitive programming, achieving an Elo rating of 2,727 on Codeforces—ranking among the very best humans in the field.

🔵 Maths
The model delivered standout performance:

- 96.7% success on AIME, a challenging maths test.
- 25.2% solved on the EpochAI benchmark—far surpassing the previous limit of 2%.

I’m not sharing this to promote OpenAI (other players will catch up soon) but to highlight how far AI has progressed in such a short time. We’re at a tipping point where soon:

💡If a task can be benchmarked then AI can excel at it given enough data and compute 💡

🔵 What’s Driving This?
The secret? Scaling test-time compute. OpenAI reportedly spent $300,000 on GPU cycles to achieve its record-breaking ARC score. Costs will drop quickly, but the lesson is clear: scaling compute during inference is unlocking new AI capabilities around reasoning.

🔵 What This Means for Organisations
A period of dramatic change is on the horizon. Before being dazzled by benchmark-shattering models, ask yourself: is your data ready for AI?

💡 AI can only reason about your business if your data is in a reasonable state💡

The window to capitalise on these powerful—though not yet super-powerful—AI models is narrowing. Success depends on organising and connecting all your data assets into a single, integrated framework—a kind of organisation-specific benchmark, if you will.

I used to think we had ten years to do this, but now I’m rethinking. We may have less.

Previous
Previous

The Data Crunch

Next
Next

Predictions For 2025