Updated 1 hour ago
GPT-5.5 Beats Claude Fable 5 on Brutal New Agents' Last Exam Benchmark

AI Model Benchmarks

GPT-5.5 Beats Claude Fable 5 on Brutal New Agents' Last Exam Benchmark

OpenAI's GPT‑5.5 beat Anthropic's brand‑new Claude Fable 5 on the Agents' Last Exam benchmark, a grueling new test from UC Berkeley that measures whether AI can execute real, economically valuable professional workflows — and both models still fail most of the time.

A Surprise Upset on AI's Toughest New Test

In a result that surprised industry observers, OpenAI's GPT‑5.5 — released in April — has beaten Anthropic's brand‑new Claude Fable 5 on the Agents' Last Exam (ALE), a grueling new benchmark launched by UC Berkeley's Center for Responsible, Decentralized Intelligence. GPT‑5.5, operating through the Codex agent framework, scored a 24.0% pass rate, edging out Fable 5's 22.0%, VentureBeat reported.

The benchmark, developed with an advisory committee of over 300 domain experts, launched with 1,490 task instances across 55 non‑physical industry sub‑domains and is scaling toward 5,000 tasks. It is explicitly designed to measure whether AI can execute economically valuable, long‑horizon professional workflows — not just pass narrow coding challenges.

Both scores highlight how far AI still has to go. The most advanced models in the world are failing roughly three‑quarters of these real‑world professional tasks. But the ranking matters: GPT‑5.5, a two‑month‑old model, outperformed Anthropic's flagship release on a test specifically built to close the gap between benchmark hype and real labor impact.

How ALE Is Different From Every Other Benchmark

ALE represents a fundamental departure from traditional AI evaluation. Rather than testing models on isolated coding puzzles or static question‑answering, it maps capability across five functional layers: Brain (reasoning), Eyes (visual perception), Body (orchestration), Hands (tool invocation), and Feet (runtime substrate). An agent must use its "Eyes" and "Hands" to navigate Linux or Windows virtual machines, interleaving shell scripting with point‑and‑click operations inside professional software tools, according to VentureBeat.

Crucially, ALE rejects the "LLM‑as‑a‑judge" grading paradigm for 93.2% of its workflows, instead using deterministic, code‑based evaluation that compares the agent's actual output artifacts against expert ground‑truth references. This neutralizes the grade inflation and cheating vulnerabilities that have plagued earlier benchmarks like SWE‑Bench Pro, where some models were caught reading hidden answer keys rather than solving problems.

Tasks are anchored in the U.S. federal occupational taxonomy and sourced from practicing professionals. Agents are asked to perform 3D model creation in Siemens NX, scene setup in Unreal Engine, neuroimaging analysis in FSLeyes, and visual effects compositing in Adobe After Effects — authentic workflows, not academic contrivances.

Where Each Model Wins

While GPT‑5.5 took the top ALE spot, the broader benchmark picture is more detailed. Claude Fable 5 dominates traditional coding benchmarks: it posts an 80.3% on SWE‑Bench Pro compared to GPT‑5.5's 58.6%, and on Cognition's FrontierCode Diamond split it reaches 29.3% — more than double GPT‑5.5's 5.7%, according to Vellum. Michael Truell of Cursor called Fable 5 "the state of the art model on CursorBench."

On visual reasoning, Fable 5 also leads: 29.8% on GDP.pdf versus GPT‑5.5's 24.9%. A Stripe team reported that Fable 5 compressed months of engineering work into days on a 50‑million‑line Ruby codebase migration.

GPT‑5.5's strengths lie elsewhere. It leads on long‑context reliability, holding to 512K‑1M tokens (74.0% on MRCR v2) versus GPT‑5.4's 36.6% collapse at 128K, according to DataCamp. It also benefits from lower pricing ($5/$30 per million tokens vs Fable 5's $10/$50) and freedom from the classifier rerouting that can silently downgrade Fable 5 queries.

The Codex Factor

GPT‑5.5's ALE victory came specifically through OpenAI's Codex agent framework — the agentic coding framework that lets the model chain tool calls, manage state across multi‑step workflows, and recover from errors. This architecture difference may explain why GPT‑5.5 outperformed Fable 5 on ALE despite trailing on pure coding benchmarks: ALE tests orchestration and multi‑modal tool use, not just code generation.

"OpenAI's Codex CLI gives GPT‑5.5 the orchestration layer that turns a strong reasoning model into an effective agent," DataCamp noted. Anthropic's Claude Code offers similar capabilities, but Fable 5's ALE run did not benefit from the same level of agentic infrastructure.

The ALE result suggests that for builders, the choice between models increasingly depends on use case: pure coding favors Fable 5, while complex multi‑tool autonomous workflows may favor GPT‑5.5 with Codex. The optimal strategy for production systems may be to route tasks to the model that performs best on that specific task type rather than committing to a single provider.

The Bigger Picture: Models Are Still Failing

For all the competitive positioning, ALE's most important finding is sobering: the best AI systems in existence are correctly completing only about one‑quarter of real professional tasks. The benchmark's name — Agents' Last Exam — is not a boast but a wager: build a test so hard that when AI finally passes it, that achievement genuinely signals the end of meaningful human‑level work evaluation.

The 24.0% pass rate means AI agents still cannot reliably execute most economically valuable knowledge work without human supervision. For builders deploying AI in production today, the implication is clear: autonomous agent workflows remain experimental. Human‑in‑the‑loop patterns are still necessary for any task where correctness matters.

The ALE leaderboard will continue updating as new models are tested. With Anthropic's Claude Mythos 5 (the unrestricted version) still only available to Project Glasswing partners, and OpenAI's GPT‑5.5 Pro variant yet to be tested, the rankings could shift quickly. But for now, Builders evaluating AI models for complex agentic workflows have a new North Star metric — and a humbling reality check on how far the technology still has to go.

Sources

  1. 1.VentureBeat(venturebeat.com)
  2. 2.Vellum(vellum.ai)
  3. 3.DataCamp(datacamp.com)

Share this article

PostShare

More on This Story

Related News