Someone, at some point, perhaps as a joke, decided to call it “Humanity’s last exam”, or HLE. By the time it was published in Nature in January, its designers had already announced a replacement. The replacement is updated continuously. It has to be.

It is a benchmark of 2,500 questions assembled by nearly 1,000 experts from 500 institutions across 50 countries, introduced by researchers at the Center for AI Safety and Scale AI. At launch, the best AI models scored under 10 per cent. The live leader board now shows 38.3 per cent. The trajectory, more than the absolute score, is the point.

The benchmark was designed in response to a failure in existing tools for measuring AI capability. Frontier models had already exceeded 90 per cent accuracy on MMLU (massive multitask language understanding), once considered a serious challenge. When all the best systems can clear it, the measuring instrument does not tell you much about the differences between them, or where they may be the following year.

HLE’s design logic was deliberately adversarial. Questions were submitted by experts across more than a hundred disciplines. Before entering the dataset, each had to defeat all current frontier models and pass two rounds of expert review. The result was a set of questions that, by construction, no existing AI could reliably answer.

The evaluation results confirmed the difficulty. At launch, GPT-4o scored 2.7 per cent, Claude 3.5 per cent, Sonnet 4.1 per cent, and OpenAI’s o1 8 per cent. DeepSeek-R1 reached 8.5 per cent. These numbers were low by design. The speed of subsequent progress was not fully anticipated. GPT-5 scored 25.3 per cent, Gemini 2.5 Pro reached 21.6 per cent. Scores have continued to climb; the live leader board now shows Gemini 3 Pro at 38.3 per cent.

Calibration errors across models ranged from 50 per cent to 89 per cent. Calibration measures whether a model’s stated confidence matches its accuracy. On HLE, models routinely expressed high confidence while being wrong. This is not a quirk of one system. It holds across architectures, suggesting a structural feature of current AI design.

Accuracy improved with more reasoning compute, but only to a point. Beyond roughly 16,000 output tokens, performance declined.

Dynamic testing

The paper reports an expert disagreement rate of 15.4 per cent, rising to 18 per cent in biology, chemistry and health. Nearly one in six questions could not be answered consistently even by specialists. The benchmark is harder than any existing AI can reliably handle, but also harder for any single human expert.

The authors draw one boundary clearly: High performance on HLE would not constitute evidence of artificial general intelligence. It would demonstrate expert-level performance on close-ended academic questions. That distinction is often lost in public commentary.

There is, however, a deeper problem. Every instrument built to measure AI capability has so far become a ceiling that models eventually reach. MMLU took years to saturate; HLE is showing pressure within months. The designers acknowledge this by announcing HLE-Rolling, a dynamically updated version intended to stay ahead of the models.

Once a benchmark is published, it becomes a target for developers and for the optimisation logic by which models are compared and sold. The instrument and the thing it measures cease to be independent. No static benchmark can escape this. The inability to build a stable yardstick suggests that capability is moving faster than the ability to define what is being measured.

Business decisions

For businesses, this has three consequences. Investment and procurement decisions across sectors are being made on capability claims built, directly or indirectly, on benchmark performance. If the benchmarks are structurally unstable, so are those claims.

The calibration finding compounds this. A model that cannot signal its own uncertainty is unsuitable for deployment in contexts where errors compound, including credit assessment, medical triage and document-intensive knowledge work. Confident wrongness is operationally worse than uncertain wrongness because it removes the incentive for a human check.

The pace of improvement also means that any organisation’s current understanding of AI capability has a short shelf-life. Planning and regulatory assumptions require revision cycles that most institutions are not designed to support.

The scores are rising fast enough; so it may not be possible to decide, in any meaningful way, which model is leading. At the margins of a benchmark this hard, differences between the best systems will fall within statistical noise. The yardstick will not so much have been beaten as dissolved.

Published on March 23, 2026



Source link

YouTube
Instagram
WhatsApp