The latest race among advanced artificial intelligence (AI) systems is increasingly being fought through benchmark scores, with Japan’s Sakana AI and China’s Z.ai (formerly Zhipu AI) reporting stronger performance than Anthropic’s Claude Mythos on select evaluations.
However, the comparison is not entirely straightforward because the systems being measured are designed for different purposes.
Different models, different strengths
Anthropic offers different Claude models for different use cases, including Opus for complex reasoning and coding tasks, Sonnet for general-purpose workloads, and Haiku for faster and lower-cost applications. Mythos sits at the top end of that stack and is positioned as Anthropic’s most capable model for software engineering, reasoning, and agentic tasks.
China’s GLM-5.2 occupies yet another position in the market. Rather than focusing primarily on raw benchmark performance, Z.ai has marketed the model around long-horizon task completion, autonomous coding workflows and agentic execution.
Why the distinction matters
It matters because many of the benchmark results now being cited compare fundamentally different approaches to AI development. While Anthropic is evaluating the capabilities of a frontier foundation model, Sakana AI is testing an orchestration system built on top of multiple models, and Z.ai is focusing on agentic systems designed for longer-duration workflows.
As a result, the benchmark leaders vary depending on what is being measured.
Where the models differ
The benchmark comparisons being cited by Anthropic, Sakana AI and Z.ai are not entirely like-for-like because the systems being evaluated are built differently and, in some cases, the companies report results for different models within their product families.
|
Benchmark |
What it measures |
Claude Mythos / Opus |
Fugu / Fugu Ultra |
GLM-5.2 |
|
SWE-Bench Pro |
Real-world software engineering |
80.3% (Mythos) |
73.70% |
62.10% |
|
Terminal Bench 2.1 |
Agentic coding via terminal use |
80.4% (Mythos Preview) |
82.10% |
82.70% |
|
GPQA Diamond |
Graduate-level science reasoning |
94.6 (Mythos Preview) |
95.5 |
91.2 |
|
CharXiv Reasoning |
Scientific charts and figures |
86.1 (Mythos Preview) |
86.6 |
Not reported |
|
Humanity’s Last Exam* |
Multidisciplinary reasoning |
64.7% (with tools) |
50.00% |
54.7% (with tools) |
|
FrontierSWE |
Long-horizon coding tasks (20 hrs) |
75.1% (Opus 4.8) |
Not reported |
74.40% |
|
PostTrainBench |
Long-horizon agentic tasks |
37.2% (Opus 4.8) |
Not reported |
34.30% |
|
MCP-Atlas |
Multi-step agent workflows |
77.8% (Opus 4.8) |
Not reported |
76.80% |
|
Tool-Decathlon |
Tool use and workflows |
59.9% (Opus 4.8) |
Not reported |
48.20% |
|
ExploitBench |
Cybersecurity and vulnerability tasks |
78.0% (Mythos) |
Not reported |
Not reported |
|
*Scores for Humanity’s Last Exam are reported under different evaluation settings and may not be directly comparable. Anthropic and Z.ai report tool-enabled scores, while Sakana AI reports a text-only score. |
||||
What does the comparison show
Anthropic’s models continue to lead several software engineering, cybersecurity and agentic workflow evaluations. The company reported an 80.3 per cent score for Claude Mythos on SWE-Bench Pro, a benchmark that tests whether AI systems can resolve real-world software issues in code repositories. Anthropic’s Opus 4.8 also leads benchmarks such as FrontierSWE, MCP-Atlas and Tool-Decathlon, according to the figures disclosed by the respective companies.
Sakana AI reported stronger performance on some scientific reasoning and agentic coding benchmarks. According to the company’s technical report, Fugu Ultra scored 95.5 on GPQA Diamond, a benchmark that evaluates graduate-level scientific reasoning, compared with 94.6 for Mythos Preview. Fugu Ultra also achieved 86.6 on CharXiv Reasoning, ahead of Mythos Preview’s 86.1. On Terminal Bench 2.1, which evaluates how effectively models can interact with computing environments through terminal commands, Fugu Ultra scored 82.1 compared with 80.4 for Mythos Preview.
Z.ai’s GLM-5.2 also reported stronger performance than Mythos Preview on Terminal Bench 2.1, scoring 82.7. The company further reported competitive results on long-horizon coding and agentic workflow benchmarks, including 74.4 on FrontierSWE and 76.8 on MCP-Atlas.
Why benchmark wins do not always translate into overall leadership
The results illustrate why multiple companies can simultaneously claim state-of-the-art performance.
A system that excels at software engineering may not necessarily lead in scientific reasoning, while a model optimised for long-duration autonomous execution may perform differently on knowledge or cybersecurity evaluations.
Anthropic’s models continue to hold advantages across software engineering, cybersecurity and several workflow-oriented benchmarks. Fugu has reported stronger performance on selected scientific reasoning and agentic coding tests, while GLM-5.2 has reported competitive or leading results on some agentic coding and long-horizon software engineering evaluations.
The results point to an increasingly fragmented AI scenario, where different systems are optimised for different tasks. Rather than producing a single dominant model, the latest generation of AI products is creating specialists in areas such as software engineering, scientific reasoning, cybersecurity and long-horizon agentic execution.