DIGITAL EXECUTIVE: Businesses are beginning to deploy agentic AI in procurement, among other operations
| Photo Credit:
WANAN YOSSINGKUM

A new analysis of leading artificial intelligence systems has found that high scores on standard model benchmarks do not reliably translate into strong performance in real-world interactions. The study, ‘How far can LLMs emulate human behavior? A strategic analysis via the buy-and-sell negotiation game’, by Mingyu Jeon, Jaeyoung Suh, Suwan Cho and Dohyeon Kim of Modulabs, was published on arXiv in November 2025.

It evaluates six major large language models in a structured buyer-seller negotiation and suggests that benchmark scores measure cognitive competence but fail to capture how models behave when incentives compete and outcomes depend on persuasion, timing and social cues.

The gap

AI progress is typically tracked through tests such as MMLLU, HumanEval and GPQA, which evaluate knowledge recall, reasoning and coding ability. Enterprises often rely on these scores to determine readiness for operational deployment.

The Modulabs research suggests this view is incomplete. In the negotiation environment, GPT 4 Turbo, a strong performer across traditional benchmarks, recorded one of the weakest seller performances. Claude 3.5 Sonnet, which also performs well academically, achieved the strongest negotiation results and the broadest behavioural range. Gemini 1.5 Flash performed poorly across both domains.

The divergence highlights a structural limitation. Benchmarks measure what a model knows, not how it behaves when it has to manage uncertainty, balance incentives, or navigate conflicting goals.

Complex dynamics

The study uses a ten-turn negotiation with asymmetric information. The seller knows their production cost. The buyer knows their maximum willingness to pay. Both attempt to steer the price towards their preferred outcome.

Across 1,737 negotiations, buyers won 53 per cent of matches and sellers 41 per cent, with the remainder ending in draws. Buyers held a slight advantage due to anchoring effects and the distribution of acceptable prices.

The researchers introduced persona prompts to test how behavioural style affects outcomes. Models were instructed to adopt one of seven personas, including competitive, cooperative, altruistic, cunning and selfish. Persona had a significant influence. Competitive and cunning personas delivered the highest win rates for both buyers and sellers. Altruistic and cooperative personas often conceded value.

The models also varied on how sharply their behaviour shifted across personas. Claude 3.5 Sonnet showed the greatest sensitivity to persona. GPT 4 Turbo showed less variation, regardless of prompt.

Enterprise deployments

The findings arrive as businesses begin to deploy agentic AI in customer service, procurement, finance and compliance.

In these settings, outcomes depend on multi-turn exchanges where tone, negotiation strategy and the handling of asymmetric information can influence commercial results as much as accuracy.

The study suggests that benchmark scores alone are inadequate predictors of how models will behave once embedded in operational environments. Two models with similar academic scores may deliver materially different outcomes in practice.

The researchers argue that organisations should incorporate behavioural evaluation alongside standard benchmarks. Negotiation tests, multi-agent simulations and social reasoning scenarios reveal tendencies that do not appear in isolated question-answering.

Industry analysts expect a shift towards scenario-based testing as companies seek to understand how models behave under pressure, how they interpret incentives, and how easily their conduct can be shaped or constrained.

Model governance

The Modulabs paper points to a shift that many organisations are yet to internalise.

As AI systems move from answering queries to taking actions inside workflows, governance is no longer limited to accuracy checks or model card disclosures. It becomes a question of behavioural reliability. Businesses will need tools to spot when a system is too forceful with a customer, too compliant in a supplier negotiation, or too erratic in exchanges that involve emotional or strategic cues. This introduces an additional layer of due diligence.

Behavioural audits

Enterprises will have to understand how models respond under pressure, how they negotiate trade-offs and whether their behaviour changes materially when a prompt, task or persona shifts.

In practice, this may mean introducing behavioural audits alongside the familiar privacy and security assessments.

The study’s underlying message is that technical capability is no longer sufficient. Organisations will require models whose conduct can be shaped, monitored and verified throughout their operational life.

More Like This

Published on December 1, 2025



Source link

YouTube
Instagram
WhatsApp