News

Vending-Bench: A New Benchmark for Testing Long-Term Coherence in AI Agents

April 20, 2025
Vending-Bench Autonomous Agents Large Language Models Long-Term Coherence AI Benchmark Simulated Environment
Vending-Bench is a simulated environment designed to test the long-term coherence of autonomous agents, particularly those based on Large Language Models (LLMs), by simulating the operation of a vending machine over extended periods.

Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents

Vending-Bench is a simulated environment designed to test the long-term coherence of autonomous agents, particularly those based on Large Language Models (LLMs). The benchmark focuses on a straightforward yet challenging scenario: operating a vending machine over extended periods. Agents are required to perform tasks such as balancing inventories, placing orders, setting prices, and handling daily fees. These tasks, while simple individually, collectively stress an LLM's ability to maintain sustained, coherent decision-making over long horizons (>20M tokens per run).

Key Findings

  • High Variance in Performance: Experiments reveal significant variability in performance across different LLMs. Models like Claude 3.5 Sonnet and o3-mini generally manage the vending machine well and turn a profit, but all models exhibit runs that derail due to misinterpretations, forgotten orders, or "meltdown" loops.
  • Memory Limits Not the Primary Cause: Breakdowns do not correlate with the point at which the model's context window becomes full, suggesting that memory limits are not the primary cause of failures.
  • Capital Acquisition: Vending-Bench also tests models' ability to acquire capital, a critical skill in many hypothetical dangerous AI scenarios.

Purpose and Implications

The Vending-Bench benchmark aims to highlight the challenges LLMs face in maintaining long-term coherence and to prepare for the advent of stronger AI systems. By simulating a real-world business scenario, it provides valuable insights into the limitations and potential improvements needed for autonomous agents.

Further Reading

Sources

Vending-Bench: A Benchmark for Long-Term Coherence of ... - arXiv In this paper, we present Vending-Bench, a simulated environment designed to specifically test an LLM-based agent's ability to manage a ...
Vending-Bench: Testing long-term coherence in agents | Andon Labs Vending-Bench is a simulated environment that tests how well AI models can manage a simple but long-running business scenario: operating a vending machine. The ...
Vending-Bench: A Benchmark for Long-Term Coherence of ... The benchmark measures how well AI agents can maintain their "personality" and decision-making patterns over time. Just as a human store owner develops ...