Vending-Bench: A New Benchmark for Testing Long-Term Coherence in AI Agents

April 20, 2025

Vending-Bench Autonomous Agents Large Language Models Long-Term Coherence AI Benchmark Simulated Environment

Vending-Bench is a simulated environment designed to test the long-term coherence of autonomous agents, particularly those based on Large Language Models (LLMs), by simulating the operation of a vending machine over extended periods.

Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents

Vending-Bench is a simulated environment designed to test the long-term coherence of autonomous agents, particularly those based on Large Language Models (LLMs). The benchmark focuses on a straightforward yet challenging scenario: operating a vending machine over extended periods. Agents are required to perform tasks such as balancing inventories, placing orders, setting prices, and handling daily fees. These tasks, while simple individually, collectively stress an LLM's ability to maintain sustained, coherent decision-making over long horizons (>20M tokens per run).

Key Findings

High Variance in Performance: Experiments reveal significant variability in performance across different LLMs. Models like Claude 3.5 Sonnet and o3-mini generally manage the vending machine well and turn a profit, but all models exhibit runs that derail due to misinterpretations, forgotten orders, or "meltdown" loops.
Memory Limits Not the Primary Cause: Breakdowns do not correlate with the point at which the model's context window becomes full, suggesting that memory limits are not the primary cause of failures.
Capital Acquisition: Vending-Bench also tests models' ability to acquire capital, a critical skill in many hypothetical dangerous AI scenarios.

Purpose and Implications

The Vending-Bench benchmark aims to highlight the challenges LLMs face in maintaining long-term coherence and to prepare for the advent of stronger AI systems. By simulating a real-world business scenario, it provides valuable insights into the limitations and potential improvements needed for autonomous agents.

Vending-Bench: A New Benchmark for Testing Long-Term Coherence in AI Agents

Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents

Key Findings

Purpose and Implications

Further Reading

Sources

What’s in Startup Plan?

What’s in Startup Plan?

What’s in Startup Plan?

What’s in Startup Plan?

Details

Frameworks

Database

Billing

Completed

Project Type

Project Settings

Drop files here or click to upload.

Budget

Build a Team

Set First Target

Upload Files

Drop files here or click to upload.

Project Created!

No result found

Advanced Search

Search Preferences

News

Vending-Bench: A New Benchmark for Testing Long-Term Coherence in AI Agents

Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents

Key Findings

Purpose and Implications

Further Reading

Sources

Drop files here or click to upload.

Drop files here or click to upload.