News

LLM Longform Creative Writing Benchmark v3 Released

April 10, 2025
LLM Creative Writing Benchmark AI Narrative Generation Evaluation Metrics
The LLM Longform Creative Writing Benchmark v3 is a comprehensive tool designed to evaluate the capabilities of large language models in generating extended creative narratives, focusing on brainstorming, planning, revising, and executing stories over multiple iterations.

LLM Longform Creative Writing Benchmark v3

The Longform Creative Writing Benchmark v3 is a comprehensive evaluation tool designed to assess the capabilities of large language models (LLMs) in generating extended creative narratives. This benchmark focuses on several key aspects of creative writing, including brainstorming, planning, revising, and executing a short story or novella over multiple iterations.

Key Features of the Benchmark

  • Brainstorming & Planning: Models are required to develop a story concept from a minimal prompt and create a detailed plan.
  • Reflection & Revision: After planning, models must reflect on their initial ideas and make necessary revisions.
  • Writing Execution: The story is written over 8 turns, each producing approximately 1000 words.

Evaluation Metrics

The benchmark uses a scoring rubric evaluated by Claude Sonnet 3.7, with the following metrics:

  • Length: Measures the average chapter length in characters.
  • Slop Score: Tracks the frequency of overused words or phrases ("GPT-isms"). Lower scores are better.
  • Repetition Metric: Assesses the tendency of the model to repeat words or phrases across tasks. Higher values indicate more repetition.
  • Degradation: Visualizes the quality of chapters over time, with a degradation score representing the trendline's gradient.
  • Overall Score (0-100): The final rating assigned by the judge LLM, scaled from 0 to 100. Higher scores indicate better performance.

Generation Settings

Models are typically evaluated using the following settings:

  • Temperature: 0.7
  • Min_p: 0.1

Additional Resources

For more detailed information about the benchmark, you can visit the EQ-Bench Longform Creative Writing page or check out the Creative Writing v3 Leaderboard.

Sources

EQ-Bench Creative Writing v3 Leaderboard A LLM-judged creative writing benchmark (v3). Learn more. Expand Details. Model, Length, Slop, Repetition, Rubric Score, Elo Score. DeepSeek-R1. 5352. 4.3 i.
Longform Creative Writing - EQ-Bench This benchmark evaluates several abilities: Brainstorming & planning out a short story/novella from a minimal prompt; Reflect on the plan & revise; Write a ...
LLM Benchmark for 'Longform Creative Writing' | Hacker News Most LLM benchmarks lean heavily on fluency, but things like internal logic, tone consistency, and narrative pacing are harder to quantify. I ...