LLM Longform Creative Writing Benchmark v3 Released

April 10, 2025

LLM Creative Writing Benchmark AI Narrative Generation Evaluation Metrics

The LLM Longform Creative Writing Benchmark v3 is a comprehensive tool designed to evaluate the capabilities of large language models in generating extended creative narratives, focusing on brainstorming, planning, revising, and executing stories over multiple iterations.

LLM Longform Creative Writing Benchmark v3

The Longform Creative Writing Benchmark v3 is a comprehensive evaluation tool designed to assess the capabilities of large language models (LLMs) in generating extended creative narratives. This benchmark focuses on several key aspects of creative writing, including brainstorming, planning, revising, and executing a short story or novella over multiple iterations.

Key Features of the Benchmark

Brainstorming & Planning: Models are required to develop a story concept from a minimal prompt and create a detailed plan.
Reflection & Revision: After planning, models must reflect on their initial ideas and make necessary revisions.
Writing Execution: The story is written over 8 turns, each producing approximately 1000 words.

Evaluation Metrics

The benchmark uses a scoring rubric evaluated by Claude Sonnet 3.7, with the following metrics:

Length: Measures the average chapter length in characters.
Slop Score: Tracks the frequency of overused words or phrases ("GPT-isms"). Lower scores are better.
Repetition Metric: Assesses the tendency of the model to repeat words or phrases across tasks. Higher values indicate more repetition.
Degradation: Visualizes the quality of chapters over time, with a degradation score representing the trendline's gradient.
Overall Score (0-100): The final rating assigned by the judge LLM, scaled from 0 to 100. Higher scores indicate better performance.

Generation Settings

Models are typically evaluated using the following settings:

Temperature: 0.7
Min_p: 0.1

Additional Resources

For more detailed information about the benchmark, you can visit the EQ-Bench Longform Creative Writing page or check out the Creative Writing v3 Leaderboard.

Sources

EQ-Bench Creative Writing v3 Leaderboard A LLM-judged creative writing benchmark (v3). Learn more. Expand Details. Model, Length, Slop, Repetition, Rubric Score, Elo Score. DeepSeek-R1. 5352. 4.3 i.

Longform Creative Writing - EQ-Bench This benchmark evaluates several abilities: Brainstorming & planning out a short story/novella from a minimal prompt; Reflect on the plan & revise; Write a ...

LLM Benchmark for 'Longform Creative Writing' | Hacker News Most LLM benchmarks lean heavily on fluency, but things like internal logic, tone consistency, and narrative pacing are harder to quantify. I ...

LLM Longform Creative Writing Benchmark v3 Released

LLM Longform Creative Writing Benchmark v3

Key Features of the Benchmark

Evaluation Metrics

Generation Settings

Additional Resources

Sources

What’s in Startup Plan?

What’s in Startup Plan?

What’s in Startup Plan?

What’s in Startup Plan?

Details

Frameworks

Database

Billing

Completed

Project Type

Project Settings

Drop files here or click to upload.

Budget

Build a Team

Set First Target

Upload Files

Drop files here or click to upload.

Project Created!

No result found

Advanced Search

Search Preferences

News

LLM Longform Creative Writing Benchmark v3 Released

LLM Longform Creative Writing Benchmark v3

Key Features of the Benchmark

Evaluation Metrics

Generation Settings

Additional Resources

Sources

Drop files here or click to upload.

Drop files here or click to upload.