The LLM Longform Creative Writing Benchmark v3 is a comprehensive tool designed to evaluate the capabilities of large language models in generating extended creative narratives, focusing on brainstorming, planning, revising, and executing stories over multiple iterations.
LLM Longform Creative Writing Benchmark v3
The Longform Creative Writing Benchmark v3 is a comprehensive evaluation tool designed to assess the capabilities of large language models (LLMs) in generating extended creative narratives. This benchmark focuses on several key aspects of creative writing, including brainstorming, planning, revising, and executing a short story or novella over multiple iterations.
Key Features of the Benchmark
- Brainstorming & Planning: Models are required to develop a story concept from a minimal prompt and create a detailed plan.
- Reflection & Revision: After planning, models must reflect on their initial ideas and make necessary revisions.
- Writing Execution: The story is written over 8 turns, each producing approximately 1000 words.
Evaluation Metrics
The benchmark uses a scoring rubric evaluated by Claude Sonnet 3.7, with the following metrics:
- Length: Measures the average chapter length in characters.
- Slop Score: Tracks the frequency of overused words or phrases ("GPT-isms"). Lower scores are better.
- Repetition Metric: Assesses the tendency of the model to repeat words or phrases across tasks. Higher values indicate more repetition.
- Degradation: Visualizes the quality of chapters over time, with a degradation score representing the trendline's gradient.
- Overall Score (0-100): The final rating assigned by the judge LLM, scaled from 0 to 100. Higher scores indicate better performance.
Generation Settings
Models are typically evaluated using the following settings:
- Temperature: 0.7
- Min_p: 0.1
Additional Resources
For more detailed information about the benchmark, you can visit the EQ-Bench Longform Creative Writing page or check out the Creative Writing v3 Leaderboard.