lars1234's picture
Update README.md
45850ca verified
metadata
license: apache-2.0
datasets:
  - lars1234/story_writing_benchmark
base_model:
  - mistralai/Mistral-Small-24B-Instruct-2501

Mistral-Small-24B-Instruct-2501-writer

Mistral-Small-24B-Instruct-2501-writer is a fine-tuned version of mistralai/Mistral-Small-24B-Instruct-2501, optimized specifically for creative writing tasks.

Performance

The following table was generated by creating 568 stories based on the same prompts as in the lars1234/story_writing_benchmark dataset and then evaluating them using the benchmark's evaluator models.

Metric Mistral-2501 Mistral-Writer Gemma-Ataraxy
Grammar & Spelling 82.1% 83.3% 88.8%
Clarity 63.0% 64.1% 65.8%
Logical Connection 57.7% 64.1% 66.0%
Scene Construction 56.1% 62.0% 64.1%
Internal Consistency 67.2% 73.1% 75.1%
Character Consistency 50.7% 54.0% 54.3%
Character Motivation 44.6% 49.8% 49.2%
Sentence Variety 57.7% 64.4% 64.0%
Avoiding Clichés 24.6% 33.3% 31.2%
Natural Dialogue 42.9% 51.9% 48.3%
Avoiding Tropes 28.6% 37.4% 40.0%
Character Depth 35.7% 46.4% 45.4%
Character Interactions 45.0% 52.0% 51.7%
Reader Interest 54.1% 63.1% 63.0%
Plot Resolution 35.3% 45.3% 44.9%
Average 49.3% 56.5% 56.1%

Mistral-Small-24B-Instruct-2501-writer outperforms the base Mistral model across all metrics. Gemma-2-Ataraxy still shows higher creativity in some categories, as seen for example in its better score on "Avoiding Tropes."

DPO Dataset Creation

The model was fine-tuned using Direct Preference Optimization (DPO), which requires pairs of responses where one is preferred over the other. The pairs were created from the lars1234/story_writing_benchmark dataset using two approaches:

1. Language-Based Pairs

  • Correct vs. Incorrect Language: For prompts requesting stories in specific languages (English, Spanish, or German), we identified cases where models incorrectly generated text in the wrong language.
  • Verification Process: Used fast_langdetect to automatically verify language with high confidence (threshold ≥ 0.8).
  • Pair Creation: Stories with correctly detected language were paired as "chosen" against stories with incorrectly detected language as "rejected" for the same prompt.

2. Quality-Based Pairs

  • Quality Scoring: For stories with correctly detected language, we calculated quality differences based on four metrics:
    • q1: Grammar and spelling
    • q11: Avoiding tropes
    • q12: Character depth
    • q14: Reader interest
  • Minimum Threshold: Only story pairs with a quality difference of at least 0.4 (on a 1-5 scale) were considered.
  • Greedy Selection: the highest-rated story was selected as "chosen" and paired with a lower-rated story as "rejected" for the same prompt.
  • Uniqueness: Each story was used in at most one pair.

The final JSONL dataset contained these pairs in the format:

{"prompt": "Write a story about...", "chosen": "High quality story text...", "rejected": "Lower quality story text..."}

See this script for the code.

Training Methodology

The model was fine-tuned using Axolotl with the following parameters:

  • Base Model: mistralai/Mistral-Small-24B-Instruct-2501
  • Adapter: LoRA with r=16, alpha=32
  • DPO Beta: 0.1
  • Learning Rate: 1e-4
  • Optimizer: AdamW with cosine scheduler
  • Training Epochs: 1
  • Gradient Accumulation Steps: 4
  • Micro Batch Size: 2
  • Sequence Length: 2048
  • Quantization: 4-bit

Inference Parameters

A grid search was performed on inference parameters to find optimal generation settings:

  • min_p: 0.05 (fixed)
  • temperature: 0.5, 0.75, 1.0, 1.25

The most significant quality improvement was observed when increasing temperature from 0.5 to 0.75. Beyond this point, other quality aspects began to suffer.