---
license: apache-2.0
datasets:
- lars1234/story_writing_benchmark
base_model:
- mistralai/Mistral-Small-24B-Instruct-2501
---

# Mistral-Small-24B-Instruct-2501-writer

Mistral-Small-24B-Instruct-2501-writer is a fine-tuned version of `mistralai/Mistral-Small-24B-Instruct-2501`, optimized specifically for creative writing tasks.

## Performance

The following table was generated by creating 568 stories based on the same prompts as in the [lars1234/story_writing_benchmark](https://huggingface.co/datasets/lars1234/story_writing_benchmark) dataset and then evaluating them using the benchmark's evaluator models.

| Metric | Mistral-2501 | Mistral-Writer | Gemma-Ataraxy |
|-------|---------|-------------------|---------|
| Grammar & Spelling | 82.1% | 83.3% | **88.8%** |
| Clarity | 63.0% | 64.1% | **65.8%** |
| Logical Connection | 57.7% | 64.1% | **66.0%** |
| Scene Construction | 56.1% | 62.0% | **64.1%** |
| Internal Consistency | 67.2% | 73.1% | **75.1%** |
| Character Consistency | 50.7% | 54.0% | **54.3%** |
| Character Motivation | 44.6% | **49.8%** | 49.2% |
| Sentence Variety | 57.7% | **64.4%** | 64.0% |
| Avoiding Clichés | 24.6% | **33.3%** | 31.2% |
| Natural Dialogue | 42.9% | **51.9%** | 48.3% |
| Avoiding Tropes | 28.6% | 37.4% | **40.0%** |
| Character Depth | 35.7% | **46.4%** | 45.4% |
| Character Interactions | 45.0% | **52.0%** | 51.7% |
| Reader Interest | 54.1% | **63.1%** | 63.0% |
| Plot Resolution | 35.3% | **45.3%** | 44.9% |
| Average | 49.3% | **56.5%** | 56.1% |

Mistral-Small-24B-Instruct-2501-writer outperforms the base Mistral model across all metrics. Gemma-2-Ataraxy still shows higher creativity in some categories, as seen for example in its better score on "Avoiding Tropes."

## DPO Dataset Creation

The model was fine-tuned using Direct Preference Optimization (DPO), which requires pairs of responses where one is preferred over the other. The pairs were created from the [lars1234/story_writing_benchmark](https://huggingface.co/datasets/lars1234/story_writing_benchmark) dataset using two approaches:

### 1. Language-Based Pairs
- **Correct vs. Incorrect Language**: For prompts requesting stories in specific languages (English, Spanish, or German), we identified cases where models incorrectly generated text in the wrong language.
- **Verification Process**: Used fast_langdetect to automatically verify language with high confidence (threshold ≥ 0.8).
- **Pair Creation**: Stories with correctly detected language were paired as "chosen" against stories with incorrectly detected language as "rejected" for the same prompt.

### 2. Quality-Based Pairs
- **Quality Scoring**: For stories with correctly detected language, we calculated quality differences based on four metrics:
  - q1: Grammar and spelling
  - q11: Avoiding tropes
  - q12: Character depth
  - q14: Reader interest
- **Minimum Threshold**: Only story pairs with a quality difference of at least 0.4 (on a 1-5 scale) were considered.
- **Greedy Selection**: the highest-rated story was selected as "chosen" and paired with a lower-rated story as "rejected" for the same prompt.
- **Uniqueness**: Each story was used in at most one pair.

The final JSONL dataset contained these pairs in the format:
```json
{"prompt": "Write a story about...", "chosen": "High quality story text...", "rejected": "Lower quality story text..."}
```

See [this script](https://github.com/lars76/story-evaluation-llm/blob/main/create_dpo_pairs.py) for the code.

## Training Methodology

The model was fine-tuned using Axolotl with the following parameters:

- **Base Model**: mistralai/Mistral-Small-24B-Instruct-2501
- **Adapter**: LoRA with r=16, alpha=32
- **DPO Beta**: 0.1
- **Learning Rate**: 1e-4
- **Optimizer**: AdamW with cosine scheduler
- **Training Epochs**: 1
- **Gradient Accumulation Steps**: 4
- **Micro Batch Size**: 2
- **Sequence Length**: 2048
- **Quantization**: 4-bit

## Inference Parameters

A grid search was performed on inference parameters to find optimal generation settings:
- **min_p**: 0.05 (fixed)
- **temperature**: 0.5, 0.75, 1.0, 1.25

The most significant quality improvement was observed when increasing temperature from 0.5 to 0.75. Beyond this point, other quality aspects began to suffer.