Update README.md

45850ca verified 4 days ago

4.25 kB

	---
	license: apache-2.0
	datasets:
	- lars1234/story_writing_benchmark
	base_model:
	- mistralai/Mistral-Small-24B-Instruct-2501
	---

	# Mistral-Small-24B-Instruct-2501-writer

	Mistral-Small-24B-Instruct-2501-writer is a fine-tuned version of `mistralai/Mistral-Small-24B-Instruct-2501`, optimized specifically for creative writing tasks.

	## Performance

	The following table was generated by creating 568 stories based on the same prompts as in the [lars1234/story_writing_benchmark](https://huggingface.co/datasets/lars1234/story_writing_benchmark) dataset and then evaluating them using the benchmark's evaluator models.

	\| Metric \| Mistral-2501 \| Mistral-Writer \| Gemma-Ataraxy \|
	\|-------\|---------\|-------------------\|---------\|
	\| Grammar & Spelling \| 82.1% \| 83.3% \| 88.8% \|
	\| Clarity \| 63.0% \| 64.1% \| 65.8% \|
	\| Logical Connection \| 57.7% \| 64.1% \| 66.0% \|
	\| Scene Construction \| 56.1% \| 62.0% \| 64.1% \|
	\| Internal Consistency \| 67.2% \| 73.1% \| 75.1% \|
	\| Character Consistency \| 50.7% \| 54.0% \| 54.3% \|
	\| Character Motivation \| 44.6% \| 49.8% \| 49.2% \|
	\| Sentence Variety \| 57.7% \| 64.4% \| 64.0% \|
	\| Avoiding Clichés \| 24.6% \| 33.3% \| 31.2% \|
	\| Natural Dialogue \| 42.9% \| 51.9% \| 48.3% \|
	\| Avoiding Tropes \| 28.6% \| 37.4% \| 40.0% \|
	\| Character Depth \| 35.7% \| 46.4% \| 45.4% \|
	\| Character Interactions \| 45.0% \| 52.0% \| 51.7% \|
	\| Reader Interest \| 54.1% \| 63.1% \| 63.0% \|
	\| Plot Resolution \| 35.3% \| 45.3% \| 44.9% \|
	\| Average \| 49.3% \| 56.5% \| 56.1% \|

	Mistral-Small-24B-Instruct-2501-writer outperforms the base Mistral model across all metrics. Gemma-2-Ataraxy still shows higher creativity in some categories, as seen for example in its better score on "Avoiding Tropes."

	## DPO Dataset Creation

	The model was fine-tuned using Direct Preference Optimization (DPO), which requires pairs of responses where one is preferred over the other. The pairs were created from the [lars1234/story_writing_benchmark](https://huggingface.co/datasets/lars1234/story_writing_benchmark) dataset using two approaches:

	### 1. Language-Based Pairs
	- Correct vs. Incorrect Language: For prompts requesting stories in specific languages (English, Spanish, or German), we identified cases where models incorrectly generated text in the wrong language.
	- Verification Process: Used fast_langdetect to automatically verify language with high confidence (threshold ≥ 0.8).
	- Pair Creation: Stories with correctly detected language were paired as "chosen" against stories with incorrectly detected language as "rejected" for the same prompt.

	### 2. Quality-Based Pairs
	- Quality Scoring: For stories with correctly detected language, we calculated quality differences based on four metrics:
	- q1: Grammar and spelling
	- q11: Avoiding tropes
	- q12: Character depth
	- q14: Reader interest
	- Minimum Threshold: Only story pairs with a quality difference of at least 0.4 (on a 1-5 scale) were considered.
	- Greedy Selection: the highest-rated story was selected as "chosen" and paired with a lower-rated story as "rejected" for the same prompt.
	- Uniqueness: Each story was used in at most one pair.

	The final JSONL dataset contained these pairs in the format:
	```json
	{"prompt": "Write a story about...", "chosen": "High quality story text...", "rejected": "Lower quality story text..."}
	```

	See [this script](https://github.com/lars76/story-evaluation-llm/blob/main/create_dpo_pairs.py) for the code.

	## Training Methodology

	The model was fine-tuned using Axolotl with the following parameters:

	- Base Model: mistralai/Mistral-Small-24B-Instruct-2501
	- Adapter: LoRA with r=16, alpha=32
	- DPO Beta: 0.1
	- Learning Rate: 1e-4
	- Optimizer: AdamW with cosine scheduler
	- Training Epochs: 1
	- Gradient Accumulation Steps: 4
	- Micro Batch Size: 2
	- Sequence Length: 2048
	- Quantization: 4-bit

	## Inference Parameters

	A grid search was performed on inference parameters to find optimal generation settings:
	- min_p: 0.05 (fixed)
	- temperature: 0.5, 0.75, 1.0, 1.25

	The most significant quality improvement was observed when increasing temperature from 0.5 to 0.75. Beyond this point, other quality aspects began to suffer.

	---
	license: apache-2.0
	datasets:
	- lars1234/story_writing_benchmark
	base_model:
	- mistralai/Mistral-Small-24B-Instruct-2501
	---

	# Mistral-Small-24B-Instruct-2501-writer

	Mistral-Small-24B-Instruct-2501-writer is a fine-tuned version of `mistralai/Mistral-Small-24B-Instruct-2501`, optimized specifically for creative writing tasks.

	## Performance

	The following table was generated by creating 568 stories based on the same prompts as in the [lars1234/story_writing_benchmark](https://huggingface.co/datasets/lars1234/story_writing_benchmark) dataset and then evaluating them using the benchmark's evaluator models.

	\| Metric \| Mistral-2501 \| Mistral-Writer \| Gemma-Ataraxy \|
	\|-------\|---------\|-------------------\|---------\|
	\| Grammar & Spelling \| 82.1% \| 83.3% \| 88.8% \|
	\| Clarity \| 63.0% \| 64.1% \| 65.8% \|
	\| Logical Connection \| 57.7% \| 64.1% \| 66.0% \|
	\| Scene Construction \| 56.1% \| 62.0% \| 64.1% \|
	\| Internal Consistency \| 67.2% \| 73.1% \| 75.1% \|
	\| Character Consistency \| 50.7% \| 54.0% \| 54.3% \|
	\| Character Motivation \| 44.6% \| 49.8% \| 49.2% \|
	\| Sentence Variety \| 57.7% \| 64.4% \| 64.0% \|
	\| Avoiding Clichés \| 24.6% \| 33.3% \| 31.2% \|
	\| Natural Dialogue \| 42.9% \| 51.9% \| 48.3% \|
	\| Avoiding Tropes \| 28.6% \| 37.4% \| 40.0% \|
	\| Character Depth \| 35.7% \| 46.4% \| 45.4% \|
	\| Character Interactions \| 45.0% \| 52.0% \| 51.7% \|
	\| Reader Interest \| 54.1% \| 63.1% \| 63.0% \|
	\| Plot Resolution \| 35.3% \| 45.3% \| 44.9% \|
	\| Average \| 49.3% \| 56.5% \| 56.1% \|

	Mistral-Small-24B-Instruct-2501-writer outperforms the base Mistral model across all metrics. Gemma-2-Ataraxy still shows higher creativity in some categories, as seen for example in its better score on "Avoiding Tropes."

	## DPO Dataset Creation

	The model was fine-tuned using Direct Preference Optimization (DPO), which requires pairs of responses where one is preferred over the other. The pairs were created from the [lars1234/story_writing_benchmark](https://huggingface.co/datasets/lars1234/story_writing_benchmark) dataset using two approaches:

	### 1. Language-Based Pairs
	- Correct vs. Incorrect Language: For prompts requesting stories in specific languages (English, Spanish, or German), we identified cases where models incorrectly generated text in the wrong language.
	- Verification Process: Used fast_langdetect to automatically verify language with high confidence (threshold ≥ 0.8).
	- Pair Creation: Stories with correctly detected language were paired as "chosen" against stories with incorrectly detected language as "rejected" for the same prompt.

	### 2. Quality-Based Pairs
	- Quality Scoring: For stories with correctly detected language, we calculated quality differences based on four metrics:
	- q1: Grammar and spelling
	- q11: Avoiding tropes
	- q12: Character depth
	- q14: Reader interest
	- Minimum Threshold: Only story pairs with a quality difference of at least 0.4 (on a 1-5 scale) were considered.
	- Greedy Selection: the highest-rated story was selected as "chosen" and paired with a lower-rated story as "rejected" for the same prompt.
	- Uniqueness: Each story was used in at most one pair.

	The final JSONL dataset contained these pairs in the format:
	```json
	{"prompt": "Write a story about...", "chosen": "High quality story text...", "rejected": "Lower quality story text..."}
	```

	See [this script](https://github.com/lars76/story-evaluation-llm/blob/main/create_dpo_pairs.py) for the code.

	## Training Methodology

	The model was fine-tuned using Axolotl with the following parameters:

	- Base Model: mistralai/Mistral-Small-24B-Instruct-2501
	- Adapter: LoRA with r=16, alpha=32
	- DPO Beta: 0.1
	- Learning Rate: 1e-4
	- Optimizer: AdamW with cosine scheduler
	- Training Epochs: 1
	- Gradient Accumulation Steps: 4
	- Micro Batch Size: 2
	- Sequence Length: 2048
	- Quantization: 4-bit

	## Inference Parameters

	A grid search was performed on inference parameters to find optimal generation settings:
	- min_p: 0.05 (fixed)
	- temperature: 0.5, 0.75, 1.0, 1.25

	The most significant quality improvement was observed when increasing temperature from 0.5 to 0.75. Beyond this point, other quality aspects began to suffer.