Model Card for GRPO Enhanced SmolLM-135M-Instruct

This model extends the capabilities of the SmolLM-135M-Instruct model with specific enhancements for task-oriented language generation, emphasizing concise and formatted text generation using generative reinforcement learning.

Model Details

Model Description

This model, developed using the Transformers library, is fine-tuned from the HuggingFaceTB/SmolLM-135M-Instruct base model on the mlabonne/smoltldr dataset. It incorporates enhancements via a reward-based training system to improve the specificity and formatting of generated text, ideal for applications requiring structured output.

Model type: GRPOTrainer with TRL
Language(s) (NLP): English
License: MIT
Finetuned from model: HuggingFaceTB/SmolLM-135M-Instruct

Model Sources

Repository: https://huggingface.co/eagle0504/finetune-mlabonne-smoltldr-using-HuggingFaceTB-SmolLM-135M-Instruct

Uses

Direct Use

This model is ready to generate structured text outputs directly for applications such as automated reasoning or instructional content generation without the need for additional fine-tuning.

Bias, Risks, and Limitations

The model may inherently carry biases from the training data and the base model it was fine-tuned from. It might not handle out-of-distribution inputs well or generate less accurate outputs in languages other than English.

Recommendations

Continual monitoring and updating of the model with diverse datasets can help mitigate biases. Users should also validate model outputs before use in critical applications.

How to Get Started with the Model

To start using this model, integrate it within your Transformers-based pipeline, specifying its model ID on the HuggingFace Hub. Ensure your environment supports bf16 for optimal performance.

Training Details

Training Data

The model was trained on the mlabonne/smoltldr dataset, which consists of structured prompts and completions designed for text generation tasks.

Training Procedure

Preprocessing

Data preprocessing involved tokenization using the tokenizer from the base model, with text split into prompts and completions.

Training Hyperparameters

Training regime: bf16 mixed precision
Total training time: Approximately 494 seconds
Steps per second: 0.101
Batch size: 1 (with gradient accumulation over 2 steps)
Training loss: 5.720701068639755e-05

Evaluation

Testing Data, Factors & Metrics

Evaluated on a held-out portion of the mlabonne/smoltldr dataset.

Results

The final model achieved a very low average training loss of 5.720701068639755e-05, suggesting highly effective learning under the configured reward functions.

Environmental Impact

Training was conducted on GPU hardware compatible with bf16 precision, optimizing compute efficiency and reducing energy consumption.

Technical Specifications

Model Architecture and Objective

The model architecture is based on a causal language model with enhancements for token-level rewards to improve response quality in task-specific generations.

Compute Infrastructure

Training was performed on cloud-based GPUs with bf16 support to ensure fast computation and reduced environmental impact.

More Information

For more detailed documentation and usage examples, visit the repository link provided above.

eagle0504
/

finetune-mlabonne-smoltldr-using-HuggingFaceTB-SmolLM-135M-Instruct