File size: 2,231 Bytes
bd2d227 dbdeb7e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 |
---
title: SmoLLMv2
emoji: 🐢
colorFrom: indigo
colorTo: blue
sdk: gradio
sdk_version: 5.13.1
app_file: app.py
pinned: false
license: mit
short_description: Text generation using smollmv2-135M model
---
# SmoLLMv2: A Small but Efficient Language Model
[Training Repo Link](https://github.com/Shilpaj1994/SmoLLMv2)
[Gradio App Link](https://huggingface.co/spaces/Shilpaj/SmoLLMv2)
SmoLLMv2 is a 135M parameter language model designed for efficient text generation. It incorporates several modern architectural improvements while maintaining a small footprint.
## Features
- **Efficient Architecture**:
- 30 transformer layers
- 9 attention heads
- 576 embedding dimension
- Memory-efficient attention with reduced KV dimensions
- Rotary Position Embeddings (RoPE)
- SwiGLU activation function
- **Training Optimizations**:
- Mixed precision training (16-bit)
- Gradient accumulation
- OneCycleLR scheduler
- Streaming dataset support
- Automatic model compilation (with PyTorch 2.0+)
## Model Architecture
SmoLLMv2 incorporates several efficiency improvements:
1. **Reduced KV Dimensions**: Uses 189-dimensional key/value projections (instead of full 576) to save memory and computation.
2. **RoPE Attention**: Implements Rotary Position Embeddings for better handling of sequential information.
3. **SwiGLU Activation**: Uses the SwiGLU activation function in the MLP layers for better performance.
4. **Weight Sharing**: Shares weights between input embeddings and output projection.
## Configuration
The model's behavior can be customized through various configuration classes in `config.py`:
- `SmollmConfig`: Core model architecture and training parameters
- `RoPEConfig`: Rotary Position Embedding settings
- `OptimizerConfig`: Optimization and learning rate settings
- `DataConfig`: Dataset and tokenizer configuration
- `TrainerConfig`: Training infrastructure settings
## Dataset
The model is trained on the Cosmopedia dataset, which is streamed during training to handle large-scale data efficiently.
## Requirements
See `requirements.txt` for full dependencies. Key requirements:
- PyTorch ≥ 2.0.0
- Transformers ≥ 4.30.0
- Lightning ≥ 2.0.0
- Gradio ≥ 5.13.1
|