SmoLLMv2 / README.md
Shilpaj's picture
Docs: Updated README
dbdeb7e verified

A newer version of the Gradio SDK is available: 5.22.0

Upgrade
metadata
title: SmoLLMv2
emoji: 🐢
colorFrom: indigo
colorTo: blue
sdk: gradio
sdk_version: 5.13.1
app_file: app.py
pinned: false
license: mit
short_description: Text generation using smollmv2-135M model

SmoLLMv2: A Small but Efficient Language Model

Training Repo Link Gradio App Link

SmoLLMv2 is a 135M parameter language model designed for efficient text generation. It incorporates several modern architectural improvements while maintaining a small footprint.

Features

  • Efficient Architecture:

    • 30 transformer layers
    • 9 attention heads
    • 576 embedding dimension
    • Memory-efficient attention with reduced KV dimensions
    • Rotary Position Embeddings (RoPE)
    • SwiGLU activation function
  • Training Optimizations:

    • Mixed precision training (16-bit)
    • Gradient accumulation
    • OneCycleLR scheduler
    • Streaming dataset support
    • Automatic model compilation (with PyTorch 2.0+)

Model Architecture

SmoLLMv2 incorporates several efficiency improvements:

  1. Reduced KV Dimensions: Uses 189-dimensional key/value projections (instead of full 576) to save memory and computation.
  2. RoPE Attention: Implements Rotary Position Embeddings for better handling of sequential information.
  3. SwiGLU Activation: Uses the SwiGLU activation function in the MLP layers for better performance.
  4. Weight Sharing: Shares weights between input embeddings and output projection.

Configuration

The model's behavior can be customized through various configuration classes in config.py:

  • SmollmConfig: Core model architecture and training parameters
  • RoPEConfig: Rotary Position Embedding settings
  • OptimizerConfig: Optimization and learning rate settings
  • DataConfig: Dataset and tokenizer configuration
  • TrainerConfig: Training infrastructure settings

Dataset

The model is trained on the Cosmopedia dataset, which is streamed during training to handle large-scale data efficiently.

Requirements

See requirements.txt for full dependencies. Key requirements:

  • PyTorch ≥ 2.0.0
  • Transformers ≥ 4.30.0
  • Lightning ≥ 2.0.0
  • Gradio ≥ 5.13.1