---
datasets:
- stanfordnlp/imdb
language:
- en
library_name: swarmformer
---
# Model Card for SwarmFormer-Small

SwarmFormer-Small is a lightweight variant of the SwarmFormer architecture, designed for efficient text classification with minimal computational requirements.

## Model Details

### Model Description
Compact version of SwarmFormer with:
- Token embedding layer with dropout (0.3)
- Two SwarmFormer layers
- Mean pooling and classification
- Optimized for shorter sequences

- **Developed by**: Jordan Legg, Mikus Sturmanis, Takara.ai
- **Funded by**: Takara.ai
- **Shared by**: Takara.ai
- **Model type**: Hierarchical transformer
- **Language(s)**: English
- **License**: Not specified
- **Finetuned from model**: Trained from scratch

### Model Sources
- **Repository**: https://github.com/takara-ai/SwarmFormer
- **Paper**: Takara.ai Research
- **Demo**: Not available

## Uses

### Direct Use
- Text classification
- Sentiment analysis
- Resource-constrained environments

### Out-of-Scope Use
- Text generation
- Machine translation
- Tasks requiring >256 tokens
- Tasks requiring high precision

## Training Details

### Training Data
- Dataset: IMDB Movie Review
- Size: 50,000 samples
- Augmentation techniques applied

### Training Procedure

#### Model Architecture Details
1. **Token Embedding Layer**:
   ```python
   - Embedding layer (vocab_size → 128)
   - Dropout rate: 0.3
   ```

2. **Local Swarm Aggregator**:
   ```python
   - Input dropout: 0.3
   - Local MLP:
     - Linear(128 → 128)
     - GELU
     - Dropout(0.3)
     - Linear(128 → 128)
   - Gate network with GELU
   ```

3. **Clustering Mechanism**:
   - Cluster size: 8 tokens
   - Mean pooling per cluster

4. **Global Cluster Attention**:
   ```python
   - Q/K/V projections: Linear(128 → 128)
   - Attention dropout: 0.3
   ```

#### Training Hyperparameters
- Embedding dimension: 128
- Number of layers: 2
- Local update steps: 3
- Cluster size: 8
- Sequence length: 256
- Batch size: 96
- Learning rate: 4.76 × 10⁻⁴
- Weight decay: 0.0541
- Dropout: 0.30

## Evaluation

### Results
- Accuracy: 86.20%
- Precision: 83.46%
- Recall: 90.31%
- F1: 86.75%
- Inference time: 0.36s (25k samples)
- Mean batch latency: 3.67ms
- Throughput: 45k samples/s
- Peak memory: 8GB

## Technical Specifications

### Compute Infrastructure
- GPU: NVIDIA RTX 2080 Ti
- VRAM: 8GB minimum
- Training time: 3.6 minutes

### How to Get Started
```python
from swarmformer import SwarmFormerModel

model = SwarmFormerModel(
    vocab_size=30000,
    d_model=128,
    seq_len=256,
    cluster_size=8,
    num_layers=2,
    T_local=3
)
```

## Citation

```bibtex
@article{legg2025swarmformer,
  title={SwarmFormer: Local-Global Hierarchical Attention via Swarming Token Representations},
  author={Legg, Jordan and Sturmanis, Mikus and {Takara.ai}},
  journal={Takara.ai Research},
  year={2025},
  url={https://takara.ai/papers/SwarmFormer-Local-Global-Hierarchical-Attention-via-Swarming-Token-Representations.pdf}
}
```

## Model Card Authors
Jordan Legg, Mikus Sturmanis, Takara.ai Research Team

## Model Card Contact
research@takara.ai