---
library_name: transformers
tags: []
---

# tinyllamas_92M

<!-- Provide a quick summary of what the model is/does. -->


## Model Details
```python
  max_seq_len = 256
  vocab_size = 8192 
  dim = 768
  n_layers = 12
  n_heads = 12
  n_kv_heads = 12
```

### Training Data
- https://huggingface.co/datasets/roneneldan/TinyStories
- Tokenized using: https://github.com/karpathy/llama2.c?tab=readme-ov-file#custom-tokenizers


#### Training Hyperparameters

```python
  batch_size = 64  # if gradient_accumulation_steps > 1, this is the micro-batch size
  dropout = 0.0
  # adamw optimizer
  gradient_accumulation_steps = 8  # used to simulate larger batch sizes
  learning_rate = 1e-3  # max learning rate
  max_iters = 34000  # total number of training iterations
  weight_decay = 3e-4
  beta1 = 0.9
  beta2 = 0.95
  grad_clip = 1.0  # clip gradients at this value, or disable if == 0.0
  # learning rate decay settings
  decay_lr = True  # whether to decay the learning rate
  warmup_iters = 1000  # how many steps to warm up for
```

### Results
```bash
4xV100 GPUs used.
Run summary:
  iter 34000
  loss/train 0.8704
  loss/val 0.9966
  tokens 983040000
```