--- library_name: transformers tags: [] --- # tinyllamas_92M ## Model Details ```python max_seq_len = 256 vocab_size = 8192 dim = 768 n_layers = 12 n_heads = 12 n_kv_heads = 12 ``` ### Training Data - https://huggingface.co/datasets/roneneldan/TinyStories - Tokenized using: https://github.com/karpathy/llama2.c?tab=readme-ov-file#custom-tokenizers #### Training Hyperparameters ```python batch_size = 64 # if gradient_accumulation_steps > 1, this is the micro-batch size dropout = 0.0 # adamw optimizer gradient_accumulation_steps = 8 # used to simulate larger batch sizes learning_rate = 1e-3 # max learning rate max_iters = 34000 # total number of training iterations weight_decay = 3e-4 beta1 = 0.9 beta2 = 0.95 grad_clip = 1.0 # clip gradients at this value, or disable if == 0.0 # learning rate decay settings decay_lr = True # whether to decay the learning rate warmup_iters = 1000 # how many steps to warm up for ``` ### Results ```bash 4xV100 GPUs used. Run summary: iter 34000 loss/train 0.8704 loss/val 0.9966 tokens 983040000 ```