saneowl commited on
Commit
9e7c31d
·
verified ·
1 Parent(s): 6b0486c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +186 -1
README.md CHANGED
@@ -2,4 +2,189 @@
2
  license: mit
3
  pipeline_tag: text-generation
4
  library_name: transformers
5
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  license: mit
3
  pipeline_tag: text-generation
4
  library_name: transformers
5
+ ---
6
+
7
+ # Random-Llama-Small
8
+
9
+ ## Model Overview
10
+
11
+ **Random-Llama-Small** is a randomly initialized transformer-based language model with approximately 2 billion parameters, built using the LLaMA architecture. It is designed for research purposes, providing a starting point for pretraining or fine-tuning on custom datasets. The model uses the tokenizer from `HuggingFaceTB/SmolLM2-1.7B-Instruct` and is configured for causal language modeling. As a randomly initialized model, it produces incoherent outputs until trained, making it ideal for researchers studying transformer training dynamics or developing custom language models.
12
+
13
+ ---
14
+
15
+ ## Key Details
16
+
17
+ - **Architecture:** LLaMA (Causal Language Model)
18
+ - **Parameters:** ~2B
19
+ - **Hidden Size:** 2304
20
+ - **Layers:** 22
21
+ - **Attention Heads:** 36 (with 9 key-value heads for grouped-query attention)
22
+ - **Intermediate Size:** 9216
23
+ - **Vocabulary Size:** 128256
24
+ - **Tokenizer:** Imported from `HuggingFaceTB/SmolLM2-1.7B-Instruct`
25
+ - **Precision:** bfloat16
26
+ - **Max Context Length:** 131,072 tokens (with RoPE scaling)
27
+ - **License:** MIT
28
+
29
+ ---
30
+
31
+ ## LLaMA Architecture
32
+
33
+ The LLaMA architecture, developed by Meta AI, is a family of efficient transformer-based models optimized for research. Random-Llama-Small follows this design, incorporating several key features:
34
+
35
+ ### Core Components
36
+
37
+ - **Decoder-Only Transformer:** Predicts the next token in a sequence based on prior tokens, suitable for autoregressive tasks like text generation.
38
+ - **Grouped-Query Attention (GQA):** 36 attention heads with only 9 key-value heads, improving efficiency and reducing memory/compute cost.
39
+ - **Rotary Position Embeddings (RoPE):** Embeds positional information with scaling, enabling a context length of up to 131,072 tokens.
40
+ - **Swiglu Activation:** Uses SiLU (Swish) activation in the FFN for improved expressiveness.
41
+ - **RMSNorm:** Root Mean Square Layer Normalization replaces LayerNorm for stability and faster convergence.
42
+ - **Tied Embeddings:** Input and output embeddings share weights (`tie_word_embeddings=True`), reducing parameter count by ~295M.
43
+
44
+ ---
45
+
46
+ ## Benefits of LLaMA Architecture
47
+
48
+ - **Efficiency:** High throughput, low memory use.
49
+ - **Scalability:** Works well across model sizes.
50
+ - **Flexibility:** Long-context support and task adaptability.
51
+ - **Research-Friendly:** Great for exploring attention, positional encoding, and training dynamics.
52
+
53
+ ---
54
+
55
+ ## Random-Llama-Small Specifics
56
+
57
+ This model uses random weights and:
58
+ - Has ~1.52B parameters across 22 layers.
59
+ - Uses a 2304 hidden size and 9216 FFN size.
60
+ - Supports 128K+ vocab tokens and bfloat16 precision.
61
+ - Supports extended context lengths of 131,072 tokens.
62
+
63
+ ---
64
+
65
+ ## Intended Use
66
+
67
+ - Research on transformer dynamics, optimization, or architectural changes.
68
+ - Baseline for pretraining or task-specific fine-tuning.
69
+ - Experimentation with scaling laws or custom architectures.
70
+
71
+ ---
72
+
73
+ ## Out-of-Scope Use
74
+
75
+ - **Not for direct production deployment.**
76
+ - **Not suitable for tasks needing coherence or accuracy without training.**
77
+
78
+ ---
79
+
80
+ ## Usage
81
+
82
+ ### Requirements
83
+
84
+ - `transformers >= 4.45.0`
85
+ - `torch >= 2.0`
86
+ - GPU with ≥ 6GB VRAM (24GB+ for training)
87
+
88
+ ---
89
+
90
+ ### Inference Example
91
+
92
+ ```python
93
+ # Use a pipeline as a high-level helper
94
+ from transformers import pipeline
95
+
96
+ messages = [
97
+ {"role": "user", "content": "Who are you?"},
98
+ ]
99
+ pipe = pipeline("text-generation", model="reflex-ai/random-llama-small")
100
+ print(pipe(messages))
101
+ ```
102
+
103
+ > Note: Outputs will be random and incoherent due to the model’s untrained state.
104
+
105
+ ---
106
+
107
+ ### Training Example
108
+
109
+ ```python
110
+ from transformers import Trainer, TrainingArguments, DataCollatorForLanguageModeling, LlamaForCausalLM, AutoTokenizer
111
+
112
+ model = LlamaForCausalLM.from_pretrained("your_username/random-llama-small")
113
+ tokenizer = AutoTokenizer.from_pretrained("your_username/random-llama-small")
114
+
115
+ training_args = TrainingArguments(
116
+ output_dir="./random_llama_small_finetuned",
117
+ per_device_train_batch_size=4,
118
+ num_train_epochs=3,
119
+ fp16=True,
120
+ )
121
+
122
+ trainer = Trainer(
123
+ model=model,
124
+ args=training_args,
125
+ train_dataset=your_dataset,
126
+ data_collator=DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False),
127
+ )
128
+
129
+ trainer.train()
130
+ ```
131
+
132
+ ---
133
+
134
+ ## Limitations
135
+
136
+ - **Random Initialization:** Needs significant training to be useful.
137
+ - **Resource Intensive:** High computational cost.
138
+ - **No Pretraining Data:** Users must provide their own.
139
+ - **Tokenizer Constraint:** May not suit all domains.
140
+
141
+ ---
142
+
143
+ ## Benefits and Potential
144
+
145
+ - **Customizability:** A blank slate for full control of objectives and data.
146
+ - **Research Insights:** Ideal for understanding early-stage LLM behavior.
147
+ - **Scalable Baseline:** Balances size and research feasibility.
148
+ - **Extended Context:** Useful for long-form tasks post-training.
149
+
150
+ ---
151
+
152
+ ## Model Configuration
153
+
154
+ ```json
155
+ {
156
+ "architectures": ["LlamaForCausalLM"],
157
+ "hidden_size": 2304,
158
+ "num_hidden_layers": 22,
159
+ "num_attention_heads": 36,
160
+ "num_key_value_heads": 9,
161
+ "intermediate_size": 9216,
162
+ "vocab_size": 128256,
163
+ "max_position_embeddings": 131072,
164
+ "rope_scaling": {
165
+ "factor": 32.0,
166
+ "high_freq_factor": 4.0,
167
+ "low_freq_factor": 1.0,
168
+ "original_max_position_embeddings": 8192,
169
+ "rope_type": "llama3"
170
+ },
171
+ "torch_dtype": "bfloat16",
172
+ "tie_word_embeddings": true
173
+ }
174
+ ```
175
+
176
+ ---
177
+
178
+ ## Ethical Considerations
179
+
180
+ - **Untrained Safety:** No immediate harmful outputs, but ethics matter during training.
181
+ - **Environmental Impact:** Large-scale training consumes energy; optimize and use green compute.
182
+ - **Accessibility:** Resource requirements may limit use by smaller research teams.
183
+
184
+ ---
185
+
186
+ ## Contact
187
+
188
+ For questions or issues, please open an issue on the Hugging Face repository.
189
+
190
+ > *Model card created on April 20, 2025.*