File size: 3,500 Bytes
8d95c20
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97

---
license: mit
language: en
library_name: transformers
tags:
- text-generation
- mixture-of-experts
- moe
- from-scratch
- ag_news
---

# Mixture-of-Experts Foundation Model: AdbhutMOE

**AdbhutMOE** is a miniature, from-scratch Mixture-of-Experts (MoE) autoregressive language model based on the Mixtral architecture. This model was pre-trained on a sample of the `ag_news` dataset as part of a learning exercise to demonstrate the end-to-end pipeline for creating a sparse foundation model.

This model is intended for **educational purposes only**. It showcases how to configure and train an MoE model, which uses a sparse activation pattern to increase parameter count while maintaining a manageable computational cost.

- **Developed by:** [rohitnagareddy](https://huggingface.co/rohitnagareddy)
- **Model type:** Mixture-of-Experts Causal Language Model
- **Language:** English
- **License:** MIT

## How to Use

The model can be easily loaded for text generation using the `transformers` library pipeline.

```python
from transformers import pipeline

# Load the model from the Hugging Face Hub
generator = pipeline('text-generation', model='rohitnagareddy/AdbhutMOE')

# Generate text
prompt = "The latest discovery in space exploration is"
output = generator(
    prompt,
    max_length=50,
    num_return_sequences=1,
    no_repeat_ngram_size=2,
    temperature=0.7,
    top_k=50
)

print(output[0]['generated_text'])
```

## Model Architecture

**AdbhutMOE** is a small-scale MoE model with the following configuration:
- **Number of layers:** 4
- **Hidden dimension:** 256
- **Number of attention heads:** 4
- **Vocabulary size:** 8000
- **Maximum sequence length:** 256 positions
- **Total Experts per Layer:** 8
- **Activated Experts per Token:** 2

This architecture results in a significantly higher parameter count than a dense model of similar computational cost, demonstrating the core benefit of the MoE approach.

---

## Training Details

### Training Data

The model was pre-trained on a shuffled sample of the **`ag_news`** dataset.
- **Dataset:** `ag_news`
- **Sample Size:** 10000 articles
- **Preprocessing:** The text of each article was extracted and used for training after filtering out empty examples.

### Training Procedure

The model was pre-trained using the Hugging Face `Trainer` on a single GPU.

- **Framework:** PyTorch
- **Training Steps:** 100
- **Batch Size:** 4
- **Optimizer:** AdamW (default)
- **Objective:** Causal Language Modeling (including the router's auxiliary loss to ensure expert load balancing).

---

## Limitations and Intended Use

**This model is a proof-of-concept and is not suitable for any real-world application.**

The primary goal of this project was to learn and demonstrate the MoE training pipeline. As a result, it has significant limitations:

1.  **Limited Coherence:** While more capable than a dense model trained for the same number of steps, the output may still lack long-range coherence due to the limited training data and short training cycle.
2.  **Confined Knowledge:** The model's knowledge is restricted to the 10000 news articles it was trained on.
3.  **Bias:** The model will reflect the biases inherent in the `ag_news` dataset.
4.  **No Safety Alignment:** This is a raw, pre-trained base model and has not undergone any instruction tuning or RLHF. It should not be used in a public-facing capacity.

The intended use is for studying the configuration and training behavior of Mixture-of-Experts models.