--- license: mit language: en library_name: transformers tags: - text-generation - mixture-of-experts - moe - from-scratch - ag_news --- # Mixture-of-Experts Foundation Model: AdbhutMOE **AdbhutMOE** is a miniature, from-scratch Mixture-of-Experts (MoE) autoregressive language model based on the Mixtral architecture. This model was pre-trained on a sample of the `ag_news` dataset as part of a learning exercise to demonstrate the end-to-end pipeline for creating a sparse foundation model. This model is intended for **educational purposes only**. It showcases how to configure and train an MoE model, which uses a sparse activation pattern to increase parameter count while maintaining a manageable computational cost. - **Developed by:** [rohitnagareddy](https://huggingface.co/rohitnagareddy) - **Model type:** Mixture-of-Experts Causal Language Model - **Language:** English - **License:** MIT ## How to Use The model can be easily loaded for text generation using the `transformers` library pipeline. ```python from transformers import pipeline # Load the model from the Hugging Face Hub generator = pipeline('text-generation', model='rohitnagareddy/AdbhutMOE') # Generate text prompt = "The latest discovery in space exploration is" output = generator( prompt, max_length=50, num_return_sequences=1, no_repeat_ngram_size=2, temperature=0.7, top_k=50 ) print(output[0]['generated_text']) ``` ## Model Architecture **AdbhutMOE** is a small-scale MoE model with the following configuration: - **Number of layers:** 4 - **Hidden dimension:** 256 - **Number of attention heads:** 4 - **Vocabulary size:** 8000 - **Maximum sequence length:** 256 positions - **Total Experts per Layer:** 8 - **Activated Experts per Token:** 2 This architecture results in a significantly higher parameter count than a dense model of similar computational cost, demonstrating the core benefit of the MoE approach. --- ## Training Details ### Training Data The model was pre-trained on a shuffled sample of the **`ag_news`** dataset. - **Dataset:** `ag_news` - **Sample Size:** 10000 articles - **Preprocessing:** The text of each article was extracted and used for training after filtering out empty examples. ### Training Procedure The model was pre-trained using the Hugging Face `Trainer` on a single GPU. - **Framework:** PyTorch - **Training Steps:** 100 - **Batch Size:** 4 - **Optimizer:** AdamW (default) - **Objective:** Causal Language Modeling (including the router's auxiliary loss to ensure expert load balancing). --- ## Limitations and Intended Use **This model is a proof-of-concept and is not suitable for any real-world application.** The primary goal of this project was to learn and demonstrate the MoE training pipeline. As a result, it has significant limitations: 1. **Limited Coherence:** While more capable than a dense model trained for the same number of steps, the output may still lack long-range coherence due to the limited training data and short training cycle. 2. **Confined Knowledge:** The model's knowledge is restricted to the 10000 news articles it was trained on. 3. **Bias:** The model will reflect the biases inherent in the `ag_news` dataset. 4. **No Safety Alignment:** This is a raw, pre-trained base model and has not undergone any instruction tuning or RLHF. It should not be used in a public-facing capacity. The intended use is for studying the configuration and training behavior of Mixture-of-Experts models.