Create README.md
Browse files*Zenthia GPT: Achieving Efficient Training with Superior Convergence*
Introducing *Zenthia GPT*, a highly efficient model that saves 95% of the computation typically required to train GPT-2, while achieving better convergence with a validation loss of *2.86*.
*Key Features:*
- *MiniPile Dataset*: Trained on a compact dataset of just *1 billion tokens* from MiniPile, enabling quicker training cycles.
- *Optimized Resource Usage*: Utilizes the *Adam Mini Optimizer*, cutting VRAM usage by *50%*, allowing the model to train efficiently even on hardware with limited resources.
- *Context Size Adjustment*: The model operates with a *context size of 256*, optimized to fit within an *8GB VRAM*, as opposed to the standard 1024 context size. This context size reduction allows for better memory management, though it may limit performance on tasks that require longer context lengths.
- *Improved HellaSwag Score*: Despite the reduced context size, the model achieves a *HellaSwag score of 0.26*, highlighting its competitive performance.
*Trade-offs and Considerations*:
- The reduced context length may hinder performance on certain *HellaSwag benchmark* examples that require more than 256 tokens.
- *Potential Overfitting*: Due to limited exposure to a small, high-quality dataset, the model shows signs of possible overfitting. While the data quality is high, the relatively small dataset size could impact its generalization capabilities.
Overall, *Zenthia GPT* strikes an effective balance between computational efficiency and performance, pushing the boundaries of what's possible with constrained resources.