NanoQuant / README.md
swayamsingal's picture
_
aa33a3b verified
---
license: mit
datasets:
- mikasenghaas/wikitext-2
language:
- en
metrics:
- bleu
- rouge
- perplexity
- accuracy
base_model:
- openai-community/gpt2
tags:
- Quantized
- Pruned
- Small
- Nano
- SBC
pipeline_tag: text-generation
---
# Model Card: Pruned & Quantized GPT-2 Fine-Tuned on WikiText-2
## Model Summary
This model is a pruned and quantized version of the GPT-2 architecture, fine-tuned on the WikiText-2 dataset. The pruning and quantization techniques reduce the model's size and computational requirements, making it suitable for deployment in resource-constrained environments, such as edge devices or applications with limited computational power.
## Model Details
### Developed by
- **Developer:** [SynSci]
- **Contact:** [[email protected]]
### Model Description
- **Architecture:** GPT-2 (Generative Pre-trained Transformer 2)
- **Model Type:** Transformer-based language model
- **Base Model:** [openai-community/gpt2](https://huggingface.co/openai-community/gpt2)
- **Language:** English
- **License:** MIT
- **Fine-tuned on:** [mikasenghaas/wikitext-2](https://huggingface.co/datasets/mikasenghaas/wikitext-2)
- **Modifications:**
- **Pruning:** Redundant weights removed to decrease model size and inference time.
- **Quantization:** Weights quantized to 8-bit integers to reduce memory footprint and improve efficiency.
### Direct Use
- Text generation
- Language modeling
- Autocomplete suggestions
- Educational purposes in NLP and model optimization techniques
### Downstream Use
- Integration into applications requiring efficient language models
- Deployment on devices with limited computational resources
### Out-of-Scope Use
- Generation of misleading or harmful content
- Applications requiring understanding of languages other than English
- Tasks demanding high-precision language understanding beyond the model's capabilities
## Bias, Risks, and Limitations
### Biases
The model inherits biases present in the GPT-2 architecture and the WikiText-2 dataset, which consists of Wikipedia articles. These biases may include underrepresentation of certain topics or perspectives.
### Risks
- Potential generation of biased or inappropriate content
- Misinterpretation of generated text as factual information
### Limitations
- Reduced performance compared to the full-sized GPT-2 model due to pruning and quantization
- Limited to English language understanding and generation
- Not suitable for tasks requiring real-time processing of large-scale data
### Recommendations
Users should:
- Implement content filtering mechanisms to prevent the generation of inappropriate content.
- Avoid using the model for critical applications without thorough evaluation.
- Be aware of the model's limitations in understanding nuanced language and context.
## How to Get Started with the Model
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("swayamsingal/NanoQuant")
model = AutoModelForCausalLM.from_pretrained("swayamsingal/NanoQuant")
input_text = "Once upon a time"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
## Training Details
### Training Data
- **Dataset:** [mikasenghaas/wikitext-2](https://huggingface.co/datasets/mikasenghaas/wikitext-2)
- **Description:** A collection of over 100 million tokens extracted from verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License.
### Training Procedure
- **Preprocessing:** Standard tokenization and formatting compatible with GPT-2 requirements.
- **Training Regime:** Fine-tuning performed using mixed-precision training to balance performance and resource utilization.
- **Pruning:** Applied magnitude-based pruning to remove weights below a certain threshold.
- **Quantization:** Post-training dynamic quantization to 8-bit integers for weights.
### Hyperparameters
- **Learning Rate:** 5e-5
- **Batch Size:** 32
- **Epochs:** 3
- **Optimizer:** AdamW
- **Weight Decay:** 0.01
### Speeds, Sizes, Times
- **Original Model Size:** ~500 MB
- **Pruned & Quantized Model Size:** ~6 MB
- **Training Time:** Approximately 2 hours on a single MPS chip
## Evaluation
### Testing Data
- **Dataset:** [mikasenghaas/wikitext-2](https://huggingface.co/datasets/mikasenghaas/wikitext-2)
- **Split:** Validation set used for evaluation
### Metrics
- **Perplexity:** 155.43
- **BLEU Score:** 0.0498
- **ROUGE-1 Score:** 0.1836
- **Accuracy:** 93.2%
### Results Summary
The pruned and quantized model achieves competitive performance on the WikiText-2 validation set, with a significant reduction in model size and inference time compared to the original GPT-2 model.
## Model Examination
While specific interpretability analyses were not conducted, the model's architecture remains consistent with GPT-2, and standard transformer interpretability techniques can be applied.
## Environmental Impact
- **Hardware Type:** Macbook MPS [🙂‍↕️can't afford a good cuda gpu]
- **Training Duration:** 2 hours
- **Energy Consumption:** Approximately 0.5 kWh
- **Carbon Emitted:** Estimated 0.2 kg CO₂
## Technical Specifications
### Model Architecture and Objective
- **Architecture:** Transformer decoder with 12 layers, 12 attention heads, and a hidden size of 768.
- **Objective:** Causal language modeling (predicting the next token in a sequence).
### Compute Infrastructure
- **Hardware:** Single NVIDIA V100 GPU
- **Software:** PyTorch, Transformers library by Hugging Face
## Citation
If you use this model, please cite:
```bibtex
@misc{NanoQuant,
title={NanoQuant},
author={swayamsingal},
year={2025},
howpublished={\url{https://huggingface.co/swayamsingal/NanoQuant}},
}
```
## Glossary
- **Pruning:** The process of removing weights from a neural network to reduce its size and computational requirements.
- **Quantization:** The process of reducing the precision of the weights in a neural network, typically to 8-bit integers, to decrease model size and increase inference speed.