NanoQuant / README.md

aa33a3b verified 7 months ago

6.2 kB

	---
	license: mit
	datasets:
	- mikasenghaas/wikitext-2
	language:
	- en
	metrics:
	- bleu
	- rouge
	- perplexity
	- accuracy
	base_model:
	- openai-community/gpt2
	tags:
	- Quantized
	- Pruned
	- Small
	- Nano
	- SBC
	pipeline_tag: text-generation
	---

	# Model Card: Pruned & Quantized GPT-2 Fine-Tuned on WikiText-2

	## Model Summary

	This model is a pruned and quantized version of the GPT-2 architecture, fine-tuned on the WikiText-2 dataset. The pruning and quantization techniques reduce the model's size and computational requirements, making it suitable for deployment in resource-constrained environments, such as edge devices or applications with limited computational power.

	## Model Details

	### Developed by

	- Developer: [SynSci]
	- Contact: [[email protected]]

	### Model Description

	- Architecture: GPT-2 (Generative Pre-trained Transformer 2)
	- Model Type: Transformer-based language model
	- Base Model: [openai-community/gpt2](https://huggingface.co/openai-community/gpt2)
	- Language: English
	- License: MIT
	- Fine-tuned on: [mikasenghaas/wikitext-2](https://huggingface.co/datasets/mikasenghaas/wikitext-2)
	- Modifications:
	- Pruning: Redundant weights removed to decrease model size and inference time.
	- Quantization: Weights quantized to 8-bit integers to reduce memory footprint and improve efficiency.

	### Direct Use

	- Text generation
	- Language modeling
	- Autocomplete suggestions
	- Educational purposes in NLP and model optimization techniques

	### Downstream Use

	- Integration into applications requiring efficient language models
	- Deployment on devices with limited computational resources

	### Out-of-Scope Use

	- Generation of misleading or harmful content
	- Applications requiring understanding of languages other than English
	- Tasks demanding high-precision language understanding beyond the model's capabilities

	## Bias, Risks, and Limitations

	### Biases

	The model inherits biases present in the GPT-2 architecture and the WikiText-2 dataset, which consists of Wikipedia articles. These biases may include underrepresentation of certain topics or perspectives.

	### Risks

	- Potential generation of biased or inappropriate content
	- Misinterpretation of generated text as factual information

	### Limitations

	- Reduced performance compared to the full-sized GPT-2 model due to pruning and quantization
	- Limited to English language understanding and generation
	- Not suitable for tasks requiring real-time processing of large-scale data

	### Recommendations

	Users should:

	- Implement content filtering mechanisms to prevent the generation of inappropriate content.
	- Avoid using the model for critical applications without thorough evaluation.
	- Be aware of the model's limitations in understanding nuanced language and context.

	## How to Get Started with the Model

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM

	tokenizer = AutoTokenizer.from_pretrained("swayamsingal/NanoQuant")
	model = AutoModelForCausalLM.from_pretrained("swayamsingal/NanoQuant")

	input_text = "Once upon a time"
	inputs = tokenizer(input_text, return_tensors="pt")
	outputs = model.generate(**inputs, max_new_tokens=50)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	## Training Details

	### Training Data

	- Dataset: [mikasenghaas/wikitext-2](https://huggingface.co/datasets/mikasenghaas/wikitext-2)
	- Description: A collection of over 100 million tokens extracted from verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License.

	### Training Procedure

	- Preprocessing: Standard tokenization and formatting compatible with GPT-2 requirements.
	- Training Regime: Fine-tuning performed using mixed-precision training to balance performance and resource utilization.
	- Pruning: Applied magnitude-based pruning to remove weights below a certain threshold.
	- Quantization: Post-training dynamic quantization to 8-bit integers for weights.

	### Hyperparameters

	- Learning Rate: 5e-5
	- Batch Size: 32
	- Epochs: 3
	- Optimizer: AdamW
	- Weight Decay: 0.01

	### Speeds, Sizes, Times

	- Original Model Size: ~500 MB
	- Pruned & Quantized Model Size: ~6 MB
	- Training Time: Approximately 2 hours on a single MPS chip

	## Evaluation

	### Testing Data

	- Dataset: [mikasenghaas/wikitext-2](https://huggingface.co/datasets/mikasenghaas/wikitext-2)
	- Split: Validation set used for evaluation

	### Metrics

	- Perplexity: 155.43
	- BLEU Score: 0.0498
	- ROUGE-1 Score: 0.1836
	- Accuracy: 93.2%

	### Results Summary

	The pruned and quantized model achieves competitive performance on the WikiText-2 validation set, with a significant reduction in model size and inference time compared to the original GPT-2 model.

	## Model Examination

	While specific interpretability analyses were not conducted, the model's architecture remains consistent with GPT-2, and standard transformer interpretability techniques can be applied.

	## Environmental Impact

	- Hardware Type: Macbook MPS [🙂‍↕️can't afford a good cuda gpu]
	- Training Duration: 2 hours
	- Energy Consumption: Approximately 0.5 kWh
	- Carbon Emitted: Estimated 0.2 kg CO₂

	## Technical Specifications

	### Model Architecture and Objective

	- Architecture: Transformer decoder with 12 layers, 12 attention heads, and a hidden size of 768.
	- Objective: Causal language modeling (predicting the next token in a sequence).

	### Compute Infrastructure

	- Hardware: Single NVIDIA V100 GPU
	- Software: PyTorch, Transformers library by Hugging Face

	## Citation

	If you use this model, please cite:

	```bibtex
	@misc{NanoQuant,
	title={NanoQuant},
	author={swayamsingal},
	year={2025},
	howpublished={\url{https://huggingface.co/swayamsingal/NanoQuant}},
	}
	```

	## Glossary

	- Pruning: The process of removing weights from a neural network to reduce its size and computational requirements.
	- Quantization: The process of reducing the precision of the weights in a neural network, typically to 8-bit integers, to decrease model size and increase inference speed.