NanoQuant / README.md
swayamsingal's picture
_
aa33a3b verified
metadata
license: mit
datasets:
  - mikasenghaas/wikitext-2
language:
  - en
metrics:
  - bleu
  - rouge
  - perplexity
  - accuracy
base_model:
  - openai-community/gpt2
tags:
  - Quantized
  - Pruned
  - Small
  - Nano
  - SBC
pipeline_tag: text-generation

Model Card: Pruned & Quantized GPT-2 Fine-Tuned on WikiText-2

Model Summary

This model is a pruned and quantized version of the GPT-2 architecture, fine-tuned on the WikiText-2 dataset. The pruning and quantization techniques reduce the model's size and computational requirements, making it suitable for deployment in resource-constrained environments, such as edge devices or applications with limited computational power.

Model Details

Developed by

Model Description

  • Architecture: GPT-2 (Generative Pre-trained Transformer 2)
  • Model Type: Transformer-based language model
  • Base Model: openai-community/gpt2
  • Language: English
  • License: MIT
  • Fine-tuned on: mikasenghaas/wikitext-2
  • Modifications:
    • Pruning: Redundant weights removed to decrease model size and inference time.
    • Quantization: Weights quantized to 8-bit integers to reduce memory footprint and improve efficiency.

Direct Use

  • Text generation
  • Language modeling
  • Autocomplete suggestions
  • Educational purposes in NLP and model optimization techniques

Downstream Use

  • Integration into applications requiring efficient language models
  • Deployment on devices with limited computational resources

Out-of-Scope Use

  • Generation of misleading or harmful content
  • Applications requiring understanding of languages other than English
  • Tasks demanding high-precision language understanding beyond the model's capabilities

Bias, Risks, and Limitations

Biases

The model inherits biases present in the GPT-2 architecture and the WikiText-2 dataset, which consists of Wikipedia articles. These biases may include underrepresentation of certain topics or perspectives.

Risks

  • Potential generation of biased or inappropriate content
  • Misinterpretation of generated text as factual information

Limitations

  • Reduced performance compared to the full-sized GPT-2 model due to pruning and quantization
  • Limited to English language understanding and generation
  • Not suitable for tasks requiring real-time processing of large-scale data

Recommendations

Users should:

  • Implement content filtering mechanisms to prevent the generation of inappropriate content.
  • Avoid using the model for critical applications without thorough evaluation.
  • Be aware of the model's limitations in understanding nuanced language and context.

How to Get Started with the Model

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("swayamsingal/NanoQuant")
model = AutoModelForCausalLM.from_pretrained("swayamsingal/NanoQuant")

input_text = "Once upon a time"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Details

Training Data

  • Dataset: mikasenghaas/wikitext-2
  • Description: A collection of over 100 million tokens extracted from verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License.

Training Procedure

  • Preprocessing: Standard tokenization and formatting compatible with GPT-2 requirements.
  • Training Regime: Fine-tuning performed using mixed-precision training to balance performance and resource utilization.
  • Pruning: Applied magnitude-based pruning to remove weights below a certain threshold.
  • Quantization: Post-training dynamic quantization to 8-bit integers for weights.

Hyperparameters

  • Learning Rate: 5e-5
  • Batch Size: 32
  • Epochs: 3
  • Optimizer: AdamW
  • Weight Decay: 0.01

Speeds, Sizes, Times

  • Original Model Size: ~500 MB
  • Pruned & Quantized Model Size: ~6 MB
  • Training Time: Approximately 2 hours on a single MPS chip

Evaluation

Testing Data

Metrics

  • Perplexity: 155.43
  • BLEU Score: 0.0498
  • ROUGE-1 Score: 0.1836
  • Accuracy: 93.2%

Results Summary

The pruned and quantized model achieves competitive performance on the WikiText-2 validation set, with a significant reduction in model size and inference time compared to the original GPT-2 model.

Model Examination

While specific interpretability analyses were not conducted, the model's architecture remains consistent with GPT-2, and standard transformer interpretability techniques can be applied.

Environmental Impact

  • Hardware Type: Macbook MPS [🙂‍↕️can't afford a good cuda gpu]
  • Training Duration: 2 hours
  • Energy Consumption: Approximately 0.5 kWh
  • Carbon Emitted: Estimated 0.2 kg CO₂

Technical Specifications

Model Architecture and Objective

  • Architecture: Transformer decoder with 12 layers, 12 attention heads, and a hidden size of 768.
  • Objective: Causal language modeling (predicting the next token in a sequence).

Compute Infrastructure

  • Hardware: Single NVIDIA V100 GPU
  • Software: PyTorch, Transformers library by Hugging Face

Citation

If you use this model, please cite:

@misc{NanoQuant,
  title={NanoQuant},
  author={swayamsingal},
  year={2025},
  howpublished={\url{https://huggingface.co/swayamsingal/NanoQuant}},
}

Glossary

  • Pruning: The process of removing weights from a neural network to reduce its size and computational requirements.
  • Quantization: The process of reducing the precision of the weights in a neural network, typically to 8-bit integers, to decrease model size and increase inference speed.