maharnab
/

gpt2_pycode

+---
+license: mit
+datasets:
+- flytech/python-codes-25k
+tags:
+- code
+---
+# GPT2 PyCode
+<!-- Provide a quick summary of what the model is/does. -->
+This model is a fine-tuned version of the GPT 124M model, specifically adapted for testing purposes in Python code generation. It was trained on a small corpus of 25,000 Python code samples.
+### Model Description
+<!-- Provide a longer summary of what this model is. -->
+This project features a GPT (Generative Pre-trained Transformer) language model with 124 million parameters that has been fine-tuned for Python code generation. Unlike larger models like GPT-2 or GPT-3, this is a smaller-scale model designed primarily for testing and experimental purposes.
+- **Developed by:** Maharnab Saikia
+- **Model type:** Language model
+- **Language(s) (NLP):** English
+- **License:** MIT
+- **Finetuned from model:** GPT2 124M
+## Uses
+<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+ - **Research:** Studying the behavior of small-scale language models in code generation tasks
+ - **Benchmarking:** Providing a baseline for comparing different model architectures or training strategies
+ - **Rapid Prototyping:** Quick tests of code generation ideas without the overhead of larger models
+ - **Education:** Demonstrating the principles of fine-tuning language models for specific tasks
+## Bias, Risks, and Limitations
+<!-- This section is meant to convey both technical and sociotechnical limitations. -->
+It's crucial to understand the limitations of this model:
+ - Limited knowledge base due to the small training corpus
+ - May struggle with complex or specialized Python code
+ - Not suitable for production-level code generation tasks
+ - Performance will likely be significantly lower than larger, more comprehensively trained models
+## How to Get Started with the Model
+Use the code below to get started with the model.
+```python
+from transformers import GPT2LMHeadModel, GPT2Tokenizer
+import re
+tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
+model = GPT2Model.from_pretrained('gpt2')
+text = "Replace me by any text you'd like."
+encoded_input = tokenizer.encode_plus(f"<sos><user>{prompt}</user><assistant>", max_length=20, truncation=True, return_tensors="pt")
+input_ids = encoded_input['input_ids']
+attention_mask = encoded_input['attention_mask']
+output = model.generate(
+    input_ids,
+    max_length=512,
+    num_return_sequences=1,
+    no_repeat_ngram_size=2,
+    temperature=0.7,
+    do_sample=True,
+    top_k=50,
+    top_p=0.95,
+    attention_mask=attention_mask,
+    pad_token_id=tokenizer.pad_token_id
+)
+generated_code = tokenizer.decode(output[0])
+generated_code = re.search(r'<assistant>(.*?)</assistant>', generated_code, re.DOTALL).group(1)
+print(f"Prompt: {prompt}\nGenerated Code:\n{generated_code}")
+```
+## Training Details
+### Training Data
+ - **Model:** GPT with 124 million parameters
+ - **Training Data:** 25,000 Python code samples
+ - **Fine-tuning:** Adapted specifically for Python code generation tasks
+#### Training Hyperparameters
+ - **Epochs:** 5
+ - **Batch Size:** 8
+ - **Learning Rate:** 5e-5
+ - **Contex Window:** 512
+## Environmental Impact
+<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
+Carbon emissions was estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
+- **Hardware Type:** P100 GPU
+- **Hours used:** 5
+- **Cloud Provider:** Kaggle
+- **Compute Region:** South Asia
+- **Carbon Emitted:** 1.15
+## Acknowledgements
+This project builds upon the GPT-2 model developed by OpenAI. We acknowledge their groundbreaking work in the field of natural language processing.