Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,112 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: mit
|
3 |
+
datasets:
|
4 |
+
- flytech/python-codes-25k
|
5 |
+
tags:
|
6 |
+
- code
|
7 |
+
---
|
8 |
+
# GPT2 PyCode
|
9 |
+
|
10 |
+
<!-- Provide a quick summary of what the model is/does. -->
|
11 |
+
|
12 |
+
This model is a fine-tuned version of the GPT 124M model, specifically adapted for testing purposes in Python code generation. It was trained on a small corpus of 25,000 Python code samples.
|
13 |
+
|
14 |
+
### Model Description
|
15 |
+
|
16 |
+
<!-- Provide a longer summary of what this model is. -->
|
17 |
+
|
18 |
+
This project features a GPT (Generative Pre-trained Transformer) language model with 124 million parameters that has been fine-tuned for Python code generation. Unlike larger models like GPT-2 or GPT-3, this is a smaller-scale model designed primarily for testing and experimental purposes.
|
19 |
+
|
20 |
+
- **Developed by:** Maharnab Saikia
|
21 |
+
- **Model type:** Language model
|
22 |
+
- **Language(s) (NLP):** English
|
23 |
+
- **License:** MIT
|
24 |
+
- **Finetuned from model:** GPT2 124M
|
25 |
+
|
26 |
+
## Uses
|
27 |
+
|
28 |
+
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
|
29 |
+
|
30 |
+
- **Research:** Studying the behavior of small-scale language models in code generation tasks
|
31 |
+
- **Benchmarking:** Providing a baseline for comparing different model architectures or training strategies
|
32 |
+
- **Rapid Prototyping:** Quick tests of code generation ideas without the overhead of larger models
|
33 |
+
- **Education:** Demonstrating the principles of fine-tuning language models for specific tasks
|
34 |
+
|
35 |
+
## Bias, Risks, and Limitations
|
36 |
+
|
37 |
+
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
|
38 |
+
|
39 |
+
It's crucial to understand the limitations of this model:
|
40 |
+
|
41 |
+
- Limited knowledge base due to the small training corpus
|
42 |
+
- May struggle with complex or specialized Python code
|
43 |
+
- Not suitable for production-level code generation tasks
|
44 |
+
- Performance will likely be significantly lower than larger, more comprehensively trained models
|
45 |
+
|
46 |
+
## How to Get Started with the Model
|
47 |
+
|
48 |
+
Use the code below to get started with the model.
|
49 |
+
|
50 |
+
```python
|
51 |
+
from transformers import GPT2LMHeadModel, GPT2Tokenizer
|
52 |
+
import re
|
53 |
+
|
54 |
+
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
|
55 |
+
model = GPT2Model.from_pretrained('gpt2')
|
56 |
+
|
57 |
+
text = "Replace me by any text you'd like."
|
58 |
+
encoded_input = tokenizer.encode_plus(f"<sos><user>{prompt}</user><assistant>", max_length=20, truncation=True, return_tensors="pt")
|
59 |
+
|
60 |
+
input_ids = encoded_input['input_ids']
|
61 |
+
attention_mask = encoded_input['attention_mask']
|
62 |
+
|
63 |
+
output = model.generate(
|
64 |
+
input_ids,
|
65 |
+
max_length=512,
|
66 |
+
num_return_sequences=1,
|
67 |
+
no_repeat_ngram_size=2,
|
68 |
+
temperature=0.7,
|
69 |
+
do_sample=True,
|
70 |
+
top_k=50,
|
71 |
+
top_p=0.95,
|
72 |
+
attention_mask=attention_mask,
|
73 |
+
pad_token_id=tokenizer.pad_token_id
|
74 |
+
)
|
75 |
+
|
76 |
+
generated_code = tokenizer.decode(output[0])
|
77 |
+
generated_code = re.search(r'<assistant>(.*?)</assistant>', generated_code, re.DOTALL).group(1)
|
78 |
+
|
79 |
+
print(f"Prompt: {prompt}\nGenerated Code:\n{generated_code}")
|
80 |
+
```
|
81 |
+
|
82 |
+
## Training Details
|
83 |
+
|
84 |
+
### Training Data
|
85 |
+
|
86 |
+
- **Model:** GPT with 124 million parameters
|
87 |
+
- **Training Data:** 25,000 Python code samples
|
88 |
+
- **Fine-tuning:** Adapted specifically for Python code generation tasks
|
89 |
+
|
90 |
+
|
91 |
+
#### Training Hyperparameters
|
92 |
+
|
93 |
+
- **Epochs:** 5
|
94 |
+
- **Batch Size:** 8
|
95 |
+
- **Learning Rate:** 5e-5
|
96 |
+
- **Contex Window:** 512
|
97 |
+
|
98 |
+
## Environmental Impact
|
99 |
+
|
100 |
+
<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
|
101 |
+
|
102 |
+
Carbon emissions was estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
|
103 |
+
|
104 |
+
- **Hardware Type:** P100 GPU
|
105 |
+
- **Hours used:** 5
|
106 |
+
- **Cloud Provider:** Kaggle
|
107 |
+
- **Compute Region:** South Asia
|
108 |
+
- **Carbon Emitted:** 1.15
|
109 |
+
|
110 |
+
## Acknowledgements
|
111 |
+
|
112 |
+
This project builds upon the GPT-2 model developed by OpenAI. We acknowledge their groundbreaking work in the field of natural language processing.
|