maharnab commited on
Commit
108499c
·
verified ·
1 Parent(s): f6add3b

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +112 -0
README.md ADDED
@@ -0,0 +1,112 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ datasets:
4
+ - flytech/python-codes-25k
5
+ tags:
6
+ - code
7
+ ---
8
+ # GPT2 PyCode
9
+
10
+ <!-- Provide a quick summary of what the model is/does. -->
11
+
12
+ This model is a fine-tuned version of the GPT 124M model, specifically adapted for testing purposes in Python code generation. It was trained on a small corpus of 25,000 Python code samples.
13
+
14
+ ### Model Description
15
+
16
+ <!-- Provide a longer summary of what this model is. -->
17
+
18
+ This project features a GPT (Generative Pre-trained Transformer) language model with 124 million parameters that has been fine-tuned for Python code generation. Unlike larger models like GPT-2 or GPT-3, this is a smaller-scale model designed primarily for testing and experimental purposes.
19
+
20
+ - **Developed by:** Maharnab Saikia
21
+ - **Model type:** Language model
22
+ - **Language(s) (NLP):** English
23
+ - **License:** MIT
24
+ - **Finetuned from model:** GPT2 124M
25
+
26
+ ## Uses
27
+
28
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
29
+
30
+ - **Research:** Studying the behavior of small-scale language models in code generation tasks
31
+ - **Benchmarking:** Providing a baseline for comparing different model architectures or training strategies
32
+ - **Rapid Prototyping:** Quick tests of code generation ideas without the overhead of larger models
33
+ - **Education:** Demonstrating the principles of fine-tuning language models for specific tasks
34
+
35
+ ## Bias, Risks, and Limitations
36
+
37
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
38
+
39
+ It's crucial to understand the limitations of this model:
40
+
41
+ - Limited knowledge base due to the small training corpus
42
+ - May struggle with complex or specialized Python code
43
+ - Not suitable for production-level code generation tasks
44
+ - Performance will likely be significantly lower than larger, more comprehensively trained models
45
+
46
+ ## How to Get Started with the Model
47
+
48
+ Use the code below to get started with the model.
49
+
50
+ ```python
51
+ from transformers import GPT2LMHeadModel, GPT2Tokenizer
52
+ import re
53
+
54
+ tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
55
+ model = GPT2Model.from_pretrained('gpt2')
56
+
57
+ text = "Replace me by any text you'd like."
58
+ encoded_input = tokenizer.encode_plus(f"<sos><user>{prompt}</user><assistant>", max_length=20, truncation=True, return_tensors="pt")
59
+
60
+ input_ids = encoded_input['input_ids']
61
+ attention_mask = encoded_input['attention_mask']
62
+
63
+ output = model.generate(
64
+ input_ids,
65
+ max_length=512,
66
+ num_return_sequences=1,
67
+ no_repeat_ngram_size=2,
68
+ temperature=0.7,
69
+ do_sample=True,
70
+ top_k=50,
71
+ top_p=0.95,
72
+ attention_mask=attention_mask,
73
+ pad_token_id=tokenizer.pad_token_id
74
+ )
75
+
76
+ generated_code = tokenizer.decode(output[0])
77
+ generated_code = re.search(r'<assistant>(.*?)</assistant>', generated_code, re.DOTALL).group(1)
78
+
79
+ print(f"Prompt: {prompt}\nGenerated Code:\n{generated_code}")
80
+ ```
81
+
82
+ ## Training Details
83
+
84
+ ### Training Data
85
+
86
+ - **Model:** GPT with 124 million parameters
87
+ - **Training Data:** 25,000 Python code samples
88
+ - **Fine-tuning:** Adapted specifically for Python code generation tasks
89
+
90
+
91
+ #### Training Hyperparameters
92
+
93
+ - **Epochs:** 5
94
+ - **Batch Size:** 8
95
+ - **Learning Rate:** 5e-5
96
+ - **Contex Window:** 512
97
+
98
+ ## Environmental Impact
99
+
100
+ <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
101
+
102
+ Carbon emissions was estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
103
+
104
+ - **Hardware Type:** P100 GPU
105
+ - **Hours used:** 5
106
+ - **Cloud Provider:** Kaggle
107
+ - **Compute Region:** South Asia
108
+ - **Carbon Emitted:** 1.15
109
+
110
+ ## Acknowledgements
111
+
112
+ This project builds upon the GPT-2 model developed by OpenAI. We acknowledge their groundbreaking work in the field of natural language processing.