File size: 2,712 Bytes

15f1e64
 
 
 
 
 
 
 
 
 
 
a66c79e
b080cf7
a66c79e
 
 
b080cf7
a66c79e
 
b080cf7
a66c79e
b080cf7
a66c79e
b080cf7
 
 
 
a66c79e
 
b080cf7
a66c79e
b080cf7
a66c79e
b080cf7
 
a66c79e
a29dd1a
 
 
 
a66c79e
b080cf7
a66c79e
b080cf7
 
a66c79e
 
b080cf7
 
a66c79e
b080cf7
 
a66c79e
b080cf7
 
15f1e64
a66c79e
b080cf7
 
a66c79e
b080cf7
 
 
 
 
a66c79e
b080cf7
 
a66c79e
b080cf7
a66c79e
b080cf7
a66c79e
b080cf7
a66c79e
b080cf7
a29dd1a
 
 
 
a66c79e
b080cf7
a29dd1a
 
 
 
 
 
a66c79e
b080cf7
a29dd1a

---
library_name: transformers
tags:
- code
license: mit
datasets:
- iamtarun/python_code_instructions_18k_alpaca
pipeline_tag: text-generation
language:
- en
---

# PyCodeGen 350M

<!-- Provide a quick summary of what the model is/does. -->

This model is finetuned version of [codegen-350M-mono](https://huggingface.co/Salesforce/codegen-350M-mono) by Salesforce trained on python code [dataset](https://huggingface.co/datasets/iamtarun/python_code_instructions_18k_alpaca) using QLORA method.


## Pretrained model description

[codegen-350M-mono](https://huggingface.co/Salesforce/codegen-350M-mono)

Codegen-350M-mono comes from the family of autoregressive models for program synthesis developed by Salesforce. 
This model was first trained on ThePile dataset which is 825.18 GiB English text corpus.
It was then adapted to generate code by training on a set of GitQuery with source codes.
Finally model has been adapted to the Python language by training on the BigPython dataset.


## Training Data

[python_code_instructions_18k_alpaca](https://huggingface.co/datasets/iamtarun/python_code_instructions_18k_alpaca)

The dataset contains problem descriptions and code in python language. 
This dataset is taken from sahil2801/code_instructions_120k, which adds a prompt column in alpaca style.

## Intended uses

The model can be used to generate python code that solves task with optionally given input data.


## Example of usage

```py
from transformers import AutoModelForCausalLM, AutoTokenizer


model = AutoModelForCausalLM.from_pretrained('chincyk/PyCodeGen')
tokenizer = AutoTokenizer.from_pretrained('chincyk/PyCodeGen')

instruction = "Write a python class that represents a calculator, then use it to add two numbers."
input = "a = 5, b = 2"

prompt = f"""
    ### Instruction:
    Use the Task below and the Input given to write the Response, which is a programming code that can solve the Task.

    ### Task:
    {instruction}

    ### Input:
    {input}
    
    ### Response:
    """

input_ids = tokenizer(prompt, truncation=True, return_tensors="pt")['input_ids']
output = model.generate(input_ids=input_ids, max_length=200)

print(tokenizer.decode(output[0], skip_special_tokens=True))

```

## Training parameters

BitsAndBytes:
- load_in_4bit: True,
- bnb_4bit_quant_type: nf4,
- bnb_4bit_use_double_quant: True,
- bnb_4bit_compute_dtype: torch.bfloat16

LoraConfig:
- r: 32,
- lora_alpha: 16,
- target_modules: all-linear,
- lora_dropout: 0.1,
- bias: none,
- task_type: CASUAL_LM

Finetuning:
- num_epochs: 15
- train_batch_size: 4
- eval_batch_size: 8
- gradient_accumulation_steps: 8
- learning_rate: 3e-4
- weight_decay: 0.01
- lr_scheduler_name: cosine
- num_warmup_steps: 190