CodeGen-ft-python
Generate python code from natural language prompts.
Model Details
Model Description
This model is a fine-tuned variant of Salesforce/codegen-350M-mono, specialized for natural language to code generation in Python. It takes natural language instructions (e.g., “check MySQL database connection”) and generates the corresponding Python code snippet. The model was trained on a curated text-to-code dataset containing diverse programming instructions and function-level examples to improve semantic and syntactic accuracy.
Developed by: Akshay Bharadwaj
Model type: Transformer-based Causal Language Model
Language(s) (NLP): English (Prompts) and Python (Code Outputs)
License: MIT License
Finetuned from model [optional]: Salesforce/codegen-350M-mono
Uses
Direct Use
The model can be used for:
Translating natural language prompts into functional Python code.
Assisting in code autocompletion or boilerplate generation.
Supporting educational and prototyping environments.
Downstream Use
Can be integrated into:
Developer tools (IDE plugins or assistants).
Chatbots for code assistance or educational coding tutors.
LLM pipelines for multi-step reasoning or coding workflows.
Out-of-Scope Use
Generating production-level code without human review.
Security-critical or real-time applications (e.g., code execution automation).
Generation of malicious or unsafe code.
Bias, Risks, and Limitations
The model may produce incomplete or syntactically incorrect code for ambiguous prompts.
It can misinterpret vague natural language queries (semantic drift).
Potential bias toward common Python idioms and limited handling of rare libraries or APIs.
How to Get Started with the Model
Use the code below to get started with the model.
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "akshayb/nl-code-gen-python"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
prompt = "write a python function to check mysql database connection"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Training Details
Training Data
The dataset contains paired natural language descriptions and Python function implementations, collected and cleaned from public code repositories and text-to-code benchmarks (e.g., CodeXGLUE). Preprocessing involved deduplication, tokenization, and removal of incomplete code samples.
Evaluation
Metrics
For Comparison between Base Model and Fine-tuned model, we use the following metrics:
| Metric | Focus | Strength |
|---|---|---|
| BLEU | Token-level similarity | Measures fluency and lexical accuracy |
| CodeBLEU | Lexical + syntactic + semantic | Captures holistic code quality |
| Exact Match | String equality | Strict correctness measure |
| Syntax Match | AST structure | Validates syntactic and logical integrity |
Citation [optional]
BibTeX:
@misc{akshay2025nlcodegen,
title={Natural Language to Code Generation (Fine-tuned CodeGen-350M)},
author={Akshay Bharadwaj},
year={2025},
howpublished={\url{https://huggingface.co/akshayb/nl-code-gen-python}}
}
- PEFT 0.7.2.dev0
- Downloads last month
- 59
Model tree for akshaybharadwaj96/nl-code-gen-python
Base model
Salesforce/codegen-350M-mono