CodeGen-ft-python

Generate python code from natural language prompts.

Model Details

Model Description

This model is a fine-tuned variant of Salesforce/codegen-350M-mono, specialized for natural language to code generation in Python. It takes natural language instructions (e.g., “check MySQL database connection”) and generates the corresponding Python code snippet. The model was trained on a curated text-to-code dataset containing diverse programming instructions and function-level examples to improve semantic and syntactic accuracy.

Developed by: Akshay Bharadwaj
Model type: Transformer-based Causal Language Model
Language(s) (NLP): English (Prompts) and Python (Code Outputs)
License: MIT License
Finetuned from model [optional]: Salesforce/codegen-350M-mono

Uses

Direct Use

The model can be used for:

Translating natural language prompts into functional Python code.
Assisting in code autocompletion or boilerplate generation.
Supporting educational and prototyping environments.

Downstream Use

Can be integrated into:

Developer tools (IDE plugins or assistants).
Chatbots for code assistance or educational coding tutors.
LLM pipelines for multi-step reasoning or coding workflows.

Out-of-Scope Use

Generating production-level code without human review.
Security-critical or real-time applications (e.g., code execution automation).
Generation of malicious or unsafe code.

Bias, Risks, and Limitations

The model may produce incomplete or syntactically incorrect code for ambiguous prompts.
It can misinterpret vague natural language queries (semantic drift).
Potential bias toward common Python idioms and limited handling of rare libraries or APIs.

How to Get Started with the Model

Use the code below to get started with the model.

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "akshayb/nl-code-gen-python"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

prompt = "write a python function to check mysql database connection"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Details

Training Data

The dataset contains paired natural language descriptions and Python function implementations, collected and cleaned from public code repositories and text-to-code benchmarks (e.g., CodeXGLUE). Preprocessing involved deduplication, tokenization, and removal of incomplete code samples.

Evaluation

Metrics

For Comparison between Base Model and Fine-tuned model, we use the following metrics:

Metric	Focus	Strength
BLEU	Token-level similarity	Measures fluency and lexical accuracy
CodeBLEU	Lexical + syntactic + semantic	Captures holistic code quality
Exact Match	String equality	Strict correctness measure
Syntax Match	AST structure	Validates syntactic and logical integrity

Citation [optional]

BibTeX:

@misc{akshay2025nlcodegen,
  title={Natural Language to Code Generation (Fine-tuned CodeGen-350M)},
  author={Akshay Bharadwaj},
  year={2025},
  howpublished={\url{https://huggingface.co/akshayb/nl-code-gen-python}}
}

PEFT 0.7.2.dev0

Downloads last month: 59

Model tree for akshaybharadwaj96/nl-code-gen-python

Base model

Salesforce/codegen-350M-mono

Adapter

(57)

this model

akshaybharadwaj96
/

nl-code-gen-python