UniXcoder for Code Vulnerability Detection

Model Summary

This model is a fine-tuned version of Microsoft's UniXcoder, optimized for detecting vulnerabilities in C/C++ code. It is trained on the DetectVul/devign dataset and achieves 68.34% accuracy with an F1 score of 62.14%. The model takes in a code snippet and classifies it as either safe (0) or vulnerable (1).

Model Details

Developed by: [mahdin70(Mukit Mahdin)]
Finetuned from: microsoft/unixcoder-base
Language(s): English (for code comments & metadata), C/C++
License: MIT
Task: Code vulnerability detection
Dataset Used: DetectVul/devign
Architecture: Transformer-based sequence classification

Uses

Direct Use

This model can be used for static code analysis, security audits, and automatic vulnerability detection in software repositories. It is useful for:

Developers: To analyze their code for potential security flaws.
Security Teams: To scan repositories for known vulnerabilities.
Researchers: To study vulnerability detection in AI-powered systems.

Downstream Use

This model can be integrated into IDE plugins, CI/CD pipelines, or security scanners to provide real-time vulnerability detection.

Out-of-Scope Use

The model is not meant to replace human security experts.
It may not generalize well to languages other than C/C++.
False positives/negatives may occur due to dataset limitations.

Bias, Risks, and Limitations

False Positives & False Negatives: The model may flag safe code as vulnerable or miss actual vulnerabilities.
Limited to C/C++: The model was trained on a dataset primarily composed of C and C++ code. It may not perform well on other languages.
Dataset Bias: The training data may not cover all possible vulnerabilities.

Recommendations

Users should not rely solely on the model for security assessments. Instead, it should be used alongside manual code review and static analysis tools.

How to Get Started with the Model

Use the code below to load the model and run inference on a sample code snippet:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load the fine-tuned model
tokenizer = AutoTokenizer.from_pretrained("microsoft/unixcoder-base")
model = AutoModelForSequenceClassification.from_pretrained("mahdin70/unixcoder-code-vulnerability-detector")

# Sample code snippet
code_snippet = """
void process(char *input) {
    char buffer[50];
    strcpy(buffer, input); // Potential buffer overflow
}
"""

# Tokenize the input
inputs = tokenizer(code_snippet, return_tensors="pt", truncation=True, padding="max_length", max_length=512)

# Run inference
with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
    predicted_label = torch.argmax(predictions, dim=1).item()

# Output the result
print("Vulnerable Code" if predicted_label == 1 else "Safe Code")

Training Details

Training Data

Dataset: DetectVul/devign
Classes: 0 (Safe), 1 (Vulnerable)
Size: 17483 code snippets

Training Procedure

Optimizer: AdamW
Loss Function: Cross-Entropy Loss
Batch Size: 8
Learning Rate: 2e-5
Epochs: 3
Hardware Used: 2x T4 GPU

Metrics

Metric	Score
Train Loss	0.4835
Evaluation Loss	0.6855
Accuracy	68.34%
F1 Score	62.14%
Precision	69.18%
Recall	56.40%

Environmental Impact

Factor	Value
GPU Used	2x T4 GPU
Training Time	~1 hour

Downloads last month: 64

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for mahdin70/unixcoder-code-vulnerability-detector

Base model

microsoft/unixcoder-base

Finetuned

(7)

this model

mahdin70
/

unixcoder-code-vulnerability-detector