|
--- |
|
library_name: transformers |
|
tags: |
|
- C/C++ |
|
- Code |
|
- Vulnerability |
|
- Detection |
|
datasets: |
|
- DetectVul/devign |
|
language: |
|
- en |
|
base_model: |
|
- microsoft/unixcoder-base |
|
--- |
|
|
|
## UniXcoder for Code Vulnerability Detection |
|
|
|
## Model Summary |
|
This model is a fine-tuned version of **Microsoft's UniXcoder**, optimized for detecting vulnerabilities in C/C++ code. It is trained on the **DetectVul/devign** dataset and achieves **68.34% accuracy** with an **F1 score of 62.14%**. The model takes in a code snippet and classifies it as either **safe (0)** or **vulnerable (1)**. |
|
|
|
## Model Details |
|
|
|
- **Developed by:** [mahdin70(Mukit Mahdin)] |
|
- **Finetuned from:** `microsoft/unixcoder-base` |
|
- **Language(s):** English (for code comments & metadata), C/C++ |
|
- **License:** MIT |
|
- **Task:** Code vulnerability detection |
|
- **Dataset Used:** `DetectVul/devign` |
|
- **Architecture:** Transformer-based sequence classification |
|
|
|
|
|
## Uses |
|
|
|
### Direct Use |
|
This model can be used for **static code analysis**, security audits, and automatic vulnerability detection in software repositories. It is useful for: |
|
- **Developers**: To analyze their code for potential security flaws. |
|
- **Security Teams**: To scan repositories for known vulnerabilities. |
|
- **Researchers**: To study vulnerability detection in AI-powered systems. |
|
|
|
### Downstream Use |
|
This model can be integrated into **IDE plugins**, **CI/CD pipelines**, or **security scanners** to provide real-time vulnerability detection. |
|
|
|
### Out-of-Scope Use |
|
- The model is **not meant to replace human security experts**. |
|
- It may not generalize well to **languages other than C/C++**. |
|
- False positives/negatives may occur due to dataset limitations. |
|
|
|
## Bias, Risks, and Limitations |
|
- **False Positives & False Negatives:** The model may flag safe code as vulnerable or miss actual vulnerabilities. |
|
- **Limited to C/C++:** The model was trained on a dataset primarily composed of **C and C++ code**. It may not perform well on other languages. |
|
- **Dataset Bias:** The training data may not cover all possible vulnerabilities. |
|
|
|
### Recommendations |
|
Users should **not rely solely on the model** for security assessments. Instead, it should be used alongside **manual code review and static analysis tools**. |
|
|
|
## How to Get Started with the Model |
|
Use the code below to load the model and run inference on a sample code snippet: |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
import torch |
|
|
|
# Load the fine-tuned model |
|
tokenizer = AutoTokenizer.from_pretrained("microsoft/unixcoder-base") |
|
model = AutoModelForSequenceClassification.from_pretrained("mahdin70/unixcoder-code-vulnerability-detector") |
|
|
|
# Sample code snippet |
|
code_snippet = """ |
|
void process(char *input) { |
|
char buffer[50]; |
|
strcpy(buffer, input); // Potential buffer overflow |
|
} |
|
""" |
|
|
|
# Tokenize the input |
|
inputs = tokenizer(code_snippet, return_tensors="pt", truncation=True, padding="max_length", max_length=512) |
|
|
|
# Run inference |
|
with torch.no_grad(): |
|
outputs = model(**inputs) |
|
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1) |
|
predicted_label = torch.argmax(predictions, dim=1).item() |
|
|
|
# Output the result |
|
print("Vulnerable Code" if predicted_label == 1 else "Safe Code") |
|
``` |
|
|
|
## Training Details |
|
|
|
### Training Data |
|
- **Dataset:** `DetectVul/devign` |
|
- **Classes:** `0 (Safe)`, `1 (Vulnerable)` |
|
- **Size:** 17483 code snippets |
|
|
|
### Training Procedure |
|
- **Optimizer:** AdamW |
|
- **Loss Function:** Cross-Entropy Loss |
|
- **Batch Size:** 8 |
|
- **Learning Rate:** 2e-5 |
|
- **Epochs:** 3 |
|
- **Hardware Used:** 2x T4 GPU |
|
|
|
### Metrics |
|
| Metric | Score | |
|
|------------|-------------| |
|
| **Train Loss** | 0.4835 | |
|
| **Evaluation Loss** | 0.6855 | |
|
| **Accuracy** | 68.34% | |
|
| **F1 Score** | 62.14% | |
|
| **Precision** | 69.18% | |
|
| **Recall** | 56.40% | |
|
|
|
## Environmental Impact |
|
|
|
| Factor | Value | |
|
|-----------|----------| |
|
| **GPU Used** | 2x T4 GPU | |
|
| **Training Time** | ~1 hour | |