--- library_name: transformers tags: - C/C++ - Code - Vulnerability - Detection datasets: - DetectVul/devign language: - en base_model: - microsoft/unixcoder-base --- ## UniXcoder for Code Vulnerability Detection ## Model Summary This model is a fine-tuned version of **Microsoft's UniXcoder**, optimized for detecting vulnerabilities in C/C++ code. It is trained on the **DetectVul/devign** dataset and achieves **68.34% accuracy** with an **F1 score of 62.14%**. The model takes in a code snippet and classifies it as either **safe (0)** or **vulnerable (1)**. ## Model Details - **Developed by:** [mahdin70(Mukit Mahdin)] - **Finetuned from:** `microsoft/unixcoder-base` - **Language(s):** English (for code comments & metadata), C/C++ - **License:** MIT - **Task:** Code vulnerability detection - **Dataset Used:** `DetectVul/devign` - **Architecture:** Transformer-based sequence classification ## Uses ### Direct Use This model can be used for **static code analysis**, security audits, and automatic vulnerability detection in software repositories. It is useful for: - **Developers**: To analyze their code for potential security flaws. - **Security Teams**: To scan repositories for known vulnerabilities. - **Researchers**: To study vulnerability detection in AI-powered systems. ### Downstream Use This model can be integrated into **IDE plugins**, **CI/CD pipelines**, or **security scanners** to provide real-time vulnerability detection. ### Out-of-Scope Use - The model is **not meant to replace human security experts**. - It may not generalize well to **languages other than C/C++**. - False positives/negatives may occur due to dataset limitations. ## Bias, Risks, and Limitations - **False Positives & False Negatives:** The model may flag safe code as vulnerable or miss actual vulnerabilities. - **Limited to C/C++:** The model was trained on a dataset primarily composed of **C and C++ code**. It may not perform well on other languages. - **Dataset Bias:** The training data may not cover all possible vulnerabilities. ### Recommendations Users should **not rely solely on the model** for security assessments. Instead, it should be used alongside **manual code review and static analysis tools**. ## How to Get Started with the Model Use the code below to load the model and run inference on a sample code snippet: ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch # Load the fine-tuned model tokenizer = AutoTokenizer.from_pretrained("microsoft/unixcoder-base") model = AutoModelForSequenceClassification.from_pretrained("mahdin70/unixcoder-code-vulnerability-detector") # Sample code snippet code_snippet = """ void process(char *input) { char buffer[50]; strcpy(buffer, input); // Potential buffer overflow } """ # Tokenize the input inputs = tokenizer(code_snippet, return_tensors="pt", truncation=True, padding="max_length", max_length=512) # Run inference with torch.no_grad(): outputs = model(**inputs) predictions = torch.nn.functional.softmax(outputs.logits, dim=-1) predicted_label = torch.argmax(predictions, dim=1).item() # Output the result print("Vulnerable Code" if predicted_label == 1 else "Safe Code") ``` ## Training Details ### Training Data - **Dataset:** `DetectVul/devign` - **Classes:** `0 (Safe)`, `1 (Vulnerable)` - **Size:** 17483 code snippets ### Training Procedure - **Optimizer:** AdamW - **Loss Function:** Cross-Entropy Loss - **Batch Size:** 8 - **Learning Rate:** 2e-5 - **Epochs:** 3 - **Hardware Used:** 2x T4 GPU ### Metrics | Metric | Score | |------------|-------------| | **Train Loss** | 0.4835 | | **Evaluation Loss** | 0.6855 | | **Accuracy** | 68.34% | | **F1 Score** | 62.14% | | **Precision** | 69.18% | | **Recall** | 56.40% | ## Environmental Impact | Factor | Value | |-----------|----------| | **GPU Used** | 2x T4 GPU | | **Training Time** | ~1 hour |