File size: 5,677 Bytes
b67582b
 
 
 
 
 
 
 
 
 
7d45f75
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b67582b
 
 
7d45f75
b67582b
7d45f75
b67582b
7d45f75
 
b67582b
7d45f75
b67582b
7d45f75
 
 
 
 
 
b67582b
7d45f75
b67582b
7d45f75
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b67582b
7d45f75
b67582b
7d45f75
 
b67582b
7d45f75
b67582b
7d45f75
b67582b
7d45f75
 
 
b67582b
7d45f75
 
 
b67582b
7d45f75
 
 
b67582b
7d45f75
 
 
 
 
 
 
 
b67582b
7d45f75
 
 
b67582b
7d45f75
b67582b
7d45f75
b67582b
7d45f75
 
 
b67582b
7d45f75
b67582b
7d45f75
 
 
 
 
 
 
 
 
b67582b
7d45f75
 
 
b67582b
7d45f75
b67582b
7d45f75
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b67582b
7d45f75
b67582b
7d45f75
 
 
 
b67582b
7d45f75
b67582b
7d45f75
 
b67582b
7d45f75
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
---
license: apache-2.0
datasets:
- Shuu12121/rust-codesearch-dataset-open
- Shuu12121/java-codesearch-dataset-open
language:
- en
pipeline_tag: sentence-similarity
tags:
- code
- code-search
- retrieval
- sentence-similarity
- bert
- transformers
- deep-learning
- machine-learning
- nlp
- programming
- multi-language
- rust
- python
- java
- javascript
- php
- ruby
- go
---


# **CodeModernBERT-Owl**

## **概要 / Overview**

### **🦉 CodeModernBERT-Owl: 高精度なコード検索 & コード理解モデル**  
**CodeModernBERT-Owl** is a **pretrained model** designed from scratch for **code search and code understanding tasks**.  

Compared to previous versions such as **CodeHawks-ModernBERT** and **CodeMorph-ModernBERT**, this model **now supports Rust** and **improves search accuracy** in Python, PHP, Java, JavaScript, Go, and Ruby.  

### **🛠 主な特徴 / Key Features****Supports long sequences up to 2048 tokens** (compared to Microsoft's 512-token models)  
✅ **Optimized for code search, code understanding, and code clone detection****Fine-tuned on GitHub open-source repositories (Java, Rust)****Achieves the highest accuracy among the CodeHawks/CodeMorph series****Multi-language support**: **Python, PHP, Java, JavaScript, Go, Ruby, and Rust**  

---

## **📊 モデルパラメータ / Model Parameters**
| パラメータ / Parameter   | 値 / Value |
|-------------------------|------------|
| **vocab_size**         | 50,000      |
| **hidden_size**        | 768        |
| **num_hidden_layers**  | 12         |
| **num_attention_heads**| 12         |
| **intermediate_size**  | 3,072      |
| **max_position_embeddings** | 2,048 |
| **type_vocab_size**    | 2          |
| **hidden_dropout_prob**| 0.1        |
| **attention_probs_dropout_prob** | 0.1 |
| **local_attention_window** | 128 |
| **rope_theta**         | 160,000    |
| **local_attention_rope_theta** | 10,000 |

---

## **💻 モデルの使用方法 / How to Use**
This model can be easily loaded using the **Hugging Face Transformers** library.  

⚠️ **Requires `transformers >= 4.48.0`**  

🔗 **[Colab Demo (Replace with "CodeModernBERT-Owl")](https://github.com/Shun0212/CodeBERTPretrained/blob/main/UseMyCodeMorph_ModernBERT.ipynb)**

### **モデルのロード / Load the Model**
```python
from transformers import AutoModelForMaskedLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Shuu12121/CodeModernBERT-Owl")
model = AutoModelForMaskedLM.from_pretrained("Shuu12121/CodeModernBERT-Owl")
```

### **コード埋め込みの取得 / Get Code Embeddings**
```python
import torch

def get_embedding(text, model, tokenizer, device="cuda"):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
    if "token_type_ids" in inputs:
        inputs.pop("token_type_ids")
    inputs = {k: v.to(device) for k, v in inputs.items()}
    outputs = model.model(**inputs)
    embedding = outputs.last_hidden_state[:, 0, :]
    return embedding

embedding = get_embedding("def my_function(): pass", model, tokenizer)
print(embedding.shape)
```

---

# **🔍 評価結果 / Evaluation Results**

### **データセット / Dataset**
📌 **Tested on `code_x_glue_ct_code_to_text` with a candidate pool size of 100.**  
📌 **Rust-specific evaluations were conducted using `Shuu12121/rust-codesearch-dataset-open`.**  

---

## **📈 主要な評価指標の比較(同一シード値)/ Key Evaluation Metrics (Same Seed)**
| 言語 / Language | **CodeModernBERT-Owl** | **CodeHawks-ModernBERT** | **Salesforce CodeT5+** | **Microsoft CodeBERT** | **GraphCodeBERT** |
|-----------|-----------------|----------------------|-----------------|------------------|------------------|
| **Python**     | **0.8793**  | 0.8551  | 0.8266  | 0.5243  | 0.5493  |
| **Java**       | **0.8880**  | 0.7971  | **0.8867**  | 0.3134  | 0.5879  |
| **JavaScript** | **0.8423**  | 0.7634  | 0.7628  | 0.2694  | 0.5051  |
| **PHP**        | **0.9129**  | 0.8578  | **0.9027**  | 0.2642  | 0.6225  |
| **Ruby**       | **0.8038**  | 0.7469  | **0.7568**  | 0.3318  | 0.5876  |
| **Go**         | **0.9386**  | 0.9043  | 0.8117  | 0.3262  | 0.4243  |

✅ **Achieves the highest accuracy in all target languages.****Significantly improved Java accuracy using additional fine-tuned GitHub data.****Outperforms previous models, especially in PHP and Go.**  

---

## **📊 Rust (独自データセット) / Rust Performance**
| 指標 / Metric | **CodeModernBERT-Owl** |
|--------------|----------------|
| **MRR**      | 0.7940 |
| **MAP**      | 0.7940 |
| **R-Precision** | 0.7173 |

### **📌 K別評価指標 / Evaluation Metrics by K**
| K  | **Recall@K** | **Precision@K** | **NDCG@K** | **F1@K** | **Success Rate@K** | **Query Coverage@K** |
|----|-------------|---------------|------------|--------|-----------------|-----------------|
| **1**   | 0.7173  | 0.7173  | 0.7173  | 0.7173  | 0.7173  | 0.7173  |
| **5**   | 0.8913  | 0.7852  | 0.8118  | 0.8132  | 0.8913  | 0.8913  |
| **10**  | 0.9333  | 0.7908  | 0.8254  | 0.8230  | 0.9333  | 0.9333  |
| **50**  | 0.9887  | 0.7938  | 0.8383  | 0.8288  | 0.9887  | 0.9887  |
| **100** | 1.0000  | 0.7940  | 0.8401  | 0.8291  | 1.0000  | 1.0000  |

---

## **📝 結論 / Conclusion****Top performance in all languages****Rust support successfully added through dataset augmentation****Further performance improvements possible with better datasets**  

---

## **📜 ライセンス / License**
📄 **Apache 2.0**  

## **📧 連絡先 / Contact**
📩 **For any questions, please contact:**  
📧 **[email protected]**